Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
Paddle
提交
1ea779ca
P
Paddle
项目概览
PaddlePaddle
/
Paddle
1 年多 前同步成功
通知
2302
Star
20931
Fork
5422
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
1423
列表
看板
标记
里程碑
合并请求
543
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
1,423
Issue
1,423
列表
看板
标记
里程碑
合并请求
543
合并请求
543
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
1ea779ca
编写于
8月 24, 2018
作者:
T
tensor-tang
浏览文件
操作
浏览文件
下载
差异文件
Merge remote-tracking branch 'ups/develop' into refine/jit
上级
e3bb98eb
d82453fb
变更
4
隐藏空白更改
内联
并排
Showing
4 changed file
with
230 addition
and
16 deletion
+230
-16
doc/fluid/design/dist_train/dist_train_nccl2.md
doc/fluid/design/dist_train/dist_train_nccl2.md
+6
-6
doc/fluid/howto/cluster/nccl2_rdma_training.md
doc/fluid/howto/cluster/nccl2_rdma_training.md
+10
-10
python/paddle/fluid/tests/unittests/test_program_code.py
python/paddle/fluid/tests/unittests/test_program_code.py
+81
-0
python/paddle/fluid/transpiler/details/program_utils.py
python/paddle/fluid/transpiler/details/program_utils.py
+133
-0
未找到文件。
doc/fluid/design/dist_train/dist_train_nccl2.md
浏览文件 @
1ea779ca
# Distributed Training with NCCL2
We design a pattern that can enable training with
`ParallelExecutor`
and
us
ing
[
NCCL2
](
https://developer.nvidia.com/nccl
)
as it's collective
us
e
[
NCCL2
](
https://developer.nvidia.com/nccl
)
as it's collective
communication library.
In
`ParallelExecutor`
we can use
`AllReduce`
or
`Reduce`
and
`Broadcast`
...
...
@@ -9,14 +9,14 @@ to do multi GPU training. And if we initialize NCCL2 communicators as
ranks in a distributed environment, we can simply run the
`ParallelExecutor`
as a distributed program! The only thing that may be different than in
the single node version is that we need to broadcast the NCCL unique ID
to all the nodes
,
and initialize communicators using that ID, so NCCL2
will
know each other as ranks.
to all the nodes and initialize communicators using that ID, so NCCL2
can
know each other as ranks.
To achieve this feature, we introduce a new operator:
`gen_nccl_id`
op,
so we are
***not**
*
"bind to" running NCCL2 with MPI, we can run it in
what
ever platform you like.
whatever platform you like.
It ha
ve
two running modes:
It ha
s
two running modes:
1.
Generate and broadcast mode, which should be used on trainer 0;
1.
Listen and fetch mode, which should be used on trainers other than 0.
...
...
@@ -29,7 +29,7 @@ initialize NCCL communicator objects.
<img
src=
"src/ncc2_design.png"
>
The above figure indicates the general process when training with NCCL2
distributed. Each trainer ha
ve
the number of communicators equal to the
distributed. Each trainer ha
s
the number of communicators equal to the
number of GPUs, but the ranks should match the global ranks number: here
we have total 8 GPUs, so
`nranks==8`
, for each trainer, the ranks should
be from 0 ~ 3 on trainer 0 and 4 ~ 7 on trainer 1.
doc/fluid/howto/cluster/nccl2_rdma_training.md
浏览文件 @
1ea779ca
# Distributed Training with NCCL2 and RDMA
When doing distributed multi-GPU training, network bandwith often becomes the
bottle
neck. We introduce a way to use NCCL2 to do such training job to
achieve best performace.
When doing distributed multi-GPU training, network bandwi
d
th often becomes the
bottleneck. We introduce a way to use NCCL2 to do such training job to
achieve best performa
n
ce.
## Prepare Hardware
s
with RDMA and Multiple GPUs
## Prepare Hardware with RDMA and Multiple GPUs
I'm using two Linux servers each of them i
s i
nstalled with 8 GPUs and
I'm using two Linux servers each of them installed with 8 GPUs and
one 100Gb RDMA card.
Base environment is:
...
...
@@ -25,7 +25,7 @@ In general, the steps including:
1.
Use docker to run tests and make sure GPUs and RDMA can work inside
the container.
I'll om
mit
section "Install GPU drivers" because we can find it easily
I'll om
it the
section "Install GPU drivers" because we can find it easily
somewhere else.
### Install RDMA drivers
...
...
@@ -33,7 +33,7 @@ somewhere else.
For my case, I've got two machines with device
"Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was
"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can
work with latest overlay2 filesystem.
work with
the
latest overlay2 filesystem.
**
*
NOTE: before you start, make sure you have a way to get a console
of the server other than ssh because we may need to re-configure the
...
...
@@ -45,14 +45,14 @@ network device.***
1.
Run
`./mlnxofedinstall --add-kernel-support`
in the software package.
1.
Run
`/etc/init.d/openibd restart`
to make everything work, note that
this operation may cause the network goes down if you are using this
RDMA device as default network device and use ssh to login the server.
RDMA device as default network device and use ssh to log
in the server.
1.
Re-configure the network interface, for example:
`ifconfig eth2 192.168.16.30/20 up`
, then add routes if needed:
`ip route add default via 192.168.16.1 dev eth2`
.
1.
Do the same thing on the other node.
1.
Use
`ping`
to test if the two nodes have typical ICMP connection.
1.
Use either
`udaddy`
or
`ib_write_bw`
to test the network connection is
ready and have the desired bandwith.
ready and have the desired bandwi
d
th.
### Prepare Docker Image to Run RDMA Programs
...
...
@@ -60,7 +60,7 @@ network device.***
package in it.
1.
Start a docker container and mount GPU driver libs into it (you can
skip this step if you are using nvidia-docker).
1.
Mount RDMA d
ir
vers and libs into the docker image (see below section),
1.
Mount RDMA d
ri
vers and libs into the docker image (see below section),
also
`udaddy`
and
`ib_write_bw`
if needed.
1.
Mount GPU devices and RDMA devices into the container using
`--device`
or just use privileged mode
`--privileged`
.
...
...
python/paddle/fluid/tests/unittests/test_program_code.py
0 → 100644
浏览文件 @
1ea779ca
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
os
import
time
import
unittest
from
multiprocessing
import
Process
import
signal
import
numpy
import
paddle.fluid
as
fluid
import
paddle.fluid.layers
as
layers
from
paddle.fluid.layers.io
import
ListenAndServ
from
paddle.fluid.layers.io
import
Recv
from
paddle.fluid.layers.io
import
Send
from
paddle.fluid.transpiler.details
import
program_to_code
class
TestProgram2Code
(
unittest
.
TestCase
):
def
test_print
(
self
):
place
=
fluid
.
CPUPlace
()
self
.
init_serv
(
place
)
self
.
init_client
(
place
,
9123
)
def
init_serv
(
self
,
place
):
main
=
fluid
.
Program
()
with
fluid
.
program_guard
(
main
):
serv
=
ListenAndServ
(
"127.0.0.1:0"
,
[
"X"
],
optimizer_mode
=
False
)
with
serv
.
do
():
out_var
=
main
.
global_block
().
create_var
(
name
=
"scale_0.tmp_0"
,
psersistable
=
True
,
dtype
=
"float32"
,
shape
=
[
32
,
32
])
x
=
layers
.
data
(
shape
=
[
32
,
32
],
dtype
=
'float32'
,
name
=
"X"
,
append_batch_size
=
False
)
fluid
.
initializer
.
Constant
(
value
=
1.0
)(
x
,
main
.
global_block
())
layers
.
scale
(
x
=
x
,
scale
=
10.0
,
out
=
out_var
)
program_to_code
(
main
)
def
init_client
(
self
,
place
,
port
):
main
=
fluid
.
Program
()
with
fluid
.
program_guard
(
main
):
x
=
layers
.
data
(
shape
=
[
32
,
32
],
dtype
=
'float32'
,
name
=
'X'
,
append_batch_size
=
False
)
fluid
.
initializer
.
Constant
(
value
=
2.3
)(
x
,
main
.
global_block
())
get_var
=
main
.
global_block
().
create_var
(
name
=
"scale_0.tmp_0"
,
# server side var
dtype
=
"float32"
,
persistable
=
False
,
shape
=
[
32
,
32
])
fluid
.
initializer
.
Constant
(
value
=
2.3
)(
get_var
,
main
.
global_block
())
Send
(
"127.0.0.1:%d"
%
port
,
[
x
])
o
=
Recv
(
"127.0.0.1:%d"
%
port
,
[
get_var
])
program_to_code
(
main
)
if
__name__
==
"__main__"
:
unittest
.
main
()
python/paddle/fluid/transpiler/details/program_utils.py
浏览文件 @
1ea779ca
...
...
@@ -16,6 +16,9 @@ from __future__ import print_function
import
six
from
paddle.fluid
import
core
import
paddle
def
delete_ops
(
block
,
ops
):
try
:
...
...
@@ -39,3 +42,133 @@ def find_op_by_output_arg(block, arg_name):
if
arg_name
in
op
.
output_arg_names
:
return
index
return
-
1
def
get_indent_space
(
indent
,
space_num
=
4
):
ret
=
""
for
i
in
range
(
0
,
indent
*
space_num
):
ret
+=
" "
return
ret
def
variable_to_code
(
var
):
"""
Get readable codes of fluid variable.
Args:
var: A fluid operator.
Returns:
string: The formatted string.
"""
var_str
=
"{name} : fluid.{type}.shape{shape}.astype({dtype})"
.
\
format
(
i
=
"{"
,
e
=
"}"
,
name
=
var
.
name
,
type
=
var
.
type
,
shape
=
var
.
shape
,
dtype
=
var
.
dtype
)
if
type
(
var
)
==
paddle
.
fluid
.
framework
.
Parameter
:
if
var
.
trainable
:
var_str
=
"trainable parameter "
+
var_str
else
:
var_str
=
"parameter "
+
var_str
else
:
var_str
=
"var "
+
var_str
if
var
.
persistable
:
var_str
=
"persist "
+
var_str
return
var_str
def
op_to_code
(
op
):
"""
Get readable codes of fluid operator.
Args:
op: A fluid operator.
Returns:
string: The foramtted string.
"""
outputs_str
=
"{"
for
i
in
range
(
0
,
len
(
op
.
output_names
)):
outputs_str
+=
"{name}="
.
format
(
name
=
op
.
output_names
[
i
])
o
=
op
.
output
(
op
.
output_names
[
i
])
outputs_str
+=
"{value}"
.
format
(
value
=
o
)
if
i
!=
len
(
op
.
output_names
)
-
1
:
outputs_str
+=
", "
outputs_str
+=
"}"
inputs_str
=
"{"
for
i
in
range
(
0
,
len
(
op
.
input_names
)):
inputs_str
+=
"{name}="
.
format
(
name
=
op
.
input_names
[
i
])
o
=
op
.
input
(
op
.
input_names
[
i
])
inputs_str
+=
"{value}"
.
format
(
value
=
o
)
if
i
!=
len
(
op
.
input_names
)
-
1
:
inputs_str
+=
", "
inputs_str
+=
"}"
attrs_str
=
""
for
i
in
range
(
0
,
len
(
op
.
attr_names
)):
name
=
op
.
attr_names
[
i
]
attr_type
=
op
.
desc
.
attr_type
(
name
)
if
attr_type
==
core
.
AttrType
.
BLOCK
:
a
=
"{name} = block[{value}]"
.
format
(
name
=
name
,
type
=
attr_type
,
value
=
op
.
block_attr_id
(
name
))
attrs_str
+=
a
continue
if
attr_type
==
core
.
AttrType
.
BLOCKS
:
a
=
"{name} = blocks{value}"
.
format
(
name
=
name
,
type
=
attr_type
,
value
=
op
.
blocks_attr_ids
(
name
))
attrs_str
+=
a
continue
a
=
"{name} = {value}"
.
format
(
name
=
name
,
type
=
attr_type
,
value
=
op
.
desc
.
attr
(
name
))
attrs_str
+=
a
if
i
!=
len
(
op
.
attr_names
)
-
1
:
attrs_str
+=
", "
if
outputs_str
!=
"{}"
:
op_str
=
"{outputs} = {op_type}(inputs={inputs}, {attrs})"
.
\
format
(
outputs
=
outputs_str
,
op_type
=
op
.
type
,
inputs
=
inputs_str
,
attrs
=
attrs_str
)
else
:
op_str
=
"{op_type}(inputs={inputs}, {attrs})"
.
\
format
(
op_type
=
op
.
type
,
inputs
=
inputs_str
,
attrs
=
attrs_str
)
return
op_str
def
program_to_code
(
prog
):
"""
Print readable codes of fluid program.
Args:
prog : A fluid program.
An example result like bellow:
https://github.com/PaddlePaddle/Paddle/pull/12673
"""
indent
=
0
block_idx
=
0
for
block
in
prog
.
blocks
:
print
(
"{0}{1} // block {2}"
.
format
(
get_indent_space
(
indent
),
'{'
,
block_idx
))
indent
+=
1
# sort all vars
all_vars
=
sorted
(
block
.
vars
.
iteritems
(),
key
=
lambda
x
:
x
[
0
])
for
var
in
all_vars
:
print
(
"{}{}"
.
format
(
get_indent_space
(
indent
),
variable_to_code
(
var
[
1
])))
if
len
(
all_vars
)
>
0
:
print
(
""
)
for
op
in
block
.
ops
:
print
(
"{}{}"
.
format
(
get_indent_space
(
indent
),
op_to_code
(
op
)))
indent
-=
1
print
(
"{0}{1}"
.
format
(
get_indent_space
(
indent
),
'}'
))
block_idx
+=
1
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录