提交 7df74a5e 编写于 作者: S sneaxiy

Merge develop

# Distributed Training with NCCL2 # Distributed Training with NCCL2
We design a pattern that can enable training with `ParallelExecutor` and We design a pattern that can enable training with `ParallelExecutor` and
using [NCCL2](https://developer.nvidia.com/nccl) as it's collective use [NCCL2](https://developer.nvidia.com/nccl) as it's collective
communication library. communication library.
In `ParallelExecutor` we can use `AllReduce` or `Reduce` and `Broadcast` In `ParallelExecutor` we can use `AllReduce` or `Reduce` and `Broadcast`
...@@ -9,14 +9,14 @@ to do multi GPU training. And if we initialize NCCL2 communicators as ...@@ -9,14 +9,14 @@ to do multi GPU training. And if we initialize NCCL2 communicators as
ranks in a distributed environment, we can simply run the `ParallelExecutor` ranks in a distributed environment, we can simply run the `ParallelExecutor`
as a distributed program! The only thing that may be different than in as a distributed program! The only thing that may be different than in
the single node version is that we need to broadcast the NCCL unique ID the single node version is that we need to broadcast the NCCL unique ID
to all the nodes, and initialize communicators using that ID, so NCCL2 to all the nodes and initialize communicators using that ID, so NCCL2
will know each other as ranks. can know each other as ranks.
To achieve this feature, we introduce a new operator: `gen_nccl_id` op, To achieve this feature, we introduce a new operator: `gen_nccl_id` op,
so we are ***not*** "bind to" running NCCL2 with MPI, we can run it in so we are ***not*** "bind to" running NCCL2 with MPI, we can run it in
what ever platform you like. whatever platform you like.
It have two running modes: It has two running modes:
1. Generate and broadcast mode, which should be used on trainer 0; 1. Generate and broadcast mode, which should be used on trainer 0;
1. Listen and fetch mode, which should be used on trainers other than 0. 1. Listen and fetch mode, which should be used on trainers other than 0.
...@@ -29,7 +29,7 @@ initialize NCCL communicator objects. ...@@ -29,7 +29,7 @@ initialize NCCL communicator objects.
<img src="src/ncc2_design.png"> <img src="src/ncc2_design.png">
The above figure indicates the general process when training with NCCL2 The above figure indicates the general process when training with NCCL2
distributed. Each trainer have the number of communicators equal to the distributed. Each trainer has the number of communicators equal to the
number of GPUs, but the ranks should match the global ranks number: here number of GPUs, but the ranks should match the global ranks number: here
we have total 8 GPUs, so `nranks==8`, for each trainer, the ranks should we have total 8 GPUs, so `nranks==8`, for each trainer, the ranks should
be from 0 ~ 3 on trainer 0 and 4 ~ 7 on trainer 1. be from 0 ~ 3 on trainer 0 and 4 ~ 7 on trainer 1.
# Distributed Training with NCCL2 and RDMA # Distributed Training with NCCL2 and RDMA
When doing distributed multi-GPU training, network bandwith often becomes the When doing distributed multi-GPU training, network bandwidth often becomes the
bottle neck. We introduce a way to use NCCL2 to do such training job to bottleneck. We introduce a way to use NCCL2 to do such training job to
achieve best performace. achieve best performance.
## Prepare Hardwares with RDMA and Multiple GPUs ## Prepare Hardware with RDMA and Multiple GPUs
I'm using two Linux servers each of them is installed with 8 GPUs and I'm using two Linux servers each of them installed with 8 GPUs and
one 100Gb RDMA card. one 100Gb RDMA card.
Base environment is: Base environment is:
...@@ -25,7 +25,7 @@ In general, the steps including: ...@@ -25,7 +25,7 @@ In general, the steps including:
1. Use docker to run tests and make sure GPUs and RDMA can work inside 1. Use docker to run tests and make sure GPUs and RDMA can work inside
the container. the container.
I'll ommit section "Install GPU drivers" because we can find it easily I'll omit the section "Install GPU drivers" because we can find it easily
somewhere else. somewhere else.
### Install RDMA drivers ### Install RDMA drivers
...@@ -33,7 +33,7 @@ somewhere else. ...@@ -33,7 +33,7 @@ somewhere else.
For my case, I've got two machines with device For my case, I've got two machines with device
"Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was "Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was
"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can "CentOS 7.4" and I updated the kernel to version 4.4 so that docker can
work with latest overlay2 filesystem. work with the latest overlay2 filesystem.
***NOTE: before you start, make sure you have a way to get a console ***NOTE: before you start, make sure you have a way to get a console
of the server other than ssh because we may need to re-configure the of the server other than ssh because we may need to re-configure the
...@@ -45,14 +45,14 @@ network device.*** ...@@ -45,14 +45,14 @@ network device.***
1. Run `./mlnxofedinstall --add-kernel-support` in the software package. 1. Run `./mlnxofedinstall --add-kernel-support` in the software package.
1. Run `/etc/init.d/openibd restart` to make everything work, note that 1. Run `/etc/init.d/openibd restart` to make everything work, note that
this operation may cause the network goes down if you are using this this operation may cause the network goes down if you are using this
RDMA device as default network device and use ssh to login the server. RDMA device as default network device and use ssh to log in the server.
1. Re-configure the network interface, for example: 1. Re-configure the network interface, for example:
`ifconfig eth2 192.168.16.30/20 up`, then add routes if needed: `ifconfig eth2 192.168.16.30/20 up`, then add routes if needed:
`ip route add default via 192.168.16.1 dev eth2`. `ip route add default via 192.168.16.1 dev eth2`.
1. Do the same thing on the other node. 1. Do the same thing on the other node.
1. Use `ping` to test if the two nodes have typical ICMP connection. 1. Use `ping` to test if the two nodes have typical ICMP connection.
1. Use either `udaddy` or `ib_write_bw` to test the network connection is 1. Use either `udaddy` or `ib_write_bw` to test the network connection is
ready and have the desired bandwith. ready and have the desired bandwidth.
### Prepare Docker Image to Run RDMA Programs ### Prepare Docker Image to Run RDMA Programs
...@@ -60,7 +60,7 @@ network device.*** ...@@ -60,7 +60,7 @@ network device.***
package in it. package in it.
1. Start a docker container and mount GPU driver libs into it (you can 1. Start a docker container and mount GPU driver libs into it (you can
skip this step if you are using nvidia-docker). skip this step if you are using nvidia-docker).
1. Mount RDMA dirvers and libs into the docker image (see below section), 1. Mount RDMA drivers and libs into the docker image (see below section),
also `udaddy` and `ib_write_bw` if needed. also `udaddy` and `ib_write_bw` if needed.
1. Mount GPU devices and RDMA devices into the container using `--device` 1. Mount GPU devices and RDMA devices into the container using `--device`
or just use privileged mode `--privileged`. or just use privileged mode `--privileged`.
......
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import time
import unittest
from multiprocessing import Process
import signal
import numpy
import paddle.fluid as fluid
import paddle.fluid.layers as layers
from paddle.fluid.layers.io import ListenAndServ
from paddle.fluid.layers.io import Recv
from paddle.fluid.layers.io import Send
from paddle.fluid.transpiler.details import program_to_code
class TestProgram2Code(unittest.TestCase):
def test_print(self):
place = fluid.CPUPlace()
self.init_serv(place)
self.init_client(place, 9123)
def init_serv(self, place):
main = fluid.Program()
with fluid.program_guard(main):
serv = ListenAndServ("127.0.0.1:0", ["X"], optimizer_mode=False)
with serv.do():
out_var = main.global_block().create_var(
name="scale_0.tmp_0",
psersistable=True,
dtype="float32",
shape=[32, 32])
x = layers.data(
shape=[32, 32],
dtype='float32',
name="X",
append_batch_size=False)
fluid.initializer.Constant(value=1.0)(x, main.global_block())
layers.scale(x=x, scale=10.0, out=out_var)
program_to_code(main)
def init_client(self, place, port):
main = fluid.Program()
with fluid.program_guard(main):
x = layers.data(
shape=[32, 32],
dtype='float32',
name='X',
append_batch_size=False)
fluid.initializer.Constant(value=2.3)(x, main.global_block())
get_var = main.global_block().create_var(
name="scale_0.tmp_0", # server side var
dtype="float32",
persistable=False,
shape=[32, 32])
fluid.initializer.Constant(value=2.3)(get_var, main.global_block())
Send("127.0.0.1:%d" % port, [x])
o = Recv("127.0.0.1:%d" % port, [get_var])
program_to_code(main)
if __name__ == "__main__":
unittest.main()
...@@ -16,6 +16,9 @@ from __future__ import print_function ...@@ -16,6 +16,9 @@ from __future__ import print_function
import six import six
from paddle.fluid import core
import paddle
def delete_ops(block, ops): def delete_ops(block, ops):
try: try:
...@@ -39,3 +42,133 @@ def find_op_by_output_arg(block, arg_name): ...@@ -39,3 +42,133 @@ def find_op_by_output_arg(block, arg_name):
if arg_name in op.output_arg_names: if arg_name in op.output_arg_names:
return index return index
return -1 return -1
def get_indent_space(indent, space_num=4):
ret = ""
for i in range(0, indent * space_num):
ret += " "
return ret
def variable_to_code(var):
"""
Get readable codes of fluid variable.
Args:
var: A fluid operator.
Returns:
string: The formatted string.
"""
var_str = "{name} : fluid.{type}.shape{shape}.astype({dtype})".\
format(i="{", e="}", name=var.name, type=var.type, shape=var.shape, dtype=var.dtype)
if type(var) == paddle.fluid.framework.Parameter:
if var.trainable:
var_str = "trainable parameter " + var_str
else:
var_str = "parameter " + var_str
else:
var_str = "var " + var_str
if var.persistable:
var_str = "persist " + var_str
return var_str
def op_to_code(op):
"""
Get readable codes of fluid operator.
Args:
op: A fluid operator.
Returns:
string: The foramtted string.
"""
outputs_str = "{"
for i in range(0, len(op.output_names)):
outputs_str += "{name}=".format(name=op.output_names[i])
o = op.output(op.output_names[i])
outputs_str += "{value}".format(value=o)
if i != len(op.output_names) - 1:
outputs_str += ", "
outputs_str += "}"
inputs_str = "{"
for i in range(0, len(op.input_names)):
inputs_str += "{name}=".format(name=op.input_names[i])
o = op.input(op.input_names[i])
inputs_str += "{value}".format(value=o)
if i != len(op.input_names) - 1:
inputs_str += ", "
inputs_str += "}"
attrs_str = ""
for i in range(0, len(op.attr_names)):
name = op.attr_names[i]
attr_type = op.desc.attr_type(name)
if attr_type == core.AttrType.BLOCK:
a = "{name} = block[{value}]".format(
name=name, type=attr_type, value=op.block_attr_id(name))
attrs_str += a
continue
if attr_type == core.AttrType.BLOCKS:
a = "{name} = blocks{value}".format(
name=name, type=attr_type, value=op.blocks_attr_ids(name))
attrs_str += a
continue
a = "{name} = {value}".format(
name=name, type=attr_type, value=op.desc.attr(name))
attrs_str += a
if i != len(op.attr_names) - 1:
attrs_str += ", "
if outputs_str != "{}":
op_str = "{outputs} = {op_type}(inputs={inputs}, {attrs})".\
format(outputs = outputs_str, op_type=op.type, inputs=inputs_str, attrs=attrs_str)
else:
op_str = "{op_type}(inputs={inputs}, {attrs})".\
format(op_type=op.type, inputs=inputs_str, attrs=attrs_str)
return op_str
def program_to_code(prog):
"""
Print readable codes of fluid program.
Args:
prog : A fluid program.
An example result like bellow:
https://github.com/PaddlePaddle/Paddle/pull/12673
"""
indent = 0
block_idx = 0
for block in prog.blocks:
print("{0}{1} // block {2}".format(
get_indent_space(indent), '{', block_idx))
indent += 1
# sort all vars
all_vars = sorted(block.vars.iteritems(), key=lambda x: x[0])
for var in all_vars:
print("{}{}".format(
get_indent_space(indent), variable_to_code(var[1])))
if len(all_vars) > 0:
print("")
for op in block.ops:
print("{}{}".format(get_indent_space(indent), op_to_code(op)))
indent -= 1
print("{0}{1}".format(get_indent_space(indent), '}'))
block_idx += 1
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册