diff --git a/doc/fluid/design/dist_train/dist_train_nccl2.md b/doc/fluid/design/dist_train/dist_train_nccl2.md index aa7455ec5de0d46d7c2b0cef3b7ebf4754af3cb1..b8b8427811cddcddf872db5badfd37c96a76c3e3 100644 --- a/doc/fluid/design/dist_train/dist_train_nccl2.md +++ b/doc/fluid/design/dist_train/dist_train_nccl2.md @@ -1,7 +1,7 @@ # Distributed Training with NCCL2 We design a pattern that can enable training with `ParallelExecutor` and -using [NCCL2](https://developer.nvidia.com/nccl) as it's collective +use [NCCL2](https://developer.nvidia.com/nccl) as it's collective communication library. In `ParallelExecutor` we can use `AllReduce` or `Reduce` and `Broadcast` @@ -9,14 +9,14 @@ to do multi GPU training. And if we initialize NCCL2 communicators as ranks in a distributed environment, we can simply run the `ParallelExecutor` as a distributed program! The only thing that may be different than in the single node version is that we need to broadcast the NCCL unique ID -to all the nodes, and initialize communicators using that ID, so NCCL2 -will know each other as ranks. +to all the nodes and initialize communicators using that ID, so NCCL2 +can know each other as ranks. To achieve this feature, we introduce a new operator: `gen_nccl_id` op, so we are ***not*** "bind to" running NCCL2 with MPI, we can run it in -what ever platform you like. +whatever platform you like. -It have two running modes: +It has two running modes: 1. Generate and broadcast mode, which should be used on trainer 0; 1. Listen and fetch mode, which should be used on trainers other than 0. @@ -29,7 +29,7 @@ initialize NCCL communicator objects. The above figure indicates the general process when training with NCCL2 -distributed. Each trainer have the number of communicators equal to the +distributed. Each trainer has the number of communicators equal to the number of GPUs, but the ranks should match the global ranks number: here we have total 8 GPUs, so `nranks==8`, for each trainer, the ranks should be from 0 ~ 3 on trainer 0 and 4 ~ 7 on trainer 1. diff --git a/doc/fluid/howto/cluster/nccl2_rdma_training.md b/doc/fluid/howto/cluster/nccl2_rdma_training.md index cecd5c3a7a7339e3be6772543a534728ec132105..8adaf324fccb4cda7af16b9bace559c0642ae444 100644 --- a/doc/fluid/howto/cluster/nccl2_rdma_training.md +++ b/doc/fluid/howto/cluster/nccl2_rdma_training.md @@ -1,12 +1,12 @@ # Distributed Training with NCCL2 and RDMA -When doing distributed multi-GPU training, network bandwith often becomes the -bottle neck. We introduce a way to use NCCL2 to do such training job to -achieve best performace. +When doing distributed multi-GPU training, network bandwidth often becomes the +bottleneck. We introduce a way to use NCCL2 to do such training job to +achieve best performance. -## Prepare Hardwares with RDMA and Multiple GPUs +## Prepare Hardware with RDMA and Multiple GPUs -I'm using two Linux servers each of them is installed with 8 GPUs and +I'm using two Linux servers each of them installed with 8 GPUs and one 100Gb RDMA card. Base environment is: @@ -25,7 +25,7 @@ In general, the steps including: 1. Use docker to run tests and make sure GPUs and RDMA can work inside the container. -I'll ommit section "Install GPU drivers" because we can find it easily +I'll omit the section "Install GPU drivers" because we can find it easily somewhere else. ### Install RDMA drivers @@ -33,7 +33,7 @@ somewhere else. For my case, I've got two machines with device "Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was "CentOS 7.4" and I updated the kernel to version 4.4 so that docker can -work with latest overlay2 filesystem. +work with the latest overlay2 filesystem. ***NOTE: before you start, make sure you have a way to get a console of the server other than ssh because we may need to re-configure the @@ -45,14 +45,14 @@ network device.*** 1. Run `./mlnxofedinstall --add-kernel-support` in the software package. 1. Run `/etc/init.d/openibd restart` to make everything work, note that this operation may cause the network goes down if you are using this - RDMA device as default network device and use ssh to login the server. + RDMA device as default network device and use ssh to log in the server. 1. Re-configure the network interface, for example: `ifconfig eth2 192.168.16.30/20 up`, then add routes if needed: `ip route add default via 192.168.16.1 dev eth2`. 1. Do the same thing on the other node. 1. Use `ping` to test if the two nodes have typical ICMP connection. 1. Use either `udaddy` or `ib_write_bw` to test the network connection is - ready and have the desired bandwith. + ready and have the desired bandwidth. ### Prepare Docker Image to Run RDMA Programs @@ -60,7 +60,7 @@ network device.*** package in it. 1. Start a docker container and mount GPU driver libs into it (you can skip this step if you are using nvidia-docker). -1. Mount RDMA dirvers and libs into the docker image (see below section), +1. Mount RDMA drivers and libs into the docker image (see below section), also `udaddy` and `ib_write_bw` if needed. 1. Mount GPU devices and RDMA devices into the container using `--device` or just use privileged mode `--privileged`. diff --git a/python/paddle/fluid/tests/unittests/test_program_code.py b/python/paddle/fluid/tests/unittests/test_program_code.py new file mode 100644 index 0000000000000000000000000000000000000000..e9c2b928617dce3904ca119896ca81454256e82e --- /dev/null +++ b/python/paddle/fluid/tests/unittests/test_program_code.py @@ -0,0 +1,81 @@ +# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time +import unittest +from multiprocessing import Process +import signal + +import numpy + +import paddle.fluid as fluid +import paddle.fluid.layers as layers +from paddle.fluid.layers.io import ListenAndServ +from paddle.fluid.layers.io import Recv +from paddle.fluid.layers.io import Send + +from paddle.fluid.transpiler.details import program_to_code + + +class TestProgram2Code(unittest.TestCase): + def test_print(self): + place = fluid.CPUPlace() + self.init_serv(place) + self.init_client(place, 9123) + + def init_serv(self, place): + main = fluid.Program() + + with fluid.program_guard(main): + serv = ListenAndServ("127.0.0.1:0", ["X"], optimizer_mode=False) + with serv.do(): + out_var = main.global_block().create_var( + name="scale_0.tmp_0", + psersistable=True, + dtype="float32", + shape=[32, 32]) + x = layers.data( + shape=[32, 32], + dtype='float32', + name="X", + append_batch_size=False) + fluid.initializer.Constant(value=1.0)(x, main.global_block()) + layers.scale(x=x, scale=10.0, out=out_var) + + program_to_code(main) + + def init_client(self, place, port): + main = fluid.Program() + with fluid.program_guard(main): + x = layers.data( + shape=[32, 32], + dtype='float32', + name='X', + append_batch_size=False) + fluid.initializer.Constant(value=2.3)(x, main.global_block()) + get_var = main.global_block().create_var( + name="scale_0.tmp_0", # server side var + dtype="float32", + persistable=False, + shape=[32, 32]) + fluid.initializer.Constant(value=2.3)(get_var, main.global_block()) + Send("127.0.0.1:%d" % port, [x]) + o = Recv("127.0.0.1:%d" % port, [get_var]) + + program_to_code(main) + + +if __name__ == "__main__": + unittest.main() diff --git a/python/paddle/fluid/transpiler/details/program_utils.py b/python/paddle/fluid/transpiler/details/program_utils.py index 640dbf4bbed58edf746456419af18c75241fa03c..420ae6dfd4b75b507dd01bb947fa707bca5cdb08 100644 --- a/python/paddle/fluid/transpiler/details/program_utils.py +++ b/python/paddle/fluid/transpiler/details/program_utils.py @@ -16,6 +16,9 @@ from __future__ import print_function import six +from paddle.fluid import core +import paddle + def delete_ops(block, ops): try: @@ -39,3 +42,133 @@ def find_op_by_output_arg(block, arg_name): if arg_name in op.output_arg_names: return index return -1 + + +def get_indent_space(indent, space_num=4): + ret = "" + for i in range(0, indent * space_num): + ret += " " + + return ret + + +def variable_to_code(var): + """ + Get readable codes of fluid variable. + + Args: + var: A fluid operator. + + Returns: + string: The formatted string. + """ + + var_str = "{name} : fluid.{type}.shape{shape}.astype({dtype})".\ + format(i="{", e="}", name=var.name, type=var.type, shape=var.shape, dtype=var.dtype) + + if type(var) == paddle.fluid.framework.Parameter: + if var.trainable: + var_str = "trainable parameter " + var_str + else: + var_str = "parameter " + var_str + else: + var_str = "var " + var_str + + if var.persistable: + var_str = "persist " + var_str + + return var_str + + +def op_to_code(op): + """ + Get readable codes of fluid operator. + + Args: + op: A fluid operator. + + Returns: + string: The foramtted string. + """ + + outputs_str = "{" + for i in range(0, len(op.output_names)): + outputs_str += "{name}=".format(name=op.output_names[i]) + o = op.output(op.output_names[i]) + outputs_str += "{value}".format(value=o) + if i != len(op.output_names) - 1: + outputs_str += ", " + outputs_str += "}" + + inputs_str = "{" + for i in range(0, len(op.input_names)): + inputs_str += "{name}=".format(name=op.input_names[i]) + o = op.input(op.input_names[i]) + inputs_str += "{value}".format(value=o) + + if i != len(op.input_names) - 1: + inputs_str += ", " + inputs_str += "}" + + attrs_str = "" + for i in range(0, len(op.attr_names)): + name = op.attr_names[i] + + attr_type = op.desc.attr_type(name) + if attr_type == core.AttrType.BLOCK: + a = "{name} = block[{value}]".format( + name=name, type=attr_type, value=op.block_attr_id(name)) + attrs_str += a + continue + + if attr_type == core.AttrType.BLOCKS: + a = "{name} = blocks{value}".format( + name=name, type=attr_type, value=op.blocks_attr_ids(name)) + attrs_str += a + continue + + a = "{name} = {value}".format( + name=name, type=attr_type, value=op.desc.attr(name)) + attrs_str += a + if i != len(op.attr_names) - 1: + attrs_str += ", " + + if outputs_str != "{}": + op_str = "{outputs} = {op_type}(inputs={inputs}, {attrs})".\ + format(outputs = outputs_str, op_type=op.type, inputs=inputs_str, attrs=attrs_str) + else: + op_str = "{op_type}(inputs={inputs}, {attrs})".\ + format(op_type=op.type, inputs=inputs_str, attrs=attrs_str) + return op_str + + +def program_to_code(prog): + """ + Print readable codes of fluid program. + + Args: + prog : A fluid program. + + An example result like bellow: + https://github.com/PaddlePaddle/Paddle/pull/12673 + """ + indent = 0 + block_idx = 0 + for block in prog.blocks: + print("{0}{1} // block {2}".format( + get_indent_space(indent), '{', block_idx)) + indent += 1 + # sort all vars + all_vars = sorted(block.vars.iteritems(), key=lambda x: x[0]) + for var in all_vars: + print("{}{}".format( + get_indent_space(indent), variable_to_code(var[1]))) + + if len(all_vars) > 0: + print("") + + for op in block.ops: + print("{}{}".format(get_indent_space(indent), op_to_code(op))) + indent -= 1 + print("{0}{1}".format(get_indent_space(indent), '}')) + block_idx += 1