Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
Paddle
提交
535245cd
P
Paddle
项目概览
PaddlePaddle
/
Paddle
大约 1 年 前同步成功
通知
2298
Star
20931
Fork
5422
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
1423
列表
看板
标记
里程碑
合并请求
543
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
1,423
Issue
1,423
列表
看板
标记
里程碑
合并请求
543
合并请求
543
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
535245cd
编写于
5月 10, 2018
作者:
W
Wu Yi
提交者:
GitHub
5月 10, 2018
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
Add NCCL2 rdma train doc (#10561)
* add rdma train doc * update by comment * fix table
上级
705e7345
变更
1
隐藏空白更改
内联
并排
Showing
1 changed file
with
110 addition
and
0 deletion
+110
-0
doc/fluid/howto/cluster/nccl2_rdma_training.md
doc/fluid/howto/cluster/nccl2_rdma_training.md
+110
-0
未找到文件。
doc/fluid/howto/cluster/nccl2_rdma_training.md
0 → 100644
浏览文件 @
535245cd
# Distributed Training with NCCL2 and RDMA
When doing distributed multi-GPU training, network bandwith often becomes the
bottle neck. We introduce a way to use NCCL2 to do such training job to
achieve best performace.
## Prepare Hardwares with RDMA and Multiple GPUs
I'm using two Linux servers each of them is installed with 8 GPUs and
one 100Gb RDMA card.
Base environment is:
*
OS: CentOS 7.4
*
RDMA device: "Mellanox Technologies MT27700 Family [ConnectX-4]"
*
Kernel version:
`4.4.88-1.el7.elrepo.x86_64`
*
Docker version:
`1.12.6`
*
Docker storage driver:
`overlay2`
*
IP addresses: 192.168.16.30,192.168.16.34
In general, the steps including:
1.
Install GPU drivers
1.
Install RDMA drivers
1.
Install "InfiniBand Support"
1.
Use docker to run tests and make sure GPUs and RDMA can work inside
the container.
I'll ommit section "Install GPU drivers" because we can find it easily
somewhere else.
### Install RDMA drivers
For my case, I've got two machines with device
"Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was
"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can
work with latest overlay2 filesystem.
**
*
NOTE: before you start, make sure you have a way to get a console
of the server other than ssh because we may need to re-configure the
network device.
**
*
1.
Go to http://www.mellanox.com/page/products_dyn?product_family=26,
download
`MLNX_OFED`
software in the bottom of the page, and upload it
onto the server.
1.
Run
`./mlnxofedinstall --add-kernel-support`
in the software package.
1.
Run
`/etc/init.d/openibd restart`
to make everything work, note that
this operation may cause the network goes down if you are using this
RDMA device as default network device and use ssh to login the server.
1.
Re-configure the network interface, for example:
`ifconfig eth2 192.168.16.30/20 up`
, then add routes if needed:
`ip route add default via 192.168.16.1 dev eth2`
.
1.
Do the same thing on the other node.
1.
Use
`ping`
to test if the two nodes have typical ICMP connection.
1.
Use either
`udaddy`
or
`ib_write_bw`
to test the network connection is
ready and have the desired bandwith.
### Prepare Docker Image to Run RDMA Programs
1.
Build a docker image using cuda base image like:
`nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04`
and install paddlepaddle whl
package in it.
1.
Start a docker container and mount GPU driver libs into it (you can
skip this step if you are using nvidia-docker).
1.
Mount RDMA dirvers and libs into the docker image (see below section),
also
`udaddy`
and
`ib_write_bw`
if needed.
1.
Mount GPU devices and RDMA devices into the container using
`--device`
or just use privileged mode
`--privileged`
.
1.
Start the container using host network mode:
`--net=host`
### RDMA Library Files Needed
Usually,
`MLNX_OFED`
install latest supported libs under
`/usr/lib64/mlnx_ofed/valgrind`
. Other libs also needed to run RDMA programs
is listed below. These libs must be mounted into the docker container.
*
Libs under
`/usr/lib64/mlnx_ofed/valgrind`
*
libibcm.so
*
libibverbs.so
*
libmlx4.so
*
libmlx5.so
*
libmlx5-rdmav2.so
*
librdmacm.so
*
Other libs:
*
libnl-3.so.200
*
libnl-route-3.so.200
*
libnuma.so.1
## Start to Run the Training Job
Setting NCCL environment variables to turn NCCL switches on and off:
| Env Name | Description |
| --- | --- |
| NCCL_SOCKET_IFNAME | The RDMA device, e.g. eth2 |
| NCCL_P2P_DISABLE | Set to 1 to disable P2P transfer between GPUs |
| NCCL_IB_DISABLE | Set to 1 to disable using RDMA |
| NCCL_IB_CUDA_SUPPORT | Set to 1 to enable GPU Direct if supported |
| NCCL_DEBUG | Set debug level: VERSION, WARN, INFO |
My two servers are:
`192.168.16.30,192.168.16.34`
, On node 1, Run :
```
bash
PADDLE_TRAINER_ID
=
0
PADDLE_PORT
=
48372
PADDLE_WORKERS
=
192.168.16.30,192.168.16.34
POD_IP
=
192.168.16.30
stdbuf
-oL
python vgg16.py
```
On node 2, Run:
```
bash
PADDLE_TRAINER_ID
=
1
PADDLE_PORT
=
48372
PADDLE_WORKERS
=
192.168.16.30,192.168.16.34
POD_IP
=
192.168.16.34
stdbuf
-oL
python vgg16.py
```
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录