Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
Crayon鑫
Paddle
提交
34eb27a1
P
Paddle
项目概览
Crayon鑫
/
Paddle
与 Fork 源项目一致
Fork自
PaddlePaddle / Paddle
通知
1
Star
1
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
1
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
1
Issue
1
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
34eb27a1
编写于
8月 11, 2020
作者:
D
danleifeng
提交者:
GitHub
8月 11, 2020
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
ps worker-ports are optional for users for fleetrun command; test=develop (#26090)
上级
615e8a20
变更
2
隐藏空白更改
内联
并排
Showing
2 changed file
with
37 addition
and
10 deletion
+37
-10
python/paddle/fleet/launch.py
python/paddle/fleet/launch.py
+28
-9
python/paddle/fluid/tests/unittests/test_fleet_launch.sh
python/paddle/fluid/tests/unittests/test_fleet_launch.sh
+9
-1
未找到文件。
python/paddle/fleet/launch.py
浏览文件 @
34eb27a1
...
...
@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
paddle.distributed.launch
is a module that spawns multiple distributed
fleetrun
is a module that spawns multiple distributed
process on each training node for gpu training and cpu training.
Usage:
In both of single node training or multiple node training, this module
...
...
@@ -31,16 +31,26 @@ launch a process on each of the given gpu card or cpu machine.
your_training_py (arg1 arg2 and all others)
CPU training:
1. for single node training with multi servers and workers:
fleetrun --server_num=
1 --worker_num=4
your_training_py (arg1 arg2 and all others)
fleetrun --server_num=
2 --worker_num=2
your_training_py (arg1 arg2 and all others)
2. for multiple node training such as two node:192.168.0.16, 192.168.0.17
\
with 2 servers and
4 workers.
with 2 servers and 4 workers.
on 192.168.0.16:
fleetrun --servers="192.168.0.16:6170,192.168.0.17:617
1
"
\
--workers="192.168.0.16
:6172,192.168.0.17:6173,192.168.0.16:6174,192.168.0.17:6175
"
\
fleetrun --servers="192.168.0.16:6170,192.168.0.17:617
0
"
\
--workers="192.168.0.16
,192.168.0.17,192.168.0.16,192.168.0.17
"
\
your_training_py (arg1 arg2 and all others)
on 192.168.0.17:
fleetrun --servers="192.168.0.16:6170,192.168.0.17:6171"
\
--workers="192.168.0.16:6172,192.168.0.17:6173,192.168.0.16:6174,192.168.0.17:6175"
\
--workers="192.168.0.16,192.168.0.17,192.168.0.16,192.168.0.17"
\
your_training_py (arg1 arg2 and all others)
3. use gloo backend for multiple node training such as two node:192.168.0.16, 192.168.0.17
\
with 2 servers and 4 workers. (workers should set port)
on 192.168.0.16:
fleetrun --servers="192.168.0.16:6170,192.168.0.17:6170"
\
--workers="192.168.0.16:6171,192.168.0.17:6171,192.168.0.16:6172,192.168.0.17:6172"
\
your_training_py (arg1 arg2 and all others)
on 192.168.0.17:
fleetrun --servers="192.168.0.16:6170,192.168.0.17:6170"
\
--workers="192.168.0.16:6171,192.168.0.17:6171,192.168.0.16:6172,192.168.0.17:6172"
\
your_training_py (arg1 arg2 and all others)
"""
...
...
@@ -215,6 +225,7 @@ def launch_collective(args):
def
launch_ps
(
args
):
ports
=
None
start_port
=
6170
if
args
.
server_num
:
server_num
=
args
.
server_num
ports
=
get_ports
(
server_num
,
0
)
...
...
@@ -240,11 +251,19 @@ def launch_ps(args):
worker_endpoints_ips
=
[
x
.
strip
().
split
(
":"
)[
0
]
for
x
in
worker_endpoints
.
split
(
","
)
]
worker_endpoints_port
=
[
x
.
strip
().
split
(
":"
)[
1
]
for
x
in
worker_endpoints
.
split
(
","
)
]
worker_num
=
len
(
worker_endpoints_ips
)
node_ips
=
list
(
set
(
server_endpoints_ips
+
worker_endpoints_ips
))
worker_endpoints_len
=
[
len
(
x
.
strip
().
split
(
":"
))
for
x
in
worker_endpoints
.
split
(
","
)
]
if
1
in
worker_endpoints_len
:
# if no port value in worker_endpoints, will set default port values.
worker_endpoints_port
=
range
(
start_port
+
server_num
,
start_port
+
server_num
+
worker_num
,
1
)
else
:
worker_endpoints_port
=
[
x
.
strip
().
split
(
":"
)[
1
]
for
x
in
worker_endpoints
.
split
(
","
)
]
# local train
if
len
(
set
(
node_ips
))
==
1
:
...
...
python/paddle/fluid/tests/unittests/test_fleet_launch.sh
浏览文件 @
34eb27a1
...
...
@@ -11,7 +11,15 @@ function test_launch_ps(){
exit
-1
fi
fleetrun
--servers
=
"120.0.0.1:6780,120.0.0.1:6781"
--workers
=
"120.0.0.1:6782,120.0.0.1:6783"
fleet_ps_training.py 2> ut.elog
fleetrun
--servers
=
"127.0.0.1:6780,127.0.0.1:6781"
--workers
=
"127.0.0.1:6782,127.0.0.1:6783"
fleet_ps_training.py 2> ut.elog
if
grep
-q
"server are killed"
ut.elog
;
then
echo
"test pserver launch succeed"
else
echo
"test pserver launch failed"
exit
-1
fi
fleetrun
--servers
=
"127.0.0.1:6780,127.0.0.1:6781"
--workers
=
"127.0.0.1,127.0.0.1"
fleet_ps_training.py 2> ut.elog
if
grep
-q
"server are killed"
ut.elog
;
then
echo
"test pserver launch succeed"
else
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录