Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
BaiXuePrincess
Paddle
提交
876e2ff1
P
Paddle
项目概览
BaiXuePrincess
/
Paddle
与 Fork 源项目一致
Fork自
PaddlePaddle / Paddle
通知
1
Star
1
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
876e2ff1
编写于
7月 18, 2022
作者:
C
caozhou
提交者:
GitHub
7月 18, 2022
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
[auto parallel] remove comm init control (#44385)
上级
c0a7830f
变更
1
隐藏空白更改
内联
并排
Showing
1 changed file
with
1 addition
and
55 deletion
+1
-55
python/paddle/distributed/auto_parallel/engine.py
python/paddle/distributed/auto_parallel/engine.py
+1
-55
未找到文件。
python/paddle/distributed/auto_parallel/engine.py
浏览文件 @
876e2ff1
...
...
@@ -324,65 +324,11 @@ class Engine:
# instantiate communication by process_mapping.
all_process_groups
=
get_all_process_groups
()
has_recv_by_socket
=
[]
# This is a magic number and the rank number for training is usually less than 5000
magic_num
=
5000
genv
=
_get_global_env
()
cur_rank_ip
,
cur_rank_port
=
genv
.
current_endpoint
.
split
(
":"
)
cur_rank_recv_port
=
int
(
cur_rank_port
)
+
magic_num
server_socket
=
None
# Large enough for recv rank
buff_size
=
1024
server_socket
=
socket
.
socket
(
socket
.
AF_INET
,
socket
.
SOCK_STREAM
)
server_socket
.
bind
((
cur_rank_ip
,
cur_rank_recv_port
))
# The 10 is an empirical value
server_socket
.
listen
(
10
)
client_sockets
=
{}
# NOTE: add the comm init control in the future for auto search
for
process_group
in
all_process_groups
:
if
self
.
_cur_rank
not
in
process_group
.
ranks
:
continue
if
len
(
process_group
.
ranks
)
==
2
:
index
=
process_group
.
ranks
.
index
(
self
.
_cur_rank
)
is_send
=
True
if
index
==
0
else
False
if
is_send
:
recv_rank
=
process_group
.
ranks
[
1
]
recv_rank_ip
,
recv_rank_port
=
genv
.
trainer_endpoints
[
recv_rank
].
split
(
":"
)
connect_port
=
int
(
recv_rank_port
)
+
magic_num
client_socket
=
socket
.
socket
(
socket
.
AF_INET
,
socket
.
SOCK_STREAM
)
client_socket
.
connect
((
recv_rank_ip
,
connect_port
))
client_socket
.
send
(
str
(
self
.
_cur_rank
).
encode
(
'utf-8'
))
rank
=
client_socket
.
recv
(
buff_size
).
decode
(
'utf-8'
)
rank
=
int
(
rank
)
if
rank
!=
recv_rank
:
raise
ValueError
(
"Please check comm pair, the recv rank should be {} but got {}."
.
format
(
recv_rank
,
rank
))
else
:
print
(
"It is able to instantiate {} as sender now."
.
format
(
process_group
.
ranks
))
client_socket
.
close
()
else
:
send_rank
=
process_group
.
ranks
[
0
]
while
True
:
if
send_rank
not
in
has_recv_by_socket
:
client_socket
,
recv_addr
=
server_socket
.
accept
(
)
rank
=
int
(
client_socket
.
recv
(
buff_size
).
decode
())
client_sockets
[
rank
]
=
client_socket
has_recv_by_socket
.
append
(
rank
)
else
:
client_sockets
[
send_rank
].
send
(
str
(
self
.
_cur_rank
).
encode
(
"utf-8"
))
client_sockets
[
send_rank
].
close
()
print
(
"It is able to instantiate {} as recver now."
.
format
(
process_group
.
ranks
))
break
process_group
.
instantiate
()
server_socket
.
close
()
self
.
_place
=
_get_device
()
if
isinstance
(
self
.
_place
,
fluid
.
CUDAPlace
):
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录