Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
PARL
提交
4c312dab
P
PARL
项目概览
PaddlePaddle
/
PARL
通知
67
Star
3
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
18
列表
看板
标记
里程碑
合并请求
3
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
PARL
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
18
Issue
18
列表
看板
标记
里程碑
合并请求
3
合并请求
3
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
4c312dab
编写于
8月 29, 2019
作者:
B
Bo Zhou
提交者:
GitHub
8月 29, 2019
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
fix the thread-safe problem of parl.client (#141)
* fix the thread-safe problem of parl.client * yapf
上级
001c4dba
变更
2
隐藏空白更改
内联
并排
Showing
2 changed file
with
24 addition
and
4 deletion
+24
-4
parl/remote/client.py
parl/remote/client.py
+6
-0
parl/remote/tests/cluster_monitor_3_test.py
parl/remote/tests/cluster_monitor_3_test.py
+18
-4
未找到文件。
parl/remote/client.py
浏览文件 @
4c312dab
...
...
@@ -202,14 +202,18 @@ class Client(object):
logger
.
error
(
'Job {} exceeds max memory usage, will stop this job.'
.
format
(
job_address
))
self
.
lock
.
acquire
()
self
.
actor_num
-=
1
self
.
lock
.
release
()
job_is_alive
=
False
else
:
time
.
sleep
(
remote_constants
.
HEARTBEAT_INTERVAL_S
)
except
zmq
.
error
.
Again
as
e
:
job_is_alive
=
False
self
.
lock
.
acquire
()
self
.
actor_num
-=
1
self
.
lock
.
release
()
except
zmq
.
error
.
ZMQError
as
e
:
break
...
...
@@ -248,7 +252,9 @@ class Client(object):
check_result
=
self
.
_check_and_monitor_job
(
job_heartbeat_address
,
ping_heartbeat_address
)
if
check_result
:
self
.
lock
.
acquire
()
self
.
actor_num
+=
1
self
.
lock
.
release
()
return
job_address
# no vacant CPU resources, cannot submit a new job
...
...
parl/remote/tests/cluster_monitor_3_test.py
浏览文件 @
4c312dab
...
...
@@ -79,15 +79,29 @@ class TestClusterMonitor(unittest.TestCase):
time
.
sleep
(
1
)
self
.
assertEqual
(
20
,
len
(
cluster_monitor
.
data
[
'workers'
]))
# check if the number of workers drops by 10
for
i
in
range
(
10
):
workers
[
i
].
exit
()
time
.
sleep
(
60
)
self
.
assertEqual
(
10
,
len
(
cluster_monitor
.
data
[
'workers'
]))
check_flag
=
False
for
_
in
range
(
10
):
if
10
==
len
(
cluster_monitor
.
data
[
'workers'
]):
check_flag
=
True
break
time
.
sleep
(
10
)
self
.
assertTrue
(
check_flag
)
for
i
in
range
(
10
,
20
):
workers
[
i
].
exit
()
time
.
sleep
(
60
)
self
.
assertEqual
(
0
,
len
(
cluster_monitor
.
data
[
'workers'
]))
# check if the number of workers drops to 0
check_flag
=
False
for
_
in
range
(
10
):
if
0
==
len
(
cluster_monitor
.
data
[
'workers'
]):
check_flag
=
True
break
time
.
sleep
(
10
)
self
.
assertTrue
(
check_flag
)
master
.
exit
()
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录