Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
BaiXuePrincess
Paddle
提交
c1db7e32
P
Paddle
项目概览
BaiXuePrincess
/
Paddle
与 Fork 源项目一致
Fork自
PaddlePaddle / Paddle
通知
1
Star
1
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
c1db7e32
编写于
4月 27, 2021
作者:
S
ShenLiang
提交者:
GitHub
4月 27, 2021
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
[HybridParallel] Fix amp bug in ModelParallel (#32579)
* fix amp bug * fix name of wordsize
上级
9930a582
变更
2
隐藏空白更改
内联
并排
Showing
2 changed file
with
9 addition
and
8 deletion
+9
-8
python/paddle/distributed/fleet/meta_optimizers/dygraph_optimizer/hybrid_parallel_gradscaler.py
...ptimizers/dygraph_optimizer/hybrid_parallel_gradscaler.py
+4
-3
python/paddle/distributed/fleet/meta_parallel/parallel_layers/pp_layers.py
...tributed/fleet/meta_parallel/parallel_layers/pp_layers.py
+5
-5
未找到文件。
python/paddle/distributed/fleet/meta_optimizers/dygraph_optimizer/hybrid_parallel_gradscaler.py
浏览文件 @
c1db7e32
...
@@ -67,10 +67,11 @@ class HybridParallelGradScaler:
...
@@ -67,10 +67,11 @@ class HybridParallelGradScaler:
# allreduce_max found_inf in check_group
# allreduce_max found_inf in check_group
if
self
.
_is_mp
:
if
self
.
_is_mp
:
self
.
_found_inf
=
paddle
.
cast
(
self
.
_found_inf
,
dtype
=
"int32"
)
self
.
_found_inf
=
paddle
.
cast
(
self
.
_found_inf
,
dtype
=
"int32"
)
# TODO(shenliang03) Since the minimize call in the optimizer is
# after the gradscaler, check_finite needs to synchronize global
# information. In the future, we should use check_group
paddle
.
distributed
.
all_reduce
(
paddle
.
distributed
.
all_reduce
(
self
.
_found_inf
,
self
.
_found_inf
,
op
=
paddle
.
distributed
.
ReduceOp
.
MAX
,
group
=
None
)
op
=
paddle
.
distributed
.
ReduceOp
.
MAX
,
group
=
self
.
_hcg
.
get_check_parallel_group
())
self
.
_found_inf
=
paddle
.
cast
(
self
.
_found_inf
,
dtype
=
"bool"
)
self
.
_found_inf
=
paddle
.
cast
(
self
.
_found_inf
,
dtype
=
"bool"
)
def
__getattr__
(
self
,
item
):
def
__getattr__
(
self
,
item
):
...
...
python/paddle/distributed/fleet/meta_parallel/parallel_layers/pp_layers.py
浏览文件 @
c1db7e32
...
@@ -77,7 +77,7 @@ class PipelineLayer(Layer):
...
@@ -77,7 +77,7 @@ class PipelineLayer(Layer):
self
.
layers
=
layers
self
.
layers
=
layers
self
.
_loss_fn
=
loss_fn
self
.
_loss_fn
=
loss_fn
self
.
_topo
=
topology
self
.
_topo
=
topology
word_size
=
dist
.
get_world_size
()
wor
l
d_size
=
dist
.
get_world_size
()
self
.
global_rank
=
dist
.
get_rank
()
self
.
global_rank
=
dist
.
get_rank
()
if
self
.
_topo
:
if
self
.
_topo
:
...
@@ -88,11 +88,11 @@ class PipelineLayer(Layer):
...
@@ -88,11 +88,11 @@ class PipelineLayer(Layer):
self
.
_num_stages
)
self
.
_num_stages
)
else
:
else
:
# construct default topology
# construct default topology
if
word_size
%
num_stages
!=
0
:
if
wor
l
d_size
%
num_stages
!=
0
:
raise
ValueError
(
"should provide correct num_stages({}) "
raise
ValueError
(
"should provide correct num_stages({}) "
"which can be divided by wor
d_size({})"
.
format
(
"which can be divided by wor
ld_size({})"
.
num_stages
,
wor
d_size
))
format
(
num_stages
,
worl
d_size
))
dp_num
=
word_size
//
num_stages
dp_num
=
wor
l
d_size
//
num_stages
self
.
_topo
=
fleet
.
CommunicateTopology
([
"data"
,
"pipe"
,
"model"
],
self
.
_topo
=
fleet
.
CommunicateTopology
([
"data"
,
"pipe"
,
"model"
],
[
dp_num
,
num_stages
,
1
])
[
dp_num
,
num_stages
,
1
])
self
.
_stage_id
=
self
.
_topo
.
get_coord
(
self
.
global_rank
).
pipe
self
.
_stage_id
=
self
.
_topo
.
get_coord
(
self
.
global_rank
).
pipe
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录