Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
机器未来
Paddle
提交
884011a4
P
Paddle
项目概览
机器未来
/
Paddle
与 Fork 源项目一致
Fork自
PaddlePaddle / Paddle
通知
1
Star
1
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
1
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
1
Issue
1
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
884011a4
编写于
9月 01, 2021
作者:
Z
zhaoyingli
提交者:
GitHub
9月 01, 2021
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
reverse xpu adamw to the combination of ops version. (#35286)
上级
572bad8a
变更
1
隐藏空白更改
内联
并排
Showing
1 changed file
with
67 addition
and
2 deletion
+67
-2
python/paddle/optimizer/adamw.py
python/paddle/optimizer/adamw.py
+67
-2
未找到文件。
python/paddle/optimizer/adamw.py
浏览文件 @
884011a4
...
...
@@ -162,6 +162,7 @@ class AdamW(Adam):
self
.
_params_name
=
set
()
self
.
_apply_decay_param_fun
=
apply_decay_param_fun
self
.
_coeff
=
coeff
self
.
_lr_to_coeff
=
dict
()
super
(
AdamW
,
self
).
__init__
(
learning_rate
=
learning_rate
,
...
...
@@ -177,6 +178,9 @@ class AdamW(Adam):
self
.
type
=
"adamw"
if
core
.
is_compiled_with_xpu
():
self
.
type
=
"adam"
# Use _auxiliary_vars together with _set_auxiliary_var/_get_auxiliary_var to achieve that.
self
.
_auxiliary_vars
=
dict
()
...
...
@@ -189,7 +193,63 @@ class AdamW(Adam):
else
:
return
None
def
_append_decoupled_weight_decay
(
self
,
block
,
param_and_grad
):
"""
Add decoupled weight decay op.
parameter = parameter - parameter * coeff * lr
Args:
block: block in which variable is to be created
param_and_grad: (parameters, gradients) pairs,
the parameters need to decay.
Raises:
Exception: The type of coeff and parameter is not consistent.
"""
if
isinstance
(
param_and_grad
,
dict
):
param_and_grad
=
self
.
_update_param_group
(
param_and_grad
)
param
,
grad
=
param_and_grad
if
self
.
_apply_decay_param_fun
is
not
None
\
and
not
self
.
_apply_decay_param_fun
(
param
.
name
):
return
if
isinstance
(
self
.
_learning_rate
,
float
):
learning_rate
=
self
.
_learning_rate
else
:
# NOTE. We add this function to the _append_optimize_op(),
# for we must make sure _create_param_lr() be called after
# optimizer._create_global_learning_rate().
learning_rate
=
self
.
_create_param_lr
(
param_and_grad
)
with
block
.
program
.
_optimized_guard
(
[
param
,
grad
]),
framework
.
name_scope
(
'weight decay'
):
self
.
_params_name
.
add
(
param
.
name
)
# If it has been calculated, the result will be reused.
# NOTE(wangxi): In dygraph mode, apply_gradient will be executed
# every step, so need clear _lr_to_coeff every step,
# we do this in _create_optimization_pass
decay_coeff
=
self
.
_lr_to_coeff
.
get
(
learning_rate
,
None
)
if
decay_coeff
is
None
:
# NOTE(wangxi): for pipeline to set device:all
with
paddle
.
static
.
device_guard
(
None
):
decay_coeff
=
1.0
-
learning_rate
*
self
.
_coeff
self
.
_lr_to_coeff
[
learning_rate
]
=
decay_coeff
find_master
=
(
self
.
_multi_precision
and
param
.
dtype
==
core
.
VarDesc
.
VarType
.
FP16
)
if
find_master
:
master_weight
=
self
.
_master_weights
[
param
.
name
]
scaled_param
=
master_weight
*
decay_coeff
paddle
.
fluid
.
layers
.
assign
(
input
=
scaled_param
,
output
=
master_weight
)
else
:
scaled_param
=
param
*
decay_coeff
paddle
.
fluid
.
layers
.
assign
(
input
=
scaled_param
,
output
=
param
)
def
_append_optimize_op
(
self
,
block
,
param_and_grad
):
if
paddle
.
is_compiled_with_xpu
():
self
.
_append_decoupled_weight_decay
(
block
,
param_and_grad
)
return
super
(
AdamW
,
self
).
_append_optimize_op
(
block
,
param_and_grad
)
assert
isinstance
(
block
,
framework
.
Block
)
if
isinstance
(
param_and_grad
,
dict
):
...
...
@@ -201,8 +261,6 @@ class AdamW(Adam):
if
self
.
_apply_decay_param_fun
is
not
None
\
and
not
self
.
_apply_decay_param_fun
(
param
.
name
):
with_decay
=
False
else
:
self
.
_params_name
.
add
(
param
.
name
)
moment1
=
self
.
_get_accumulator
(
self
.
_moment1_acc_str
,
param_and_grad
[
0
])
...
...
@@ -291,6 +349,13 @@ class AdamW(Adam):
return
adamw_op
def
_create_optimization_pass
(
self
,
parameters_and_grads
):
optimize_ops
=
super
(
AdamW
,
self
).
_create_optimization_pass
(
parameters_and_grads
)
# In dygraph mode, clear _lr_to_coeff after applied gradient
self
.
_lr_to_coeff
=
dict
()
return
optimize_ops
def
__str__
(
self
):
return
" "
.
join
([
"Weight Decay, params:"
,
","
.
join
(
self
.
_params_name
)])
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录