Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
models
提交
ad74de01
M
models
项目概览
PaddlePaddle
/
models
大约 2 年 前同步成功
通知
232
Star
6828
Fork
2962
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
602
列表
看板
标记
里程碑
合并请求
255
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
M
models
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
602
Issue
602
列表
看板
标记
里程碑
合并请求
255
合并请求
255
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
ad74de01
编写于
5月 21, 2019
作者:
P
pkpk
提交者:
GitHub
5月 21, 2019
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
add two dygraph examples of reinforcement learning
上级
6b6f23da
变更
3
隐藏空白更改
内联
并排
Showing
3 changed file
with
488 addition
and
0 deletion
+488
-0
dygraph/reinforcement_learning/README.md
dygraph/reinforcement_learning/README.md
+105
-0
dygraph/reinforcement_learning/actor_critic.py
dygraph/reinforcement_learning/actor_critic.py
+199
-0
dygraph/reinforcement_learning/reinforce.py
dygraph/reinforcement_learning/reinforce.py
+184
-0
未找到文件。
dygraph/reinforcement_learning/README.md
0 → 100644
浏览文件 @
ad74de01
# 强化学习
本页将介绍如何使用PaddlePaddle在DyGraph模式下实现典型强化学习算法,包括
[
安装
](
#installation
)
、
[
训练
](
#training-a-model
)
、
[
输出
](
#log
)
、
[
模型评估
](
#evaluation
)
。
---
## 内容
-
[
安装
](
#installation
)
-
[
训练
](
#training-a-model
)
-
[
输出
](
#log
)
## 安装
在当前目录下运行样例代码需要PadddlePaddle Fluid的v1.4.0或以上的版本。如果你的运行环境中的PaddlePaddle低于此版本,请根据安装文档中的说明来更新PaddlePaddle。
除了paddle以外,请安装gym模拟器(https://gym.openai.com/ ),这个模拟器内包含了十分丰富的模拟环境,可以极大地便利强化学习研究。
## 训练
教程中使用
`Cartpole`
作为模拟环境(关于Cartpole: https://gym.openai.com/envs/CartPole-v0 ),并包含了2个典型的强化学习算法:reinforce.py 和 actor_critic.py,可以通过如下的方式启动训练:
```
env CUDA_VISIBLE_DEVICES=0 python reinforce.py
```
或
```
env CUDA_VISIBLE_DEVICES=0 python actor_critic.py
```
## 输出
执行训练开始后,将得到类似如下的输出。
```
env CUDA_VISIBLE_DEVICES=0 python reinforce.py
Episode 10 Last reward: 20.00 Average reward: 14.96
Episode 20 Last reward: 35.00 Average reward: 20.56
Episode 30 Last reward: 26.00 Average reward: 23.18
Episode 40 Last reward: 21.00 Average reward: 28.68
Episode 50 Last reward: 21.00 Average reward: 30.06
Episode 60 Last reward: 27.00 Average reward: 37.21
Episode 70 Last reward: 67.00 Average reward: 47.69
Episode 80 Last reward: 46.00 Average reward: 55.25
Episode 90 Last reward: 113.00 Average reward: 80.11
Episode 100 Last reward: 124.00 Average reward: 89.36
Episode 110 Last reward: 97.00 Average reward: 98.29
Episode 120 Last reward: 200.00 Average reward: 110.29
Episode 130 Last reward: 200.00 Average reward: 142.01
Episode 140 Last reward: 157.00 Average reward: 162.18
Episode 150 Last reward: 101.00 Average reward: 165.37
Episode 160 Last reward: 119.00 Average reward: 156.74
Episode 170 Last reward: 114.00 Average reward: 146.62
Episode 180 Last reward: 149.00 Average reward: 140.74
Episode 190 Last reward: 114.00 Average reward: 149.52
Episode 200 Last reward: 124.00 Average reward: 130.40
Episode 210 Last reward: 103.00 Average reward: 119.44
Episode 220 Last reward: 200.00 Average reward: 120.50
Episode 230 Last reward: 172.00 Average reward: 126.33
Episode 240 Last reward: 187.00 Average reward: 139.02
Episode 250 Last reward: 170.00 Average reward: 154.12
Episode 260 Last reward: 172.00 Average reward: 167.44
Episode 270 Last reward: 195.00 Average reward: 175.00
Episode 280 Last reward: 200.00 Average reward: 178.56
Episode 290 Last reward: 200.00 Average reward: 187.16
Episode 300 Last reward: 200.00 Average reward: 192.32
Solved! Running reward is now 195.156645521 and the last episode runs to 200 time steps!
```
或
```
env CUDA_VISIBLE_DEVICES=0 python actor_critic.py
Episode 10 Last reward: 131.00 Average reward: 23.54
Episode 20 Last reward: 89.00 Average reward: 31.96
Episode 30 Last reward: 108.00 Average reward: 76.43
Episode 40 Last reward: 20.00 Average reward: 83.57
Episode 50 Last reward: 19.00 Average reward: 56.94
Episode 60 Last reward: 53.00 Average reward: 48.44
Episode 70 Last reward: 147.00 Average reward: 82.04
Episode 80 Last reward: 90.00 Average reward: 94.94
Episode 90 Last reward: 144.00 Average reward: 97.71
Episode 100 Last reward: 200.00 Average reward: 133.73
Episode 110 Last reward: 200.00 Average reward: 158.69
Episode 120 Last reward: 159.00 Average reward: 162.60
Episode 130 Last reward: 150.00 Average reward: 159.57
Episode 140 Last reward: 195.00 Average reward: 163.27
Episode 150 Last reward: 143.00 Average reward: 157.88
Episode 160 Last reward: 113.00 Average reward: 151.82
Episode 170 Last reward: 147.00 Average reward: 146.14
Episode 180 Last reward: 199.00 Average reward: 150.11
Episode 190 Last reward: 200.00 Average reward: 168.77
Episode 200 Last reward: 200.00 Average reward: 177.60
Episode 210 Last reward: 102.00 Average reward: 174.29
Episode 220 Last reward: 189.00 Average reward: 171.91
Episode 230 Last reward: 200.00 Average reward: 169.92
Episode 240 Last reward: 200.00 Average reward: 181.99
Episode 250 Last reward: 200.00 Average reward: 189.22
Episode 260 Last reward: 200.00 Average reward: 188.75
Episode 270 Last reward: 180.00 Average reward: 192.27
Episode 280 Last reward: 200.00 Average reward: 175.83
Episode 290 Last reward: 200.00 Average reward: 185.53
Episode 300 Last reward: 200.00 Average reward: 191.33
Episode 310 Last reward: 200.00 Average reward: 194.81
Solved! Running reward is now 195.071295316 and the last episode runs to 200 time steps!
```
## 模型评估
强化学习模型一般采取边预测边训练,不断通过反馈来改进模型的学习方式,因此我们只需要观察学习过程中的reward变化情况,就可以评估强化学习算法的好坏。
在gym中,不同的游戏一般都设置了不同的solve_threshold,强化学习模型只需要达到这个threshold,即可完成训练。
dygraph/reinforcement_learning/actor_critic.py
0 → 100644
浏览文件 @
ad74de01
import
argparse
import
gym
import
numpy
as
np
from
itertools
import
count
from
collections
import
namedtuple
import
math
import
paddle
import
paddle.fluid
as
fluid
import
paddle.fluid.dygraph.nn
as
nn
import
paddle.fluid.framework
as
framework
parser
=
argparse
.
ArgumentParser
(
description
=
'PyTorch REINFORCE example'
)
parser
.
add_argument
(
'--gamma'
,
type
=
float
,
default
=
0.99
,
metavar
=
'G'
,
help
=
'discount factor (default: 0.99)'
)
parser
.
add_argument
(
'--seed'
,
type
=
int
,
default
=
543
,
metavar
=
'N'
,
help
=
'random seed (default: 543)'
)
parser
.
add_argument
(
'--render'
,
action
=
'store_true'
,
help
=
'render the environment'
)
parser
.
add_argument
(
'--log-interval'
,
type
=
int
,
default
=
10
,
metavar
=
'N'
,
help
=
'interval between training status logs (default: 10)'
)
args
=
parser
.
parse_args
()
env
=
gym
.
make
(
'CartPole-v0'
)
env
.
seed
(
args
.
seed
)
SavedAction
=
namedtuple
(
'SavedAction'
,
[
'log_prob'
,
'value'
])
class
Policy
(
fluid
.
dygraph
.
Layer
):
def
__init__
(
self
,
name_scope
):
super
(
Policy
,
self
).
__init__
(
name_scope
)
self
.
affine1
=
nn
.
FC
(
self
.
full_name
(),
size
=
128
)
self
.
action_head
=
nn
.
FC
(
self
.
full_name
(),
size
=
2
)
self
.
value_head
=
nn
.
FC
(
self
.
full_name
(),
size
=
1
)
self
.
saved_actions
=
[]
self
.
rewards
=
[]
def
forward
(
self
,
x
):
x
=
fluid
.
layers
.
reshape
(
x
,
shape
=
[
1
,
4
])
x
=
self
.
affine1
(
x
)
x
=
fluid
.
layers
.
relu
(
x
)
action_scores
=
self
.
action_head
(
x
)
state_values
=
self
.
value_head
(
x
)
return
fluid
.
layers
.
softmax
(
action_scores
,
axis
=-
1
),
state_values
with
fluid
.
dygraph
.
guard
():
policy
=
Policy
(
"PolicyModel"
)
eps
=
np
.
finfo
(
np
.
float32
).
eps
.
item
()
optimizer
=
fluid
.
optimizer
.
AdamOptimizer
(
learning_rate
=
3e-2
)
def
get_mean_and_std
(
values
=
[]):
n
=
0.
s
=
0.
for
val
in
values
:
s
+=
val
n
+=
1
mean
=
s
/
n
std
=
0.
for
val
in
values
:
std
+=
(
val
-
mean
)
*
(
val
-
mean
)
std
/=
n
std
=
math
.
sqrt
(
std
)
return
mean
,
std
def
sample_action
(
probs
):
sample
=
np
.
random
.
random
()
idx
=
0
while
idx
<
len
(
probs
)
and
sample
>
probs
[
idx
]:
sample
-=
probs
[
idx
]
idx
+=
1
mask
=
[
0.
]
*
len
(
probs
)
mask
[
idx
]
=
1.
return
idx
,
np
.
array
([
mask
]).
astype
(
"float32"
)
def
choose_best_action
(
probs
):
idx
=
0
if
probs
[
0
]
>
probs
[
1
]
else
1
mask
=
[
1.
,
0.
]
if
idx
==
0
else
[
0.
,
1.
]
return
idx
,
np
.
array
([
mask
]).
astype
(
"float32"
)
def
select_action
(
state
):
state
=
fluid
.
dygraph
.
base
.
to_variable
(
state
)
state
.
stop_gradient
=
True
probs
,
state_value
=
policy
(
state
)
np_probs
=
probs
.
numpy
()
action
,
_mask
=
sample_action
(
np_probs
[
0
])
mask
=
fluid
.
dygraph
.
base
.
to_variable
(
_mask
)
mask
.
stop_gradient
=
True
loss_probs
=
fluid
.
layers
.
log
(
probs
)
loss_probs
=
fluid
.
layers
.
elementwise_mul
(
loss_probs
,
mask
)
loss_probs
=
fluid
.
layers
.
reduce_sum
(
loss_probs
,
dim
=-
1
)
policy
.
saved_actions
.
append
(
SavedAction
(
loss_probs
,
state_value
))
return
action
def
finish_episode
():
R
=
0
saved_actions
=
policy
.
saved_actions
policy_losses
=
[]
value_losses
=
[]
returns
=
[]
for
r
in
policy
.
rewards
[::
-
1
]:
R
=
r
+
args
.
gamma
*
R
returns
.
insert
(
0
,
R
)
mean
,
std
=
get_mean_and_std
(
returns
)
returns
=
np
.
array
(
returns
).
astype
(
"float32"
)
returns
=
(
returns
-
mean
)
/
(
std
+
eps
)
for
(
log_prob
,
value
),
R
in
zip
(
saved_actions
,
returns
):
advantage
=
R
-
value
[
0
][
0
]
log_prob_numpy
=
log_prob
.
numpy
()
R_numpy
=
np
.
ones_like
(
log_prob_numpy
).
astype
(
"float32"
)
_R
=
-
1
*
advantage
*
R_numpy
_R
=
fluid
.
dygraph
.
base
.
to_variable
(
_R
)
_R
.
stop_gradient
=
True
policy_loss
=
fluid
.
layers
.
elementwise_mul
(
_R
,
log_prob
)
policy_losses
.
append
(
policy_loss
)
_R2
=
np
.
ones_like
(
value
.
numpy
()).
astype
(
"float32"
)
*
R
_R2
=
fluid
.
dygraph
.
base
.
to_variable
(
_R2
)
_R2
.
stop_gradient
=
True
value_loss
=
fluid
.
layers
.
smooth_l1
(
value
,
_R2
,
sigma
=
1.0
)
value_losses
.
append
(
value_loss
)
all_policy_loss
=
fluid
.
layers
.
concat
(
policy_losses
)
all_policy_loss
=
fluid
.
layers
.
reduce_sum
(
all_policy_loss
)
all_value_loss
=
fluid
.
layers
.
concat
(
value_losses
)
all_value_loss
=
fluid
.
layers
.
reduce_sum
(
all_value_loss
)
loss
=
all_policy_loss
+
all_value_loss
loss
.
backward
()
optimizer
.
minimize
(
loss
)
policy
.
clear_gradients
()
del
policy
.
rewards
[:]
del
policy
.
saved_actions
[:]
return
returns
running_reward
=
10
for
i_episode
in
count
(
1
):
state
,
ep_reward
=
env
.
reset
(),
0
for
t
in
range
(
1
,
10000
):
# Don't infinite loop while learning
state
=
np
.
array
(
state
).
astype
(
"float32"
)
action
=
select_action
(
state
)
state
,
reward
,
done
,
_
=
env
.
step
(
action
)
if
args
.
render
:
env
.
render
()
policy
.
rewards
.
append
(
reward
)
ep_reward
+=
reward
if
done
:
break
running_reward
=
0.05
*
ep_reward
+
(
1
-
0.05
)
*
running_reward
returns
=
finish_episode
()
if
i_episode
%
args
.
log_interval
==
0
:
print
(
'Episode {}
\t
Last reward: {:.2f}
\t
Average reward: {:.2f}'
.
format
(
i_episode
,
ep_reward
,
running_reward
))
#print(returns)
if
running_reward
>
env
.
spec
.
reward_threshold
:
print
(
"Solved! Running reward is now {} and "
"the last episode runs to {} time steps!"
.
format
(
running_reward
,
t
))
break
dygraph/reinforcement_learning/reinforce.py
0 → 100644
浏览文件 @
ad74de01
import
argparse
import
gym
import
numpy
as
np
from
itertools
import
count
import
math
import
paddle
import
paddle.fluid
as
fluid
import
paddle.fluid.dygraph.nn
as
nn
import
paddle.fluid.framework
as
framework
parser
=
argparse
.
ArgumentParser
(
description
=
'PyTorch REINFORCE example'
)
parser
.
add_argument
(
'--gamma'
,
type
=
float
,
default
=
0.99
,
metavar
=
'G'
,
help
=
'discount factor (default: 0.99)'
)
parser
.
add_argument
(
'--seed'
,
type
=
int
,
default
=
543
,
metavar
=
'N'
,
help
=
'random seed (default: 543)'
)
parser
.
add_argument
(
'--render'
,
action
=
'store_true'
,
help
=
'render the environment'
)
parser
.
add_argument
(
'--log-interval'
,
type
=
int
,
default
=
10
,
metavar
=
'N'
,
help
=
'interval between training status logs (default: 10)'
)
args
=
parser
.
parse_args
()
env
=
gym
.
make
(
'CartPole-v0'
)
env
.
seed
(
args
.
seed
)
class
Policy
(
fluid
.
dygraph
.
Layer
):
def
__init__
(
self
,
name_scope
):
super
(
Policy
,
self
).
__init__
(
name_scope
)
self
.
affine1
=
nn
.
FC
(
self
.
full_name
(),
size
=
128
)
self
.
affine2
=
nn
.
FC
(
self
.
full_name
(),
size
=
2
)
self
.
dropout_ratio
=
0.6
self
.
saved_log_probs
=
[]
self
.
rewards
=
[]
def
forward
(
self
,
x
):
x
=
fluid
.
layers
.
reshape
(
x
,
shape
=
[
1
,
4
])
x
=
self
.
affine1
(
x
)
x
=
fluid
.
layers
.
dropout
(
x
,
self
.
dropout_ratio
)
x
=
fluid
.
layers
.
relu
(
x
)
action_scores
=
self
.
affine2
(
x
)
self
.
_x_for_debug
=
x
return
fluid
.
layers
.
softmax
(
action_scores
,
axis
=
1
)
with
fluid
.
dygraph
.
guard
():
policy
=
Policy
(
"PolicyModel"
)
eps
=
np
.
finfo
(
np
.
float32
).
eps
.
item
()
optimizer
=
fluid
.
optimizer
.
AdamOptimizer
(
learning_rate
=
1e-2
)
def
get_mean_and_std
(
values
=
[]):
n
=
0.
s
=
0.
for
val
in
values
:
s
+=
val
n
+=
1
mean
=
s
/
n
std
=
0.
for
val
in
values
:
std
+=
(
val
-
mean
)
*
(
val
-
mean
)
std
/=
n
std
=
math
.
sqrt
(
std
)
return
mean
,
std
def
sample_action
(
probs
):
sample
=
np
.
random
.
random
()
idx
=
0
while
idx
<
len
(
probs
)
and
sample
>
probs
[
idx
]:
sample
-=
probs
[
idx
]
idx
+=
1
mask
=
[
0.
]
*
len
(
probs
)
mask
[
idx
]
=
1.
return
idx
,
np
.
array
([
mask
]).
astype
(
"float32"
)
def
choose_best_action
(
probs
):
idx
=
0
if
probs
[
0
]
>
probs
[
1
]
else
1
mask
=
[
1.
,
0.
]
if
idx
==
0
else
[
0.
,
1.
]
return
idx
,
np
.
array
([
mask
]).
astype
(
"float32"
)
def
select_action
(
state
):
state
=
fluid
.
dygraph
.
base
.
to_variable
(
state
)
state
.
stop_gradient
=
True
loss_probs
=
policy
(
state
)
probs
=
loss_probs
.
numpy
()
action
,
_mask
=
sample_action
(
probs
[
0
])
mask
=
fluid
.
dygraph
.
base
.
to_variable
(
_mask
)
mask
.
stop_gradient
=
True
loss_probs
=
fluid
.
layers
.
log
(
loss_probs
)
loss_probs
=
fluid
.
layers
.
elementwise_mul
(
loss_probs
,
mask
)
loss_probs
=
fluid
.
layers
.
reduce_sum
(
loss_probs
,
dim
=-
1
)
policy
.
saved_log_probs
.
append
(
loss_probs
)
return
action
def
finish_episode
():
R
=
0
policy_loss
=
[]
returns
=
[]
for
r
in
policy
.
rewards
[::
-
1
]:
R
=
r
+
args
.
gamma
*
R
returns
.
insert
(
0
,
R
)
mean
,
std
=
get_mean_and_std
(
returns
)
returns
=
np
.
array
(
returns
).
astype
(
"float32"
)
returns
=
(
returns
-
mean
)
/
(
std
+
eps
)
for
log_prob
,
R
in
zip
(
policy
.
saved_log_probs
,
returns
):
log_prob_numpy
=
log_prob
.
numpy
()
R_numpy
=
np
.
ones_like
(
log_prob_numpy
).
astype
(
"float32"
)
_R
=
-
1
*
R
*
R_numpy
_R
=
fluid
.
dygraph
.
base
.
to_variable
(
_R
)
_R
.
stop_gradient
=
True
curr_loss
=
fluid
.
layers
.
elementwise_mul
(
_R
,
log_prob
)
policy_loss
.
append
(
curr_loss
)
policy_loss
=
fluid
.
layers
.
concat
(
policy_loss
)
policy_loss
=
fluid
.
layers
.
reduce_sum
(
policy_loss
)
policy_loss
.
backward
()
optimizer
.
minimize
(
policy_loss
)
dy_grad
=
policy
.
_x_for_debug
.
gradient
()
policy
.
clear_gradients
()
del
policy
.
rewards
[:]
del
policy
.
saved_log_probs
[:]
return
returns
running_reward
=
10
for
i_episode
in
count
(
1
):
state
,
ep_reward
=
env
.
reset
(),
0
for
t
in
range
(
1
,
10000
):
# Don't infinite loop while learning
state
=
np
.
array
(
state
).
astype
(
"float32"
)
action
=
select_action
(
state
)
state
,
reward
,
done
,
_
=
env
.
step
(
action
)
if
args
.
render
:
env
.
render
()
policy
.
rewards
.
append
(
reward
)
ep_reward
+=
reward
if
done
:
break
running_reward
=
0.05
*
ep_reward
+
(
1
-
0.05
)
*
running_reward
returns
=
finish_episode
()
if
i_episode
%
args
.
log_interval
==
0
:
print
(
'Episode {}
\t
Last reward: {:.2f}
\t
Average reward: {:.2f}'
.
format
(
i_episode
,
ep_reward
,
running_reward
))
#print(returns)
if
running_reward
>
env
.
spec
.
reward_threshold
:
print
(
"Solved! Running reward is now {} and "
"the last episode runs to {} time steps!"
.
format
(
running_reward
,
t
))
break
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录