Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
PARL
提交
49b0e706
P
PARL
项目概览
PaddlePaddle
/
PARL
通知
67
Star
3
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
18
列表
看板
标记
里程碑
合并请求
3
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
PARL
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
18
Issue
18
列表
看板
标记
里程碑
合并请求
3
合并请求
3
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
体验新版 GitCode,发现更多精彩内容 >>
提交
49b0e706
编写于
9月 25, 2019
作者:
L
LI Yunxiang
提交者:
Bo Zhou
9月 25, 2019
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
add dygraph pg (#155)
* add dygraph pg * update acc. comments * update comments
上级
89c3366b
变更
7
隐藏空白更改
内联
并排
Showing
7 changed file
with
273 addition
and
0 deletion
+273
-0
examples/EagerMode/QuickStart/README.md
examples/EagerMode/QuickStart/README.md
+32
-0
examples/EagerMode/QuickStart/cartpole_agent.py
examples/EagerMode/QuickStart/cartpole_agent.py
+49
-0
examples/EagerMode/QuickStart/cartpole_model.py
examples/EagerMode/QuickStart/cartpole_model.py
+28
-0
examples/EagerMode/QuickStart/policy_gradient.py
examples/EagerMode/QuickStart/policy_gradient.py
+44
-0
examples/EagerMode/QuickStart/train.py
examples/EagerMode/QuickStart/train.py
+76
-0
examples/EagerMode/QuickStart/utils.py
examples/EagerMode/QuickStart/utils.py
+36
-0
examples/QuickStart/utils.py
examples/QuickStart/utils.py
+8
-0
未找到文件。
examples/EagerMode/QuickStart/README.md
0 → 100644
浏览文件 @
49b0e706
## Dygraph Quick Start
Train an agent with PARL to solve the CartPole problem, a classical benchmark in RL. Dygraph version of
[
QuickStart
][
origin
]
## How to use
### Dependencies:
+
[
paddlepaddle>=1.5.1
](
https://github.com/PaddlePaddle/Paddle
)
+
[
parl
](
https://github.com/PaddlePaddle/PARL
)
+
gym
### Start Training:
```
# Install dependencies
pip install paddlepaddle
# Or use Cuda: pip install paddlepaddle-gpu
pip install gym
git clone https://github.com/PaddlePaddle/PARL.git
cd PARL
pip install .
# Train model
cd examples/EagerMode/QuickStart/
python train.py
```
### Expected Result
<img
src=
"https://github.com/PaddlePaddle/PARL/blob/develop/examples/QuickStart/performance.gif"
width =
"300"
height =
"200"
alt=
"result"
/>
The agent can get around 200 points in a few minutes.
[
origin
]:
https://github.com/PaddlePaddle/PARL/tree/develop/examples/QuickStart
examples/EagerMode/QuickStart/cartpole_agent.py
0 → 100644
浏览文件 @
49b0e706
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
numpy
as
np
import
paddle.fluid
as
fluid
from
parl.utils
import
machine_info
class
CartpoleAgent
(
object
):
def
__init__
(
self
,
alg
,
obs_dim
,
act_dim
,
):
self
.
alg
=
alg
self
.
obs_dim
=
obs_dim
self
.
act_dim
=
act_dim
def
sample
(
self
,
obs
):
obs
=
np
.
expand_dims
(
obs
,
axis
=
0
)
act_prob
=
self
.
alg
.
predict
(
obs
).
numpy
()
act_prob
=
np
.
squeeze
(
act_prob
,
axis
=
0
)
act
=
np
.
random
.
choice
(
self
.
act_dim
,
p
=
act_prob
)
return
act
def
predict
(
self
,
obs
):
obs
=
np
.
expand_dims
(
obs
,
axis
=
0
)
act_prob
=
self
.
alg
.
predict
(
obs
).
numpy
()
act_prob
=
np
.
squeeze
(
act_prob
,
axis
=
0
)
act
=
np
.
argmax
(
act_prob
)
return
act
def
learn
(
self
,
obs
,
act
,
reward
):
act
=
np
.
expand_dims
(
act
,
axis
=-
1
)
reward
=
np
.
expand_dims
(
reward
,
axis
=-
1
)
cost
=
self
.
alg
.
learn
(
obs
,
act
,
reward
)
return
cost
examples/EagerMode/QuickStart/cartpole_model.py
0 → 100644
浏览文件 @
49b0e706
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
paddle.fluid
as
fluid
class
CartpoleModel
(
fluid
.
dygraph
.
Layer
):
def
__init__
(
self
,
name_scope
,
act_dim
):
super
(
CartpoleModel
,
self
).
__init__
(
name_scope
)
hid1_size
=
act_dim
*
10
self
.
fc1
=
fluid
.
FC
(
'fc1'
,
hid1_size
,
act
=
'tanh'
)
self
.
fc2
=
fluid
.
FC
(
'fc2'
,
act_dim
,
act
=
'softmax'
)
def
forward
(
self
,
obs
):
out
=
self
.
fc1
(
obs
)
out
=
self
.
fc2
(
out
)
return
out
examples/EagerMode/QuickStart/policy_gradient.py
0 → 100644
浏览文件 @
49b0e706
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
numpy
as
np
import
paddle.fluid
as
fluid
import
paddle.fluid.layers
as
layers
class
PolicyGradient
(
object
):
def
__init__
(
self
,
model
,
lr
):
self
.
model
=
model
self
.
optimizer
=
fluid
.
optimizer
.
Adam
(
learning_rate
=
lr
)
def
predict
(
self
,
obs
):
obs
=
fluid
.
dygraph
.
to_variable
(
obs
)
obs
=
layers
.
cast
(
obs
,
dtype
=
'float32'
)
return
self
.
model
(
obs
)
def
learn
(
self
,
obs
,
action
,
reward
):
obs
=
fluid
.
dygraph
.
to_variable
(
obs
)
obs
=
layers
.
cast
(
obs
,
dtype
=
'float32'
)
act_prob
=
self
.
model
(
obs
)
action
=
fluid
.
dygraph
.
to_variable
(
action
)
reward
=
fluid
.
dygraph
.
to_variable
(
reward
)
log_prob
=
layers
.
cross_entropy
(
act_prob
,
action
)
cost
=
log_prob
*
reward
cost
=
layers
.
cast
(
cost
,
dtype
=
'float32'
)
cost
=
layers
.
reduce_mean
(
cost
)
cost
.
backward
()
self
.
optimizer
.
minimize
(
cost
)
self
.
model
.
clear_gradients
()
return
cost
examples/EagerMode/QuickStart/train.py
0 → 100644
浏览文件 @
49b0e706
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
gym
import
numpy
as
np
import
paddle.fluid
as
fluid
from
parl.utils
import
logger
from
cartpole_model
import
CartpoleModel
from
cartpole_agent
import
CartpoleAgent
from
policy_gradient
import
PolicyGradient
from
utils
import
calc_discount_norm_reward
OBS_DIM
=
4
ACT_DIM
=
2
GAMMA
=
0.99
LEARNING_RATE
=
1e-3
def
run_episode
(
env
,
agent
,
train_or_test
=
'train'
):
obs_list
,
action_list
,
reward_list
=
[],
[],
[]
obs
=
env
.
reset
()
while
True
:
obs_list
.
append
(
obs
)
if
train_or_test
==
'train'
:
action
=
agent
.
sample
(
obs
)
else
:
action
=
agent
.
predict
(
obs
)
action_list
.
append
(
action
)
obs
,
reward
,
done
,
_
=
env
.
step
(
action
)
reward_list
.
append
(
reward
)
if
done
:
break
return
obs_list
,
action_list
,
reward_list
def
main
():
env
=
gym
.
make
(
'CartPole-v0'
)
model
=
CartpoleModel
(
name_scope
=
'noIdeaWhyNeedThis'
,
act_dim
=
ACT_DIM
)
alg
=
PolicyGradient
(
model
,
LEARNING_RATE
)
agent
=
CartpoleAgent
(
alg
,
OBS_DIM
,
ACT_DIM
)
with
fluid
.
dygraph
.
guard
():
for
i
in
range
(
1000
):
# 100 episodes
obs_list
,
action_list
,
reward_list
=
run_episode
(
env
,
agent
)
if
i
%
10
==
0
:
logger
.
info
(
"Episode {}, Reward Sum {}."
.
format
(
i
,
sum
(
reward_list
)))
batch_obs
=
np
.
array
(
obs_list
)
batch_action
=
np
.
array
(
action_list
)
batch_reward
=
calc_discount_norm_reward
(
reward_list
,
GAMMA
)
agent
.
learn
(
batch_obs
,
batch_action
,
batch_reward
)
if
(
i
+
1
)
%
100
==
0
:
_
,
_
,
reward_list
=
run_episode
(
env
,
agent
,
train_or_test
=
'test'
)
total_reward
=
np
.
sum
(
reward_list
)
logger
.
info
(
'Test reward: {}'
.
format
(
total_reward
))
if
__name__
==
'__main__'
:
main
()
examples/EagerMode/QuickStart/utils.py
0 → 100644
浏览文件 @
49b0e706
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
numpy
as
np
def
calc_discount_norm_reward
(
reward_list
,
gamma
):
'''
Calculate the discounted reward list according to the discount factor gamma, and normalize it.
Args:
reward_list(list): a list containing the rewards along the trajectory.
gamma(float): the discounted factor for accumulation reward computation.
Returns:
a list containing the discounted reward
'''
discount_norm_reward
=
np
.
zeros_like
(
reward_list
)
discount_cumulative_reward
=
0
for
i
in
reversed
(
range
(
0
,
len
(
reward_list
))):
discount_cumulative_reward
=
(
gamma
*
discount_cumulative_reward
+
reward_list
[
i
])
discount_norm_reward
[
i
]
=
discount_cumulative_reward
discount_norm_reward
=
discount_norm_reward
-
np
.
mean
(
discount_norm_reward
)
discount_norm_reward
=
discount_norm_reward
/
np
.
std
(
discount_norm_reward
)
return
discount_norm_reward
examples/QuickStart/utils.py
浏览文件 @
49b0e706
...
...
@@ -16,6 +16,14 @@ import numpy as np
def
calc_discount_norm_reward
(
reward_list
,
gamma
):
'''
Calculate the discounted reward list according to the discount factor gamma, and normalize it.
Args:
reward_list(list): a list containing the rewards along the trajectory.
gamma(float): the discounted factor for accumulation reward computation.
Returns:
a list containing the discounted reward
'''
discount_norm_reward
=
np
.
zeros_like
(
reward_list
)
discount_cumulative_reward
=
0
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录