提交 a7670972 编写于 作者: L LI Yunxiang 提交者: Hongsheng Zeng

add new_alg.rst (#123)

* add new_alg.rst

* rename LiftSim_demo as LiftSim_baseline

* Update new_alg.rst

* Update new_alg.rst
上级 bc2c3ad3
......@@ -38,7 +38,8 @@ Algorithm
In this tutorial, we solve the benchmark `Cartpole` using the `Policy Graident` algorithm, which has been implemented in our repository.
Thus, we can simply use this algorithm by importting it from ``parl.algorithms``.
We have also published various algorithms in PARL, please visit this page for more detail. For those who want to implement a new algorithm, please follow this tutorial.
We have also published various algorithms in PARL, please visit this `page <https://parl.readthedocs.io/en/latest/implementations.html>`_ for more detail.
For those who want to implement a new algorithm, please follow this `tutorial <https://parl.readthedocs.io/en/latest/new_alg.html>`_.
.. code-block:: python
......
......@@ -59,6 +59,7 @@ Abstractions
:caption: Tutorial
getting_started.rst
new_alg.rst
.. toctree::
:maxdepth: 2
......
Create Customized Algorithms
===============================
Goal of this tutorial:
- Learn how to implement your own algorithms.
Overview
-----------
To build a new algorithm, you need to inherit class ``parl.Algorithm``
and implement three basic functions: ``sample``, ``predict`` and ``learn``.
Methods
-----------
- ``__init__``
As algorithms update weights of the models, this method needs to define some models inherited from ``parl.Model``, like ``self.model`` in this example.
You can also set some hyperparameters in this method, like learning_rate, reward_decay and action_dimension,
which might be used in the following steps.
- ``predict``
This function defines how to choose actions. For instance, you can use a policy model to predict actions.
- ``sample``
Based on ``predict`` method, ``sample`` generates actions with noises. Use this method to do exploration if needed.
- ``learn``
Define loss function in ``learn`` method, which will be used to update weights of ``self.model``.
Example: DQN
--------------
This example shows how to implement DQN algorithm based on class ``parl.Algorithm`` according to the steps mentioned above.
Within class ``DQN(Algorithm)``, we define the following methods:
- \_\_init\_\_(self, model, act_dim=None, gamma=None, lr=None)
We define ``self.model`` and ``self.target_model`` of DQN in this method, which are instances of class ``parl.Model``.
And we also set hyperparameters act_dim, gamma and lr here. We will use these parameters in ``learn`` method.
.. code-block:: python
def __init__(self,
model,
act_dim=None,
gamma=None,
lr=None):
""" DQN algorithm
Args:
model (parl.Model): model defining forward network of Q function
hyperparas (dict): (deprecated) dict of hyper parameters.
act_dim (int): dimension of the action space
gamma (float): discounted factor for reward computation.
lr (float): learning rate.
"""
self.model = model
self.target_model = copy.deepcopy(model)
assert isinstance(act_dim, int)
assert isinstance(gamma, float)
assert isinstance(lr, float)
self.act_dim = act_dim
self.gamma = gamma
self.lr = lr
- predict(self, obs)
We use the forward network defined in ``self.model`` here, which uses observations to predict action values directly.
.. code-block:: python
def predict(self, obs):
""" use value model self.model to predict the action value
"""
return self.model.value(obs)
- learn(self, obs, action, reward, next_obs, terminal)
``learn`` method calculates the cost of value function according to the predict value and the target value.
``Agent`` will use the cost to update weights in ``self.model``.
.. code-block:: python
def learn(self, obs, action, reward, next_obs, terminal):
""" update value model self.model with DQN algorithm
"""
pred_value = self.model.value(obs)
next_pred_value = self.target_model.value(next_obs)
best_v = layers.reduce_max(next_pred_value, dim=1)
best_v.stop_gradient = True
target = reward + (
1.0 - layers.cast(terminal, dtype='float32')) * self.gamma * best_v
action_onehot = layers.one_hot(action, self.act_dim)
action_onehot = layers.cast(action_onehot, dtype='float32')
pred_action_value = layers.reduce_sum(
layers.elementwise_mul(action_onehot, pred_value), dim=1)
cost = layers.square_error_cost(pred_action_value, target)
cost = layers.reduce_mean(cost)
optimizer = fluid.optimizer.Adam(self.lr, epsilon=1e-3)
optimizer.minimize(cost)
return cost
- sync_target(self)
Use this method to synchronize the weights in ``self.target_model`` with those in ``self.model``.
This is the step used in DQN algorithm.
.. code-block:: python
def sync_target(self, gpu_id=None):
""" sync weights of self.model to self.target_model
"""
self.model.sync_weights_to(self.target_model)
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册