add new_alg.rst (#123)

* add new_alg.rst * rename LiftSim_demo as LiftSim_baseline * Update new_alg.rst * Update new_alg.rst

add new_alg.rst (#123)
* add new_alg.rst * rename LiftSim_demo as LiftSim_baseline * Update new_alg.rst * Update new_alg.rst
a7670972 · LI Yunxiang · Hongsheng Zeng · bc2c3ad3 · a7670972 · a7670972
13 changed file
--- a/docs/getting_started.rst
+++ b/docs/getting_started.rst
@@ -38,7 +38,8 @@ Algorithm
 In this tutorial, we solve the benchmark `Cartpole` using the `Policy Graident` algorithm, which has been implemented in our repository.
 Thus, we can simply use this algorithm by importting it from ``parl.algorithms``.
-We have also published various algorithms in PARL, please visit this page for more detail. For those who want to implement a new algorithm, please follow this tutorial.
+We have also published various algorithms in PARL, please visit this `page <https://parl.readthedocs.io/en/latest/implementations.html>`_ for more detail. 
+For those who want to implement a new algorithm, please follow this `tutorial <https://parl.readthedocs.io/en/latest/new_alg.html>`_.
 .. code-block:: python

--- a/docs/index.rst
+++ b/docs/index.rst
@@ -59,6 +59,7 @@ Abstractions
    :caption: Tutorial
    getting_started.rst
+    new_alg.rst
 .. toctree::
    :maxdepth: 2

--- a/docs/new_alg.rst
+++ b/docs/new_alg.rst
+Create Customized Algorithms
+===============================
+Goal of this tutorial:
+- Learn how to implement your own algorithms.
+Overview
+-----------
+To build a new algorithm, you need to inherit class ``parl.Algorithm``
+and implement three basic functions: ``sample``, ``predict`` and ``learn``.
+Methods
+-----------
+- ``__init__``
+  As algorithms update weights of the models, this method needs to define some models inherited from ``parl.Model``, like ``self.model`` in this example.
+  You can also set some hyperparameters in this method, like learning_rate, reward_decay and action_dimension,
+  which might be used in the following steps.
+- ``predict``
+  This function defines how to choose actions. For instance, you can use a policy model to predict actions. 
+- ``sample``
+  Based on ``predict`` method, ``sample`` generates actions with noises. Use this method to do exploration if needed.
+- ``learn``
+  Define loss function in ``learn`` method, which will be used to update weights of ``self.model``.
+Example: DQN
+--------------
+This example shows how to implement DQN algorithm based on class ``parl.Algorithm`` according to the steps mentioned above.
+Within class ``DQN(Algorithm)``, we define the following methods:
+- \_\_init\_\_(self, model, act_dim=None, gamma=None, lr=None)
+  We define ``self.model`` and ``self.target_model`` of DQN in this method, which are instances of class ``parl.Model``. 
+  And we also set hyperparameters act_dim, gamma and lr here. We will use these parameters in ``learn`` method.
+  .. code-block:: python
+    def __init__(self,
+                 model,
+                 act_dim=None,
+                 gamma=None,
+                 lr=None):
+        """ DQN algorithm
+        Args:
+            model (parl.Model): model defining forward network of Q function
+            hyperparas (dict): (deprecated) dict of hyper parameters.
+            act_dim (int): dimension of the action space
+            gamma (float): discounted factor for reward computation.
+            lr (float): learning rate.
+        """
+        self.model = model
+        self.target_model = copy.deepcopy(model)
+        assert isinstance(act_dim, int)
+        assert isinstance(gamma, float)
+        assert isinstance(lr, float)
+        self.act_dim = act_dim
+        self.gamma = gamma
+        self.lr = lr
+- predict(self, obs)
+  We use the forward network defined in ``self.model`` here, which uses observations to predict action values directly.
+  .. code-block:: python
+    def predict(self, obs):
+            """ use value model self.model to predict the action value
+            """
+            return self.model.value(obs)
+- learn(self, obs, action, reward, next_obs, terminal)
+  ``learn`` method calculates the cost of value function according to the predict value and the target value.
+  ``Agent`` will use the cost to update weights in ``self.model``.
+  .. code-block:: python
+    def learn(self, obs, action, reward, next_obs, terminal):
+        """ update value model self.model with DQN algorithm
+        """
+        pred_value = self.model.value(obs)
+        next_pred_value = self.target_model.value(next_obs)
+        best_v = layers.reduce_max(next_pred_value, dim=1)
+        best_v.stop_gradient = True
+        target = reward + (
+            1.0 - layers.cast(terminal, dtype='float32')) * self.gamma * best_v
+        action_onehot = layers.one_hot(action, self.act_dim)
+        action_onehot = layers.cast(action_onehot, dtype='float32')
+        pred_action_value = layers.reduce_sum(
+            layers.elementwise_mul(action_onehot, pred_value), dim=1)
+        cost = layers.square_error_cost(pred_action_value, target)
+        cost = layers.reduce_mean(cost)
+        optimizer = fluid.optimizer.Adam(self.lr, epsilon=1e-3)
+        optimizer.minimize(cost)
+        return cost
+- sync_target(self)
+  Use this method to synchronize the weights in ``self.target_model`` with those in ``self.model``. 
+  This is the step used in DQN algorithm.
+  .. code-block:: python
+    def sync_target(self, gpu_id=None):
+        """ sync weights of self.model to self.target_model
+        """
+        self.model.sync_weights_to(self.target_model)
--- a/examples/LiftSim_demo/README.md
+++ b/examples/LiftSim_demo/README.md
--- a/examples/LiftSim_demo/__init__.py
+++ b/examples/LiftSim_demo/__init__.py
--- a/examples/LiftSim_demo/demo.py
+++ b/examples/LiftSim_demo/demo.py
--- a/examples/LiftSim_demo/rl_10.png
+++ b/examples/LiftSim_demo/rl_10.png
--- a/examples/LiftSim_demo/rl_benchmark/__init__.py
+++ b/examples/LiftSim_demo/rl_benchmark/__init__.py
--- a/examples/LiftSim_demo/rl_benchmark/agent.py
+++ b/examples/LiftSim_demo/rl_benchmark/agent.py
--- a/examples/LiftSim_demo/rl_benchmark/dispatcher.py
+++ b/examples/LiftSim_demo/rl_benchmark/dispatcher.py
--- a/examples/LiftSim_demo/rl_benchmark/model.py
+++ b/examples/LiftSim_demo/rl_benchmark/model.py
--- a/examples/LiftSim_demo/wrapper.py
+++ b/examples/LiftSim_demo/wrapper.py
--- a/examples/LiftSim_demo/wrapper_utils.py
+++ b/examples/LiftSim_demo/wrapper_utils.py