Deploying to gh-pages from @ eb6c60cc 🚀

bf2f7f0e · PaParaZz1 · be6af82a · bf2f7f0e · bf2f7f0e · bf2f7f0e
11 changed file
--- a/_sources/env_tutorial/index.rst.txt
+++ b/_sources/env_tutorial/index.rst.txt
@@ -7,4 +7,6 @@ RL Environments Tutorial

   atari
   overcooked
-   bsuite
\ No newline at end of file
+   bsuite
+   slime_volleyball
+   
\ No newline at end of file
--- a/_sources/env_tutorial/slime_volleyball.rst.txt
+++ b/_sources/env_tutorial/slime_volleyball.rst.txt
+Slime Volleyball
+~~~~~~~~~~~~~~~~~
+
+Overview
+============
+
+Slime Volleyball is a two-player match-based environment with two types of observation spaces, vector and picture forms. The action space is often simplified to a discrete action space, used as the basic environment for testing ``self-play``-related algorithms. It is a collection of environments (there are 3 sub-environments, namely ``SlimeVolley-v0``, ``SlimeVolleyPixel-v0``, ``SlimeVolleyNoFrameskip-v0``), of which the ``SlimeVolley-v0`` game is shown in the figure below.
+
+.. image:: ./images/slime_volleyball.gif
+   :align: center
+
+Installation
+===============
+
+Installation Methods
+------------------------
+
+Install ``slimevolleygym``. You can install by command ``pip`` or through **DI-engine**.
+
+.. code:: shell
+
+   # Method1: Install Directly
+   pip install slimevolleygym
+
+
+Installation Check
+------------------------
+
+After completing installation, you can check whether it is succesful by the following commands:
+
+.. code:: python
+
+   import gym
+   import slimevolleygym
+   env = gym.make("SlimeVolley-v0")
+   obs = env.reset()
+   print(obs.shape)  # (12, )
+
+DI-engine Mirrors
+----
+
+Due to Slime Volleyball is easy to install, DI-engine does not have Mirror specifically for it. You can customize your build with the benchmark Mirror ``opendilab/ding:nightly``, or visit the `docker
+hub <https://hub.docker.com/repository/docker/opendilab/ding>`__ for more mirrors.
+
+.. _Original environment space) :
+
+Original Environment
+========================
+Note: ``SlimeVolley-v0`` is used here as an example, because benchmarking the ``self-play`` series of algorithms naturally gives priority to simplicity. If you want to use the other two environments, you can check the original repository and adapt the environment according to the `DI-engine的API <https://di-engine-docs.readthedocs.io/en/main-zh/feature/env_overview.html>`_.
+
+.. _Observation Space-1:
+
+Observation Space
+--------------------------
+
+- The observation space is a vector of size ``(12, )`` containing the absolute coordinates of self, opponent, and ball with two consecutive frames stitched togerther. The data type is \ ``float64``
+i.e. (x_agent, y_agent, x_agent_next, y_agent_next, x_ball, y_ball, x_ball_next, y_ball_next, x_opponent, y_opponent, x_opponent_next, y_opponent_next)
+
+.. _Action Space-1:
+
+Action Space
+------------------
+
+- The original action space of ``SlimeVolley-v0`` is defined as ``MultiBinary(3)`` with three kinds of actions. More than one actions can be performed at the same time. Each action is corresponding to two cases: 0 (not executed) and 1 (executed). 
+i.e. ``(1, 0, 1)`` represents the execution of the first and third actions at the same time. The data type is \ ``int``\, which needs to be passed into a python list object (or a 1-dimensional np array of size 3, i.e. ``np.array([0, 1, 0])``
+
+- The actual implementation does not strictly limit the action to 0 and 1. It treats values greater than 0 as 1, while values less than or equal to 0 as 0.
+
+- In the ``SlimeVolley-v0`` environment, the basic action is meant to be
+
+   - 0: forward (forward)
+
+   - 1: backward (backward)
+
+   - 2: jump (jump)
+
+- In the ``SlimeVolley-v0`` environment, the combined action is meant to be
+
+   - [0, 0, 0],  NOOP
+
+   - [1, 0, 0],  LEFT (forward)
+
+   - [1, 0, 1],  UPLEFT (forward jump)
+
+   - [0, 0, 1],  UP (jump)
+
+   - [0, 1, 1],  UPRIGHT (backward jump)
+
+   - [0, 1, 0]]  RIGHT (backward)
+
+
+Reward Space
+-----------------
+
+- The reward is the score of the game. If the ball lands on the ground of your field, -1 is given. If it lands on the ground of the opponent‘s field, +1 is given. If the game is still in progress, 0 is given.
+
+.. _Other-1:
+
+Other
+--------
+
+- The end of the game is represented as the end of episode. There are two ending conditions
+  
+  - 1. The life point of one side is 0, default is 5.
+  
+  - 2. reach the maximum environmental step, default is 3000.
+
+- The game supports two kinds of matchmaking, intelligent body against built-in bot (the bot left, the intelligent body right), intelligent body against intelligent body
+
+- The built-in bot is a very simple RNN-trained smartbody `bot_link <https://blog.otoro.net/2015/03/28/neural-slime-volleyball/>`_
+
+- Only one side's obs are returned by default. The other side's obs, and information can be found in the ``info`` field/
+
+Key Facts
+========
+
+1. 1-dimensional vector observation space (of size (12, )) with information in absolute coordinates
+
+2. ``MultiBinary`` action space
+
+3. sparser rewards (maximum life value of 5, maximum number of steps of 3,000, the reward can be gain only when the life value is deducted)
+
+.. _RL Environment Space) :
+
+RL Environment Space
+======================
+
+.. _Observation Space-2:
+
+Observation Space
+--------
+
+- Transform the space vector into a one-dimensional np array of size ``(12, )``. The data type is ``np.float32``.
+
+Action Space
+--------
+
+- Transform the ``MultiBinary`` action space into a discrete action space of size 6 (a simple Cartesian product is sufficient). The final result is a one-dimensional np array of size \ ``(1, )``\. The data type is \ ``np.int64``
+
+.. _Reward Space-2:
+
+Reward Space
+--------
+
+- Transform the reward vector into a one-dimensional np array of size\ ``(1, )``\. The data type is\ ``np.float32``\ values in ``[-1, 0, 1]``.
+
+Using Slime Volleyball in 'OpenAI Gym' format:
+
+.. code:: python
+
+   import gym
+
+   obs_space = gym.spaces.Box(low=-np.inf, high=np.inf, shape=(12, ), dtype=np.float32)
+   act_space = gym.spaces.Discrete(6)
+   rew_space = gym.spaces.Box(low=-1, high=1, shape=(1, ), dtype=np.float32)
+
+.. _Other-2:
+
+Other
+----
+
+- The\ `info``\returned form the environment\ ``step``\ must contain the\ ``final_eval_reward``\ key-value pair, which represents the evaluation metrics for the entire episode, containing the rewards for the episode (life value difference between two players).
+
+- The above spatial definitions are all descriptions of single intelligences. The multi-intelligence space splices the corresponding obs/action/reward information.
+
+i.e. The observation space changes from ``(12, )`` to ``(2, 12)``, thar represents the observation information of both sides.
+
+.. _Other-3:
+
+Other
+====
+
+Lazy initialization
+----------
+
+In order to support environment vetorization, an environment instance is oftern initialized lazily. In this way, method ``__init__`` does not really initialize the real original environment, but only set corresponding parameters and configurations. The real original environment is initialized when first calling mdthod ``reset``.
+
+Random Seed
+------------------
+
+-  There are two random seeds in the environment. One is orignal environment's random seed; The other is the random seed which is required in many environment space transformations. (e.g. ``random``, ``np.random``)
+
+-  As a user, you only need to set these two random seeds by calling method ``seed``, and do not need to care about the implementation details.
+
+-  Implementation details: For orignal environment's random seed, within RL env's ``reset`` method; Before orginal env's ``reset`` method.
+
+-  Implementation details: For the seed for ``random`` / ``np.random``, within env's ``seed`` method.
+
+Difference between training env and evaluation env
+----------------------------------------------------------
+
+-  Training env uses dynamic random seed, i.e. Every episode has different random seeds generated by one random generator. However, this random generator's random seed is set by env's ``seed`` method, and is fixed throughout an experiment. Evaluation env uses static random seed, i.e. Every episode has the same random seed, which is set directly by ``seed`` method.
+
+-  Training env and evaluation env use different pre-process wrappers. ``episode_life`` and ``clip_reward`` are not used in evaluation env.
+
+Save the replay video
+----------------------------
+
+After env is initiated, and before it is reset, call ``enable_save_replay`` method to set where the replay video will be saved. Environment will automatically save the replay video after each episode is completed. (The default call is ``gym.wrapper.Monitor``, depending on ``ffmpeg``). The code shown below will run an environment episode and save the replay viedo in a file like ``./video/xxx.mp4``.
+
+.. code:: python
+
+   from easydict import EasyDict
+   from dizoo.slime_volley.envs.slime_volley_env import SlimeVolleyEnv
+
+   env = SlimeVolleyEnv(EasyDict({'env_id': 'SlimeVolley-v0', 'agent_vs_agent': False}))
+   env.enable_save_replay(replay_path='./video')
+   obs = env.reset()
+
+   while True:
+       action = env.random_action()
+       timestep = env.step(action)
+       if timestep.done:
+           print('Episode is over, final eval reward is: {}'.format(timestep.info['final_eval_reward']))
+           break
+
+DI-zoo runnable code
+====================
+
+Complete training configuration can be found on `github
+link <https://github.com/opendilab/DI-engine/tree/main/dizoo/slime_volley/entry>`__.
+For specific configuration file, e.g. ``slime_volley_selfplay_ppo_main.py``\, you can run the demo as shown below: 
+
+.. code:: python
+
+    import os
+    import gym
+    import numpy as np
+    import copy
+    import torch
+    from tensorboardX import SummaryWriter
+    from functools import partial
+
+    from ding.config import compile_config
+    from ding.worker import BaseLearner, BattleSampleSerialCollector, NaiveReplayBuffer, InteractionSerialEvaluator
+    from ding.envs import SyncSubprocessEnvManager
+    from ding.policy import PPOPolicy
+    from ding.model import VAC
+    from ding.utils import set_pkg_seed
+    from dizoo.slime_volley.envs import SlimeVolleyEnv
+    from dizoo.slime_volley.config.slime_volley_ppo_config import main_config
+
+
+    def main(cfg, seed=0, max_iterations=int(1e10)):
+        cfg = compile_config(
+            cfg,
+            SyncSubprocessEnvManager,
+            PPOPolicy,
+            BaseLearner,
+            BattleSampleSerialCollector,
+            InteractionSerialEvaluator,
+            NaiveReplayBuffer,
+            save_cfg=True
+        )
+        collector_env_num, evaluator_env_num = cfg.env.collector_env_num, cfg.env.evaluator_env_num
+        collector_env_cfg = copy.deepcopy(cfg.env)
+        collector_env_cfg.agent_vs_agent = True
+        evaluator_env_cfg = copy.deepcopy(cfg.env)
+        evaluator_env_cfg.agent_vs_agent = False
+        collector_env = SyncSubprocessEnvManager(
+            env_fn=[partial(SlimeVolleyEnv, collector_env_cfg) for _ in range(collector_env_num)], cfg=cfg.env.manager
+        )
+        evaluator_env = SyncSubprocessEnvManager(
+            env_fn=[partial(SlimeVolleyEnv, evaluator_env_cfg) for _ in range(evaluator_env_num)], cfg=cfg.env.manager
+        )
+
+        collector_env.seed(seed)
+        evaluator_env.seed(seed, dynamic_seed=False)
+        set_pkg_seed(seed, use_cuda=cfg.policy.cuda)
+
+        model = VAC(**cfg.policy.model)
+        policy = PPOPolicy(cfg.policy, model=model)
+
+        tb_logger = SummaryWriter(os.path.join('./{}/log/'.format(cfg.exp_name), 'serial'))
+        learner = BaseLearner(
+            cfg.policy.learn.learner, policy.learn_mode, tb_logger, exp_name=cfg.exp_name, instance_name='learner1'
+        )
+        collector = BattleSampleSerialCollector(
+            cfg.policy.collect.collector,
+            collector_env, [policy.collect_mode, policy.collect_mode],
+            tb_logger,
+            exp_name=cfg.exp_name
+        )
+        evaluator_cfg = copy.deepcopy(cfg.policy.eval.evaluator)
+        evaluator_cfg.stop_value = cfg.env.stop_value
+        evaluator = InteractionSerialEvaluator(
+            evaluator_cfg,
+            evaluator_env,
+            policy.eval_mode,
+            tb_logger,
+            exp_name=cfg.exp_name,
+            instance_name='builtin_ai_evaluator'
+        )
+
+        learner.call_hook('before_run')
+        for _ in range(max_iterations):
+            if evaluator.should_eval(learner.train_iter):
+                stop_flag, reward = evaluator.eval(learner.save_checkpoint, learner.train_iter, collector.envstep)
+                if stop_flag:
+                    break
+            new_data, _ = collector.collect(train_iter=learner.train_iter)
+            train_data = new_data[0] + new_data[1]
+            learner.train(train_data, collector.envstep)
+        learner.call_hook('after_run')
+
+
+    if __name__ == "__main__":
+        main(main_config)
+
+Note: To run the intelligent body against built-in bot mode, python ``slime_volley_ppo_config.py``.
+
+Note: For some specific algorithm, use the corresponding specific entry function. 
+
+Algorithm Benchmark
+============
+
+-  SlimeVolley-v0（Average reward greater than 1 is considered as good agent with the build-in bot）
+   
+   - SlimeVolley-v0 + PPO + vs Bot
+   .. image:: images/slime_volleyball_ppo_vsbot.png
+     :align: center
+
+   - SlimeVolley-v0 + PPO + self-play
+   .. image:: images/slime_volleyball_ppo_selfplay.png
+     :align: center
+     :scale: 70%
+
--- a/distributed/index.html
+++ b/distributed/index.html
@@ -38,7 +38,7 @@
    <link rel="index" title="Index" href="../genindex.html" />
    <link rel="search" title="Search" href="../search.html" />
    <link rel="next" title="Asynchronous Parallel in GoBigger" href="gobigger.html" />
-    <link rel="prev" title="Bsuite" href="../env_tutorial/bsuite.html" />
+    <link rel="prev" title="Slime Volleyball" href="../env_tutorial/slime_volleyball.html" />
    <link href="../_static/css/style.css" rel="stylesheet" type="text/css">

 </head>
@@ -500,7 +500,7 @@ This connection will make your next optimization work extremely simple.</p>
        <a href="gobigger.html" class="btn btn-neutral float-right" title="Asynchronous Parallel in GoBigger" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
      
      
-        <a href="../env_tutorial/bsuite.html" class="btn btn-neutral float-left" title="Bsuite" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
+        <a href="../env_tutorial/slime_volleyball.html" class="btn btn-neutral float-left" title="Slime Volleyball" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
      
    </div>
  

--- a/env_tutorial/atari.html
+++ b/env_tutorial/atari.html
@@ -128,6 +128,7 @@
 </li>
 <li class="toctree-l2"><a class="reference internal" href="overcooked.html">Overcooked</a></li>
 <li class="toctree-l2"><a class="reference internal" href="bsuite.html">Bsuite</a></li>
+<li class="toctree-l2"><a class="reference internal" href="slime_volleyball.html">Slime Volleyball</a></li>
 </ul>
 </li>
 <li class="toctree-l1"><a class="reference internal" href="../distributed/index.html">Distributed</a></li>

--- a/env_tutorial/bsuite.html
+++ b/env_tutorial/bsuite.html
@@ -37,7 +37,7 @@
  <link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
    <link rel="index" title="Index" href="../genindex.html" />
    <link rel="search" title="Search" href="../search.html" />
-    <link rel="next" title="Distributed" href="../distributed/index.html" />
+    <link rel="next" title="Slime Volleyball" href="slime_volleyball.html" />
    <link rel="prev" title="Overcooked" href="overcooked.html" />
    <link href="../_static/css/style.css" rel="stylesheet" type="text/css">

@@ -117,6 +117,7 @@
 <li class="toctree-l3"><a class="reference internal" href="#di-zoo-runnable-code">DI-zoo runnable code</a></li>
 </ul>
 </li>
+<li class="toctree-l2"><a class="reference internal" href="slime_volleyball.html">Slime Volleyball</a></li>
 </ul>
 </li>
 <li class="toctree-l1"><a class="reference internal" href="../distributed/index.html">Distributed</a></li>
@@ -384,7 +385,7 @@ link</a>
  
    <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
      
-        <a href="../distributed/index.html" class="btn btn-neutral float-right" title="Distributed" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
+        <a href="slime_volleyball.html" class="btn btn-neutral float-right" title="Slime Volleyball" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
      
      
        <a href="overcooked.html" class="btn btn-neutral float-left" title="Overcooked" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>

--- a/env_tutorial/index.html
+++ b/env_tutorial/index.html
@@ -95,6 +95,7 @@
 <li class="toctree-l2"><a class="reference internal" href="atari.html">Atari</a></li>
 <li class="toctree-l2"><a class="reference internal" href="overcooked.html">Overcooked</a></li>
 <li class="toctree-l2"><a class="reference internal" href="bsuite.html">Bsuite</a></li>
+<li class="toctree-l2"><a class="reference internal" href="slime_volleyball.html">Slime Volleyball</a></li>
 </ul>
 </li>
 <li class="toctree-l1"><a class="reference internal" href="../distributed/index.html">Distributed</a></li>
@@ -179,6 +180,7 @@
 <li class="toctree-l1"><a class="reference internal" href="atari.html">Atari</a></li>
 <li class="toctree-l1"><a class="reference internal" href="overcooked.html">Overcooked</a></li>
 <li class="toctree-l1"><a class="reference internal" href="bsuite.html">Bsuite</a></li>
+<li class="toctree-l1"><a class="reference internal" href="slime_volleyball.html">Slime Volleyball</a></li>
 </ul>
 </div>
 </div>

--- a/env_tutorial/overcooked.html
+++ b/env_tutorial/overcooked.html
@@ -117,6 +117,7 @@
 </ul>
 </li>
 <li class="toctree-l2"><a class="reference internal" href="bsuite.html">Bsuite</a></li>
+<li class="toctree-l2"><a class="reference internal" href="slime_volleyball.html">Slime Volleyball</a></li>
 </ul>
 </li>
 <li class="toctree-l1"><a class="reference internal" href="../distributed/index.html">Distributed</a></li>

--- a/env_tutorial/slime_volleyball.html
+++ b/env_tutorial/slime_volleyball.html
--- a/index.html
+++ b/index.html
@@ -275,6 +275,7 @@ If you want to deeply customize your algorithm and application with DI-engine, a
 <li class="toctree-l2"><a class="reference internal" href="env_tutorial/atari.html">Atari</a></li>
 <li class="toctree-l2"><a class="reference internal" href="env_tutorial/overcooked.html">Overcooked</a></li>
 <li class="toctree-l2"><a class="reference internal" href="env_tutorial/bsuite.html">Bsuite</a></li>
+<li class="toctree-l2"><a class="reference internal" href="env_tutorial/slime_volleyball.html">Slime Volleyball</a></li>
 </ul>
 </li>
 <li class="toctree-l1"><a class="reference internal" href="distributed/index.html">Distributed</a><ul>

--- a/objects.inv
+++ b/objects.inv
--- a/searchindex.js
+++ b/searchindex.js