提交 bf2f7f0e 编写于 作者: P PaParaZz1

Deploying to gh-pages from @ eb6c60cc 🚀

上级 be6af82a
......@@ -7,4 +7,6 @@ RL Environments Tutorial
atari
overcooked
bsuite
\ No newline at end of file
bsuite
slime_volleyball
\ No newline at end of file
Slime Volleyball
~~~~~~~~~~~~~~~~~
Overview
============
Slime Volleyball is a two-player match-based environment with two types of observation spaces, vector and picture forms. The action space is often simplified to a discrete action space, used as the basic environment for testing ``self-play``-related algorithms. It is a collection of environments (there are 3 sub-environments, namely ``SlimeVolley-v0``, ``SlimeVolleyPixel-v0``, ``SlimeVolleyNoFrameskip-v0``), of which the ``SlimeVolley-v0`` game is shown in the figure below.
.. image:: ./images/slime_volleyball.gif
:align: center
Installation
===============
Installation Methods
------------------------
Install ``slimevolleygym``. You can install by command ``pip`` or through **DI-engine**.
.. code:: shell
# Method1: Install Directly
pip install slimevolleygym
Installation Check
------------------------
After completing installation, you can check whether it is succesful by the following commands:
.. code:: python
import gym
import slimevolleygym
env = gym.make("SlimeVolley-v0")
obs = env.reset()
print(obs.shape) # (12, )
DI-engine Mirrors
----
Due to Slime Volleyball is easy to install, DI-engine does not have Mirror specifically for it. You can customize your build with the benchmark Mirror ``opendilab/ding:nightly``, or visit the `docker
hub <https://hub.docker.com/repository/docker/opendilab/ding>`__ for more mirrors.
.. _Original environment space) :
Original Environment
========================
Note: ``SlimeVolley-v0`` is used here as an example, because benchmarking the ``self-play`` series of algorithms naturally gives priority to simplicity. If you want to use the other two environments, you can check the original repository and adapt the environment according to the `DI-engine的API <https://di-engine-docs.readthedocs.io/en/main-zh/feature/env_overview.html>`_.
.. _Observation Space-1:
Observation Space
--------------------------
- The observation space is a vector of size ``(12, )`` containing the absolute coordinates of self, opponent, and ball with two consecutive frames stitched togerther. The data type is \ ``float64``
i.e. (x_agent, y_agent, x_agent_next, y_agent_next, x_ball, y_ball, x_ball_next, y_ball_next, x_opponent, y_opponent, x_opponent_next, y_opponent_next)
.. _Action Space-1:
Action Space
------------------
- The original action space of ``SlimeVolley-v0`` is defined as ``MultiBinary(3)`` with three kinds of actions. More than one actions can be performed at the same time. Each action is corresponding to two cases: 0 (not executed) and 1 (executed).
i.e. ``(1, 0, 1)`` represents the execution of the first and third actions at the same time. The data type is \ ``int``\, which needs to be passed into a python list object (or a 1-dimensional np array of size 3, i.e. ``np.array([0, 1, 0])``
- The actual implementation does not strictly limit the action to 0 and 1. It treats values greater than 0 as 1, while values less than or equal to 0 as 0.
- In the ``SlimeVolley-v0`` environment, the basic action is meant to be
- 0: forward (forward)
- 1: backward (backward)
- 2: jump (jump)
- In the ``SlimeVolley-v0`` environment, the combined action is meant to be
- [0, 0, 0], NOOP
- [1, 0, 0], LEFT (forward)
- [1, 0, 1], UPLEFT (forward jump)
- [0, 0, 1], UP (jump)
- [0, 1, 1], UPRIGHT (backward jump)
- [0, 1, 0]] RIGHT (backward)
Reward Space
-----------------
- The reward is the score of the game. If the ball lands on the ground of your field, -1 is given. If it lands on the ground of the opponent‘s field, +1 is given. If the game is still in progress, 0 is given.
.. _Other-1:
Other
--------
- The end of the game is represented as the end of episode. There are two ending conditions
- 1. The life point of one side is 0, default is 5.
- 2. reach the maximum environmental step, default is 3000.
- The game supports two kinds of matchmaking, intelligent body against built-in bot (the bot left, the intelligent body right), intelligent body against intelligent body
- The built-in bot is a very simple RNN-trained smartbody `bot_link <https://blog.otoro.net/2015/03/28/neural-slime-volleyball/>`_
- Only one side's obs are returned by default. The other side's obs, and information can be found in the ``info`` field/
Key Facts
========
1. 1-dimensional vector observation space (of size (12, )) with information in absolute coordinates
2. ``MultiBinary`` action space
3. sparser rewards (maximum life value of 5, maximum number of steps of 3,000, the reward can be gain only when the life value is deducted)
.. _RL Environment Space) :
RL Environment Space
======================
.. _Observation Space-2:
Observation Space
--------
- Transform the space vector into a one-dimensional np array of size ``(12, )``. The data type is ``np.float32``.
Action Space
--------
- Transform the ``MultiBinary`` action space into a discrete action space of size 6 (a simple Cartesian product is sufficient). The final result is a one-dimensional np array of size \ ``(1, )``\. The data type is \ ``np.int64``
.. _Reward Space-2:
Reward Space
--------
- Transform the reward vector into a one-dimensional np array of size\ ``(1, )``\. The data type is\ ``np.float32``\ values in ``[-1, 0, 1]``.
Using Slime Volleyball in 'OpenAI Gym' format:
.. code:: python
import gym
obs_space = gym.spaces.Box(low=-np.inf, high=np.inf, shape=(12, ), dtype=np.float32)
act_space = gym.spaces.Discrete(6)
rew_space = gym.spaces.Box(low=-1, high=1, shape=(1, ), dtype=np.float32)
.. _Other-2:
Other
----
- The\ `info``\returned form the environment\ ``step``\ must contain the\ ``final_eval_reward``\ key-value pair, which represents the evaluation metrics for the entire episode, containing the rewards for the episode (life value difference between two players).
- The above spatial definitions are all descriptions of single intelligences. The multi-intelligence space splices the corresponding obs/action/reward information.
i.e. The observation space changes from ``(12, )`` to ``(2, 12)``, thar represents the observation information of both sides.
.. _Other-3:
Other
====
Lazy initialization
----------
In order to support environment vetorization, an environment instance is oftern initialized lazily. In this way, method ``__init__`` does not really initialize the real original environment, but only set corresponding parameters and configurations. The real original environment is initialized when first calling mdthod ``reset``.
Random Seed
------------------
- There are two random seeds in the environment. One is orignal environment's random seed; The other is the random seed which is required in many environment space transformations. (e.g. ``random``, ``np.random``)
- As a user, you only need to set these two random seeds by calling method ``seed``, and do not need to care about the implementation details.
- Implementation details: For orignal environment's random seed, within RL env's ``reset`` method; Before orginal env's ``reset`` method.
- Implementation details: For the seed for ``random`` / ``np.random``, within env's ``seed`` method.
Difference between training env and evaluation env
----------------------------------------------------------
- Training env uses dynamic random seed, i.e. Every episode has different random seeds generated by one random generator. However, this random generator's random seed is set by env's ``seed`` method, and is fixed throughout an experiment. Evaluation env uses static random seed, i.e. Every episode has the same random seed, which is set directly by ``seed`` method.
- Training env and evaluation env use different pre-process wrappers. ``episode_life`` and ``clip_reward`` are not used in evaluation env.
Save the replay video
----------------------------
After env is initiated, and before it is reset, call ``enable_save_replay`` method to set where the replay video will be saved. Environment will automatically save the replay video after each episode is completed. (The default call is ``gym.wrapper.Monitor``, depending on ``ffmpeg``). The code shown below will run an environment episode and save the replay viedo in a file like ``./video/xxx.mp4``.
.. code:: python
from easydict import EasyDict
from dizoo.slime_volley.envs.slime_volley_env import SlimeVolleyEnv
env = SlimeVolleyEnv(EasyDict({'env_id': 'SlimeVolley-v0', 'agent_vs_agent': False}))
env.enable_save_replay(replay_path='./video')
obs = env.reset()
while True:
action = env.random_action()
timestep = env.step(action)
if timestep.done:
print('Episode is over, final eval reward is: {}'.format(timestep.info['final_eval_reward']))
break
DI-zoo runnable code
====================
Complete training configuration can be found on `github
link <https://github.com/opendilab/DI-engine/tree/main/dizoo/slime_volley/entry>`__.
For specific configuration file, e.g. ``slime_volley_selfplay_ppo_main.py``\, you can run the demo as shown below:
.. code:: python
import os
import gym
import numpy as np
import copy
import torch
from tensorboardX import SummaryWriter
from functools import partial
from ding.config import compile_config
from ding.worker import BaseLearner, BattleSampleSerialCollector, NaiveReplayBuffer, InteractionSerialEvaluator
from ding.envs import SyncSubprocessEnvManager
from ding.policy import PPOPolicy
from ding.model import VAC
from ding.utils import set_pkg_seed
from dizoo.slime_volley.envs import SlimeVolleyEnv
from dizoo.slime_volley.config.slime_volley_ppo_config import main_config
def main(cfg, seed=0, max_iterations=int(1e10)):
cfg = compile_config(
cfg,
SyncSubprocessEnvManager,
PPOPolicy,
BaseLearner,
BattleSampleSerialCollector,
InteractionSerialEvaluator,
NaiveReplayBuffer,
save_cfg=True
)
collector_env_num, evaluator_env_num = cfg.env.collector_env_num, cfg.env.evaluator_env_num
collector_env_cfg = copy.deepcopy(cfg.env)
collector_env_cfg.agent_vs_agent = True
evaluator_env_cfg = copy.deepcopy(cfg.env)
evaluator_env_cfg.agent_vs_agent = False
collector_env = SyncSubprocessEnvManager(
env_fn=[partial(SlimeVolleyEnv, collector_env_cfg) for _ in range(collector_env_num)], cfg=cfg.env.manager
)
evaluator_env = SyncSubprocessEnvManager(
env_fn=[partial(SlimeVolleyEnv, evaluator_env_cfg) for _ in range(evaluator_env_num)], cfg=cfg.env.manager
)
collector_env.seed(seed)
evaluator_env.seed(seed, dynamic_seed=False)
set_pkg_seed(seed, use_cuda=cfg.policy.cuda)
model = VAC(**cfg.policy.model)
policy = PPOPolicy(cfg.policy, model=model)
tb_logger = SummaryWriter(os.path.join('./{}/log/'.format(cfg.exp_name), 'serial'))
learner = BaseLearner(
cfg.policy.learn.learner, policy.learn_mode, tb_logger, exp_name=cfg.exp_name, instance_name='learner1'
)
collector = BattleSampleSerialCollector(
cfg.policy.collect.collector,
collector_env, [policy.collect_mode, policy.collect_mode],
tb_logger,
exp_name=cfg.exp_name
)
evaluator_cfg = copy.deepcopy(cfg.policy.eval.evaluator)
evaluator_cfg.stop_value = cfg.env.stop_value
evaluator = InteractionSerialEvaluator(
evaluator_cfg,
evaluator_env,
policy.eval_mode,
tb_logger,
exp_name=cfg.exp_name,
instance_name='builtin_ai_evaluator'
)
learner.call_hook('before_run')
for _ in range(max_iterations):
if evaluator.should_eval(learner.train_iter):
stop_flag, reward = evaluator.eval(learner.save_checkpoint, learner.train_iter, collector.envstep)
if stop_flag:
break
new_data, _ = collector.collect(train_iter=learner.train_iter)
train_data = new_data[0] + new_data[1]
learner.train(train_data, collector.envstep)
learner.call_hook('after_run')
if __name__ == "__main__":
main(main_config)
Note: To run the intelligent body against built-in bot mode, python ``slime_volley_ppo_config.py``.
Note: For some specific algorithm, use the corresponding specific entry function.
Algorithm Benchmark
============
- SlimeVolley-v0(Average reward greater than 1 is considered as good agent with the build-in bot)
- SlimeVolley-v0 + PPO + vs Bot
.. image:: images/slime_volleyball_ppo_vsbot.png
:align: center
- SlimeVolley-v0 + PPO + self-play
.. image:: images/slime_volleyball_ppo_selfplay.png
:align: center
:scale: 70%
......@@ -38,7 +38,7 @@
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Asynchronous Parallel in GoBigger" href="gobigger.html" />
<link rel="prev" title="Bsuite" href="../env_tutorial/bsuite.html" />
<link rel="prev" title="Slime Volleyball" href="../env_tutorial/slime_volleyball.html" />
<link href="../_static/css/style.css" rel="stylesheet" type="text/css">
</head>
......@@ -500,7 +500,7 @@ This connection will make your next optimization work extremely simple.</p>
<a href="gobigger.html" class="btn btn-neutral float-right" title="Asynchronous Parallel in GoBigger" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
<a href="../env_tutorial/bsuite.html" class="btn btn-neutral float-left" title="Bsuite" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
<a href="../env_tutorial/slime_volleyball.html" class="btn btn-neutral float-left" title="Slime Volleyball" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
</div>
......
......@@ -128,6 +128,7 @@
</li>
<li class="toctree-l2"><a class="reference internal" href="overcooked.html">Overcooked</a></li>
<li class="toctree-l2"><a class="reference internal" href="bsuite.html">Bsuite</a></li>
<li class="toctree-l2"><a class="reference internal" href="slime_volleyball.html">Slime Volleyball</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../distributed/index.html">Distributed</a></li>
......
......@@ -37,7 +37,7 @@
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Distributed" href="../distributed/index.html" />
<link rel="next" title="Slime Volleyball" href="slime_volleyball.html" />
<link rel="prev" title="Overcooked" href="overcooked.html" />
<link href="../_static/css/style.css" rel="stylesheet" type="text/css">
......@@ -117,6 +117,7 @@
<li class="toctree-l3"><a class="reference internal" href="#di-zoo-runnable-code">DI-zoo runnable code</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="slime_volleyball.html">Slime Volleyball</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../distributed/index.html">Distributed</a></li>
......@@ -384,7 +385,7 @@ link</a>
<div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
<a href="../distributed/index.html" class="btn btn-neutral float-right" title="Distributed" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
<a href="slime_volleyball.html" class="btn btn-neutral float-right" title="Slime Volleyball" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right"></span></a>
<a href="overcooked.html" class="btn btn-neutral float-left" title="Overcooked" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left"></span> Previous</a>
......
......@@ -95,6 +95,7 @@
<li class="toctree-l2"><a class="reference internal" href="atari.html">Atari</a></li>
<li class="toctree-l2"><a class="reference internal" href="overcooked.html">Overcooked</a></li>
<li class="toctree-l2"><a class="reference internal" href="bsuite.html">Bsuite</a></li>
<li class="toctree-l2"><a class="reference internal" href="slime_volleyball.html">Slime Volleyball</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../distributed/index.html">Distributed</a></li>
......@@ -179,6 +180,7 @@
<li class="toctree-l1"><a class="reference internal" href="atari.html">Atari</a></li>
<li class="toctree-l1"><a class="reference internal" href="overcooked.html">Overcooked</a></li>
<li class="toctree-l1"><a class="reference internal" href="bsuite.html">Bsuite</a></li>
<li class="toctree-l1"><a class="reference internal" href="slime_volleyball.html">Slime Volleyball</a></li>
</ul>
</div>
</div>
......
......@@ -117,6 +117,7 @@
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="bsuite.html">Bsuite</a></li>
<li class="toctree-l2"><a class="reference internal" href="slime_volleyball.html">Slime Volleyball</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../distributed/index.html">Distributed</a></li>
......
此差异已折叠。
......@@ -275,6 +275,7 @@ If you want to deeply customize your algorithm and application with DI-engine, a
<li class="toctree-l2"><a class="reference internal" href="env_tutorial/atari.html">Atari</a></li>
<li class="toctree-l2"><a class="reference internal" href="env_tutorial/overcooked.html">Overcooked</a></li>
<li class="toctree-l2"><a class="reference internal" href="env_tutorial/bsuite.html">Bsuite</a></li>
<li class="toctree-l2"><a class="reference internal" href="env_tutorial/slime_volleyball.html">Slime Volleyball</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="distributed/index.html">Distributed</a><ul>
......
无法预览此类型文件
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册