From bf2f7f0e9bdfdcb4183dc22bea444878d59debf4 Mon Sep 17 00:00:00 2001
From: PaParaZz1
Date: Thu, 23 Dec 2021 15:21:54 +0000
Subject: [PATCH] =?UTF-8?q?Deploying=20to=20gh-pages=20from=20=20@=20eb6c6?=
=?UTF-8?q?0cc38f58fd356e221b29c4755e85b73f503=20=F0=9F=9A=80?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
---
_sources/env_tutorial/index.rst.txt | 4 +-
.../env_tutorial/slime_volleyball.rst.txt | 327 ++++++++++
distributed/index.html | 4 +-
env_tutorial/atari.html | 1 +
env_tutorial/bsuite.html | 5 +-
env_tutorial/index.html | 2 +
env_tutorial/overcooked.html | 1 +
env_tutorial/slime_volleyball.html | 559 ++++++++++++++++++
index.html | 1 +
objects.inv | Bin 14411 -> 14543 bytes
searchindex.js | 2 +-
11 files changed, 900 insertions(+), 6 deletions(-)
create mode 100644 _sources/env_tutorial/slime_volleyball.rst.txt
create mode 100644 env_tutorial/slime_volleyball.html
diff --git a/_sources/env_tutorial/index.rst.txt b/_sources/env_tutorial/index.rst.txt
index d3ae999..3388eeb 100644
--- a/_sources/env_tutorial/index.rst.txt
+++ b/_sources/env_tutorial/index.rst.txt
@@ -7,4 +7,6 @@ RL Environments Tutorial
atari
overcooked
- bsuite
\ No newline at end of file
+ bsuite
+ slime_volleyball
+
\ No newline at end of file
diff --git a/_sources/env_tutorial/slime_volleyball.rst.txt b/_sources/env_tutorial/slime_volleyball.rst.txt
new file mode 100644
index 0000000..c6368ea
--- /dev/null
+++ b/_sources/env_tutorial/slime_volleyball.rst.txt
@@ -0,0 +1,327 @@
+Slime Volleyball
+~~~~~~~~~~~~~~~~~
+
+Overview
+============
+
+Slime Volleyball is a two-player match-based environment with two types of observation spaces, vector and picture forms. The action space is often simplified to a discrete action space, used as the basic environment for testing ``self-play``-related algorithms. It is a collection of environments (there are 3 sub-environments, namely ``SlimeVolley-v0``, ``SlimeVolleyPixel-v0``, ``SlimeVolleyNoFrameskip-v0``), of which the ``SlimeVolley-v0`` game is shown in the figure below.
+
+.. image:: ./images/slime_volleyball.gif
+ :align: center
+
+Installation
+===============
+
+Installation Methods
+------------------------
+
+Install ``slimevolleygym``. You can install by command ``pip`` or through **DI-engine**.
+
+.. code:: shell
+
+ # Method1: Install Directly
+ pip install slimevolleygym
+
+
+Installation Check
+------------------------
+
+After completing installation, you can check whether it is succesful by the following commands:
+
+.. code:: python
+
+ import gym
+ import slimevolleygym
+ env = gym.make("SlimeVolley-v0")
+ obs = env.reset()
+ print(obs.shape) # (12, )
+
+DI-engine Mirrors
+----
+
+Due to Slime Volleyball is easy to install, DI-engine does not have Mirror specifically for it. You can customize your build with the benchmark Mirror ``opendilab/ding:nightly``, or visit the `docker
+hub `__ for more mirrors.
+
+.. _Original environment space) :
+
+Original Environment
+========================
+Note: ``SlimeVolley-v0`` is used here as an example, because benchmarking the ``self-play`` series of algorithms naturally gives priority to simplicity. If you want to use the other two environments, you can check the original repository and adapt the environment according to the `DI-engine的API `_.
+
+.. _Observation Space-1:
+
+Observation Space
+--------------------------
+
+- The observation space is a vector of size ``(12, )`` containing the absolute coordinates of self, opponent, and ball with two consecutive frames stitched togerther. The data type is \ ``float64``
+i.e. (x_agent, y_agent, x_agent_next, y_agent_next, x_ball, y_ball, x_ball_next, y_ball_next, x_opponent, y_opponent, x_opponent_next, y_opponent_next)
+
+.. _Action Space-1:
+
+Action Space
+------------------
+
+- The original action space of ``SlimeVolley-v0`` is defined as ``MultiBinary(3)`` with three kinds of actions. More than one actions can be performed at the same time. Each action is corresponding to two cases: 0 (not executed) and 1 (executed).
+i.e. ``(1, 0, 1)`` represents the execution of the first and third actions at the same time. The data type is \ ``int``\, which needs to be passed into a python list object (or a 1-dimensional np array of size 3, i.e. ``np.array([0, 1, 0])``
+
+- The actual implementation does not strictly limit the action to 0 and 1. It treats values greater than 0 as 1, while values less than or equal to 0 as 0.
+
+- In the ``SlimeVolley-v0`` environment, the basic action is meant to be
+
+ - 0: forward (forward)
+
+ - 1: backward (backward)
+
+ - 2: jump (jump)
+
+- In the ``SlimeVolley-v0`` environment, the combined action is meant to be
+
+ - [0, 0, 0], NOOP
+
+ - [1, 0, 0], LEFT (forward)
+
+ - [1, 0, 1], UPLEFT (forward jump)
+
+ - [0, 0, 1], UP (jump)
+
+ - [0, 1, 1], UPRIGHT (backward jump)
+
+ - [0, 1, 0]] RIGHT (backward)
+
+
+Reward Space
+-----------------
+
+- The reward is the score of the game. If the ball lands on the ground of your field, -1 is given. If it lands on the ground of the opponent‘s field, +1 is given. If the game is still in progress, 0 is given.
+
+.. _Other-1:
+
+Other
+--------
+
+- The end of the game is represented as the end of episode. There are two ending conditions
+
+ - 1. The life point of one side is 0, default is 5.
+
+ - 2. reach the maximum environmental step, default is 3000.
+
+- The game supports two kinds of matchmaking, intelligent body against built-in bot (the bot left, the intelligent body right), intelligent body against intelligent body
+
+- The built-in bot is a very simple RNN-trained smartbody `bot_link `_
+
+- Only one side's obs are returned by default. The other side's obs, and information can be found in the ``info`` field/
+
+Key Facts
+========
+
+1. 1-dimensional vector observation space (of size (12, )) with information in absolute coordinates
+
+2. ``MultiBinary`` action space
+
+3. sparser rewards (maximum life value of 5, maximum number of steps of 3,000, the reward can be gain only when the life value is deducted)
+
+.. _RL Environment Space) :
+
+RL Environment Space
+======================
+
+.. _Observation Space-2:
+
+Observation Space
+--------
+
+- Transform the space vector into a one-dimensional np array of size ``(12, )``. The data type is ``np.float32``.
+
+Action Space
+--------
+
+- Transform the ``MultiBinary`` action space into a discrete action space of size 6 (a simple Cartesian product is sufficient). The final result is a one-dimensional np array of size \ ``(1, )``\. The data type is \ ``np.int64``
+
+.. _Reward Space-2:
+
+Reward Space
+--------
+
+- Transform the reward vector into a one-dimensional np array of size\ ``(1, )``\. The data type is\ ``np.float32``\ values in ``[-1, 0, 1]``.
+
+Using Slime Volleyball in 'OpenAI Gym' format:
+
+.. code:: python
+
+ import gym
+
+ obs_space = gym.spaces.Box(low=-np.inf, high=np.inf, shape=(12, ), dtype=np.float32)
+ act_space = gym.spaces.Discrete(6)
+ rew_space = gym.spaces.Box(low=-1, high=1, shape=(1, ), dtype=np.float32)
+
+.. _Other-2:
+
+Other
+----
+
+- The\ `info``\returned form the environment\ ``step``\ must contain the\ ``final_eval_reward``\ key-value pair, which represents the evaluation metrics for the entire episode, containing the rewards for the episode (life value difference between two players).
+
+- The above spatial definitions are all descriptions of single intelligences. The multi-intelligence space splices the corresponding obs/action/reward information.
+
+i.e. The observation space changes from ``(12, )`` to ``(2, 12)``, thar represents the observation information of both sides.
+
+.. _Other-3:
+
+Other
+====
+
+Lazy initialization
+----------
+
+In order to support environment vetorization, an environment instance is oftern initialized lazily. In this way, method ``__init__`` does not really initialize the real original environment, but only set corresponding parameters and configurations. The real original environment is initialized when first calling mdthod ``reset``.
+
+Random Seed
+------------------
+
+- There are two random seeds in the environment. One is orignal environment's random seed; The other is the random seed which is required in many environment space transformations. (e.g. ``random``, ``np.random``)
+
+- As a user, you only need to set these two random seeds by calling method ``seed``, and do not need to care about the implementation details.
+
+- Implementation details: For orignal environment's random seed, within RL env's ``reset`` method; Before orginal env's ``reset`` method.
+
+- Implementation details: For the seed for ``random`` / ``np.random``, within env's ``seed`` method.
+
+Difference between training env and evaluation env
+----------------------------------------------------------
+
+- Training env uses dynamic random seed, i.e. Every episode has different random seeds generated by one random generator. However, this random generator's random seed is set by env's ``seed`` method, and is fixed throughout an experiment. Evaluation env uses static random seed, i.e. Every episode has the same random seed, which is set directly by ``seed`` method.
+
+- Training env and evaluation env use different pre-process wrappers. ``episode_life`` and ``clip_reward`` are not used in evaluation env.
+
+Save the replay video
+----------------------------
+
+After env is initiated, and before it is reset, call ``enable_save_replay`` method to set where the replay video will be saved. Environment will automatically save the replay video after each episode is completed. (The default call is ``gym.wrapper.Monitor``, depending on ``ffmpeg``). The code shown below will run an environment episode and save the replay viedo in a file like ``./video/xxx.mp4``.
+
+.. code:: python
+
+ from easydict import EasyDict
+ from dizoo.slime_volley.envs.slime_volley_env import SlimeVolleyEnv
+
+ env = SlimeVolleyEnv(EasyDict({'env_id': 'SlimeVolley-v0', 'agent_vs_agent': False}))
+ env.enable_save_replay(replay_path='./video')
+ obs = env.reset()
+
+ while True:
+ action = env.random_action()
+ timestep = env.step(action)
+ if timestep.done:
+ print('Episode is over, final eval reward is: {}'.format(timestep.info['final_eval_reward']))
+ break
+
+DI-zoo runnable code
+====================
+
+Complete training configuration can be found on `github
+link `__.
+For specific configuration file, e.g. ``slime_volley_selfplay_ppo_main.py``\, you can run the demo as shown below:
+
+.. code:: python
+
+ import os
+ import gym
+ import numpy as np
+ import copy
+ import torch
+ from tensorboardX import SummaryWriter
+ from functools import partial
+
+ from ding.config import compile_config
+ from ding.worker import BaseLearner, BattleSampleSerialCollector, NaiveReplayBuffer, InteractionSerialEvaluator
+ from ding.envs import SyncSubprocessEnvManager
+ from ding.policy import PPOPolicy
+ from ding.model import VAC
+ from ding.utils import set_pkg_seed
+ from dizoo.slime_volley.envs import SlimeVolleyEnv
+ from dizoo.slime_volley.config.slime_volley_ppo_config import main_config
+
+
+ def main(cfg, seed=0, max_iterations=int(1e10)):
+ cfg = compile_config(
+ cfg,
+ SyncSubprocessEnvManager,
+ PPOPolicy,
+ BaseLearner,
+ BattleSampleSerialCollector,
+ InteractionSerialEvaluator,
+ NaiveReplayBuffer,
+ save_cfg=True
+ )
+ collector_env_num, evaluator_env_num = cfg.env.collector_env_num, cfg.env.evaluator_env_num
+ collector_env_cfg = copy.deepcopy(cfg.env)
+ collector_env_cfg.agent_vs_agent = True
+ evaluator_env_cfg = copy.deepcopy(cfg.env)
+ evaluator_env_cfg.agent_vs_agent = False
+ collector_env = SyncSubprocessEnvManager(
+ env_fn=[partial(SlimeVolleyEnv, collector_env_cfg) for _ in range(collector_env_num)], cfg=cfg.env.manager
+ )
+ evaluator_env = SyncSubprocessEnvManager(
+ env_fn=[partial(SlimeVolleyEnv, evaluator_env_cfg) for _ in range(evaluator_env_num)], cfg=cfg.env.manager
+ )
+
+ collector_env.seed(seed)
+ evaluator_env.seed(seed, dynamic_seed=False)
+ set_pkg_seed(seed, use_cuda=cfg.policy.cuda)
+
+ model = VAC(**cfg.policy.model)
+ policy = PPOPolicy(cfg.policy, model=model)
+
+ tb_logger = SummaryWriter(os.path.join('./{}/log/'.format(cfg.exp_name), 'serial'))
+ learner = BaseLearner(
+ cfg.policy.learn.learner, policy.learn_mode, tb_logger, exp_name=cfg.exp_name, instance_name='learner1'
+ )
+ collector = BattleSampleSerialCollector(
+ cfg.policy.collect.collector,
+ collector_env, [policy.collect_mode, policy.collect_mode],
+ tb_logger,
+ exp_name=cfg.exp_name
+ )
+ evaluator_cfg = copy.deepcopy(cfg.policy.eval.evaluator)
+ evaluator_cfg.stop_value = cfg.env.stop_value
+ evaluator = InteractionSerialEvaluator(
+ evaluator_cfg,
+ evaluator_env,
+ policy.eval_mode,
+ tb_logger,
+ exp_name=cfg.exp_name,
+ instance_name='builtin_ai_evaluator'
+ )
+
+ learner.call_hook('before_run')
+ for _ in range(max_iterations):
+ if evaluator.should_eval(learner.train_iter):
+ stop_flag, reward = evaluator.eval(learner.save_checkpoint, learner.train_iter, collector.envstep)
+ if stop_flag:
+ break
+ new_data, _ = collector.collect(train_iter=learner.train_iter)
+ train_data = new_data[0] + new_data[1]
+ learner.train(train_data, collector.envstep)
+ learner.call_hook('after_run')
+
+
+ if __name__ == "__main__":
+ main(main_config)
+
+Note: To run the intelligent body against built-in bot mode, python ``slime_volley_ppo_config.py``.
+
+Note: For some specific algorithm, use the corresponding specific entry function.
+
+Algorithm Benchmark
+============
+
+- SlimeVolley-v0(Average reward greater than 1 is considered as good agent with the build-in bot)
+
+ - SlimeVolley-v0 + PPO + vs Bot
+ .. image:: images/slime_volleyball_ppo_vsbot.png
+ :align: center
+
+ - SlimeVolley-v0 + PPO + self-play
+ .. image:: images/slime_volleyball_ppo_selfplay.png
+ :align: center
+ :scale: 70%
+
diff --git a/distributed/index.html b/distributed/index.html
index eea1949..e63ddd4 100644
--- a/distributed/index.html
+++ b/distributed/index.html
@@ -38,7 +38,7 @@
-
+
@@ -500,7 +500,7 @@ This connection will make your next optimization work extremely simple.
Next
- Previous
+ Previous
diff --git a/env_tutorial/atari.html b/env_tutorial/atari.html
index 20efedf..119d2c3 100644
--- a/env_tutorial/atari.html
+++ b/env_tutorial/atari.html
@@ -128,6 +128,7 @@
Slime Volleyball is a two-player match-based environment with two types of observation spaces, vector and picture forms. The action space is often simplified to a discrete action space, used as the basic environment for testing self-play-related algorithms. It is a collection of environments (there are 3 sub-environments, namely SlimeVolley-v0, SlimeVolleyPixel-v0, SlimeVolleyNoFrameskip-v0), of which the SlimeVolley-v0 game is shown in the figure below.
Due to Slime Volleyball is easy to install, DI-engine does not have Mirror specifically for it. You can customize your build with the benchmark Mirror opendilab/ding:nightly, or visit the docker
+hub for more mirrors.
Note: SlimeVolley-v0 is used here as an example, because benchmarking the self-play series of algorithms naturally gives priority to simplicity. If you want to use the other two environments, you can check the original repository and adapt the environment according to the DI-engine的API.
The observation space is a vector of size (12,) containing the absolute coordinates of self, opponent, and ball with two consecutive frames stitched togerther. The data type is float64
The original action space of SlimeVolley-v0 is defined as MultiBinary(3) with three kinds of actions. More than one actions can be performed at the same time. Each action is corresponding to two cases: 0 (not executed) and 1 (executed).
+
+
i.e. (1,0,1) represents the execution of the first and third actions at the same time. The data type is int, which needs to be passed into a python list object (or a 1-dimensional np array of size 3, i.e. np.array([0,1,0])
+
+
The actual implementation does not strictly limit the action to 0 and 1. It treats values greater than 0 as 1, while values less than or equal to 0 as 0.
+
In the SlimeVolley-v0 environment, the basic action is meant to be
+
+
+
0: forward (forward)
+
1: backward (backward)
+
2: jump (jump)
+
+
+
+
In the SlimeVolley-v0 environment, the combined action is meant to be
The reward is the score of the game. If the ball lands on the ground of your field, -1 is given. If it lands on the ground of the opponent‘s field, +1 is given. If the game is still in progress, 0 is given.
The end of the game is represented as the end of episode. There are two ending conditions
+
+
+
The life point of one side is 0, default is 5.
+
+
+
+
reach the maximum environmental step, default is 3000.
+
+
+
+
+
The game supports two kinds of matchmaking, intelligent body against built-in bot (the bot left, the intelligent body right), intelligent body against intelligent body
+
The built-in bot is a very simple RNN-trained smartbody bot_link
+
Only one side’s obs are returned by default. The other side’s obs, and information can be found in the info field/
Transform the MultiBinary action space into a discrete action space of size 6 (a simple Cartesian product is sufficient). The final result is a one-dimensional np array of size (1,). The data type is np.int64
Theinfo`returned form the environmentstepmust contain thefinal_eval_rewardkey-value pair, which represents the evaluation metrics for the entire episode, containing the rewards for the episode (life value difference between two players).
+
The above spatial definitions are all descriptions of single intelligences. The multi-intelligence space splices the corresponding obs/action/reward information.
+
+
i.e. The observation space changes from (12,) to (2,12), thar represents the observation information of both sides.
In order to support environment vetorization, an environment instance is oftern initialized lazily. In this way, method __init__ does not really initialize the real original environment, but only set corresponding parameters and configurations. The real original environment is initialized when first calling mdthod reset.
There are two random seeds in the environment. One is orignal environment’s random seed; The other is the random seed which is required in many environment space transformations. (e.g. random, np.random)
+
As a user, you only need to set these two random seeds by calling method seed, and do not need to care about the implementation details.
+
Implementation details: For orignal environment’s random seed, within RL env’s reset method; Before orginal env’s reset method.
+
Implementation details: For the seed for random / np.random, within env’s seed method.
+
+
+
+
Difference between training env and evaluation env¶
+
+
Training env uses dynamic random seed, i.e. Every episode has different random seeds generated by one random generator. However, this random generator’s random seed is set by env’s seed method, and is fixed throughout an experiment. Evaluation env uses static random seed, i.e. Every episode has the same random seed, which is set directly by seed method.
+
Training env and evaluation env use different pre-process wrappers. episode_life and clip_reward are not used in evaluation env.
After env is initiated, and before it is reset, call enable_save_replay method to set where the replay video will be saved. Environment will automatically save the replay video after each episode is completed. (The default call is gym.wrapper.Monitor, depending on ffmpeg). The code shown below will run an environment episode and save the replay viedo in a file like ./video/xxx.mp4.
+
fromeasydictimportEasyDict
+fromdizoo.slime_volley.envs.slime_volley_envimportSlimeVolleyEnv
+
+env=SlimeVolleyEnv(EasyDict({'env_id':'SlimeVolley-v0','agent_vs_agent':False}))
+env.enable_save_replay(replay_path='./video')
+obs=env.reset()
+
+whileTrue:
+ action=env.random_action()
+ timestep=env.step(action)
+ iftimestep.done:
+ print('Episode is over, final eval reward is: {}'.format(timestep.info['final_eval_reward']))
+ break
+
Complete training configuration can be found on github
+link.
+For specific configuration file, e.g. slime_volley_selfplay_ppo_main.py, you can run the demo as shown below:
SlimeVolley-v0(Average reward greater than 1 is considered as good agent with the build-in bot)
+
+
SlimeVolley-v0 + PPO + vs Bot
+
+
+
+
SlimeVolley-v0 + PPO + self-play
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/index.html b/index.html
index 36ee3b3..76adb21 100644
--- a/index.html
+++ b/index.html
@@ -275,6 +275,7 @@ If you want to deeply customize your algorithm and application with DI-engine, a