Rainbow

Overview

Rainbow was proposed in Rainbow: Combining Improvements in Deep Reinforcement Learning. It combines many independent improvements to DQN, including: target network(double DQN), priority, dueling head, multi-step TD-loss, C51 and noisy net.

Quick Facts

  1. Rainbow is a model-free and value-based RL algorithm.

  2. Rainbow only support discrete action spaces.

  3. Rainbow is an off-policy algorithm.

  4. Usually, Rainbow use eps-greedy, multinomial sample or noisy net for exploration.

  5. Rainbow can be equipped with RNN.

  6. The DI-engine implementation of Rainbow supports multi-discrete action space.

Double Q-learning

Double Q-learning maintains a target q network, which is periodically updated with the current q network. Double Q-learning decouples the over-estimation of q-value by selects action with the current q network but estimate the q-value with the target network, formally:

\[\left(R_{t+1}+\gamma_{t+1} q_{\bar{\theta}}\left(S_{t+1}, \underset{a^{\prime}}{\operatorname{argmax}} q_{\theta}\left(S_{t+1}, a^{\prime}\right)\right)-q_{0}\left(S_{t}, A_{t}\right)\right)^{2}\]

Prioritized Experience Replay(PER)

DQN samples uniformly from the replay buffer. Ideally, we want to sample more frequently those transitions from which there is much to learn. As a proxy for learning potential, prioritized experience replay samples transitions with probabilities relative to the last encountered absolute TD error, formally:

\[p_{t} \propto\left|R_{t+1}+\gamma_{t+1} \max _{a^{\prime}} q_{\bar{\theta}}\left(S_{t+1}, a^{\prime}\right)-q_{\theta}\left(S_{t}, A_{t}\right)\right|^{\omega}\]

In the original paper of PER, the authors show that PER achieve improvements on most of the 57 Atari games, especially on Gopher, Atlantis, James Bond 007, Space Invaders, etc.

Dueling Network

The dueling network is a neural network architecture designed for value based RL. It features two streams of computation, the value and advantage streams, sharing a convolutional encoder, and merged by a special aggregator. This corresponds to the following factorization of action values:

\[q_{\theta}(s, a)=v_{\eta}\left(f_{\xi}(s)\right)+a_{\psi}\left(f_{\xi}(s), a\right)-\frac{\sum_{a^{\prime}} a_{\psi}\left(f_{\xi}(s), a^{\prime}\right)}{N_{\text {actions }}}\]

The network architecture of Rainbow is a dueling network architecture adapted for use with return distributions. The network has a shared representation, which is then fed into a value stream \(v_\eta\) with \(N_{atoms}\) outputs, and into an advantage stream \(a_{\psi}\) with \(N_{atoms} \times N_{actions}\) outputs, where \(a_{\psi}^i(a)\) will denote the output corresponding to atom i and action a. For each atom \(z_i\), the value and advantage streams are aggregated, as in dueling DQN, and then passed through a softmax layer to obtain the normalized parametric distributions used to estimate the returns’ distributions:

\[\begin{split}\begin{array}{r} p_{\theta}^{i}(s, a)=\frac{\exp \left(v_{\eta}^{i}(\phi)+a_{\psi}^{i}(\phi, a)-\bar{a}_{\psi}^{i}(s)\right)}{\sum_{j} \exp \left(v_{\eta}^{j}(\phi)+a_{\psi}^{j}(\phi, a)-\bar{a}_{\psi}^{j}(s)\right)} \\ \text { where } \phi=f_{\xi}(s) \text { and } \bar{a}_{\psi}^{i}(s)=\frac{1}{N_{\text {actions }}} \sum_{a^{\prime}} a_{\psi}^{i}\left(\phi, a^{\prime}\right) \end{array}\end{split}\]

Multi-step Learning

A multi-step variant of DQN is then defined by minimizing the alternative loss:

\[\left(R_{t}^{(n)}+\gamma_{t}^{(n)} \max _{a^{\prime}} q_{\bar{\theta}}\left(S_{t+n}, a^{\prime}\right)-q_{\theta}\left(S_{t}, A_{t}\right)\right)^{2}\]

where the truncated n-step return is defined as:

\[`R_{t}^{(n)} \equiv \sum^{n-1} \gamma_{t}^{(k)} R_{t+k+1}\]

In the paper Revisiting Fundamentals of Experience Replay, the authors analyze that a greater capacity of replay buffer substantially increases the performance when multi-step learning is used, and they think the reason is that multi-step learning brings larger variance, which is compensated by a larger replay buffer.

Noisy Net

Noisy Nets use a noisy linear layer that combines a deterministic and noisy stream:

\[\boldsymbol{y}=(\boldsymbol{b}+\mathbf{W} \boldsymbol{x})+\left(\boldsymbol{b}_{\text {noisy }} \odot \epsilon^{b}+\left(\mathbf{W}_{\text {noisy }} \odot \epsilon^{w}\right) \boldsymbol{x}\right)\]

Over time, the network can learn to ignore the noisy stream, but at different rates in different parts of the state space, allowing state-conditional exploration with a form of self-annealing. It usually achieves improvements against epsilon-greedy when the action space is large, e.g. Montezuma’s Revenge, because epsilon-greedy tends to quickly converge to a one-hot distribution before the rewards of the large numbers of actions are collected enough. In our implementation, the noises are resampled before each forward both during data collection and training. When double Q-learning is used, the target network also resamples the noises before each forward. During the noise sampling, the noises are first sampled form N(0,1), then their magnitudes are modulated via a sqrt function with their signs preserved, i.e. x -> x.sign() * x.sqrt().

Extensions

Rainbow can be combined with:
  • RNN

Implementation

The default config is defined as follows:

class ding.policy.rainbow.RainbowDQNPolicy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]
Overview:
Rainbow DQN contain several improvements upon DQN, including:
  • target network

  • dueling architecture

  • prioritized experience replay

  • n_step return

  • noise net

  • distribution net

Therefore, the RainbowDQNPolicy class inherit upon DQNPolicy class

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

rainbow

RL policy register name, refer to
registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
this arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy

4

priority

bool

True

Whether use priority(PER)
priority sample,
update priority

5

model.v_min

float

-10

Value of the smallest atom
in the support set.

6

model.v_max

float

10

Value of the largest atom
in the support set.

7

model.n_atom

int

51

Number of atoms in the support set
of the value distribution.

8

other.eps
.start

float

0.05

Start value for epsilon decay. It’s
small because rainbow use noisy net.

9

other.eps
.end

float

0.05

End value for epsilon decay.

10

discount_
factor

float

0.97, [0.95, 0.999]

Reward’s future discount factor, aka.
gamma
may be 1 when sparse
reward env

11

nstep

int

3, [3, 5]

N-step reward discount sum for target
q_value estimation

12

learn.update
per_collect

int

3

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
this args can be vary
from envs. Bigger val
means more off-policy

The network interface Rainbow used is defined as follows:

class ding.model.template.q_learning.RainbowDQN(obs_shape: Union[int, ding.utils.type_helper.SequenceType], action_shape: Union[int, ding.utils.type_helper.SequenceType], encoder_hidden_size_list: ding.utils.type_helper.SequenceType = [128, 128, 64], head_hidden_size: Optional[int] = None, head_layer_num: int = 1, activation: Optional[torch.nn.modules.module.Module] = ReLU(), norm_type: Optional[str] = None, v_min: Optional[float] = - 10, v_max: Optional[float] = 10, n_atom: Optional[int] = 51)[source]
Overview:

RainbowDQN network (C51 + Dueling + Noisy Block)

Note

RainbowDQN contains dueling architecture by default

__init__(obs_shape: Union[int, ding.utils.type_helper.SequenceType], action_shape: Union[int, ding.utils.type_helper.SequenceType], encoder_hidden_size_list: ding.utils.type_helper.SequenceType = [128, 128, 64], head_hidden_size: Optional[int] = None, head_layer_num: int = 1, activation: Optional[torch.nn.modules.module.Module] = ReLU(), norm_type: Optional[str] = None, v_min: Optional[float] = - 10, v_max: Optional[float] = 10, n_atom: Optional[int] = 51) None[source]
Overview:

Init the Rainbow Model according to arguments.

Arguments:
  • obs_shape (Union[int, SequenceType]): Observation space shape.

  • action_shape (Union[int, SequenceType]): Action space shape.

  • encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder

  • head_hidden_size (Optional[int]): The hidden_size to pass to Head.

  • head_layer_num (int): The num of layers used in the network to compute Q value output

  • activation (Optional[nn.Module]): The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU()

  • norm_type (Optional[str]): The type of normalization to use, see ding.torch_utils.fc_block for more details`

  • n_atom (Optional[int]): Number of atoms in the prediction distribution.

forward(x: torch.Tensor) Dict[source]
Overview:

Use observation tensor to predict Rainbow output. Parameter updates with Rainbow’s MLPs forward setup.

Arguments:
  • x (torch.Tensor):

    The encoded embedding tensor with (B, N=hidden_size).

Returns:
  • outputs (Dict):

    Run MLP with RainbowHead setups and return the result prediction dictionary.

ReturnsKeys:
  • logit (torch.Tensor): Logit tensor with same size as input x.

  • distribution (torch.Tensor): Distribution tensor of size (B, N, n_atom)

Shapes:
  • x (torch.Tensor): \((B, N)\), where B is batch size and N is head_hidden_size.

  • logit (torch.FloatTensor): \((B, M)\), where M is action_shape.

  • distribution(torch.FloatTensor): \((B, M, P)\), where P is n_atom.

Examples:
>>> model = RainbowDQN(64, 64) # arguments: 'obs_shape' and 'action_shape'
>>> inputs = torch.randn(4, 64)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict)
>>> assert outputs['logit'].shape == torch.Size([4, 64])
>>> # default n_atom: int =51
>>> assert outputs['distribution'].shape == torch.Size([4, 64, 51])

The Benchmark result of Rainbow implemented in DI-engine is shown in Benchmark

Experiments on Rainbow Tricks

We conduct experiments on the lunarlander environment using rainbow (dqn) policy to compare the performance of n-step, dueling, priority, and priority_IS tricks with baseline. The code link for the experiments is here. Note that the config file is set for dqn by default. If we want to adopt rainbow policy, we need to change the type of policy as below.

lunarlander_dqn_create_config = dict(
 env=dict(
     type='lunarlander',
     import_names=['dizoo.box2d.lunarlander.envs.lunarlander_env'],
 ),
 env_manager=dict(type='subprocess'),
 policy=dict(type='rainbow'),
)

The detailed experiments setting is stated below.

Experiments setting

Remark

base

one step DQN (n-step=1, dueling=False, priority=False, priority_IS=False)

n-step

n step DQN (n-step=3, dueling=False, priority=False, priority_IS=False)

dueling

use dueling head trick (n-step=3, dueling=True, priority=False, priority_IS=False)

priority

use priority experience replay buffer (n-step=3, dueling=False, priority=True, priority_IS=False)

priority_IS

use importance sampling tricks (n-step=3, dueling=False, priority=True, priority_IS=True)

  1. reward_mean over training iteration is used as an evaluation metric.

  2. Each experiment setting is done for three times with random seed 0, 1, 2 and average the results to ensure stochasticity.

if __name__ == "__main__":
   serial_pipeline([main_config, create_config], seed=0)
  1. By setting the exp_name in config file, the experiment results can be saved in specified path. Otherwise, it will be saved in ‘./default_experiment’ directory.

from easydict import EasyDict
from ding.entry import serial_pipeline

nstep = 1
lunarlander_dqn_default_config = dict(
 exp_name='lunarlander_exp/base-one-step2',
 env=dict(
    ......

The result is shown in the figure below. As we can see, with tricks on, the speed of convergence is increased by a large amount. In this experiment setting, dueling trick contributes most to the performance.

../_images/rainbow_exp.png

References

Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, David Silver: “Rainbow: Combining Improvements in Deep Reinforcement Learning”, 2017; [http://arxiv.org/abs/1710.02298 arXiv:1710.02298].

William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, Will Dabney: “Revisiting Fundamentals of Experience Replay”, 2020; [http://arxiv.org/abs/2007.06700 arXiv:2007.06700].