DQN

Overview

DQN was first proposed in Playing Atari with Deep Reinforcement Learning, which combines Q-learning with deep neural networks. Different from the previous methods, DQN uses a deep neural network to evaluate the q-values, which is updated via TD-loss along with gradient decent.

Quick Facts

  1. DQN is a model-free and value-based RL algorithm.

  2. DQN only support discrete action spaces.

  3. DQN is an off-policy algorithm.

  4. Usually, DQN use eps-greedy or multinomial sample for exploration.

  5. DQN + RNN = DRQN.

  6. The DI-engine implementation of DQN supports multi-discrete action space.

Key Equations or Key Graphs

The TD-loss used in DQN is:

\[L(w)=\mathbb{E}\left[(\underbrace{r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}, w\right)}_{\text {Target }}-Q(s, a, w))^{2}\right]\]

Pseudo-code

../_images/DQN.png

Note

Comparing with the vanilla version, DQN has been dramatically modified in both algorithm and implementation aspects. In the algorithm part, n-step TD-loss, PER, target network and dueling head are widely used. For the implementation details, the value of epsilon anneals from a high value to zero during the training rather than keeps constant, according to env step(the number of policy interaction with env).

Extensions

DQN can be combined with:

  • PER (Prioritized Experience Replay)

    PRIORITIZED EXPERIENCE REPLAY replaces the uniform sampling in a replay buffer with so-called priority, which is defined by various metrics, such as absolute TD error, the novelty of observation and so on. By this priority sampling, the convergence speed and performance of DQN can be improved a lot.

    One of implementations of PER is described below:

    ../_images/PERDQN.png
  • Multi-step TD-loss

    Note

    In the one-step setting, Q-learning learns \(Q(s,a)\) with the Bellman update: \(r(s,a)+\gamma \mathop{max}\limits_{a^*}Q(s',a^*)\). While in the n-step setting the equation is \(\sum_{t=0}^{n-1}\gamma^t r(s_t,a_t) + \gamma^n \mathop{max}\limits_{a^*}Q(s_n,a^*)\). An issue about n-step for Q-learning is that, when epsilon greedy is adopted, the q value estimation is biased because the \(r(s_t,a_t)\) at t>=1 are sampled under epsilon greedy rather than the policy itself. However, multi-step along with epsilon greedy generally improves DQN practically.

  • Double (target) network

    Double DQN, proposed in Deep Reinforcement Learning with Double Q-learning, is a kind of common variant of DQN. This method maintains another Q-network, named target network, which is updated by the current network by a fixed frequency(update times).

    Double DQN doesn’t select the maximum q_value in the total discrete action space from the current network, but first finds the action whose q_value is highest in the current network, then gets the q_value from the target network according to this selected action. This variant can surpass the overestimation problem of target q_value, and reduce upward bias.

    Note

    The overestimation can be caused by the error of function approximation(neural network for q table), environment noise, numerical instability and other reasons.

  • Dueling head

    In Dueling Network Architectures for Deep Reinforcement Learning, dueling head architecture is utilized to implement the decomposition of state-value and advantage for taking each action, and use these two parts to construct the final q_value, which is better for evaluating the value of some states in which not all actions can be sampled

    The specific architecture is shown in the following graph:

    ../_images/Dueling_DQN.png
  • RNN (DRQN, R2D2)

Implementations

The default config of DQNPolicy is defined as follows:

class ding.policy.dqn.DQNPolicy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]
Overview:

Policy class of DQN algorithm, extended by Double DQN/Dueling DQN/PER/multi-step TD.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

dqn

RL policy register name, refer to
registry POLICY_REGISTRY
This arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
This arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy

4

priority

bool

False

Whether use priority(PER)
Priority sample,
update priority

5

priority_IS
_weight

bool

False

Whether use Importance Sampling Weight
to correct biased update. If True,
priority must be True.

6

discount_
factor

float

0.97, [0.95, 0.999]

Reward’s future discount factor, aka.
gamma
May be 1 when sparse
reward env

7

nstep

int

1, [3, 5]

N-step reward discount sum for target
q_value estimation

8

learn.update
per_collect

int

3

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
This args can be vary
from envs. Bigger val
means more off-policy

9

learn.multi
_gpu

bool

False

whether to use multi gpu during

10

learn.batch_
size

int

64

The number of samples of an iteration

11

learn.learning
_rate

float

0.001

Gradient step length of an iteration.

12

learn.target_
update_freq

int

100

Frequence of target network update.
Hard(assign) update

13

learn.ignore_
done

bool

False

Whether ignore done for target value
calculation.
Enable it for some
fake termination env

14

collect.n_sample

int

[8, 128]

The number of training samples of a
call of collector.
It varies from
different envs

15

collect.unroll
_len

int

1

unroll length of an iteration
In RNN, unroll_len>1

16

other.eps.type

str

exp

exploration rate decay type
Support [‘exp’,
‘linear’].

17

other.eps.
start

float

0.95

start value of exploration rate
[0,1]

18

other.eps.
end

float

0.1

end value of exploration rate
[0,1]

19

other.eps.
decay

int

10000

decay length of exploration
greater than 0. set
decay=10000 means
the exploration rate
decay from start
value to end value
during decay length.

The network interface DQN used is defined as follows:

class ding.model.template.q_learning.DQN(obs_shape: Union[int, ding.utils.type_helper.SequenceType], action_shape: Union[int, ding.utils.type_helper.SequenceType], encoder_hidden_size_list: ding.utils.type_helper.SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: Optional[int] = None, head_layer_num: int = 1, activation: Optional[torch.nn.modules.module.Module] = ReLU(), norm_type: Optional[str] = None)[source]
__init__(obs_shape: Union[int, ding.utils.type_helper.SequenceType], action_shape: Union[int, ding.utils.type_helper.SequenceType], encoder_hidden_size_list: ding.utils.type_helper.SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: Optional[int] = None, head_layer_num: int = 1, activation: Optional[torch.nn.modules.module.Module] = ReLU(), norm_type: Optional[str] = None) None[source]
Overview:

Init the DQN (encoder + head) Model according to input arguments.

Arguments:
  • obs_shape (Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84].

  • action_shape (Union[int, SequenceType]): Action space shape, such as 6 or [2, 3, 3].

  • encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size.

  • dueling (dueling): Whether choose DuelingHead or DiscreteHead(default).

  • head_hidden_size (Optional[int]): The hidden_size of head network.

  • head_layer_num (int): The number of layers used in the head network to compute Q value output

  • activation (Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU()

  • norm_type (Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details.

forward(x: torch.Tensor) Dict[source]
Overview:

DQN forward computation graph, input observation tensor to predict q_value.

Arguments:
  • x (torch.Tensor): Observation inputs

Returns:
  • outputs (Dict): DQN forward outputs, such as q_value.

ReturnsKeys:
  • logit (torch.Tensor): Discrete Q-value output of each action dimension.

Shapes:
  • x (torch.Tensor): \((B, N)\), where B is batch size and N is obs_shape

  • logit (torch.FloatTensor): \((B, M)\), where B is batch size and M is action_shape

Examples:
>>> model = DQN(32, 6)  # arguments: 'obs_shape' and 'action_shape'
>>> inputs = torch.randn(4, 32)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 6])

The Benchmark result of DQN implemented in DI-engine is shown in Benchmark

Reference

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller: “Playing Atari with Deep Reinforcement Learning”, 2013; arXiv:1312.5602. https://arxiv.org/abs/1312.5602