DQN¶

Overview¶

DQN was first proposed in Playing Atari with Deep Reinforcement Learning, which combines Q-learning with deep neural networks. Different from the previous methods, DQN uses a deep neural network to evaluate the q-values, which is updated via TD-loss along with gradient decent.

Quick Facts¶

DQN is a model-free and value-based RL algorithm.
DQN only support discrete action spaces.
DQN is an off-policy algorithm.
Usually, DQN use eps-greedy or multinomial sample for exploration.
DQN + RNN = DRQN.
The DI-engine implementation of DQN supports multi-discrete action space.

Key Equations or Key Graphs¶

The TD-loss used in DQN is:

\[L(w)=\mathbb{E}\left[(\underbrace{r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}, w\right)}_{\text {Target }}-Q(s, a, w))^{2}\right]\]

Pseudo-code¶

Note

Comparing with the vanilla version, DQN has been dramatically modified in both algorithm and implementation aspects. In the algorithm part, n-step TD-loss, PER, target network and dueling head are widely used. For the implementation details, the value of epsilon anneals from a high value to zero during the training rather than keeps constant, according to env step(the number of policy interaction with env).

Extensions¶

DQN can be combined with:

PER (Prioritized Experience Replay)

PRIORITIZED EXPERIENCE REPLAY replaces the uniform sampling in a replay buffer with so-called priority, which is defined by various metrics, such as absolute TD error, the novelty of observation and so on. By this priority sampling, the convergence speed and performance of DQN can be improved a lot.

One of implementations of PER is described below:

Multi-step TD-loss

Note

In the one-step setting, Q-learning learns \(Q(s,a)\) with the Bellman update: \(r(s,a)+\gamma \mathop{max}\limits_{a^*}Q(s',a^*)\). While in the n-step setting the equation is \(\sum_{t=0}^{n-1}\gamma^t r(s_t,a_t) + \gamma^n \mathop{max}\limits_{a^*}Q(s_n,a^*)\). An issue about n-step for Q-learning is that, when epsilon greedy is adopted, the q value estimation is biased because the \(r(s_t,a_t)\) at t>=1 are sampled under epsilon greedy rather than the policy itself. However, multi-step along with epsilon greedy generally improves DQN practically.

Double (target) network

Double DQN, proposed in Deep Reinforcement Learning with Double Q-learning, is a kind of common variant of DQN. This method maintains another Q-network, named target network, which is updated by the current network by a fixed frequency(update times).

Double DQN doesn’t select the maximum q_value in the total discrete action space from the current network, but first finds the action whose q_value is highest in the current network, then gets the q_value from the target network according to this selected action. This variant can surpass the overestimation problem of target q_value, and reduce upward bias.

Note

The overestimation can be caused by the error of function approximation(neural network for q table), environment noise, numerical instability and other reasons.

Dueling head

In Dueling Network Architectures for Deep Reinforcement Learning, dueling head architecture is utilized to implement the decomposition of state-value and advantage for taking each action, and use these two parts to construct the final q_value, which is better for evaluating the value of some states in which not all actions can be sampled

The specific architecture is shown in the following graph:

RNN (DRQN, R2D2)

Implementations¶

The default config of DQNPolicy is defined as follows:

class ding.policy.dqn.DQNPolicy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]

Overview:

Policy class of DQN algorithm, extended by Double DQN/Dueling DQN/PER/multi-step TD.

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	dqn	RL policy register name, refer to registry `POLICY_REGISTRY`	This arg is optional, a placeholder
2	`cuda`	bool	False	Whether to use cuda for network	This arg can be diff- erent from modes
3	`on_policy`	bool	False	Whether the RL algorithm is on-policy or off-policy
4	`priority`	bool	False	Whether use priority(PER)	Priority sample, update priority
5	`priority_IS` `_weight`	bool	False	Whether use Importance Sampling Weight to correct biased update. If True, priority must be True.
6	`discount_` `factor`	float	0.97, [0.95, 0.999]	Reward’s future discount factor, aka. gamma	May be 1 when sparse reward env
7	`nstep`	int	1, [3, 5]	N-step reward discount sum for target q_value estimation
8	`learn.update` `per_collect`	int	3	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	This args can be vary from envs. Bigger val means more off-policy
9	`learn.multi` `_gpu`	bool	False	whether to use multi gpu during
10	`learn.batch_` `size`	int	64	The number of samples of an iteration
11	`learn.learning` `_rate`	float	0.001	Gradient step length of an iteration.
12	`learn.target_` `update_freq`	int	100	Frequence of target network update.	Hard(assign) update
13	`learn.ignore_` `done`	bool	False	Whether ignore done for target value calculation.	Enable it for some fake termination env
14	`collect.n_sample`	int	[8, 128]	The number of training samples of a call of collector.	It varies from different envs
15	`collect.unroll` `_len`	int	1	unroll length of an iteration	In RNN, unroll_len>1
16	`other.eps.type`	str	exp	exploration rate decay type	Support [‘exp’, ‘linear’].
17	`other.eps.` `start`	float	0.95	start value of exploration rate	[0,1]
18	`other.eps.` `end`	float	0.1	end value of exploration rate	[0,1]
19	`other.eps.` `decay`	int	10000	decay length of exploration	greater than 0. set decay=10000 means the exploration rate decay from start value to end value during decay length.

The network interface DQN used is defined as follows:

class ding.model.template.q_learning.DQN(obs_shape: Union[int, ding.utils.type_helper.SequenceType], action_shape: Union[int, ding.utils.type_helper.SequenceType], encoder_hidden_size_list: ding.utils.type_helper.SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: Optional[int] = None, head_layer_num: int = 1, activation: Optional[torch.nn.modules.module.Module] = ReLU(), norm_type: Optional[str] = None)[source]

__init__(obs_shape: Union[int, ding.utils.type_helper.SequenceType], action_shape: Union[int, ding.utils.type_helper.SequenceType], encoder_hidden_size_list: ding.utils.type_helper.SequenceType = [128, 128, 64], dueling: bool = True, head_hidden_size: Optional[int] = None, head_layer_num: int = 1, activation: Optional[torch.nn.modules.module.Module] = ReLU(), norm_type: Optional[str] = None) → None[source]

Overview:

Init the DQN (encoder + head) Model according to input arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation space shape, such as 8 or [4, 84, 84].
action_shape (Union[int, SequenceType]): Action space shape, such as 6 or [2, 3, 3].
encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder, the last element must match head_hidden_size.
dueling (dueling): Whether choose DuelingHead or DiscreteHead(default).
head_hidden_size (Optional[int]): The hidden_size of head network.
head_layer_num (int): The number of layers used in the head network to compute Q value output
activation (Optional[nn.Module]): The type of activation function in networks if None then default set it to nn.ReLU()
norm_type (Optional[str]): The type of normalization in networks, see ding.torch_utils.fc_block for more details.

forward(x: torch.Tensor) → Dict[source]

Overview:

DQN forward computation graph, input observation tensor to predict q_value.

Arguments:

x (torch.Tensor): Observation inputs

Returns:

outputs (Dict): DQN forward outputs, such as q_value.

ReturnsKeys:

logit (torch.Tensor): Discrete Q-value output of each action dimension.

Shapes:

x (torch.Tensor): \((B, N)\), where B is batch size and N is obs_shape
logit (torch.FloatTensor): \((B, M)\), where B is batch size and M is action_shape

Examples:

>>> model = DQN(32, 6)  # arguments: 'obs_shape' and 'action_shape'
>>> inputs = torch.randn(4, 32)
>>> outputs = model(inputs)
>>> assert isinstance(outputs, dict) and outputs['logit'].shape == torch.Size([4, 6])

The Benchmark result of DQN implemented in DI-engine is shown in Benchmark

Reference¶

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller: “Playing Atari with Deep Reinforcement Learning”, 2013; arXiv:1312.5602. https://arxiv.org/abs/1312.5602