A2C

Overview

A2C (advantage actor critic) is an actor-critic RL algorithm, where the policy gradient algorithm is combined with an advantage function to reduce variance.

Quick Facts

  1. A2C is a model-free and policy-based RL algorithm.

  2. A2C supports both discrete and continuous action spaces.

  3. A2C supports both off-policy and on-policy modes.

  4. A2C can be equipped with Recurrent Neural Network (RNN).

Key Equations or Key Graphs

A2C uses advantage estimation in the policy gradient:

\[\nabla_{\theta^{\prime}} \log \pi\left(a_{t} \mid {s}_{t} ; \theta^{\prime}\right) A\left(s_{t}, {a}_{t} ; \theta, \theta_{v}\right)\]

where the n-step advantage function is defined:

\[\sum_{i=0}^{k-1} \gamma^{i} r_{t+i}+\gamma^{k} V\left(s_{t+k} ; \theta_{v}\right)-V\left(s_{t} ; \theta_{v}\right)\]

Pseudo-code

../_images/A2C.png

Note

Different from Q-learning, A2C(and other actor critic methods) alternates between policy estimation and policy improvement.

Extensions

A2C can be combined with:
  • Multi-step learning

  • RNN

  • Generalized Advantage Estimation (GAE) GAE is proposed in High-Dimensional Continuous Control Using Generalized Advantage Estimation, it uses exponentially-weighted average of different steps of advantage estimators, to make trade-off between variance and bias of the estimation of the advantage:

    \[\hat{A}_{t}^{\mathrm{GAE}(\gamma, \lambda)}:=(1-\lambda)\left(\hat{A}_{t}^{(1)}+\lambda \hat{A}_{t}^{(2)}+\lambda^{2} \hat{A}_{t}^{(3)}+\ldots\right)\]

    where the k-step advantage estimator \(\hat{A}_t^{(k)}\) is defined as :

    \[\hat{A}_{t}^{(k)}:=\sum_{l=0}^{k-1} \gamma^{l} \delta_{t+l}^{V}=-V\left(s_{t}\right)+r_{t}+\gamma r_{t+1}+\cdots+\gamma^{k-1} r_{t+k-1}+\gamma^{k} V\left(s_{t+k}\right)\]

    When k=1, the estimator \(\hat{A}_t^{(1)}\) is the naive advantage estimator:

    \[\hat{A}_{t}^{(1)}:=\delta_{t}^{V} \quad=-V\left(s_{t}\right)+r_{t}+\gamma V\left(s_{t+1}\right)\]

    When GAE is used, the common values of \(\lambda\) usually belong to [0.8, 1.0].

Implementation

The default config is defined as follows:

class ding.policy.a2c.A2CPolicy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]
Overview:

Policy class of A2C algorithm.

The network interface A2C used is defined as follows:

class ding.model.template.vac.VAC(obs_shape: Union[int, ding.utils.type_helper.SequenceType], action_shape: Union[int, ding.utils.type_helper.SequenceType], share_encoder: bool = True, continuous: bool = False, encoder_hidden_size_list: ding.utils.type_helper.SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Optional[torch.nn.modules.module.Module] = ReLU(), norm_type: Optional[str] = None, sigma_type: Optional[str] = 'independent', bound_type: Optional[str] = None)[source]
Overview:

The VAC model.

Interfaces:

__init__, forward, compute_actor, compute_critic

__init__(obs_shape: Union[int, ding.utils.type_helper.SequenceType], action_shape: Union[int, ding.utils.type_helper.SequenceType], share_encoder: bool = True, continuous: bool = False, encoder_hidden_size_list: ding.utils.type_helper.SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Optional[torch.nn.modules.module.Module] = ReLU(), norm_type: Optional[str] = None, sigma_type: Optional[str] = 'independent', bound_type: Optional[str] = None) None[source]
Overview:

Init the VAC Model according to arguments.

Arguments:
  • obs_shape (Union[int, SequenceType]): Observation’s space.

  • action_shape (Union[int, SequenceType]): Action’s space.

  • share_encoder (bool): Whether share encoder.

  • continuous (bool): Whether collect continuously.

  • encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder

  • actor_head_hidden_size (Optional[int]): The hidden_size to pass to actor-nn’s Head.

  • actor_head_layer_num (int):

    The num of layers used in the network to compute Q value output for actor’s nn.

  • critic_head_hidden_size (Optional[int]): The hidden_size to pass to critic-nn’s Head.

  • critic_head_layer_num (int):

    The num of layers used in the network to compute Q value output for critic’s nn.

  • activation (Optional[nn.Module]):

    The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU()

  • norm_type (Optional[str]):

    The type of normalization to use, see ding.torch_utils.fc_block for more details`

compute_actor(x: torch.Tensor) Dict[source]
Overview:

Execute parameter updates with 'compute_actor' mode Use encoded embedding tensor to predict output.

Arguments:
  • inputs (torch.Tensor):

    The encoded embedding tensor, determined with given hidden_size, i.e. (B, N=hidden_size). hidden_size = actor_head_hidden_size

Returns:
  • outputs (Dict):

    Run with encoder and head.

ReturnsKeys:
  • logit (torch.Tensor): Logit encoding tensor, with same size as input x.

Shapes:
  • logit (torch.FloatTensor): \((B, N)\), where B is batch size and N is action_shape

Examples:
>>> model = VAC(64,64)
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['action'].shape == torch.Size([4, 64])
compute_actor_critic(x: torch.Tensor) Dict[source]
Overview:

Execute parameter updates with 'compute_actor_critic' mode Use encoded embedding tensor to predict output.

Arguments:
  • inputs (torch.Tensor): The encoded embedding tensor.

Returns:
  • outputs (Dict):

    Run with encoder and head.

ReturnsKeys:
  • logit (torch.Tensor): Logit encoding tensor, with same size as input x.

  • value (torch.Tensor): Q value tensor with same size as batch size.

Shapes:
  • logit (torch.FloatTensor): \((B, N)\), where B is batch size and N is action_shape

  • value (torch.FloatTensor): \((B, )\), where B is batch size.

Examples:
>>> model = VAC(64,64)
>>> inputs = torch.randn(4, 64)
>>> outputs = model(inputs,'compute_actor_critic')
>>> outputs['value']
tensor([0.0252, 0.0235, 0.0201, 0.0072], grad_fn=<SqueezeBackward1>)
>>> assert outputs['logit'].shape == torch.Size([4, 64])

Note

compute_actor_critic interface aims to save computation when shares encoder. Returning the combination dictionry.

compute_critic(x: torch.Tensor) Dict[source]
Overview:

Execute parameter updates with 'compute_critic' mode Use encoded embedding tensor to predict output.

Arguments:
  • inputs (torch.Tensor):

    The encoded embedding tensor, determined with given hidden_size, i.e. (B, N=hidden_size). hidden_size = critic_head_hidden_size

Returns:
  • outputs (Dict):

    Run with encoder and head.

    Necessary Keys:
    • value (torch.Tensor): Q value tensor with same size as batch size.

Shapes:
  • value (torch.FloatTensor): \((B, )\), where B is batch size.

Examples:
>>> model = VAC(64,64)
>>> inputs = torch.randn(4, 64)
>>> critic_outputs = model(inputs,'compute_critic')
>>> critic_outputs['value']
tensor([0.0252, 0.0235, 0.0201, 0.0072], grad_fn=<SqueezeBackward1>)
forward(inputs: Union[torch.Tensor, Dict], mode: str) Dict[source]
Overview:

Use encoded embedding tensor to predict output. Parameter updates with VAC’s MLPs forward setup.

Arguments:
Forward with 'compute_actor' or 'compute_critic':
  • inputs (torch.Tensor):

    The encoded embedding tensor, determined with given hidden_size, i.e. (B, N=hidden_size). Whether actor_head_hidden_size or critic_head_hidden_size depend on mode.

Returns:
  • outputs (Dict):

    Run with encoder and head.

    Forward with 'compute_actor', Necessary Keys:
    • logit (torch.Tensor): Logit encoding tensor, with same size as input x.

    Forward with 'compute_critic', Necessary Keys:
    • value (torch.Tensor): Q value tensor with same size as batch size.

Shapes:
  • inputs (torch.Tensor): \((B, N)\), where B is batch size and N corresponding hidden_size

  • logit (torch.FloatTensor): \((B, N)\), where B is batch size and N is action_shape

  • value (torch.FloatTensor): \((B, )\), where B is batch size.

Actor Examples:
>>> model = VAC(64,128)
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['logit'].shape == torch.Size([4, 128])
Critic Examples:
>>> model = VAC(64,64)
>>> inputs = torch.randn(4, 64)
>>> critic_outputs = model(inputs,'compute_critic')
>>> critic_outputs['value']
tensor([0.0252, 0.0235, 0.0201, 0.0072], grad_fn=<SqueezeBackward1>)
Actor-Critic Examples:
>>> model = VAC(64,64)
>>> inputs = torch.randn(4, 64)
>>> outputs = model(inputs,'compute_actor_critic')
>>> outputs['value']
tensor([0.0252, 0.0235, 0.0201, 0.0072], grad_fn=<SqueezeBackward1>)
>>> assert outputs['logit'].shape == torch.Size([4, 64])

The policy gradient and value update of A2C is implemented as follows:

def a2c_error(data: namedtuple) -> namedtuple:
    logit, action, value, adv, return_, weight = data
    if weight is None:
        weight = torch.ones_like(value)
    dist = torch.distributions.categorical.Categorical(logits=logit)
    logp = dist.log_prob(action)
    entropy_loss = (dist.entropy() * weight).mean()
    policy_loss = -(logp * adv * weight).mean()
    value_loss = (F.mse_loss(return_, value, reduction='none') * weight).mean()
    return a2c_loss(policy_loss, value_loss, entropy_loss)

The Benchmark results of A2C implemented in DI-engine can be found in Benchmark.

References

Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu: “Asynchronous Methods for Deep Reinforcement Learning”, 2016, ICML 2016; arXiv:1602.01783. https://arxiv.org/abs/1602.01783