A2C¶
Overview¶
A2C (advantage actor critic) is an actor-critic RL algorithm, where the policy gradient algorithm is combined with an advantage function to reduce variance.
Quick Facts¶
A2C is a model-free and policy-based RL algorithm.
A2C supports both discrete and continuous action spaces.
A2C supports both off-policy and on-policy modes.
A2C can be equipped with Recurrent Neural Network (RNN).
Key Equations or Key Graphs¶
A2C uses advantage estimation in the policy gradient:
where the n-step advantage function is defined:
Pseudo-code¶
Note
Different from Q-learning, A2C(and other actor critic methods) alternates between policy estimation and policy improvement.
Extensions¶
- A2C can be combined with:
Multi-step learning
RNN
Generalized Advantage Estimation (GAE) GAE is proposed in High-Dimensional Continuous Control Using Generalized Advantage Estimation, it uses exponentially-weighted average of different steps of advantage estimators, to make trade-off between variance and bias of the estimation of the advantage:
\[\hat{A}_{t}^{\mathrm{GAE}(\gamma, \lambda)}:=(1-\lambda)\left(\hat{A}_{t}^{(1)}+\lambda \hat{A}_{t}^{(2)}+\lambda^{2} \hat{A}_{t}^{(3)}+\ldots\right)\]where the k-step advantage estimator \(\hat{A}_t^{(k)}\) is defined as :
\[\hat{A}_{t}^{(k)}:=\sum_{l=0}^{k-1} \gamma^{l} \delta_{t+l}^{V}=-V\left(s_{t}\right)+r_{t}+\gamma r_{t+1}+\cdots+\gamma^{k-1} r_{t+k-1}+\gamma^{k} V\left(s_{t+k}\right)\]When k=1, the estimator \(\hat{A}_t^{(1)}\) is the naive advantage estimator:
\[\hat{A}_{t}^{(1)}:=\delta_{t}^{V} \quad=-V\left(s_{t}\right)+r_{t}+\gamma V\left(s_{t+1}\right)\]When GAE is used, the common values of \(\lambda\) usually belong to [0.8, 1.0].
Implementation¶
The default config is defined as follows:
- class ding.policy.a2c.A2CPolicy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]
- Overview:
Policy class of A2C algorithm.
The network interface A2C used is defined as follows:
- class ding.model.template.vac.VAC(obs_shape: Union[int, ding.utils.type_helper.SequenceType], action_shape: Union[int, ding.utils.type_helper.SequenceType], share_encoder: bool = True, continuous: bool = False, encoder_hidden_size_list: ding.utils.type_helper.SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Optional[torch.nn.modules.module.Module] = ReLU(), norm_type: Optional[str] = None, sigma_type: Optional[str] = 'independent', bound_type: Optional[str] = None)[source]
- Overview:
The VAC model.
- Interfaces:
__init__
,forward
,compute_actor
,compute_critic
- __init__(obs_shape: Union[int, ding.utils.type_helper.SequenceType], action_shape: Union[int, ding.utils.type_helper.SequenceType], share_encoder: bool = True, continuous: bool = False, encoder_hidden_size_list: ding.utils.type_helper.SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Optional[torch.nn.modules.module.Module] = ReLU(), norm_type: Optional[str] = None, sigma_type: Optional[str] = 'independent', bound_type: Optional[str] = None) None [source]
- Overview:
Init the VAC Model according to arguments.
- Arguments:
obs_shape (
Union[int, SequenceType]
): Observation’s space.action_shape (
Union[int, SequenceType]
): Action’s space.share_encoder (
bool
): Whether share encoder.continuous (
bool
): Whether collect continuously.encoder_hidden_size_list (
SequenceType
): Collection ofhidden_size
to pass toEncoder
actor_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to actor-nn’sHead
.
- actor_head_layer_num (
int
):The num of layers used in the network to compute Q value output for actor’s nn.
critic_head_hidden_size (
Optional[int]
): Thehidden_size
to pass to critic-nn’sHead
.
- critic_head_layer_num (
int
):The num of layers used in the network to compute Q value output for critic’s nn.
- activation (
Optional[nn.Module]
):The type of activation function to use in
MLP
the afterlayer_fn
, ifNone
then default set tonn.ReLU()
- norm_type (
Optional[str]
):The type of normalization to use, see
ding.torch_utils.fc_block
for more details`
- compute_actor(x: torch.Tensor) Dict [source]
- Overview:
Execute parameter updates with
'compute_actor'
mode Use encoded embedding tensor to predict output.- Arguments:
- inputs (
torch.Tensor
):The encoded embedding tensor, determined with given
hidden_size
, i.e.(B, N=hidden_size)
.hidden_size = actor_head_hidden_size
- Returns:
- outputs (
Dict
):Run with encoder and head.
- ReturnsKeys:
logit (
torch.Tensor
): Logit encoding tensor, with same size as inputx
.- Shapes:
logit (
torch.FloatTensor
): \((B, N)\), where B is batch size and N isaction_shape
- Examples:
>>> model = VAC(64,64) >>> inputs = torch.randn(4, 64) >>> actor_outputs = model(inputs,'compute_actor') >>> assert actor_outputs['action'].shape == torch.Size([4, 64])
- compute_actor_critic(x: torch.Tensor) Dict [source]
- Overview:
Execute parameter updates with
'compute_actor_critic'
mode Use encoded embedding tensor to predict output.- Arguments:
inputs (
torch.Tensor
): The encoded embedding tensor.- Returns:
- outputs (
Dict
):Run with encoder and head.
- ReturnsKeys:
logit (
torch.Tensor
): Logit encoding tensor, with same size as inputx
.value (
torch.Tensor
): Q value tensor with same size as batch size.- Shapes:
logit (
torch.FloatTensor
): \((B, N)\), where B is batch size and N isaction_shape
value (
torch.FloatTensor
): \((B, )\), where B is batch size.- Examples:
>>> model = VAC(64,64) >>> inputs = torch.randn(4, 64) >>> outputs = model(inputs,'compute_actor_critic') >>> outputs['value'] tensor([0.0252, 0.0235, 0.0201, 0.0072], grad_fn=<SqueezeBackward1>) >>> assert outputs['logit'].shape == torch.Size([4, 64])Note
compute_actor_critic
interface aims to save computation when shares encoder. Returning the combination dictionry.
- compute_critic(x: torch.Tensor) Dict [source]
- Overview:
Execute parameter updates with
'compute_critic'
mode Use encoded embedding tensor to predict output.- Arguments:
- inputs (
torch.Tensor
):The encoded embedding tensor, determined with given
hidden_size
, i.e.(B, N=hidden_size)
.hidden_size = critic_head_hidden_size
- Returns:
- outputs (
Dict
):Run with encoder and head.
- Necessary Keys:
value (
torch.Tensor
): Q value tensor with same size as batch size.- Shapes:
value (
torch.FloatTensor
): \((B, )\), where B is batch size.- Examples:
>>> model = VAC(64,64) >>> inputs = torch.randn(4, 64) >>> critic_outputs = model(inputs,'compute_critic') >>> critic_outputs['value'] tensor([0.0252, 0.0235, 0.0201, 0.0072], grad_fn=<SqueezeBackward1>)
- forward(inputs: Union[torch.Tensor, Dict], mode: str) Dict [source]
- Overview:
Use encoded embedding tensor to predict output. Parameter updates with VAC’s MLPs forward setup.
- Arguments:
- Forward with
'compute_actor'
or'compute_critic'
:
- inputs (
torch.Tensor
):The encoded embedding tensor, determined with given
hidden_size
, i.e.(B, N=hidden_size)
. Whetheractor_head_hidden_size
orcritic_head_hidden_size
depend onmode
.- Returns:
- outputs (
Dict
):Run with encoder and head.
- Forward with
'compute_actor'
, Necessary Keys:
logit (
torch.Tensor
): Logit encoding tensor, with same size as inputx
.- Forward with
'compute_critic'
, Necessary Keys:
value (
torch.Tensor
): Q value tensor with same size as batch size.- Shapes:
inputs (
torch.Tensor
): \((B, N)\), where B is batch size and N correspondinghidden_size
logit (
torch.FloatTensor
): \((B, N)\), where B is batch size and N isaction_shape
value (
torch.FloatTensor
): \((B, )\), where B is batch size.- Actor Examples:
>>> model = VAC(64,128) >>> inputs = torch.randn(4, 64) >>> actor_outputs = model(inputs,'compute_actor') >>> assert actor_outputs['logit'].shape == torch.Size([4, 128])- Critic Examples:
>>> model = VAC(64,64) >>> inputs = torch.randn(4, 64) >>> critic_outputs = model(inputs,'compute_critic') >>> critic_outputs['value'] tensor([0.0252, 0.0235, 0.0201, 0.0072], grad_fn=<SqueezeBackward1>)- Actor-Critic Examples:
>>> model = VAC(64,64) >>> inputs = torch.randn(4, 64) >>> outputs = model(inputs,'compute_actor_critic') >>> outputs['value'] tensor([0.0252, 0.0235, 0.0201, 0.0072], grad_fn=<SqueezeBackward1>) >>> assert outputs['logit'].shape == torch.Size([4, 64])
The policy gradient and value update of A2C is implemented as follows:
def a2c_error(data: namedtuple) -> namedtuple:
logit, action, value, adv, return_, weight = data
if weight is None:
weight = torch.ones_like(value)
dist = torch.distributions.categorical.Categorical(logits=logit)
logp = dist.log_prob(action)
entropy_loss = (dist.entropy() * weight).mean()
policy_loss = -(logp * adv * weight).mean()
value_loss = (F.mse_loss(return_, value, reduction='none') * weight).mean()
return a2c_loss(policy_loss, value_loss, entropy_loss)
The Benchmark results of A2C implemented in DI-engine can be found in Benchmark.
References¶
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu: “Asynchronous Methods for Deep Reinforcement Learning”, 2016, ICML 2016; arXiv:1602.01783. https://arxiv.org/abs/1602.01783