PPO

Overview

PPO(Proximal Policy Optimization) was proposed in Proximal Policy Optimization Algorithms. PPO follows the idea of TRPO, which restricts the step of policy update by KL-divergence, and uses clipped probability ratios of the new and old policies to replace the direct KL-divergence restriction. This adaptation is simpler to implement and avoid the calculation of the Hessian matrix in TRPO.

Quick Facts

  1. PPO is a model-free and policy-based RL algorithm.

  2. PPO supports both discrete and continuous action spaces.

  3. PPO supports off-policy mode and on-policy mode.

  4. PPO can be equipped with RNN.

  5. PPO on-policy implementation use double loop(epoch loop and minibatch loop)

Key Equations or Key Graphs

PPO use clipped probability ratios in the policy gradient to prevent the policy from too rapid changes:

\[L^{C L I P}(\theta)=\hat{\mathbb{E}}_{t}\left[\min \left(r_{t}(\theta) \hat{A}_{t}, \operatorname{clip}\left(r_{t}(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_{t}\right)\right]\]

with the probability ratio \(r_t(\theta)\) defined as:

\[r_{t}(\theta)=\frac{\pi_{\theta}\left(a_{t} \mid s_{t}\right)}{\pi_{\theta_{\text {old }}}\left(a_{t} \mid s_{t}\right)}\]

When \(\hat{A}_t > 0\), \(r_t(\theta) > 1 + \epsilon\) will be clipped. While when \(\hat{A}_t < 0\), \(r_t(\theta) < 1 - \epsilon\) will be clipped. However, in the paper Mastering Complex Control in MOBA Games with Deep Reinforcement Learning, the authors claim that when \(\hat{A}_t < 0\), a too large \(r_t(\theta)\) should also be clipped, which introduces dual clip:

\[\max \left(\min \left(r_{t}(\theta) \hat{A}_{t}, \operatorname{clip}\left(r_{t}(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_{t}\right), c \hat{A}_{t}\right)\]

Pseudo-code

../_images/PPO.png

Note

This is the on-policy version of PPO.

Extensions

PPO can be combined with:
  • Multi-step learning

  • RNN

  • GAE

Note

Indeed, the standard implementation of PPO contains the many additional optimizations which are not described in the paper. Further details can be found in IMPLEMENTATION MATTERS IN DEEP POLICY GRADIENTS: A CASE STUDY ON PPO AND TRPO.

Implementation

The default config is defined as follows:

class ding.policy.ppo.PPOPolicy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]
Overview:

Policy class of on policy version PPO algorithm.

class ding.model.template.vac.VAC(obs_shape: Union[int, ding.utils.type_helper.SequenceType], action_shape: Union[int, ding.utils.type_helper.SequenceType], share_encoder: bool = True, continuous: bool = False, encoder_hidden_size_list: ding.utils.type_helper.SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Optional[torch.nn.modules.module.Module] = ReLU(), norm_type: Optional[str] = None, sigma_type: Optional[str] = 'independent', bound_type: Optional[str] = None)[source]
Overview:

The VAC model.

Interfaces:

__init__, forward, compute_actor, compute_critic

__init__(obs_shape: Union[int, ding.utils.type_helper.SequenceType], action_shape: Union[int, ding.utils.type_helper.SequenceType], share_encoder: bool = True, continuous: bool = False, encoder_hidden_size_list: ding.utils.type_helper.SequenceType = [128, 128, 64], actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Optional[torch.nn.modules.module.Module] = ReLU(), norm_type: Optional[str] = None, sigma_type: Optional[str] = 'independent', bound_type: Optional[str] = None) None[source]
Overview:

Init the VAC Model according to arguments.

Arguments:
  • obs_shape (Union[int, SequenceType]): Observation’s space.

  • action_shape (Union[int, SequenceType]): Action’s space.

  • share_encoder (bool): Whether share encoder.

  • continuous (bool): Whether collect continuously.

  • encoder_hidden_size_list (SequenceType): Collection of hidden_size to pass to Encoder

  • actor_head_hidden_size (Optional[int]): The hidden_size to pass to actor-nn’s Head.

  • actor_head_layer_num (int):

    The num of layers used in the network to compute Q value output for actor’s nn.

  • critic_head_hidden_size (Optional[int]): The hidden_size to pass to critic-nn’s Head.

  • critic_head_layer_num (int):

    The num of layers used in the network to compute Q value output for critic’s nn.

  • activation (Optional[nn.Module]):

    The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU()

  • norm_type (Optional[str]):

    The type of normalization to use, see ding.torch_utils.fc_block for more details`

compute_actor(x: torch.Tensor) Dict[source]
Overview:

Execute parameter updates with 'compute_actor' mode Use encoded embedding tensor to predict output.

Arguments:
  • inputs (torch.Tensor):

    The encoded embedding tensor, determined with given hidden_size, i.e. (B, N=hidden_size). hidden_size = actor_head_hidden_size

Returns:
  • outputs (Dict):

    Run with encoder and head.

ReturnsKeys:
  • logit (torch.Tensor): Logit encoding tensor, with same size as input x.

Shapes:
  • logit (torch.FloatTensor): \((B, N)\), where B is batch size and N is action_shape

Examples:
>>> model = VAC(64,64)
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['action'].shape == torch.Size([4, 64])
compute_actor_critic(x: torch.Tensor) Dict[source]
Overview:

Execute parameter updates with 'compute_actor_critic' mode Use encoded embedding tensor to predict output.

Arguments:
  • inputs (torch.Tensor): The encoded embedding tensor.

Returns:
  • outputs (Dict):

    Run with encoder and head.

ReturnsKeys:
  • logit (torch.Tensor): Logit encoding tensor, with same size as input x.

  • value (torch.Tensor): Q value tensor with same size as batch size.

Shapes:
  • logit (torch.FloatTensor): \((B, N)\), where B is batch size and N is action_shape

  • value (torch.FloatTensor): \((B, )\), where B is batch size.

Examples:
>>> model = VAC(64,64)
>>> inputs = torch.randn(4, 64)
>>> outputs = model(inputs,'compute_actor_critic')
>>> outputs['value']
tensor([0.0252, 0.0235, 0.0201, 0.0072], grad_fn=<SqueezeBackward1>)
>>> assert outputs['logit'].shape == torch.Size([4, 64])

Note

compute_actor_critic interface aims to save computation when shares encoder. Returning the combination dictionry.

compute_critic(x: torch.Tensor) Dict[source]
Overview:

Execute parameter updates with 'compute_critic' mode Use encoded embedding tensor to predict output.

Arguments:
  • inputs (torch.Tensor):

    The encoded embedding tensor, determined with given hidden_size, i.e. (B, N=hidden_size). hidden_size = critic_head_hidden_size

Returns:
  • outputs (Dict):

    Run with encoder and head.

    Necessary Keys:
    • value (torch.Tensor): Q value tensor with same size as batch size.

Shapes:
  • value (torch.FloatTensor): \((B, )\), where B is batch size.

Examples:
>>> model = VAC(64,64)
>>> inputs = torch.randn(4, 64)
>>> critic_outputs = model(inputs,'compute_critic')
>>> critic_outputs['value']
tensor([0.0252, 0.0235, 0.0201, 0.0072], grad_fn=<SqueezeBackward1>)
forward(inputs: Union[torch.Tensor, Dict], mode: str) Dict[source]
Overview:

Use encoded embedding tensor to predict output. Parameter updates with VAC’s MLPs forward setup.

Arguments:
Forward with 'compute_actor' or 'compute_critic':
  • inputs (torch.Tensor):

    The encoded embedding tensor, determined with given hidden_size, i.e. (B, N=hidden_size). Whether actor_head_hidden_size or critic_head_hidden_size depend on mode.

Returns:
  • outputs (Dict):

    Run with encoder and head.

    Forward with 'compute_actor', Necessary Keys:
    • logit (torch.Tensor): Logit encoding tensor, with same size as input x.

    Forward with 'compute_critic', Necessary Keys:
    • value (torch.Tensor): Q value tensor with same size as batch size.

Shapes:
  • inputs (torch.Tensor): \((B, N)\), where B is batch size and N corresponding hidden_size

  • logit (torch.FloatTensor): \((B, N)\), where B is batch size and N is action_shape

  • value (torch.FloatTensor): \((B, )\), where B is batch size.

Actor Examples:
>>> model = VAC(64,128)
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['logit'].shape == torch.Size([4, 128])
Critic Examples:
>>> model = VAC(64,64)
>>> inputs = torch.randn(4, 64)
>>> critic_outputs = model(inputs,'compute_critic')
>>> critic_outputs['value']
tensor([0.0252, 0.0235, 0.0201, 0.0072], grad_fn=<SqueezeBackward1>)
Actor-Critic Examples:
>>> model = VAC(64,64)
>>> inputs = torch.randn(4, 64)
>>> outputs = model(inputs,'compute_actor_critic')
>>> outputs['value']
tensor([0.0252, 0.0235, 0.0201, 0.0072], grad_fn=<SqueezeBackward1>)
>>> assert outputs['logit'].shape == torch.Size([4, 64])

The policy gradient and value update of PPO is implemented as follows:

def ppo_error(
        data: namedtuple,
        clip_ratio: float = 0.2,
        use_value_clip: bool = True,
        dual_clip: Optional[float] = None
) -> Tuple[namedtuple, namedtuple]:

    assert dual_clip is None or dual_clip > 1.0, "dual_clip value must be greater than 1.0, but get value: {}".format(
        dual_clip
    )
    logit_new, logit_old, action, value_new, value_old, adv, return_, weight = data
    policy_data = ppo_policy_data(logit_new, logit_old, action, adv, weight)
    policy_output, policy_info = ppo_policy_error(policy_data, clip_ratio, dual_clip)
    value_data = ppo_value_data(value_new, value_old, return_, weight)
    value_loss = ppo_value_error(value_data, clip_ratio, use_value_clip)

    return ppo_loss(policy_output.policy_loss, value_loss, policy_output.entropy_loss), policy_info

The Benchmark result of PPO implemented in DI-engine is shown in Benchmark.

References

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov: “Proximal Policy Optimization Algorithms”, 2017; [http://arxiv.org/abs/1707.06347 arXiv:1707.06347].

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, Aleksander Madry: “Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO”, 2020; [http://arxiv.org/abs/2005.12729 arXiv:2005.12729].