rl_utils.ppo

ppo

ppo_error

Overview:

Implementation of Proximal Policy Optimization (arXiv:1707.06347) with value_clip and dual_clip

Arguments:
  • data (namedtuple): the ppo input data with fieids shown in ppo_data

  • clip_ratio (float): the ppo clip ratio for the constraint of policy update, defaults to 0.2

  • use_value_clip (bool): whether to use clip in value loss with the same ratio as policy

  • dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None

Returns:
  • ppo_loss (namedtuple): the ppo loss item, all of them are the differentiable 0-dim tensor

  • ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:
  • logit_new (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim

  • logit_old (torch.FloatTensor): \((B, N)\)

  • action (torch.LongTensor): \((B, )\)

  • value_new (torch.FloatTensor): \((B, )\)

  • value_old (torch.FloatTensor): \((B, )\)

  • adv (torch.FloatTensor): \((B, )\)

  • return (torch.FloatTensor): \((B, )\)

  • weight (torch.FloatTensor or None): \((B, )\)

  • policy_loss (torch.FloatTensor): \(()\), 0-dim tensor

  • value_loss (torch.FloatTensor): \(()\)

Note

adv is already normalized value (adv - adv.mean()) / (adv.std() + 1e-8), and there are many ways to calculate this mean and std, like among data buffer or train batch, so we don’t couple this part into ppo_error, you can refer to our examples for different ways.