rl_utils.ppo¶

ppo¶

ppo_error¶

Overview:

Implementation of Proximal Policy Optimization (arXiv:1707.06347) with value_clip and dual_clip

Arguments:

data (namedtuple): the ppo input data with fieids shown in ppo_data
clip_ratio (float): the ppo clip ratio for the constraint of policy update, defaults to 0.2
use_value_clip (bool): whether to use clip in value loss with the same ratio as policy
dual_clip (float): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None

Returns:

ppo_loss (namedtuple): the ppo loss item, all of them are the differentiable 0-dim tensor
ppo_info (namedtuple): the ppo optim information for monitoring, all of them are Python scalar

Shapes:

logit_new (torch.FloatTensor): \((B, N)\), where B is batch size and N is action dim
logit_old (torch.FloatTensor): \((B, N)\)
action (torch.LongTensor): \((B, )\)
value_new (torch.FloatTensor): \((B, )\)
value_old (torch.FloatTensor): \((B, )\)
adv (torch.FloatTensor): \((B, )\)
return (torch.FloatTensor): \((B, )\)
weight (torch.FloatTensor or None): \((B, )\)
policy_loss (torch.FloatTensor): \(()\), 0-dim tensor
value_loss (torch.FloatTensor): \(()\)

Note

adv is already normalized value (adv - adv.mean()) / (adv.std() + 1e-8), and there are many ways to calculate this mean and std, like among data buffer or train batch, so we don’t couple this part into ppo_error, you can refer to our examples for different ways.