rl_utils.ppo¶
ppo¶
ppo_error¶
- Overview:
Implementation of Proximal Policy Optimization (arXiv:1707.06347) with value_clip and dual_clip
- Arguments:
data (
namedtuple
): the ppo input data with fieids shown inppo_data
clip_ratio (
float
): the ppo clip ratio for the constraint of policy update, defaults to 0.2use_value_clip (
bool
): whether to use clip in value loss with the same ratio as policydual_clip (
float
): a parameter c mentioned in arXiv:1912.09729 Equ. 5, shoule be in [1, inf), defaults to 5.0, if you don’t want to use it, set this parameter to None
- Returns:
ppo_loss (
namedtuple
): the ppo loss item, all of them are the differentiable 0-dim tensorppo_info (
namedtuple
): the ppo optim information for monitoring, all of them are Python scalar
- Shapes:
logit_new (
torch.FloatTensor
): \((B, N)\), where B is batch size and N is action dimlogit_old (
torch.FloatTensor
): \((B, N)\)action (
torch.LongTensor
): \((B, )\)value_new (
torch.FloatTensor
): \((B, )\)value_old (
torch.FloatTensor
): \((B, )\)adv (
torch.FloatTensor
): \((B, )\)return (
torch.FloatTensor
): \((B, )\)weight (
torch.FloatTensor
orNone
): \((B, )\)policy_loss (
torch.FloatTensor
): \(()\), 0-dim tensorvalue_loss (
torch.FloatTensor
): \(()\)
Note
adv is already normalized value (adv - adv.mean()) / (adv.std() + 1e-8), and there are many ways to calculate this mean and std, like among data buffer or train batch, so we don’t couple this part into ppo_error, you can refer to our examples for different ways.