rl_tuils.upgo

UPGO

upgo_returns

Overview:

Computing UPGO return targets. Also notice there is no special handling for the terminal state.

Arguments:
  • rewards (torch.Tensor): the returns from time step 0 to T-1,

    of size [T_traj, batchsize]

  • bootstrap_values (torch.Tensor): estimation of the state value at step 0 to T,

    of size [T_traj+1, batchsize]

Returns:
  • ret (torch.Tensor): Computed lambda return value for each state from 0 to T-1,

    of size [T_traj, batchsize]

upgo_loss

Overview:

Computing UPGO loss given constant gamma and lambda. There is no special handling for terminal state value, if the last state in trajectory is the terminal, just pass a 0 as bootstrap_terminal_value.

Arguments:
  • target_output (torch.Tensor): the output computed by the target policy network,

    of size [T_traj, batchsize, n_output]

  • rhos (torch.Tensor): the importance sampling ratio, of size [T_traj, batchsize]

  • action (torch.Tensor): the action taken, of size [T_traj, batchsize]

  • rewards (torch.Tensor): the returns from time step 0 to T-1, of size [T_traj, batchsize]

  • bootstrap_values (torch.Tensor): estimation of the state value at step 0 to T,

    of size [T_traj+1, batchsize]

Returns:
  • loss (torch.Tensor): Computed importance sampled UPGO loss, averaged over the samples, of size []