rl_tuils.upgo¶
UPGO¶
upgo_returns¶
- Overview:
Computing UPGO return targets. Also notice there is no special handling for the terminal state.
- Arguments:
- rewards (
torch.Tensor
): the returns from time step 0 to T-1, of size [T_traj, batchsize]
- rewards (
- bootstrap_values (
torch.Tensor
): estimation of the state value at step 0 to T, of size [T_traj+1, batchsize]
- bootstrap_values (
- Returns:
- ret (
torch.Tensor
): Computed lambda return value for each state from 0 to T-1, of size [T_traj, batchsize]
- ret (
upgo_loss¶
- Overview:
Computing UPGO loss given constant gamma and lambda. There is no special handling for terminal state value, if the last state in trajectory is the terminal, just pass a 0 as bootstrap_terminal_value.
- Arguments:
- target_output (
torch.Tensor
): the output computed by the target policy network, of size [T_traj, batchsize, n_output]
- target_output (
rhos (
torch.Tensor
): the importance sampling ratio, of size [T_traj, batchsize]action (
torch.Tensor
): the action taken, of size [T_traj, batchsize]rewards (
torch.Tensor
): the returns from time step 0 to T-1, of size [T_traj, batchsize]- bootstrap_values (
torch.Tensor
): estimation of the state value at step 0 to T, of size [T_traj+1, batchsize]
- bootstrap_values (
- Returns:
loss (
torch.Tensor
): Computed importance sampled UPGO loss, averaged over the samples, of size []