rl_tuils.upgo¶

UPGO¶

upgo_returns¶

Overview:

Computing UPGO return targets. Also notice there is no special handling for the terminal state.

Arguments:

rewards (torch.Tensor): the returns from time step 0 to T-1,
of size [T_traj, batchsize]
bootstrap_values (torch.Tensor): estimation of the state value at step 0 to T,
of size [T_traj+1, batchsize]

Returns:

ret (torch.Tensor): Computed lambda return value for each state from 0 to T-1,
of size [T_traj, batchsize]

upgo_loss¶

Overview:

Computing UPGO loss given constant gamma and lambda. There is no special handling for terminal state value, if the last state in trajectory is the terminal, just pass a 0 as bootstrap_terminal_value.

Arguments:

target_output (torch.Tensor): the output computed by the target policy network,
of size [T_traj, batchsize, n_output]
rhos (torch.Tensor): the importance sampling ratio, of size [T_traj, batchsize]
action (torch.Tensor): the action taken, of size [T_traj, batchsize]
rewards (torch.Tensor): the returns from time step 0 to T-1, of size [T_traj, batchsize]
bootstrap_values (torch.Tensor): estimation of the state value at step 0 to T,
of size [T_traj+1, batchsize]

Returns:

loss (torch.Tensor): Computed importance sampled UPGO loss, averaged over the samples, of size []