rl_utils.gae¶
gae¶
gae¶
- Overview:
Implementation of Generalized Advantage Estimator (arXiv:1506.02438)
- Arguments:
data (
namedtuple
): gae input data with fields [‘value’, ‘reward’], which contains some episodes or trajectories datagamma (
float
): the future discount factor, should be in [0, 1], defaults to 0.99.lambda (
float
): the gae parameter lambda, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.
- Returns:
adv (
torch.FloatTensor
): the calculated advantage
- Shapes:
value (
torch.FloatTensor
): \((T+1, B)\), where T is trajectory length and B is batch sizereward (
torch.FloatTensor
): \((T, B)\)adv (
torch.FloatTensor
): \((T, B)\)
Note
value_{T+1} should be 0 if this trajectory reached a terminal state(done=True), otherwise we use value function, this operation is implemented in collector for packing trajectory.