rl_utils.gae¶

gae¶

Overview:

Implementation of Generalized Advantage Estimator (arXiv:1506.02438)

Arguments:

data (namedtuple): gae input data with fields [‘value’, ‘reward’], which contains some episodes or trajectories data
gamma (float): the future discount factor, should be in [0, 1], defaults to 0.99.
lambda (float): the gae parameter lambda, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.

Returns:

adv (torch.FloatTensor): the calculated advantage

Shapes:

value (torch.FloatTensor): \((T+1, B)\), where T is trajectory length and B is batch size
reward (torch.FloatTensor): \((T, B)\)
adv (torch.FloatTensor): \((T, B)\)

Note

value_{T+1} should be 0 if this trajectory reached a terminal state(done=True), otherwise we use value function, this operation is implemented in collector for packing trajectory.