rl_utils.gae

gae

gae

Overview:

Implementation of Generalized Advantage Estimator (arXiv:1506.02438)

Arguments:
  • data (namedtuple): gae input data with fields [‘value’, ‘reward’], which contains some episodes or trajectories data

  • gamma (float): the future discount factor, should be in [0, 1], defaults to 0.99.

  • lambda (float): the gae parameter lambda, should be in [0, 1], defaults to 0.97, when lambda -> 0, it induces bias, but when lambda -> 1, it has high variance due to the sum of terms.

Returns:
  • adv (torch.FloatTensor): the calculated advantage

Shapes:
  • value (torch.FloatTensor): \((T+1, B)\), where T is trajectory length and B is batch size

  • reward (torch.FloatTensor): \((T, B)\)

  • adv (torch.FloatTensor): \((T, B)\)

Note

value_{T+1} should be 0 if this trajectory reached a terminal state(done=True), otherwise we use value function, this operation is implemented in collector for packing trajectory.