R2D2

R2D2Policy

class ding.policy.r2d2.R2D2Policy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]
Overview:

Policy class of R2D2, from paper Recurrent Experience Replay in Distributed Reinforcement Learning . R2D2 proposes that several tricks should be used to improve upon DRQN, namely some recurrent experience replay tricks such as burn-in.

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

dqn

RL policy register name, refer to
registry POLICY_REGISTRY
This arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
This arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy

4

priority

bool

False

Whether use priority(PER)
Priority sample,
update priority

5

priority_IS
_weight

bool

False

Whether use Importance Sampling Weight
to correct biased update. If True,
priority must be True.

6

discount_
factor

float

0.997, [0.95, 0.999]

Reward’s future discount factor, aka.
gamma
May be 1 when sparse
reward env

7

nstep

int

3, [3, 5]

N-step reward discount sum for target
q_value estimation

8

burnin_step

int

2

The timestep of burnin operation,
which is designed to RNN hidden state
difference caused by off-policy

9

learn.update
per_collect

int

1

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
This args can be vary
from envs. Bigger val
means more off-policy

10

learn.batch_
size

int

64

The number of samples of an iteration

11

learn.learning
_rate

float

0.001

Gradient step length of an iteration.

12

learn.value_
rescale

bool

True

Whether use value_rescale function for
predicted value

13

learn.target_
update_freq

int

100

Frequence of target network update.
Hard(assign) update

14

learn.ignore_
done

bool

False

Whether ignore done for target value
calculation.
Enable it for some
fake termination env

15

collect.n_sample

int

[8, 128]

The number of training samples of a
call of collector.
It varies from
different envs

16

collect.unroll
_len

int

1

unroll length of an iteration
In RNN, unroll_len>1
_forward_collect(data: dict, eps: float) dict[source]
Overview:

Forward function for collect mode with eps_greedy

Arguments:
  • data (Dict[str, Any]): Dict type data, stacked env data for predicting policy_output(action),

    values are torch.Tensor or np.ndarray or dict/list combinations, keys are env_id indicated by integer.

  • eps (float): epsilon value for exploration, which is decayed by collected env step.

Returns:
  • output (Dict[int, Any]): Dict type data, including at least inferred action according to input obs.

ReturnsKeys
  • necessary: action

_forward_eval(data: dict) dict[source]
Overview:

Forward function of eval mode, similar to self._forward_collect.

Arguments:
  • data (Dict[str, Any]): Dict type data, stacked env data for predicting policy_output(action),

    values are torch.Tensor or np.ndarray or dict/list combinations, keys are env_id indicated by integer.

Returns:
  • output (Dict[int, Any]): The dict of predicting action for the interaction with env.

ReturnsKeys
  • necessary: action

_forward_learn(data: dict) Dict[str, Any][source]
Overview:

Forward and backward function of learn mode. Acquire the data, calculate the loss and optimize learner model.

Arguments:
  • data (dict): Dict type data, including at least

    [‘main_obs’, ‘target_obs’, ‘burnin_obs’, ‘action’, ‘reward’, ‘done’, ‘weight’]

Returns:
  • info_dict (Dict[str, Any]): Including cur_lr and total_loss
    • cur_lr (float): Current learning rate

    • total_loss (float): The calculated loss

_init_collect() None[source]
Overview:

Collect mode init method. Called by self.__init__. Init traj and unroll length, collect model.

_init_eval() None[source]
Overview:

Evaluate mode init method. Called by self.__init__. Init eval model with argmax strategy.

_init_learn() None[source]
Overview:

Init the learner model of R2D2Policy

Arguments:

Note

The _init_learn method takes the argument from the self._cfg.learn in the config file

  • learning_rate (float): The learning rate fo the optimizer

  • gamma (float): The discount factor

  • nstep (int): The num of n step return

  • value_rescale (bool): Whether to use value rescaled loss in algorithm

  • burnin_step (int): The num of step of burnin

_process_transition(obs: Any, model_output: dict, timestep: collections.namedtuple) dict[source]
Overview:

Generate dict type transition data from inputs.

Arguments:
  • obs (Any): Env observation

  • model_output (dict): Output of collect model, including at least [‘action’, ‘prev_state’]

  • timestep (namedtuple): Output after env step, including at least [‘reward’, ‘done’]

    (here ‘obs’ indicates obs after env step).

Returns:
  • transition (dict): Dict type transition data.