R2D2¶

R2D2Policy¶

class ding.policy.r2d2.R2D2Policy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]¶

Overview:

Policy class of R2D2, from paper Recurrent Experience Replay in Distributed Reinforcement Learning . R2D2 proposes that several tricks should be used to improve upon DRQN, namely some recurrent experience replay tricks such as burn-in.

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	dqn	RL policy register name, refer to registry `POLICY_REGISTRY`	This arg is optional, a placeholder
2	`cuda`	bool	False	Whether to use cuda for network	This arg can be diff- erent from modes
3	`on_policy`	bool	False	Whether the RL algorithm is on-policy or off-policy
4	`priority`	bool	False	Whether use priority(PER)	Priority sample, update priority
5	`priority_IS` `_weight`	bool	False	Whether use Importance Sampling Weight to correct biased update. If True, priority must be True.
6	`discount_` `factor`	float	0.997, [0.95, 0.999]	Reward’s future discount factor, aka. gamma	May be 1 when sparse reward env
7	`nstep`	int	3, [3, 5]	N-step reward discount sum for target q_value estimation
8	`burnin_step`	int	2	The timestep of burnin operation, which is designed to RNN hidden state difference caused by off-policy
9	`learn.update` `per_collect`	int	1	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	This args can be vary from envs. Bigger val means more off-policy
10	`learn.batch_` `size`	int	64	The number of samples of an iteration
11	`learn.learning` `_rate`	float	0.001	Gradient step length of an iteration.
12	`learn.value_` `rescale`	bool	True	Whether use value_rescale function for predicted value
13	`learn.target_` `update_freq`	int	100	Frequence of target network update.	Hard(assign) update
14	`learn.ignore_` `done`	bool	False	Whether ignore done for target value calculation.	Enable it for some fake termination env
15	`collect.n_sample`	int	[8, 128]	The number of training samples of a call of collector.	It varies from different envs
16	`collect.unroll` `_len`	int	1	unroll length of an iteration	In RNN, unroll_len>1

_forward_collect(data: dict, eps: float) → dict[source]¶

Overview:

Forward function for collect mode with eps_greedy

Arguments:

data (Dict[str, Any]): Dict type data, stacked env data for predicting policy_output(action),
values are torch.Tensor or np.ndarray or dict/list combinations, keys are env_id indicated by integer.
eps (float): epsilon value for exploration, which is decayed by collected env step.

Returns:

output (Dict[int, Any]): Dict type data, including at least inferred action according to input obs.

ReturnsKeys

necessary: action

_forward_eval(data: dict) → dict[source]¶

Overview:

Forward function of eval mode, similar to self._forward_collect.

Arguments:

data (Dict[str, Any]): Dict type data, stacked env data for predicting policy_output(action),
values are torch.Tensor or np.ndarray or dict/list combinations, keys are env_id indicated by integer.

Returns:

output (Dict[int, Any]): The dict of predicting action for the interaction with env.

ReturnsKeys

necessary: action

_forward_learn(data: dict) → Dict[str, Any][source]¶

Overview:

Forward and backward function of learn mode. Acquire the data, calculate the loss and optimize learner model.

Arguments:

data (dict): Dict type data, including at least
[‘main_obs’, ‘target_obs’, ‘burnin_obs’, ‘action’, ‘reward’, ‘done’, ‘weight’]

Returns:

info_dict (Dict[str, Any]): Including cur_lr and total_loss
- cur_lr (float): Current learning rate
- total_loss (float): The calculated loss

_init_collect() → None[source]¶

Overview:: Collect mode init method. Called by self.__init__. Init traj and unroll length, collect model.

_init_eval() → None[source]¶

Overview:: Evaluate mode init method. Called by self.__init__. Init eval model with argmax strategy.

_init_learn() → None[source]¶

Overview:

Init the learner model of R2D2Policy

Arguments:

Note

The _init_learn method takes the argument from the self._cfg.learn in the config file

learning_rate (float): The learning rate fo the optimizer
gamma (float): The discount factor
nstep (int): The num of n step return
value_rescale (bool): Whether to use value rescaled loss in algorithm
burnin_step (int): The num of step of burnin

_process_transition(obs: Any, model_output: dict, timestep: collections.namedtuple) → dict[source]¶

Overview:

Generate dict type transition data from inputs.

Arguments:

obs (Any): Env observation
model_output (dict): Output of collect model, including at least [‘action’, ‘prev_state’]
timestep (namedtuple): Output after env step, including at least [‘reward’, ‘done’]
(here ‘obs’ indicates obs after env step).

Returns:

transition (dict): Dict type transition data.