COMA¶

COMAPolicy¶

class ding.policy.coma.COMAPolicy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]¶

Overview:

Policy class of COMA algorithm. COMA is a multi model reinforcement learning algorithm

Interface:

_init_learn, _data_preprocess_learn, _forward_learn, _reset_learn, _state_dict_learn, _load_state_dict_learn: _init_collect, _forward_collect, _reset_collect, _process_transition, _init_eval, _forward_eval_reset_eval, _get_train_sample, default_model, _monitor_vars_learn

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	coma	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	bool	False	Whether to use cuda for network	this arg can be diff- erent from modes
3	`on_policy`	bool	True	Whether the RL algorithm is on-policy or off-policy
	`priority`	bool	False	Whether use priority(PER)	priority sample, update priority
5	`priority_` `IS_weight`	bool	False	Whether use Importance Sampling Weight to correct biased update.	IS weight
6	`learn.update` `_per_collect`	int	1	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	this args can be vary from envs. Bigger val means more off-policy
7	`learn.target_` `update_theta`	float	0.001	Target network update momentum parameter.	between[0,1]
8	`learn.discount` `_factor`	float	0.99	Reward’s future discount factor, aka. gamma	may be 1 when sparse reward env
9	`learn.td_` `lambda`	float	0.8	The trade-off factor of td-lambda, which balances 1step td and mc
10	`learn.value_` `weight`	float	1.0	The loss weight of value network	policy network weight is set to 1
11	`learn.entropy_` `weight`	float	0.01	The loss weight of entropy regularization	policy network weight is set to 1

_data_preprocess_learn(data: List[Any]) → dict[source]¶

Overview:

Preprocess the data to fit the required data format for learning

Arguments:

data (List[Dict[str, Any]]): the data collected from collect function, the Dict
in data should contain keys including at least [‘obs’, ‘action’, ‘reward’]

Returns:

data (Dict[str, Any]): the processed data, including at least
[‘obs’, ‘action’, ‘reward’, ‘done’, ‘weight’]

_forward_collect(data: dict, eps: float) → dict[source]¶

Overview:

Collect output according to eps_greedy plugin

Arguments:

data (Dict[str, Any]): Dict type data, stacked env data for predicting policy_output(action),
values are torch.Tensor or np.ndarray or dict/list combinations, keys are env_id indicated by integer.
eps (float): epsilon value for exploration, which is decayed by collected env step.

Returns:

output (Dict[int, Any]): Dict type data, including at least inferred action according to input obs.

ReturnsKeys

necessary: action

_forward_eval(data: dict) → dict[source]¶

Overview:

Forward function of eval mode, similar to self._forward_collect.

Arguments:

data (Dict[str, Any]): Dict type data, stacked env data for predicting policy_output(action),
values are torch.Tensor or np.ndarray or dict/list combinations, keys are env_id indicated by integer.

Returns:

output (Dict[int, Any]): The dict of predicting action for the interaction with env.

ReturnsKeys

necessary: action

_forward_learn(data: dict) → Dict[str, Any][source]¶

Overview:

Forward and backward function of learn mode, acquire the data and calculate the loss andoptimize learner model

Arguments:

data (Dict[str, Any]): Dict type data, a batch of data for training, values are torch.Tensor or
np.ndarray or dict/list combinations.

Returns:

info_dict (Dict[str, Any]): Dict type data, a info dict indicated training result, which will be
recorded in text log and tensorboard, values are python scalar or a list of scalars.

ArgumentsKeys:

necessary: obs, action, reward, done, weight

ReturnsKeys:

necessary: cur_lr, total_loss, policy_loss, value_loss, entropy_loss
- cur_lr (float): Current learning rate
- total_loss (float): The calculated loss
- policy_loss (float): The policy(actor) loss of coma
- value_loss (float): The value(critic) loss of coma
- entropy_loss (float): The entropy loss

_get_train_sample(data: list) → Union[None, List[Any]][source]¶

Overview:

Get the train sample from trajectory

Arguments:

data (list): The trajectory’s cache

Returns:

samples (dict): The training samples generated

_init_collect() → None[source]¶

Overview:: Collect mode init moethod. Called by self.__init__. Init traj and unroll length, collect model. Model has eps_greedy_sample wrapper and hidden state wrapper

_init_eval() → None[source]¶

Overview:: Evaluate mode init method. Called by self.__init__. Init eval model with argmax strategy and hidden_state plugin.

_init_learn() → None[source]¶

Overview:

Init the learner model of COMAPolicy

Arguments:

Note

The _init_learn method takes the argument from the self._cfg.learn in the config file

learning_rate (float): The learning rate fo the optimizer
gamma (float): The discount factor
lambda (float): The lambda factor, determining the mix of bootstrapping vs further accumulation of multistep returns at each timestep,
value_wight(float): The weight of value loss in total loss
entropy_weight(float): The weight of entropy loss in total loss
agent_num (int): Since this is a multi-agent algorithm, we need to input the agent num.
batch_size (int): Need batch size info to init hidden_state plugins

_monitor_vars_learn() → List[str][source]¶

Overview:

Return variables’ name if variables are to used in monitor.

Returns:

vars (List[str]): Variables’ name list.

_process_transition(obs: Any, model_output: dict, timestep: collections.namedtuple) → dict[source]¶

Overview:

Generate dict type transition data from inputs.

Arguments:

obs (Any): Env observation
model_output (dict): Output of collect model, including at least [‘action’, ‘prev_state’]
timestep (namedtuple): Output after env step, including at least [‘obs’, ‘reward’, ‘done’]
(here ‘obs’ indicates obs after env step).

Returns:

transition (dict): Dict type transition data.

default_model() → Tuple[str, List[str]][source]¶

Overview:

Return this algorithm default model setting for demonstration.

Returns:

model_info (Tuple[str, List[str]]): model name and mode import_names

Note

The user can define and use customized network model but must obey the same inferface definition indicated by import_names path. For coma, ding.model.coma.coma