utils.pytorch_ddp_dist_helper¶

pytorch_ddp_dist_helper¶

Please Reference ding/ding/utils/disk_helper.py for usage.

errro_wrapper¶

Overview:

wrap the function, so that any Exception in the function will be catched and return the default_ret

Arguments:

fn (Callable): the function to be wraped
default_ret (obj): the default return when an Exception occurred in the function

Returns:

wrapper (Callable): the wrapped function

Examples:

>>> # Used to checkfor Fakelink (Refer to utils.linklink_dist_helper.py)
>>> def get_rank():  # Get the rank of linklink model, return 0 if use FakeLink.
>>>    if is_fake_link:
>>>        return 0
>>>    return error_wrapper(link.get_rank, 0)()

get_rank¶

Overview:: Get the rank of current process in total world_size

get_world_size¶

Overview:: Get the world_size(total process number in data parallel training)

broadcast¶

Broadcasts the tensor to the whole group.

tensor must have the same number of elements in all processes participating in the collective.

Args:

tensor (Tensor): Data to be sent if src is the rank of current: process, and tensor to be used to save received data otherwise.

src (int): Source rank. group (ProcessGroup, optional): The process group to work on. If None,

the default process group will be used.

async_op (bool, optional): Whether this op should be an async op

Returns:

Async work handle, if async_op is set to True. None, if not async_op or if not part of the group

allreduce¶

get_group¶

Overview:

Get the group segmentation of group_size each group

Arguments:

group_size (int) the group_size

dist_mode¶

Overview:: Wrap the function so that in can init and finalize automatically before each call

dist_init¶

Overview:: Init the distributed training setting

dist_finalize¶

Overview:: Finalize distributed training resources