utils.pytorch_ddp_dist_helper

pytorch_ddp_dist_helper

Please Reference ding/ding/utils/disk_helper.py for usage.

errro_wrapper

Overview:

wrap the function, so that any Exception in the function will be catched and return the default_ret

Arguments:
  • fn (Callable): the function to be wraped

  • default_ret (obj): the default return when an Exception occurred in the function

Returns:
  • wrapper (Callable): the wrapped function

Examples:
>>> # Used to checkfor Fakelink (Refer to utils.linklink_dist_helper.py)
>>> def get_rank():  # Get the rank of linklink model, return 0 if use FakeLink.
>>>    if is_fake_link:
>>>        return 0
>>>    return error_wrapper(link.get_rank, 0)()

get_rank

Overview:

Get the rank of current process in total world_size

get_world_size

Overview:

Get the world_size(total process number in data parallel training)

broadcast

Broadcasts the tensor to the whole group.

tensor must have the same number of elements in all processes participating in the collective.

Args:
tensor (Tensor): Data to be sent if src is the rank of current

process, and tensor to be used to save received data otherwise.

src (int): Source rank. group (ProcessGroup, optional): The process group to work on. If None,

the default process group will be used.

async_op (bool, optional): Whether this op should be an async op

Returns:

Async work handle, if async_op is set to True. None, if not async_op or if not part of the group

allreduce

get_group

Overview:

Get the group segmentation of group_size each group

Arguments:
  • group_size (int) the group_size

dist_mode

Overview:

Wrap the function so that in can init and finalize automatically before each call

dist_init

Overview:

Init the distributed training setting

dist_finalize

Overview:

Finalize distributed training resources