utils.pytorch_ddp_dist_helper¶
pytorch_ddp_dist_helper¶
Please Reference ding/ding/utils/disk_helper.py for usage.
errro_wrapper¶
- Overview:
wrap the function, so that any Exception in the function will be catched and return the default_ret
- Arguments:
fn (
Callable
): the function to be wrapeddefault_ret (
obj
): the default return when an Exception occurred in the function
- Returns:
wrapper (
Callable
): the wrapped function
- Examples:
>>> # Used to checkfor Fakelink (Refer to utils.linklink_dist_helper.py) >>> def get_rank(): # Get the rank of linklink model, return 0 if use FakeLink. >>> if is_fake_link: >>> return 0 >>> return error_wrapper(link.get_rank, 0)()
get_rank¶
- Overview:
Get the rank of current process in total world_size
get_world_size¶
- Overview:
Get the world_size(total process number in data parallel training)
broadcast¶
Broadcasts the tensor to the whole group.
tensor
must have the same number of elements in all processes
participating in the collective.
- Args:
- tensor (Tensor): Data to be sent if
src
is the rank of current process, and tensor to be used to save received data otherwise.
src (int): Source rank. group (ProcessGroup, optional): The process group to work on. If None,
the default process group will be used.
async_op (bool, optional): Whether this op should be an async op
- tensor (Tensor): Data to be sent if
- Returns:
Async work handle, if async_op is set to True. None, if not async_op or if not part of the group
allreduce¶
get_group¶
- Overview:
Get the group segmentation of
group_size
each group- Arguments:
group_size (
int
) thegroup_size
dist_mode¶
- Overview:
Wrap the function so that in can init and finalize automatically before each call
dist_init¶
- Overview:
Init the distributed training setting
dist_finalize¶
- Overview:
Finalize distributed training resources