do allgather only in shared optimizer states groups (#4167)

* skip all-gather * add notes --------- Co-authored-by: N Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: N Olatunji Ruwase <olruwase@microsoft.com>

do allgather only in shared optimizer states groups (#4167)
* skip all-gather * add notes --------- Co-authored-by: N Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: N Olatunji Ruwase <olruwase@microsoft.com>
7f3e82fe · mzl · GitHub · 7711bdbb · 7f3e82fe
隐藏空白更改
内联并排

Showing with 4 addition and 0 deletion

deepspeed/runtime/utils.py deepspeed/runtime/utils.py +4 -0

未找到文件。
--- a/deepspeed/runtime/utils.py
+++ b/deepspeed/runtime/utils.py
@@ -944,6 +944,10 @@ def all_gather_dp_groups(partitioned_param_groups, dp_process_group, start_align
        partition_id = dist.get_rank(group=dp_process_group[group_id])
        dp_world_size = dist.get_world_size(group=dp_process_group[group_id])

+        if dp_world_size == 1:
+            # no groups share optimizer states
+            # pipeline parallel with bf16 will default call this even if dp size = 1.
+            continue
        num_shards = max(1, partitioned_params[partition_id].numel() * dp_world_size // allgather_bucket_size)

        shard_size = partitioned_params[partition_id].numel() // num_shards