Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 (#2999)

* * try to fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 * * fix format error * * fix format issue * * add TODO for integrated testing of TP and ZeRO 1/2/3 * fix default pg error --------- Co-authored-by: N Olatunji Ruwase <olruwase@microsoft.com>

Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 (#2999)
* * try to fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 * * fix format error * * fix format issue * * add TODO for integrated testing of TP and ZeRO 1/2/3 * fix default pg error --------- Co-authored-by: N Olatunji Ruwase <olruwase@microsoft.com>
9f4a8763 · Yizhou Wang · GitHub · 5c6da1f0 · 9f4a8763 · 9f4a8763
2 changed file
--- a/deepspeed/runtime/zero/partition_parameters.py
+++ b/deepspeed/runtime/zero/partition_parameters.py
@@ -813,7 +813,11 @@ class Init(InsertPostInitMethodToModuleSubClasses):
                    f"Partitioning param {debug_param2name_id_shape(param)} module={debug_module2name(module)}")

                if get_accelerator().on_accelerator(param):
-                    dist.broadcast(param, 0, self.get_dp_process_group())
+                    if dist.get_world_group() == self.get_dp_process_group():
+                        dist.broadcast(param, 0, self.get_dp_process_group())
+                    else:
+                        dist.broadcast(param, dist.get_global_rank(self.get_dp_process_group(), 0),
+                                       self.get_dp_process_group())
                else:
                    if dist.get_rank() == 0:
                        logger.warn(f"param `{name}` in {module.__class__.__name__} "

--- a/tests/unit/model_parallelism/test_configurable_parallel_mp.py
+++ b/tests/unit/model_parallelism/test_configurable_parallel_mp.py
@@ -21,6 +21,7 @@ pytestmark = pytest.mark.skipif(not required_maximum_torch_version(major_version
                                reason='Megatron-LM package requires Pytorch version 1.13 or below')


+# TODO: integrated testing of TP and ZeRO 1/2/3
 def get_deepspeed_model(model):
    ds_config_dict = {
        "train_micro_batch_size_per_gpu": 1,