Fix memory leak in zero2 contiguous gradients (#3306)

No usage of extra_large_param_to_reduce if contiguous_gradients is False. It keeps reference of the param for the lifetime of the application. Co-authored-by: N Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: N Logan Adams <114770087+loadams@users.noreply.github.com>

Fix memory leak in zero2 contiguous gradients (#3306)
No usage of extra_large_param_to_reduce if contiguous_gradients is False. It keeps reference of the param for the lifetime of the application. Co-authored-by: N Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: N Logan Adams <114770087+loadams@users.noreply.github.com>
01d17492 · hablb · GitHub · 0e357666 · 01d17492
隐藏空白更改
内联并排

Showing with 8 addition and 8 deletion

deepspeed/runtime/zero/stage_1_and_2.py deepspeed/runtime/zero/stage_1_and_2.py +8 -8

未找到文件。
--- a/deepspeed/runtime/zero/stage_1_and_2.py
+++ b/deepspeed/runtime/zero/stage_1_and_2.py
@@ -839,14 +839,14 @@ class DeepSpeedZeroOptimizer(ZeROOptimizer):
            Gradient computed twice for this partition. \
            Multiple gradient reduction is currently not supported"

-        if param.numel() > self.reduce_bucket_size:
-            self.extra_large_param_to_reduce = param
-
-        elif self.contiguous_gradients:
-            # keeping the gradients contiguous to prevent memory fragmentation, and avoid flattening
-            new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel())
-            new_grad_tensor.copy_(param.grad.view(-1))
-            param.grad.data = new_grad_tensor.data.view_as(param.grad)
+        if self.contiguous_gradients:
+            if param.numel() > self.reduce_bucket_size:
+                self.extra_large_param_to_reduce = param
+            else:
+                # keeping the gradients contiguous to prevent memory fragmentation, and avoid flattening
+                new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel())
+                new_grad_tensor.copy_(param.grad.view(-1))
+                param.grad.data = new_grad_tensor.data.view_as(param.grad)

        self.elements_in_ipg_bucket += param.numel()