ZeRO-Infinity docs (#979)

* zinf tutorial * more megatron integration docs * ZInf + tiling docs

ZeRO-Infinity docs (#979)
* zinf tutorial * more megatron integration docs * ZInf + tiling docs
11279ae4 · Shaden Smith · GitHub · 598e50f9 · 11279ae4 · 11279ae4
隐藏空白更改
内联并排

Showing with 91 addition and 51 deletion

deepspeed/runtime/zero/tiling.py deepspeed/runtime/zero/tiling.py +4 -3

docs/code-docs/source/zero3.rst docs/code-docs/source/zero3.rst +87 -48

未找到文件。
--- a/deepspeed/runtime/zero/tiling.py
+++ b/deepspeed/runtime/zero/tiling.py
@@ -216,9 +216,10 @@ class TiledLinear(torch.nn.Module):
                    self.bias.copy_(other.bias)
        .. note::
-            If ZeRO-3 is enabled, this is a collective operation and the updated parameters of
+            If ZeRO-3 is enabled, this is a collective operation and the
-            data-parallel rank 0 will be visibly on all ranks. See
+            updated parameters of data-parallel rank 0 will be visible on all
-            :class:`deepspeed.zero.GatheredParameters` for more information.
+            ranks. See :class:`deepspeed.zero.GatheredParameters` for more
+            information.
        Args:

--- a/docs/code-docs/source/zero3.rst
+++ b/docs/code-docs/source/zero3.rst
-ZeRO-3 Offload
+ZeRO
-##############
+####
 The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across
 data-parallel processes by partitioning the three model states (optimizer
@@ -8,13 +8,31 @@ replicating them. By doing this, it boosts memory efficiency compared to
 classic data-parallelism while retaining its computational granularity and
 communication efficiency.
-ZeRO-Offload further increases memory efficiency by offloading the
+#. **ZeRO Stage 1**: The optimizer states (e.g., for `Adam optimizer <https://arxiv.org/abs/1412.6980>`_, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.
-optimizer's states and computations to the CPU. The model parameters can also
-be offloaded for even more memory savings!
+#. **ZeRO Stage 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
+#. **ZeRO Stage 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.
+In addition, ZeRO-3 includes the *infinity offload engine* to form
+ZeRO-Infinity ([paper](https://arxiv.org/abs/2104.07857)), which can offload
+all model states to both CPU and NVMe memory for huge memory savings.
+For a deep dive of our algorithms, please see our `papers <https://www.deepspeed.ai/#publications>`_ on `ZeRO
+<https://arxiv.org/abs/1910.02054>`_, `ZeRO-Offload
+<https://arxiv.org/abs/2101.06840>`_,
+and `ZeRO-Infinity <https://arxiv.org/abs/2104.07857>`_.
+.. note::
+    DeepSpeed first included offloading capabilities with **ZeRO-Offload**, a
+    system for offloading optimizer and gradient states to CPU memory within
+    ZeRO-2. **ZeRO-Infinity** is the next generation of offloading
+    capabilities, accessible to ZeRO-3. ZeRO-Infinity has all of the savings
+    of ZeRO-Offload, plus is able to offload more the model weights and has
+    more effective bandwidth utilization and overlapping of computation and
+    communication.
-For more information on our algorithms, please see our papers on `ZeRO
-<https://arxiv.org/abs/1910.02054>`_ and `ZeRO-Offload
-<https://arxiv.org/abs/2101.06840>`_.
 Getting Started
@@ -28,14 +46,15 @@ our `config guide <https://www.deepspeed.ai/docs/config-json/#zero-optimizations
 for a complete list of options for configuration and performance tuning.
 .. note::
-        ZeRO-3 Offload works best with our heavily optimized
+        ZeRO-Infinity and ZeRO-Offload work best with our heavily optimized
        :class:`deepspeed.ops.adam.DeepSpeedCPUAdam` optimizer. We recommend using
        our `optimizer config <https://www.deepspeed.ai/docs/config-json/#optimizer-parameters>`_
        to instruct :meth:`deepspeed.initialize` to build the optimizer for you.
-Example ZeRO-3 Offload Configurations
-=====================================
+Example ZeRO-3 Configurations
+=============================
 #. Use ZeRO to partition the optimizer states (stage 1), gradients (stage 2),
   and parameters (stage 3).
@@ -46,8 +65,6 @@ Example ZeRO-3 Offload Configurations
        {
            "zero_optimization": {
                "stage": 3,
-                "overlap_comm": true
            },
            "fp16": {
                "enabled": true
@@ -68,14 +85,13 @@ Example ZeRO-3 Offload Configurations
        }
-#. Additionally offload the optimizer states and computations to the CPU.
+#. Additionally offload the optimizer states and computations to the CPU with ZeRO-Infinity.
    .. code-block:: python
        {
            "zero_optimization": {
                "stage": 3,
-                "overlap_comm": true
                "offload_optimizer": {
                    "device": "cpu"
                }
@@ -91,7 +107,6 @@ Example ZeRO-3 Offload Configurations
        {
            "zero_optimization": {
                "stage": 3,
-                "overlap_comm": true
                "offload_optimizer": {
                    "device": "cpu"
                }
@@ -103,14 +118,13 @@ Example ZeRO-3 Offload Configurations
        }
-#. Save even MORE memory by offloading to NVMe (if available):
+#. Save even MORE memory by offloading to NVMe (if available on your system):
    .. code-block:: python
        {
            "zero_optimization": {
                "stage": 3,
-                "overlap_comm": true
                "offload_optimizer": {
                    "device": "nvme",
                    "nvme_path": "/nvme_data"
@@ -134,6 +148,9 @@ granularity of (sub)module ``forward()`` methods. The backward pass is
 handled similarly. This strategy has two underlying assumptions:
 #. The forward and backward passes of submodules must individually fit in device memory.
+   If this not the case, :class:`deepspeed.zero.TiledLinear` implements
+   **memory-centric tiling** and works with ZeRO-3 to break linear layers
+   into a sequence of smaller submodules that can fit in memory.
 #. A module's parameters are only accessed within its own ``__init__`` and ``forward()`` methods.
   Otherwise, DeepSpeed must be instructed to collect and re-partition the parameter.
@@ -153,6 +170,7 @@ you can simply allocate your model in our context:
        model = MyLargeModel()
+.. autoclass:: deepspeed.zero.Init
    :members:
@@ -185,46 +203,56 @@ parameters are accessed outside of the module that created them. To do so, use
 Registering External Parameters
 ===============================
-Consider the following pattern common in language models such as GPT:
+ZeRO-3 will automatically collect and partition the model parameters as they
+are needed during the forward and backward passes. However, in some cases a
+parameter may be used outside of its module's forward pass. We call these
+*external* parameters. ZeRO-3 can coordinate these parameters if they are
+registered either automatically or manually.
-.. code-block:: python
-    class LanguageModel(torch.nn.Module):
+.. note::
-        ...
+    DeepSpeed version ``0.3.15`` includes automatic external parameter
-        def forward(self, inputs):
+    discovery and registration to support the most common cases. Parameters
-            embeds = self.embeddings(inputs)
+    can still be manually registered if they cannot be automatically
-            ...
+    detected.
-            logits = compute_logits(output, self.embeddings.weight)
-            ...
-The tensor ``embeddings.weight`` is used in both ``embeddings.forward()`` and
+DeepSpeed can automatically detect the following external parameter scenarios:
-``compute_logits()``. We call ``embeddings.weight`` an *external* parameter
-because it is used in the training loop outside of its owning module's
-forward pass. DeepSpeed will coordinate external parameters if they are
-registered prior to the first forward pass.
-Consider the following pattern common in language models such as GPT:
-.. code-block:: python
+#. Parameter access: consider the following pattern common in language models such as GPT:
-    class LanguageModel(torch.nn.Module):
+   The tensor ``embeddings.weight`` is used in both ``embeddings.forward()`` and
-        ...
+   ``compute_logits()``. We call ``embeddings.weight`` an *external* parameter
-        def forward(self, inputs):
+   because it is used in the training loop outside of its owning module's
-            embeds = self.embeddings(inputs)
+   forward pass.
-            ...
-            logits = compute_logits(output, self.embeddings.weight)
-            ...
-The tensor ``embeddings.weight`` is used in both ``embeddings.forward()`` and
+   .. code-block:: python
-``compute_logits()``. We call ``embeddings.weight`` an *external* parameter
-because it is used in the training loop outside of its owning module's
+       class LanguageModel(torch.nn.Module):
-forward pass. DeepSpeed will coordinate external parameters if they are
+           ...
-registered prior to the first forward pass.
+           def forward(self, inputs):
+               embeds = self.embeddings(inputs)
+               ...
+               logits = compute_logits(output, self.embeddings.weight)
+               ...
+#. Returning a parameter:
+   ``CustomLinear`` returns both an output and its own ``bias`` parameter. DeepSpeed
+   will detect the external ``bias`` parameter and register it with submodules that
+   use ``CustomLinear``.
+   .. code-block:: python
+       class CustomLinear(torch.nn.Linear):
+           def forward(self, *input):
+               output = super().forward(*input)
+               return output, self.bias
-.. note::
-    Most models should not need to manually register parameters.
 .. autofunction:: deepspeed.zero.register_external_parameter
@@ -234,5 +262,16 @@ registered prior to the first forward pass.
 Memory-Centric Tiling
 ---------------------
+To reduce the working memory requirements of DL training for large models,
+ZeRO-Infinity includes technique called *memory-centric tiling* that exploits
+the data fetch and release pattern of ZeRO-3 to reduce the working memory
+requirements by breaking down a large operator into smaller tiles that can be
+executed sequentially. When combined with ZeRO-3, the parameter and gradients
+of each tile can be fetched and released one at a time, reducing the working
+memory proportional to the number of tiles. Therefore, ZeRO-Infinity can
+support operators of arbitrary sizes, without refactoring for model
+parallelism to fit them in limited GPU memory.
 .. autoclass:: deepspeed.zero.TiledLinear
    :members: