未验证 提交 11279ae4 编写于 作者: S Shaden Smith 提交者: GitHub

ZeRO-Infinity docs (#979)

* zinf tutorial

* more megatron integration docs

* ZInf + tiling docs
上级 598e50f9
...@@ -216,9 +216,10 @@ class TiledLinear(torch.nn.Module): ...@@ -216,9 +216,10 @@ class TiledLinear(torch.nn.Module):
self.bias.copy_(other.bias) self.bias.copy_(other.bias)
.. note:: .. note::
If ZeRO-3 is enabled, this is a collective operation and the updated parameters of If ZeRO-3 is enabled, this is a collective operation and the
data-parallel rank 0 will be visibly on all ranks. See updated parameters of data-parallel rank 0 will be visible on all
:class:`deepspeed.zero.GatheredParameters` for more information. ranks. See :class:`deepspeed.zero.GatheredParameters` for more
information.
Args: Args:
......
ZeRO-3 Offload ZeRO
############## ####
The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across
data-parallel processes by partitioning the three model states (optimizer data-parallel processes by partitioning the three model states (optimizer
...@@ -8,13 +8,31 @@ replicating them. By doing this, it boosts memory efficiency compared to ...@@ -8,13 +8,31 @@ replicating them. By doing this, it boosts memory efficiency compared to
classic data-parallelism while retaining its computational granularity and classic data-parallelism while retaining its computational granularity and
communication efficiency. communication efficiency.
ZeRO-Offload further increases memory efficiency by offloading the #. **ZeRO Stage 1**: The optimizer states (e.g., for `Adam optimizer <https://arxiv.org/abs/1412.6980>`_, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.
optimizer's states and computations to the CPU. The model parameters can also
be offloaded for even more memory savings! #. **ZeRO Stage 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
#. **ZeRO Stage 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.
In addition, ZeRO-3 includes the *infinity offload engine* to form
ZeRO-Infinity ([paper](https://arxiv.org/abs/2104.07857)), which can offload
all model states to both CPU and NVMe memory for huge memory savings.
For a deep dive of our algorithms, please see our `papers <https://www.deepspeed.ai/#publications>`_ on `ZeRO
<https://arxiv.org/abs/1910.02054>`_, `ZeRO-Offload
<https://arxiv.org/abs/2101.06840>`_,
and `ZeRO-Infinity <https://arxiv.org/abs/2104.07857>`_.
.. note::
DeepSpeed first included offloading capabilities with **ZeRO-Offload**, a
system for offloading optimizer and gradient states to CPU memory within
ZeRO-2. **ZeRO-Infinity** is the next generation of offloading
capabilities, accessible to ZeRO-3. ZeRO-Infinity has all of the savings
of ZeRO-Offload, plus is able to offload more the model weights and has
more effective bandwidth utilization and overlapping of computation and
communication.
For more information on our algorithms, please see our papers on `ZeRO
<https://arxiv.org/abs/1910.02054>`_ and `ZeRO-Offload
<https://arxiv.org/abs/2101.06840>`_.
Getting Started Getting Started
...@@ -28,14 +46,15 @@ our `config guide <https://www.deepspeed.ai/docs/config-json/#zero-optimizations ...@@ -28,14 +46,15 @@ our `config guide <https://www.deepspeed.ai/docs/config-json/#zero-optimizations
for a complete list of options for configuration and performance tuning. for a complete list of options for configuration and performance tuning.
.. note:: .. note::
ZeRO-3 Offload works best with our heavily optimized ZeRO-Infinity and ZeRO-Offload work best with our heavily optimized
:class:`deepspeed.ops.adam.DeepSpeedCPUAdam` optimizer. We recommend using :class:`deepspeed.ops.adam.DeepSpeedCPUAdam` optimizer. We recommend using
our `optimizer config <https://www.deepspeed.ai/docs/config-json/#optimizer-parameters>`_ our `optimizer config <https://www.deepspeed.ai/docs/config-json/#optimizer-parameters>`_
to instruct :meth:`deepspeed.initialize` to build the optimizer for you. to instruct :meth:`deepspeed.initialize` to build the optimizer for you.
Example ZeRO-3 Offload Configurations
===================================== Example ZeRO-3 Configurations
=============================
#. Use ZeRO to partition the optimizer states (stage 1), gradients (stage 2), #. Use ZeRO to partition the optimizer states (stage 1), gradients (stage 2),
and parameters (stage 3). and parameters (stage 3).
...@@ -46,8 +65,6 @@ Example ZeRO-3 Offload Configurations ...@@ -46,8 +65,6 @@ Example ZeRO-3 Offload Configurations
{ {
"zero_optimization": { "zero_optimization": {
"stage": 3, "stage": 3,
"overlap_comm": true
}, },
"fp16": { "fp16": {
"enabled": true "enabled": true
...@@ -68,14 +85,13 @@ Example ZeRO-3 Offload Configurations ...@@ -68,14 +85,13 @@ Example ZeRO-3 Offload Configurations
} }
#. Additionally offload the optimizer states and computations to the CPU. #. Additionally offload the optimizer states and computations to the CPU with ZeRO-Infinity.
.. code-block:: python .. code-block:: python
{ {
"zero_optimization": { "zero_optimization": {
"stage": 3, "stage": 3,
"overlap_comm": true
"offload_optimizer": { "offload_optimizer": {
"device": "cpu" "device": "cpu"
} }
...@@ -91,7 +107,6 @@ Example ZeRO-3 Offload Configurations ...@@ -91,7 +107,6 @@ Example ZeRO-3 Offload Configurations
{ {
"zero_optimization": { "zero_optimization": {
"stage": 3, "stage": 3,
"overlap_comm": true
"offload_optimizer": { "offload_optimizer": {
"device": "cpu" "device": "cpu"
} }
...@@ -103,14 +118,13 @@ Example ZeRO-3 Offload Configurations ...@@ -103,14 +118,13 @@ Example ZeRO-3 Offload Configurations
} }
#. Save even MORE memory by offloading to NVMe (if available): #. Save even MORE memory by offloading to NVMe (if available on your system):
.. code-block:: python .. code-block:: python
{ {
"zero_optimization": { "zero_optimization": {
"stage": 3, "stage": 3,
"overlap_comm": true
"offload_optimizer": { "offload_optimizer": {
"device": "nvme", "device": "nvme",
"nvme_path": "/nvme_data" "nvme_path": "/nvme_data"
...@@ -134,6 +148,9 @@ granularity of (sub)module ``forward()`` methods. The backward pass is ...@@ -134,6 +148,9 @@ granularity of (sub)module ``forward()`` methods. The backward pass is
handled similarly. This strategy has two underlying assumptions: handled similarly. This strategy has two underlying assumptions:
#. The forward and backward passes of submodules must individually fit in device memory. #. The forward and backward passes of submodules must individually fit in device memory.
If this not the case, :class:`deepspeed.zero.TiledLinear` implements
**memory-centric tiling** and works with ZeRO-3 to break linear layers
into a sequence of smaller submodules that can fit in memory.
#. A module's parameters are only accessed within its own ``__init__`` and ``forward()`` methods. #. A module's parameters are only accessed within its own ``__init__`` and ``forward()`` methods.
Otherwise, DeepSpeed must be instructed to collect and re-partition the parameter. Otherwise, DeepSpeed must be instructed to collect and re-partition the parameter.
...@@ -153,6 +170,7 @@ you can simply allocate your model in our context: ...@@ -153,6 +170,7 @@ you can simply allocate your model in our context:
model = MyLargeModel() model = MyLargeModel()
.. autoclass:: deepspeed.zero.Init
:members: :members:
...@@ -185,46 +203,56 @@ parameters are accessed outside of the module that created them. To do so, use ...@@ -185,46 +203,56 @@ parameters are accessed outside of the module that created them. To do so, use
Registering External Parameters Registering External Parameters
=============================== ===============================
Consider the following pattern common in language models such as GPT: ZeRO-3 will automatically collect and partition the model parameters as they
are needed during the forward and backward passes. However, in some cases a
parameter may be used outside of its module's forward pass. We call these
*external* parameters. ZeRO-3 can coordinate these parameters if they are
registered either automatically or manually.
.. code-block:: python
class LanguageModel(torch.nn.Module): .. note::
... DeepSpeed version ``0.3.15`` includes automatic external parameter
def forward(self, inputs): discovery and registration to support the most common cases. Parameters
embeds = self.embeddings(inputs) can still be manually registered if they cannot be automatically
... detected.
logits = compute_logits(output, self.embeddings.weight)
...
The tensor ``embeddings.weight`` is used in both ``embeddings.forward()`` and DeepSpeed can automatically detect the following external parameter scenarios:
``compute_logits()``. We call ``embeddings.weight`` an *external* parameter
because it is used in the training loop outside of its owning module's
forward pass. DeepSpeed will coordinate external parameters if they are
registered prior to the first forward pass.
Consider the following pattern common in language models such as GPT:
.. code-block:: python #. Parameter access: consider the following pattern common in language models such as GPT:
class LanguageModel(torch.nn.Module): The tensor ``embeddings.weight`` is used in both ``embeddings.forward()`` and
... ``compute_logits()``. We call ``embeddings.weight`` an *external* parameter
def forward(self, inputs): because it is used in the training loop outside of its owning module's
embeds = self.embeddings(inputs) forward pass.
...
logits = compute_logits(output, self.embeddings.weight)
...
The tensor ``embeddings.weight`` is used in both ``embeddings.forward()`` and .. code-block:: python
``compute_logits()``. We call ``embeddings.weight`` an *external* parameter
because it is used in the training loop outside of its owning module's class LanguageModel(torch.nn.Module):
forward pass. DeepSpeed will coordinate external parameters if they are ...
registered prior to the first forward pass. def forward(self, inputs):
embeds = self.embeddings(inputs)
...
logits = compute_logits(output, self.embeddings.weight)
...
#. Returning a parameter:
``CustomLinear`` returns both an output and its own ``bias`` parameter. DeepSpeed
will detect the external ``bias`` parameter and register it with submodules that
use ``CustomLinear``.
.. code-block:: python
class CustomLinear(torch.nn.Linear):
def forward(self, *input):
output = super().forward(*input)
return output, self.bias
.. note::
Most models should not need to manually register parameters.
.. autofunction:: deepspeed.zero.register_external_parameter .. autofunction:: deepspeed.zero.register_external_parameter
...@@ -234,5 +262,16 @@ registered prior to the first forward pass. ...@@ -234,5 +262,16 @@ registered prior to the first forward pass.
Memory-Centric Tiling Memory-Centric Tiling
--------------------- ---------------------
To reduce the working memory requirements of DL training for large models,
ZeRO-Infinity includes technique called *memory-centric tiling* that exploits
the data fetch and release pattern of ZeRO-3 to reduce the working memory
requirements by breaking down a large operator into smaller tiles that can be
executed sequentially. When combined with ZeRO-3, the parameter and gradients
of each tile can be fetched and released one at a time, reducing the working
memory proportional to the number of tiles. Therefore, ZeRO-Infinity can
support operators of arbitrary sizes, without refactoring for model
parallelism to fit them in limited GPU memory.
.. autoclass:: deepspeed.zero.TiledLinear .. autoclass:: deepspeed.zero.TiledLinear
:members: :members:
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册