未验证 提交 4e5d39fe 编写于 作者: S Sam Ade Jacobs 提交者: GitHub

Deepspeed-Ulysses blog (#4201)

上级 a855405e
<div align="center">
# DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
</div>
<div align="center">
<img src="./media/hero1.png" width="780px" />
<img src="./media/hero2.png" width="820px" />
</div>
## Introduction
Training large models with long sequences is becoming very important
across the board from generative AI to models for scientific discovery.
On generative AI side, conversational AI, long document summarization
and video generation require reasoning over long contexts in spatial and
temporal domains. For example, multimodal foundation models such as ones
that process speech, images and waveforms concurrently require long
context reasoning over high dimensional inputs with extremely large
sequences. Similarly, chapter and book level summarization (estimated at
tens and hundreds of thousands of words) are of great importance in
conversational AI and abstractive summarization tasks.
Long sequence length is equally critical for AI for science opening
doors for better understanding of structure biology, health care,
climate and weather forecasting and large molecular simulation. For
instance, by adapting large language models with gene sequences, we can
create language models that can learn the evolutionary patterns of
genomes using simple alphabets and extremely long sequences (the human
genome has 6.4 billion letters). In health care, diagnostic predictive
model conditioned on entire patient care record requires context of
extremely long sequence.
Despite the emerging importance of long sequence length for both
generative AI and AI for science, existing large model training systems
and the underlying parallelism technologies (data, tensor, pipeline,
sequence parallelism) are limited in their ability to support the
efficient long sequence training. Two challenges with existing
parallelism approach come to the fore. First, existing parallelism
approach such as data, tensor and pipeline parallelism cannot address
the scaling along sequence dimension. Second, existing sequence
parallelism approaches are not effective because of memory-communication
inefficiencies<span class="mark">.</span> Furthermore, existing
approaches have limited usability requiring intrusive and error prone
code refactoring.
In this release, we are proud to introduce *DeepSpeed-Ulysses (or
Ulysses, a very long novel)*, a simple, portable, and effective
methodology for enabling highly efficient and scalable LLM training with
extremely long sequence lengths.
DeepSpeed-Ulysses partitions individual samples along the sequence
dimension among participating GPU. Then right before the attention
computation, it employs *all-to-all communication* collective on the
partitioned queries, keys and values such that each GPU receives the
full sequence but only for a non-overlapping subset of the attention
heads. This allows the participating GPUs to compute attention for
different attention heads in parallel. Finally, DeepSpeed-Ulysses
employs another all-to-all to gather the results along the attention
heads while re-partitioning along the sequence dimension.
The key properties of DeepSpeed-Ulysses and its implementation released
with this blog are as follows:
* ***4x larger sequence lengths*** than existing systems, while
enabling training with sequences with ***over a million tokens***.
* Communication reduction of ***over 10x*** compared to existing
systems, resulting in throughput improvements of ***up to 2.5x***, and
sustained throughput of over 175 TFlops/GPU (over 54% of hardware peak).
* Fully general and implementation agnostic attention: DeepSpeed
sequence parallelism supports dense as well as sparse
attention, and it works with efficient attention implementations such as
FlashAttention v2.
* Support for massive model training: DeepSpeed sequence parallelism
works together with ZeRO-3 to not only support large sequence lengths
but also massive model sizes.
* Easy-to-use and portable, requiring minimal code changes to the
existing training frameworks.
In subsequent sections, we provide detailed discussion of DeepSpeed-Ulysses
core design, communication complexity analysis,
experimental evaluation and comparison with existing work and highlight
of usability and guide on usage.
## Core Design of DeepSpeed-Ulysses
<div align="center">
<img src="./media/image3.png" style="width:6.3479in;height:2.89442in"
alt="A diagram of a computer Description automatically generated" />
*Figure 1: DeepSpeed sequence parallelism (DeepSpeed-Ulysses) design*
</div>
Figure 1 shows the core design of DeepSpeed-Ulysses. As with the known
transformer architecture, the design consists of input sequences *N*
partitioned across *P* available devices. Each local *N/P* partition is
projected into queries (Q), keys (K) and values (V) embeddings. Next,
(QKV) embeddings are gathered into global QKV through highly optimized
all-to-all collectives between participating compute devices. Sequel to
all-to-all collective is the attention computation per head in the form:
$$Output\ context = Softmax\ (\frac{QK^{T}}{\sqrt{d}})V$$
After the attention computation, another all-to-all collective
transforms *output context* tensor of attention computation to sequence
(*N/P*) parallel for subsequent operators (MLP MatMul, layer norm etc)
in the remaining modules of transformer layer block.
### Significant Communication Volume Reduction
What distinguishes DeepSpeed-Ulysses from the other existing
long-sequence approaches is our much smaller aggregate communication
volume and overall better scalability with increasing degree of sequence
parallelism compared to existing solutions, as demonstrated by the
communication volume analysis below:
On modern clusters with intra-node NVSwitch interconnect and inter-node
fat tree IB topology, the communication volume transmitted per link for
an all-to-all for aggregate message of size *M* over *P* GPUs is *M/P*.
For a transformer model with hidden size h, sequence length of N, and
parallelism degree of P, DeepSpeed sequence parallelism performs all-to-all for the QKV
projections with an aggregate message size of *3Nh* before the attention
computation, and another all-to-all for output context projection with a
size *Nh* for each transformer layer. Therefore, DeepSpeed sequence
parallelism incurs an aggregate communication volume per link of
***4Nh/P (or with the complexity of O(N/P).*** Note that this
communication volume is constant when both N and P are increased
proportionally.
In contrast, the existing approaches like Megatron-LM incur
communication volume that increases linearly with N regardless of P,
resulting in the ***communication complexity of O(N).*** For instance,
Megatron-LM performs two *all-gather* with the message volume of *Nh*
and two *reduce-scatter* with the volume of *Nh* for each transformer
layer. However, the cost of each all-gather and reduce-scatter of size M
remains M when *P \>\> 1*, instead of *M/P*. Therefore, Megatron-LM
sequence parallelism incurs a communication volume per link of ***4Nh***
which is P times larger than that for DeepSpeed sequence parallelism.
This allows DeepSpeed sequence parallelism to enable training with
extremely long sequences while achieving significantly higher training
efficiency compared to the existing approaches. Our evaluation results
match this analysis.
### Additional Highlights of DeepSpeed-Ulysses
1) An Attention Agnostic Solution
DeepSpeed implementation of distributed attention module is general
enough to support any attention: e.g., self-attention, cross-attention,
causal attention in both their dense and sparse counterparts, and their
various optimized kernels that support long-sequence at local attention
level such as different versions of FlashAttention.
The generality property of DeepSpeed-Ulysses stems from the modular
nature of its core design: an attention-centric sequence parallelism
design. Prior to attention computation is sequence parallelism of N/P
partition, attention computation is head parallelism with full attention
per head but just with fewer heads, thus attention computation can be
replaced with any type of attention mechanisms, e.g., dense attention
and various forms of sparse attention.
2) Training Bigger Models with Longer Sequences through ZeRO-3
Integration
While DeepSpeed sequence parallelism reduces the activation memory when
training with longer sequences, it does not impact the memory consumed
by the model states. Therefore, to support large sequence length
training with large language model, DeepSpeed sequence parallelism is
integrated with ZeRO-3.
[ZeRO Redundancy Optimizer Stage 3 (ZeRO-3)](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/) is a memory optimization technique for training large
models. Unlike the classic data parallel training of neural networks
where model states are replicated across data parallel ranks, ZeRO-3
optimizes memory usage by partitioning model states across data parallel
ranks. However, with sequence parallelism, training data can be
considered in both batch (sample) and sequence dimensions and the
associated parallel groups combined to form a larger group for ZeRO
parallelism.
Therefore, we extend ZeRO-3 partitioning to combination of data parallel
and sequence parallel ranks. In other words, in DeepSpeed sequence
parallelism, ZeRO partitions model states across both sequence and data
parallel group and collects per rank partitions (allgather) when they
are needed. Similarly, gradients are reduced across both data and
sequence parallel ranks for parameter update. ZeRO allows support allows
for huge memory savings in both sequence and data dimensions and enables
scaling not just to large sequence lengths but also to large models.
## Evaluation
We evaluate DeepSpeed-Ulysses on GPT,
a foundation model for many NLP tasks on up to 64 A100 GPUs with 40GB memory. Our
evaluations are four-fold: i) sequence length scalability, ii)
throughput for dense attention and comparison with existing system, and
iii) throughput with sparse attention and comparison with existing
system, iv) convergence study of Deep sequence parallelism. We discuss
and present evaluations from each of these categories next.
### Sequence Length Scalability
The first set of experiments is strong scaling of sequence length up to
1 million tokens on 1.2 billion parameter GPT model. Results of this
evaluation are shown in Figures 2. DeepSpeed sequence parallelism
allows increasing sequence length linearly with the
number of GPUs and sequence length scales linearly relative to and
maintains similar computation throughput across different sequence
length at appropriate GPU count.
<div align="center">
<img src="./media/fig2Ulysses.png" style="width:5in;height:4in" />
*Figure 2: DeepSpeed sequence parallelism strong scalability evaluation
at different sequence length and GPU count.*
</div>
### Dense Attention Evaluation
Next, we evaluate DeepSpeed sequence parallelism on 30 billion parameter
dense attention model and benchmark against Megatron sequence
parallelism on 64 A100 GPUs. The results of these evaluations are shown
in Figures 3.
We compare DeepSpeed sequence parallelism with Megatron-LM for a 30B
model running various sequence lengths. For our evaluation we chose the
sequence parallelism degree and global batch size that produced the best
performance (measured as throughput or TFLOPs) for both DeepSpeed
sequence parallelism and Megatron-LM, this we call optimal (batch
size-sequence length) configurations. For DeepSpeed sequence
parallelism, we always use a ZeRO parallelism degree of 64.
Figure 3 shows that DeepSpeed sequence parallelism consistently
outperforms Megatron-LM for the sequence length that can be run with
both. In addition, DeepSpeed sequence parallelism can run longer
sequence than Megatron-LM. DeepSpeed sequence parallelism performance
advantages are two folds: (1) DeepSpeed sequence parallelism in
combination with ZeRO-3 fits more sample than Megatron-LM because of
memory optimization leading to higher throughput (2) DeepSpeed sequence
parallelism benefits from efficient all-to-all communication relative to
*all-gather* communication as applied in Megatron-LM sequence
parallelism.
<div align="center">
<img src="./media/fig3Ulysses.png" style="width:5in;height:4in" />
*Figure 3: Evaluation of DeepSpeed and Megatron LM sequence parallelism on 30B
parameter model with dense attention.*
</div>
### Sparse Attention Evaluation
Similarly, we evaluate DeepSpeed sequence parallelism on 30 billion
parameter sparse attention model and benchmark against Megatron sequence
parallelism. Results of our evaluation are shown in Figure 4. We observe
similar trends with sparse attention as dense attention experiments. We
observe more than 2X throughput performance of DeepSpeed sequence
parallelism compared to Megatron-LM. For memory saving, DeepSpeed
sequence parallelism leveraging ZeRO-3 scales to 4X longer sequence
lengths than Megatron-LM.
DeepSpeed sequence parallelism outperforms Megatron-LM for sequence
length that can be run with both. In fact, the current DeepSpeed
throughput is bottlenecked by the local sparse attention implementation,
and as a result DeepSpeed throughput decreases as the sequence length
increases. We expect this gap in performance between DeepSpeed and
Megatron to increase further for larger sequence lengths as we improve
the performance of the local sparse attention implementation in future.
<div align="center">
<img src="./media/fig4Ulysses.png" style="width:5in;height:4in" />
*Figure 4: Evaluation of DeepSpeed and Megatron LM sequence parallelism on 30B
parameter model with block sparse attention.*
</div>
### Convergence Study
Lastly, Figure 5 shows convergence of a 1.3 billion GPT model at 32K
sequence length on 8 A100 GPUs with sequence parallelism degree set at 4
for both DeepSpeed and Megatron-LM sequence parallelism. For DeepSpeed
sequence parallelism, we evaluate convergence with different ZeRO
stages. DeepSpeed sequence parallelism is a purely system optimization
technique that enables training of long sequence Transformer model, thus
there is no (negative) on quality of trained models, this assertion is
validated through experiments and is shown in Figure 5.
<div align="center">
<img src="./media/convg.png" width="500px" />
*Figure 5: Convergence evaluation of DeepSpeed sequence parallelism with different
ZeRO memory optimization stages.*
</div>
## DeepSpeed-Ulysses Software Accessibility
DeepSpeed-Ulysses can be easily integrated into your code with just a
few lines of simple code changes. Here is an example of how to enable
it:
```python
from deepspeed.sequence.layer import DistributedAttention
# Replace the original self-attention (attn) with DeepSpeed-Ulysses’s self-attention
dist_attn = DistributedAttention(attn, get_sequence_parallel_group())
```
Compared to other libraries that support sequence parallelism, such as
Megatron-LM, DeepSpeed-Ulysses does not require model refactoring.
DeepSpeed-Ulysses has been fully integrated and tested with the
Megatron-DeepSpeed code repository. This means that if you are already
using this repository for training large language models, you can
seamlessly benefit from DeepSpeed-Ulysses to train models with massive
sequence length.
## Release: Try DeepSpeed-Ulysses Today
We are excited to release DeepSpeed-Ulysses, accessible through
DeepSpeed GitHub. Detailed tutorial on usage is available on [DeepSpeed
tutorial page](https://www.deepspeed.ai/tutorials/ds-sequence/).
We welcome contributions and collaboration as we together push forward
on what is possible when long context window is no longer a limitation.
DeepSpeed-Ulysses is part of the bigger DeepSpeed ecosystem of
large-scale AI training and inference. For more details on all DeepSpeed
technologies and innovations, please see visit our [website]((https://www.deepspeed.ai/)) and follow us
on X, formerly Twitter, ([English](https://twitter.com/MSFTDeepSpeed), [Japanese](https://twitter.com/MSFTDeepSpeedJP)) and [Chinese Zhihu](https://www.zhihu.com/people/deepspeed).
We are open to collaborations with universities, research labs, and
companies. For such requests (and other requests unsuitable for GitHub),
please directly email to <deepspeed-info@microsoft.com>.
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册