After the attention computation, another all-to-all collective
transforms *output context* tensor of attention computation to sequence
(*N/P*) parallel for subsequent operators (MLP MatMul, layer norm etc)
in the remaining modules of transformer layer block.
### Significant Communication Volume Reduction
What distinguishes DeepSpeed-Ulysses from the other existing
long-sequence approaches is our much smaller aggregate communication
volume and overall better scalability with increasing degree of sequence
parallelism compared to existing solutions, as demonstrated by the
communication volume analysis below:
On modern clusters with intra-node NVSwitch interconnect and inter-node
fat tree IB topology, the communication volume transmitted per link for
an all-to-all for aggregate message of size *M* over *P* GPUs is *M/P*.
For a transformer model with hidden size h, sequence length of N, and
parallelism degree of P, DeepSpeed sequence parallelism performs all-to-all for the QKV
projections with an aggregate message size of *3Nh* before the attention
computation, and another all-to-all for output context projection with a
size *Nh* for each transformer layer. Therefore, DeepSpeed sequence
parallelism incurs an aggregate communication volume per link of
***4Nh/P (or with the complexity of O(N/P).*** Note that this
communication volume is constant when both N and P are increased
proportionally.
In contrast, the existing approaches like Megatron-LM incur
communication volume that increases linearly with N regardless of P,
resulting in the ***communication complexity of O(N).*** For instance,
Megatron-LM performs two *all-gather* with the message volume of *Nh*
and two *reduce-scatter* with the volume of *Nh* for each transformer
layer. However, the cost of each all-gather and reduce-scatter of size M
remains M when *P \>\> 1*, instead of *M/P*. Therefore, Megatron-LM
sequence parallelism incurs a communication volume per link of ***4Nh***
which is P times larger than that for DeepSpeed sequence parallelism.
This allows DeepSpeed sequence parallelism to enable training with
extremely long sequences while achieving significantly higher training
efficiency compared to the existing approaches. Our evaluation results
match this analysis.
### Additional Highlights of DeepSpeed-Ulysses
1) An Attention Agnostic Solution
DeepSpeed implementation of distributed attention module is general
enough to support any attention: e.g., self-attention, cross-attention,
causal attention in both their dense and sparse counterparts, and their
various optimized kernels that support long-sequence at local attention
level such as different versions of FlashAttention.
The generality property of DeepSpeed-Ulysses stems from the modular
nature of its core design: an attention-centric sequence parallelism
design. Prior to attention computation is sequence parallelism of N/P
partition, attention computation is head parallelism with full attention
per head but just with fewer heads, thus attention computation can be
replaced with any type of attention mechanisms, e.g., dense attention
and various forms of sparse attention.
2) Training Bigger Models with Longer Sequences through ZeRO-3
Integration
While DeepSpeed sequence parallelism reduces the activation memory when
training with longer sequences, it does not impact the memory consumed
by the model states. Therefore, to support large sequence length
training with large language model, DeepSpeed sequence parallelism is
integrated with ZeRO-3.
[ZeRO Redundancy Optimizer Stage 3 (ZeRO-3)](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/) is a memory optimization technique for training large
models. Unlike the classic data parallel training of neural networks
where model states are replicated across data parallel ranks, ZeRO-3
optimizes memory usage by partitioning model states across data parallel
ranks. However, with sequence parallelism, training data can be
considered in both batch (sample) and sequence dimensions and the
associated parallel groups combined to form a larger group for ZeRO
parallelism.
Therefore, we extend ZeRO-3 partitioning to combination of data parallel
and sequence parallel ranks. In other words, in DeepSpeed sequence
parallelism, ZeRO partitions model states across both sequence and data
parallel group and collects per rank partitions (allgather) when they
are needed. Similarly, gradients are reduced across both data and
sequence parallel ranks for parameter update. ZeRO allows support allows
for huge memory savings in both sequence and data dimensions and enables
scaling not just to large sequence lengths but also to large models.
## Evaluation
We evaluate DeepSpeed-Ulysses on GPT,
a foundation model for many NLP tasks on up to 64 A100 GPUs with 40GB memory. Our
evaluations are four-fold: i) sequence length scalability, ii)
throughput for dense attention and comparison with existing system, and
iii) throughput with sparse attention and comparison with existing
system, iv) convergence study of Deep sequence parallelism. We discuss
and present evaluations from each of these categories next.
### Sequence Length Scalability
The first set of experiments is strong scaling of sequence length up to
1 million tokens on 1.2 billion parameter GPT model. Results of this
evaluation are shown in Figures 2. DeepSpeed sequence parallelism
allows increasing sequence length linearly with the
number of GPUs and sequence length scales linearly relative to and
maintains similar computation throughput across different sequence
We welcome contributions and collaboration as we together push forward
on what is possible when long context window is no longer a limitation.
DeepSpeed-Ulysses is part of the bigger DeepSpeed ecosystem of
large-scale AI training and inference. For more details on all DeepSpeed
technologies and innovations, please see visit our [website]((https://www.deepspeed.ai/)) and follow us
on X, formerly Twitter, ([English](https://twitter.com/MSFTDeepSpeed), [Japanese](https://twitter.com/MSFTDeepSpeedJP)) and [Chinese Zhihu](https://www.zhihu.com/people/deepspeed).
We are open to collaborations with universities, research labs, and
companies. For such requests (and other requests unsuitable for GitHub),
please directly email to <deepspeed-info@microsoft.com>.