未验证 提交 961827be 编写于 作者: S Sam Ade Jacobs 提交者: GitHub

DS-Ulysses formating (#4204)

* fix identation

* fix formatting

---------
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
上级 3e82cb64
[![License Apache 2.0](https://badgen.net/badge/license/apache2.0/blue)](https://github.com/Microsoft/DeepSpeed/blob/master/LICENSE)
[![PyPI version](https://badge.fury.io/py/deepspeed.svg)](https://pypi.org/project/deepspeed/)
[![Downloads](https://static.pepy.tech/badge/deepspeed)](https://pepy.tech/project/deepspeed)
[![Downloads](https://static.pepy.tech/badge/deepspeed)](https://pepy.tech/project/deepspeed)
[![Build](https://badgen.net/badge/build/check-status/blue)](#build-pipeline-status)
[![Twitter](https://img.shields.io/twitter/follow/MSFTDeepSpeed)](https://twitter.com/intent/follow?screen_name=MSFTDeepSpeed)
[![Japanese Twitter](https://img.shields.io/badge/%E6%97%A5%E6%9C%AC%E8%AA%9ETwitter-%40MSFTDeepSpeedJP-blue)](https://twitter.com/MSFTDeepSpeedJP)
......
......@@ -149,7 +149,7 @@ match this analysis.
### Additional Highlights of DeepSpeed-Ulysses
1) An Attention Agnostic Solution
***An Attention Agnostic Solution***
DeepSpeed implementation of distributed attention module is general
enough to support any attention: e.g., self-attention, cross-attention,
......@@ -165,8 +165,7 @@ per head but just with fewer heads, thus attention computation can be
replaced with any type of attention mechanisms, e.g., dense attention
and various forms of sparse attention.
2) Training Bigger Models with Longer Sequences through ZeRO-3
Integration
***Training Bigger Models with Longer Sequences through ZeRO-3 Integration***
While DeepSpeed sequence parallelism reduces the activation memory when
training with longer sequences, it does not impact the memory consumed
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册