Update Sparse Attention Tutorial (#357)

* adding BingSqaud e2e test * updating the draft test; bring final step under try section * finalizinf test for base deepspeed and deepspeed with ZeRO * applying the comment (thanks Jeff); fixed formatting * update Sparse Attention Tutorial * fixed few issues and applied comments for better organization and readability * updated sparse attention tutorial with making how to use section incremental; applying more comments Co-authored-by: N arashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com>

Update Sparse Attention Tutorial (#357)
* adding BingSqaud e2e test * updating the draft test; bring final step under try section * finalizinf test for base deepspeed and deepspeed with ZeRO * applying the comment (thanks Jeff); fixed formatting * update Sparse Attention Tutorial * fixed few issues and applied comments for better organization and readability * updated sparse attention tutorial with making how to use section incremental; applying more comments Co-authored-by: N arashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com>
9dadf38d · Arash Ashari · GitHub · 9e83ef21 · 9dadf38d · 9dadf38d
3 changed file
--- a/docs/_data/navigation.yml
+++ b/docs/_data/navigation.yml
@@ -65,6 +65,6 @@ lnav:
      - title: "Learning Rate Range Test"
        url: /tutorials/lrrt/
      - title: "DeepSpeed Sparse Attention"
-        url: /tutorials/sparse_attention/
+        url: /tutorials/sparse-attention/
  - title: "Contributing"
    url: /contributing/
--- a/docs/_posts/2020-09-09-sparse-attention.md
+++ b/docs/_posts/2020-09-09-sparse-attention.md
@@ -7,7 +7,7 @@ new_post: true
 date: 2020-09-09 01:00:00
 ---

-Attention-based deep learning models such as the transformers are highly effective in capturing relationship between tokens in an input sequence, even across long distances. As a result, they are used with text, image, and sound-based inputs, where the sequence length can be in thousands of tokens. However, despite the effectiveness of attention modules to capture long term dependencies, in practice, their application to long sequence input is limited by compute and memory requirements of the attention computation that grow quadratically, `O(n^2)`, with the sequence length n.
+Attention-based deep learning models such as the transformers are highly effective in capturing relationship between tokens in an input sequence, even across long distances. As a result, they are used with text, image, and sound-based inputs, where the sequence length can be in thousands of tokens. However, despite the effectiveness of attention modules to capture long term dependencies, in practice, their application to long sequence input is limited by compute and memory requirements of the attention computation that grow quadratically, `O(n^2)`, with the sequence length `n`.

 To address this limitation, DeepSpeed offers a suite of sparse attention kernels --an instrumental technology that can reduce the compute and memory requirement of attention computation by orders-of-magnitude via block-sparse computation. The suite not only alleviates the memory bottleneck of attention calculation, but also performs sparse computation efficiently. Its APIs allow convenient integration with any transformer-based models. Along with providing a wide spectrum of sparsity structures, it has the flexibility of handling any user-defined block-sparse structures. More specifically, sparse attention (SA) can be designed to compute local attention between nearby tokens, or global attention via summary tokens computed with local attention. Moreover, SA can also allow random attention, or any combination of local, global, and random attention as shown in the following figure with blue, orange, and green blocks, respectively. As a result, SA decreases the memory footprint to `O(wn)`, in which `1 < w < n` is a parameter, whose value depends on the attention structure.

@@ -20,7 +20,7 @@ Block-sparse computations handled by DeepSpeed Sparse Attention kernels are illu

 ![Sparse attention backward pass](/assets/images/sa_backward_pass.png){: .align-center}

-To learn more about Sparsity Config, and also how to use this library, please check our [tutorial](https://github.com/microsoft/DeepSpeed-internal/tree/master/docs/_tutorials/sparse_attention.md) that provides detailed information about it.
+To learn more about Sparsity Config, and also how to use this library, please check our [tutorial](/tutorials/sparse_attention/) that provides detailed information about it.

 ## Performance Results

@@ -44,17 +44,10 @@ Related works along the line of sparse attention ([Sparse Transformer](https://a

 ![Accuracy of long document comprehension application](/assets/images/sa_long_document_comprehension_result.png){: .align-center}

-* **flexibility to handle any block-sparse structure**
-DeepSpeed Sparse Attention suite does not target at any specific sparse structure but enables model scientists to explore any block sparse structure with efficient system support. Currently, we have added popular sparse structure like:
-  * [Fixed](https://arxiv.org/pdf/1904.10509.pdf) (from OpenAI Sparse Transformer)
-  * [BigBird](https://arxiv.org/pdf/2007.14062.pdf) (from Google)
-  * BSLongformer (Block-Sparse implementation of [Longformer](https://arxiv.org/pdf/2004.05150.pdf) from AI2)
-We also define a template to have `variable` structure (top figure), which can be used to simply customize any block-sparse random/local/global attention pattern. In addition to this list, user can add any other sparsity structure as described in [tutorial](https://github.com/microsoft/DeepSpeed-internal/tree/master/docs/_tutorials/sparse_transformer.md) section.
-

 * **comparison with state of the art, Longformer**
 We compared SA with Longformer, a state-of-the-art sparse structure and implementation. In our experiment, SA uses `Fixed` sparsity, and two implementations have comparable accuracy. On system performance, SA outperforms Longformer both in training and inference:
-  * 1.47x faster execution pre-training MLM on Wikitext103
+  * **1.47x** faster execution pre-training MLM on Wikitext103
 We ran an experiment following the [notebook](https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) offered by Longformer. In this experiment, we pre-train an MLM model using RoBERTa-base checkpoint. This is done on 8 V100-SXM2 GPU. Following table shows the details of the result in which using DeepSpeed Sparse Attention shows 1.47x speed up.

 |Model 	            |Local Window Size |BPC     |Train Step  |Time Per Iteration  |Time Improvement  |Accuracy improvement  |
@@ -70,7 +63,7 @@ We ran an experiment following the [notebook](https://github.com/allenai/longfor
 |Sparse Attention   |                  |1.8693  |            |1.1372              |                  |                      |


-  * 3.13x faster execution inference on BERT-Base
+  * **3.13x** faster execution inference on BERT-Base
 Through our Long Document Comprehension application we described above, we also checked the inference time for different window sizes testing BERT model on a `2,048` Sequence Length and batch size `1`. In this experiment, we noticed up to `3.13X` speed up replacing Bert Attention with DeepSpeed Sparse Attention instead of Longformer Attention. Following table shows the complete result.

 |Local Window Size   |Time Improvement|
@@ -81,3 +74,11 @@ Through our Long Document Comprehension application we described above, we also
 |64                  |1.5             |
 |32                  |1.24            |
 |16                  |1.23            |
+
+* **flexibility to handle any block-sparse structure**
+DeepSpeed Sparse Attention suite does not target at any specific sparse structure but enables model scientists to explore any block sparse structure with efficient system support. Currently, we have added popular sparse structure like:
+  * [Fixed](https://arxiv.org/pdf/1904.10509.pdf) (from OpenAI Sparse Transformer)
+  * [BigBird](https://arxiv.org/pdf/2007.14062.pdf) (from Google)
+  * BSLongformer (Block-Sparse implementation of [Longformer](https://arxiv.org/pdf/2004.05150.pdf) from AI2)
+
+We also define a template to have `variable` structure (top figure), which can be used to simply customize any block-sparse random/local/global attention pattern. In addition to this list, user can add any other sparsity structure as described in [tutorial](https://www.deepspeed.ai/tutorials/sparse_attention/) section.
--- a/docs/_tutorials/sparse_attention.md
+++ b/docs/_tutorials/sparse_attention.md
@@ -2,24 +2,19 @@
 title: "DeepSpeed  Sparse Attention"
 ---

-In this tutorial we describe how to use DeepSpeed Sparse Attention and its building-block kernels through DeepSpeed launcher or integrating individual kernels into your code.
+In this tutorial we describe how to use DeepSpeed Sparse Attention (SA) and its building-block kernels. The easiest way to use SA is through DeepSpeed launcher. We will describe this through an example in [How to use sparse attention with DeepSpeed launcher](/tutorials/sparse-attention/#how-to-use-sparse-attention-with-deepspeed-launcher) section. But before that, we introduce modules provided by DeepSpeed SA in the [next](/tutorials/sparse-attention/#sparse-attention-modules) section.

-**Note:** Currently DeepSpeed Sparse Attention can be used only on Nvidia V100 GPU using Cuda 10.1 or 10.2.
+**Note:** Currently DeepSpeed Sparse Attention can be used only on Nvidia V100 GPU using Torch >= 1.5 and Cuda 10.1 or 10.2.
 {: .notice--warning}

-## How to use
-DeepSpeed Sparse Attention can be used as a feature through DeepSpeed, or simply integrated with any Transformer model as a self-attention module alone. Further, the building block kernels, matrix multiplication and softmax can be used separately. To use sparse attention alone, you can simply install DeepSpeed and import any of the following modules from it; example:
-```python
-from deepspeed.ops.sparse_attention import SparseSelfAttention
-```
-Following we describe Sparse Attention modules:
-* **MatMul**: This module handles block-sparse matrix-matrix multiplication. Currently it supports SDD, DSD, and DDS as described in [DeepSpeed Sparse Attention](https://github.com/microsoft/DeepSpeed-internal/tree/master/docs/_posts/2020-09-09-sparse-attention.md) section.
+## Sparse attention modules
+* **MatMul**: This module handles block-sparse matrix-matrix multiplication. Currently it supports SDD, DSD, and DDS as described in [DeepSpeed Sparse Attention](https://www.deepspeed.ai/news/2020/09/08/sparse-attention.html) section.
 * **Softmax**: This module applies block sparse softmax. It handles both forward and backward pass.
 * **SparseSelfAttention**: This module uses MatMul and Softmax kernels and generates Context Layer output given Query, Keys and Values. It is a simplified version of common operations in any self-attention layer. It can also apply:
  * `Relative position embedding`
  * `Attention mask`
  * `Key padding mask`
-on the intermediate attention scores. For more details about SelfAttantion, please check [MultiHeadAttention](https://pytorch.org/docs/master/generated/torch.nn.MultiheadAttention.html#multiheadattention).
+on the intermediate attention scores. For more details about self attention, please check [MultiHeadAttention](https://pytorch.org/docs/master/generated/torch.nn.MultiheadAttention.html#multiheadattention).
 * **BertSparseSelfAttention**: This module contains a simplified BertSelfAttention layer that can be used instead of original dense Bert Self-Attention layer. Our implementation is based on [DeepSpeedExample](https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/nvidia/modelingpreln.py#L373-#L434).
 * **SparseAttentionUtils**: This module provides few utility functions to handle adapting pre-trained model with sparse attention:
  * `replace_model_self_attention_with_sparse_self_attention`: If you have currently loaded a model and want to replace self-attention module with sparse self-attention, you can simply use this function to handle it for you. It currently handles BERT and RoBERTa based pre-trained models, but you can extend it base on your model type if it is different from these two. You also need to extend the position embedding to handle new sequence length; this can be done using `extend_position_embedding` function.
@@ -27,20 +22,132 @@ on the intermediate attention scores. For more details about SelfAttantion, plea
  * `extend_position_embedding`: This function extends the position embedding based on the current values. For example, if you have a 128 max sequence length model and extending it to a 1k sequence length, it replicates current embeddings 8 times to initialize new embedding. Experimentally we have seen such initialization works much better than initializing from scratch; leads to faster convergence.
  * `pad_to_block_size`: This function pads input tokens and attention mask on sequence length dimension to be multiple of block size; this is a requirement for SA.
  * `unpad_sequence_output`: This function unpads sequence output if inputs of the model were padded.
-* **SparsityConfig**: this is an abstract class for sparsity structure. Any sparsity structure extends this class and writes its own `make_layout` function. DeepSpeed currently provides the following structures that will be described in next section:
+* **SparsityConfig**: this is an abstract class for sparsity structure. Any sparsity structure needs to extend this class and writes its own sparsity pattern construction; `make_layout` function. DeepSpeed currently provides the following structures that will be described in [How to config sparsity structures](/tutorials/sparse-attention/#how-to-config-sparsity-structures) section:
  * `FixedSparsityConfig`
  * `BSLongformerSparsityConfig`
  * `BigBirdSparsityConfig`
  * `VariableSparsityConfig`
-
-### BertSparseSelfAttention Example
-We have currently integrated Sparse Attention with our [bing_bert](https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/nvidia/modelingpreln.py) code that can be used as an example for integration. In this example, we replace, BertSelfAttention module with BertSparseSelfAttention. Using DeepSpeed launcher, you can enable sparse attention using `deepspeed_sparse_attention` argument ([example](https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/ds_sa_train_bert_bsz64k_seq128.sh)) and add your desired sparsity config into the [DeepSpeed config file](https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/deepspeed_bsz64k_lamb_config_seq128.json). In this example, we have used `fixed` sparsity mode. Further, you need to pad sequence dimension of `input_ids` and `attention_mask` to be multiple of sparse block size. As mentioned above, DeepSpeed provides utility functions for padding and unpadding and you can check our [example](https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/nvidia/modelingpreln.py) to see where and how pad and unpad the inputs or outputs of the model.
+  * `DenseSparsityConfig`

 **Note:** Currently DeepSpeed Transformer Kernels do not support Sparse Attention. To use Sparse Attention, you need to disable Transformer Kernels!
 {: .notice--warning}

-### Sparsity structures
-Following we describe supported sparsity structures, their parameter set and the flexibility of adding arbitrary sparsity pattern on the self-attention layer.
+## How to use sparse attention with DeepSpeed launcher
+In this section we describe how to use DeepSpeed Sparse Attention through our [bing_bert](https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/nvidia/modelingpreln.py) code.
+
+* **Update attention module**: First, you need to update your attention module based on sparse computation. Here, we use [BertSparseSelfAttention](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/ops/sparse_attention/bert_sparse_self_attention.py) which is the sparse version of `BertSelfAttention` from our [bing_bert](https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/nvidia/modelingpreln.py) code. It rewrites `BertSelfAttention` where it replaces:
+
+```python
+attention_scores = torch.matmul(query_layer, key_layer)
+attention_scores = attention_scores / math.sqrt(
+    self.attention_head_size)
+
+# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+attention_scores = attention_scores + attention_mask
+
+pdtype = attention_scores.dtype
+# Normalize the attention scores to probabilities.
+attention_probs = self.softmax(attention_scores)
+
+# This is actually dropping out entire tokens to attend to, which might
+# seem a bit unusual, but is taken from the original Transformer paper.
+attention_probs = self.dropout(attention_probs)
+
+context_layer = torch.matmul(attention_probs, value_layer)
+```
+
+with:
+
+```python
+context_layer =
+  self.sparse_self_attention(
+	query_layer,
+	key_layer,
+	value_layer,
+	key_padding_mask=attention_mask)
+```
+
+in which `sparse_self_attention` is an instance of [SparseSelfAttention](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/ops/sparse_attention/sparse_self_attention.py). This module computes attention context through sparse attention replacing underlying matrix multiplications and softmax with their equivalent sparse version. You can update any other attention module similarly.
+
+* **Setup sparse attention config in the model**: You need to setup the sparse attention config. In our example, this is done in the `BertModel`.
+
+```python
+self.pad_token_id = config.pad_token_id if hasattr(
+   config, 'pad_token_id') and config.pad_token_id is not None else 0
+# set sparse_attention_config if it has been selected
+self.sparse_attention_config = get_sparse_attention_config(
+   args, config.num_attention_heads)
+self.encoder = BertEncoder(
+   config, args, sparse_attention_config=self.sparse_attention_config)
+```
+
+* **Update encoder model**: Further, you need to update your encoder model to use SA for the attention layer when SA is enabled. Please check our [bing_bert example](https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/nvidia/modelingpreln.py) in which we use `BertSparseSelfAttention` instead of `BertSelfAttention` when SA is enabled.
+
+```python
+if sparse_attention_config is not None:
+    from deepspeed.ops.sparse_attention import BertSparseSelfAttention
+
+    layer.attention.self = BertSparseSelfAttention(
+         config, sparsity_config=sparse_attention_config)
+```
+
+* **Pad and unpad input data**: Also you may need to pad sequence dimension of `input_ids` and `attention_mask` to be multiple of sparse block size. As mentioned in [module](/tutorials/sparse-attention/#sparse-attention-modules) section above, DeepSpeed provides utility functions for padding and unpadding. Please check our [bing_bert example](https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/nvidia/modelingpreln.py) to see where and how pad and unpad the inputs or outputs of the model.
+
+```python
+if self.sparse_attention_config is not None:
+   pad_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds = SparseAttentionUtils.pad_to_block_size(
+      block_size=self.sparse_attention_config.block,
+      input_ids=input_ids,
+      attention_mask=extended_attention_mask,
+      token_type_ids=token_type_ids,
+      position_ids=None,
+      inputs_embeds=None,
+      pad_token_id=self.pad_token_id,
+      model_mbeddings=self.embeddings)
+.
+.
+.
+# If BertEncoder uses sparse attention, and input_ids were padded, sequence output needs to be unpadded to original length
+if self.sparse_attention_config is not None and pad_len > 0:
+   encoded_layers[-1] = SparseAttentionUtils.unpad_sequence_output(
+      pad_len, encoded_layers[-1])
+```
+
+* **Enable sparse attention*: To use DeepSpeed Sparse Attention, you need to enable it in the launcher script through `deepspeed_sparse_attention` argument:
+
+```python
+--deepspeed_sparse_attention
+```
+
+Please check [our bing_bert runner script](https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/ds_sa_train_bert_bsz64k_seq128.sh) as an example of how to enable SA with DeepSpeed launcher.
+
+* **Add sparsity config**: The sparsity config can be set through the [DeepSpeed JSON config file](https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/deepspeed_bsz64k_lamb_config_seq128.json). In this example, we have used `fixed` sparsity mode that will be described in [How to config sparsity structures](/tutorials/sparse-attention/#how-to-config-sparsity-structures) section.
+
+```python
+"sparse_attention": {
+    "mode": "fixed",
+    "block": 16,
+    "different_layout_per_head": true,
+    "num_local_blocks": 4,
+    "num_global_blocks": 1,
+    "attention": "bidirectional",
+    "horizontal_global_attention": false,
+    "num_different_global_patterns": 4
+}
+```
+
+## How to use individual kernels
+DeepSpeed Sparse Attention can be used as a feature through DeepSpeed, as described above, or simply integrated with any Transformer model as a self-attention module alone. Further, the building block kernels, matrix multiplication and softmax can be used separately. To use sparse attention alone, you can simply install DeepSpeed and import any of the modules described in [modules](/tutorials/sparse-attention/#sparse-attention-modules) section; example:
+
+```python
+from deepspeed.ops.sparse_attention import SparseSelfAttention
+```
+
+Please refer to the Docstrings for details of how to use each module separately.
+
+## How to config sparsity structures
+Following we describe supported sparsity structures, their parameter set and the flexibility of adding arbitrary sparsity pattern on the self-attention layer. You can update DeepSpeed config file using any of the supported sparsity structures and set the parameters accordingly.
+
 * **SpasityConfig**:
 This module, is the parent class for all sparsity structures and contains the shared features of all sparsity structures. It takes the following parameters:
  * `num_heads`: an integer determining number of attention heads of the layer.
@@ -84,5 +191,5 @@ Figure bellow illustrates an example of `variable` sparsity, in which blue, oran
 Further, we provide a `dense` pattern (`DenseSparsityConfig`), that can be used for the sake of testing while it represents the full attention.


-### How to expand block-base sparsity patterns
-Our building block kernels, block-based `MatMul` & `Softmax`, can accept any block-based sparsity. This provides the flexibility to apply any block-based sparsity pattern to attention score. To define and apply a new sparsity pattern, you can simply follow any of the above sparsity structures. You need to add a new class that expands `SparsityConfig` and define `make_layout` function based on how your sparsity is structured. You can add any extra parameters you may need or just use default parameters of the parent class.
+## How to support new user defined sparsity structures
+Our building block kernels, block-based `MatMul` and `Softmax`, can accept any block-based sparsity. This provides the flexibility to apply any block-based sparsity pattern to attention score. To define and apply a new sparsity pattern, you can simply follow any of the above sparsity structures. You need to add a new class that expands `SparsityConfig` and define `make_layout` function based on how your sparsity is structured. You can add any extra parameters you may need or just use default parameters of the parent class.