From f1e4fb0b878e48d6fe33dbb2e0969d791d698e86 Mon Sep 17 00:00:00 2001 From: Satpal Singh Rathore Date: Sat, 18 Mar 2023 00:00:24 +0530 Subject: [PATCH] Fix Broken Links (#3048) --- docs/_pages/training.md | 2 +- docs/_posts/2020-05-19-bert-record.md | 2 +- docs/_posts/2020-09-08-sparse-attention-news.md | 2 +- docs/_posts/2020-09-09-onebit-adam-news.md | 2 +- docs/_tutorials/bert-finetuning.md | 4 ++-- docs/_tutorials/bert-pretraining.md | 4 ++-- docs/_tutorials/inference-tutorial.md | 2 +- docs/_tutorials/mixture-of-experts.md | 2 +- docs/_tutorials/onebit-adam.md | 8 ++++---- docs/_tutorials/progressive_layer_dropping.md | 2 +- docs/_tutorials/sparse-attention.md | 2 +- docs/_tutorials/transformer_kernel.md | 4 ++-- 12 files changed, 18 insertions(+), 18 deletions(-) diff --git a/docs/_pages/training.md b/docs/_pages/training.md index 466800a3..e5eee865 100644 --- a/docs/_pages/training.md +++ b/docs/_pages/training.md @@ -364,7 +364,7 @@ They offer the same convergence as Adam/LAMB, incur up to 26x less communication up to 6.6x higher throughput for BERT-Large pretraining and up to 2.7x higher throughput for SQuAD fine-tuning on bandwidth-limited clusters. For more details on usage and performance, please refer to the [1-bit Adam tutorial](https://www.deepspeed.ai/tutorials/onebit-adam), -[1-bit Adam blog post](https://www.deepspeed.ai/news/2020/09/09/onebit-adam-blog-post.md), +[1-bit Adam blog post](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html), [0/1 Adam tutorial](https://www.deepspeed.ai/tutorials/zero-one-adam) and [1-bit LAMB tutorial](https://www.deepspeed.ai/tutorials/onebit-lamb/). For technical details, please refer to the [1-bit Adam paper](https://arxiv.org/abs/2102.02888), [0/1 Adam paper](https://arxiv.org/abs/2202.06009) and diff --git a/docs/_posts/2020-05-19-bert-record.md b/docs/_posts/2020-05-19-bert-record.md index 93d0c9ce..4c2a93e5 100644 --- a/docs/_posts/2020-05-19-bert-record.md +++ b/docs/_posts/2020-05-19-bert-record.md @@ -18,6 +18,6 @@ NVIDIA V100 GPUs**, compared with the best published result of 67 minutes on the same number and generation of GPUs. * Brief overview, see our [press release](https://www.microsoft.com/en-us/research/blog/ZeRO-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/). -* Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html). +* Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/2020/05/27/fastest-bert-training.html). * Tutorial on how to reproduce our results, see our [BERT pre-training tutorial](https://www.deepspeed.ai/tutorials/bert-pretraining/). * The source code for our transformer kernels can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed) and BERT pre-training code can be found in the [DeepSpeedExamples repo](https://github.com/microsoft/deepspeedexamples). diff --git a/docs/_posts/2020-09-08-sparse-attention-news.md b/docs/_posts/2020-09-08-sparse-attention-news.md index 4c37054a..c5f2a104 100644 --- a/docs/_posts/2020-09-08-sparse-attention-news.md +++ b/docs/_posts/2020-09-08-sparse-attention-news.md @@ -9,6 +9,6 @@ toc: false DeepSpeed offers sparse attention kernels, an instrumental technology to support long sequences of model inputs, whether for text, image, or sound. Compared with the classic dense Transformers, it powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution with comparable accuracy. It also outperforms state-of-the-art sparse implementations with 1.5-3x faster execution. Furthermore, our sparse kernels support efficient execution of flexible sparse format and empower users to innovate on their custom sparse structures. * Brief overview, see our [press release]({{ site.press_release_v3 }}). -* Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/news/2020/09/08/sparse-attention.html). +* Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/2020/09/08/sparse-attention.html). * Tutorial on how to use sparse attention, see our [Sparse attention tutorial](https://www.deepspeed.ai/tutorials/sparse-attention/). * The source code for our sparse attention kernels can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed) and BERT pre-training code using sparse attention can be found in the [DeepSpeedExamples repo](https://github.com/microsoft/deepspeedexamples). diff --git a/docs/_posts/2020-09-09-onebit-adam-news.md b/docs/_posts/2020-09-09-onebit-adam-news.md index 8873f58c..4ec3c3c8 100644 --- a/docs/_posts/2020-09-09-onebit-adam-news.md +++ b/docs/_posts/2020-09-09-onebit-adam-news.md @@ -15,6 +15,6 @@ across distributed devices. We introduce a new algorithm - 1-bit Adam - and its efficient implementation in DeepSpeed. 1-bit Adam offers the ***same convergence*** as Adam, incurs up to ***5x less communication*** that enables up to ***3.5x higher throughput for BERT-Large pretraining*** and up to ***2.7x higher throughput for SQuAD fine-tuning*** on bandwidth-limited clusters. * Brief overview, see our [press release]({{ site.press_release_v3 }}). -* Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html). +* Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html). * Tutorial on how to reproduce our results, see our [1-bit Adam tutorial](/tutorials/onebit-adam/). * The source code for 1-bit Adam can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed). The implementation of 1-bit Adam is in [onebit_adam.py](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/fp16/onebit_adam.py) and CUDA-Aware communication for 1-bit Adam is in [custom_collectives.py](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/custom_collectives.py). Example codes to try this feature can be found in the [DeepSpeedExamples repo](https://github.com/microsoft/deepspeedexamples) as shown in the [tutorial](/tutorials/onebit-adam/). diff --git a/docs/_tutorials/bert-finetuning.md b/docs/_tutorials/bert-finetuning.md index f7ea8226..3014be18 100755 --- a/docs/_tutorials/bert-finetuning.md +++ b/docs/_tutorials/bert-finetuning.md @@ -201,7 +201,7 @@ the `--predict_batch_size` should also be 8. For further details about the transformer kernel, please see our [usage tutorial](/tutorials/transformer_kernel/) and [technical deep -dive](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html) on +dive](https://www.deepspeed.ai/2020/05/27/fastest-bert-training.html) on the fastest BERT training. @@ -302,7 +302,7 @@ Table 4. The setting of memory-optimization flags for a range of micro-batch siz ### FineTuning model pre-trained with DeepSpeed Transformer Kernels -Fine-tuning the model pre-trained using DeepSpeed Transformer and the recipe in [DeepSpeed Fast-Bert Training](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html) should yield F1 score of 90.5 and is expected to increase if you let the pre-training longer than suggested in the tutorial. +Fine-tuning the model pre-trained using DeepSpeed Transformer and the recipe in [DeepSpeed Fast-Bert Training](https://www.deepspeed.ai/2020/05/27/fastest-bert-training.html) should yield F1 score of 90.5 and is expected to increase if you let the pre-training longer than suggested in the tutorial. To get these results, we do require some tuning of the dropout settings as described below: diff --git a/docs/_tutorials/bert-pretraining.md b/docs/_tutorials/bert-pretraining.md index a0943949..87d4e871 100755 --- a/docs/_tutorials/bert-pretraining.md +++ b/docs/_tutorials/bert-pretraining.md @@ -308,7 +308,7 @@ Note: For more details about the transformer kernel, please see [DeepSpeed Transformer Kernel](/tutorials/transformer_kernel/) and [DeepSpeed Fast-Bert -Training](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html). +Training](https://www.deepspeed.ai/2020/05/27/fastest-bert-training.html). ### Start Training @@ -391,4 +391,4 @@ for more details in Compared to SOTA, DeepSpeed significantly improves single GPU performance for transformer-based model like BERT. Figure above shows the single GPU throughput of training BertBERT-Large optimized through DeepSpeed, compared with two well-known Pytorch implementations, NVIDIA BERT and HuggingFace BERT. DeepSpeed reaches as high as 64 and 53 teraflops throughputs (corresponding to 272 and 52 samples/second) for sequence lengths of 128 and 512, respectively, exhibiting up to 28% throughput improvements over NVIDIA BERT and up to 62% over HuggingFace BERT. We also support up to 1.8x larger batch size without running out of memory. -For more details on how we achieve the record breaking BERT training time please check out deep dive into DeepSpeed BERT [Fastest BERT Training](https://www.deepspeed.ai/news/2020/05/18/bert-record.html) +For more details on how we achieve the record breaking BERT training time please check out deep dive into DeepSpeed BERT [Fastest BERT Training](https://www.deepspeed.ai/2020/05/18/bert-record.html) diff --git a/docs/_tutorials/inference-tutorial.md b/docs/_tutorials/inference-tutorial.md index 17666229..411d5e75 100644 --- a/docs/_tutorials/inference-tutorial.md +++ b/docs/_tutorials/inference-tutorial.md @@ -132,7 +132,7 @@ Below is an output of the generated text. You can try other prompt and see how ## Datatypes and Quantized Models -DeepSpeed inference supports fp32, fp16 and int8 parameters. The appropriate datatype can be set using dtype in `init_inference`, and DeepSpeed will choose the kernels optimized for that datatype. For quantized int8 models, if the model was quantized using DeepSpeed's quantization approach ([MoQ](https://www.deepspeed.ai/news/2020/05/27/MoQ.html)), the setting by which the quantization is applied needs to be passed to `init_inference`. This setting includes the number of groups used for quantization and whether the MLP part of transformer is quantized with extra grouping. For more information on these parameters, please visit our [quantization tutorial](https://www.deepspeed.ai/tutorials/MoQ-tutorial/). +DeepSpeed inference supports fp32, fp16 and int8 parameters. The appropriate datatype can be set using dtype in `init_inference`, and DeepSpeed will choose the kernels optimized for that datatype. For quantized int8 models, if the model was quantized using DeepSpeed's quantization approach ([MoQ](https://www.deepspeed.ai/2021/05/04/MoQ.html)), the setting by which the quantization is applied needs to be passed to `init_inference`. This setting includes the number of groups used for quantization and whether the MLP part of transformer is quantized with extra grouping. For more information on these parameters, please visit our [quantization tutorial](https://www.deepspeed.ai/tutorials/MoQ-tutorial/). ```python import deepspeed diff --git a/docs/_tutorials/mixture-of-experts.md b/docs/_tutorials/mixture-of-experts.md index 23d807ab..55859364 100644 --- a/docs/_tutorials/mixture-of-experts.md +++ b/docs/_tutorials/mixture-of-experts.md @@ -165,4 +165,4 @@ We have devised a new technique called “Random Token Selection” that greatly ## Advanced MoE usage -We have added an example of applying MoE to NLG models. Please read more in this [newsletter](https://www.deepspeed.ai/news/2021/12/09/deepspeed-moe-nlg.html) and [tutorial](/tutorials/mixture-of-experts-nlg/). +We have added an example of applying MoE to NLG models. Please read more in this [newsletter](https://www.deepspeed.ai/2021/12/09/deepspeed-moe-nlg.html) and [tutorial](/tutorials/mixture-of-experts-nlg/). diff --git a/docs/_tutorials/onebit-adam.md b/docs/_tutorials/onebit-adam.md index 5166869e..a6443901 100644 --- a/docs/_tutorials/onebit-adam.md +++ b/docs/_tutorials/onebit-adam.md @@ -16,7 +16,7 @@ This tutorial is updated on 03/04/2021 to reflect the 1-bit Adam v2. Changes inc 1) The NCCL-based implementation requires PyTorch >= 1.8 (and NCCL >= 2.8.3 when you have 64 or more GPUs). See details below. 2) Although 1-bit Adam is compatible with both FP16 and FP32, currently we only verified the convergence under mixed precision/FP16 training. 3) Currently the MPI-based implementation is not compatible with pipeline parallelism. 4) Frequent checkpoint loading could hurt 1-bit Adam's convergence. See details below. {: .notice--warning} -In this tutorial, we are going to introduce the 1-bit Adam optimizer in DeepSpeed. 1-bit Adam can improve model training speed on communication-constrained clusters, especially for communication-intensive large models by reducing the overall communication volume by up to 5x. Detailed description of the 1-bit Adam algorithm, its implementation in DeepSpeed, and performance evaluation is available from our [blog post](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html). We also have a [paper](https://arxiv.org/abs/2102.02888) which provides the most complete details including algorithm, system implementation, theoretical analysis, and more evaluations. +In this tutorial, we are going to introduce the 1-bit Adam optimizer in DeepSpeed. 1-bit Adam can improve model training speed on communication-constrained clusters, especially for communication-intensive large models by reducing the overall communication volume by up to 5x. Detailed description of the 1-bit Adam algorithm, its implementation in DeepSpeed, and performance evaluation is available from our [blog post](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html). We also have a [paper](https://arxiv.org/abs/2102.02888) which provides the most complete details including algorithm, system implementation, theoretical analysis, and more evaluations. To illustrate the benefits and usage of 1-bit Adam optimizer in DeepSpeed, we use the following two training tasks as examples: @@ -77,7 +77,7 @@ mpirun -np [#processes] -ppn [#GPUs on each node] -hostfile [hostfile] [MPI flag ### 1.3 1-bit Algorithm -The detailed description of the 1-bit Algorithm can be seen from our [blog post](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html) and our [paper](https://arxiv.org/abs/2102.02888). +The detailed description of the 1-bit Algorithm can be seen from our [blog post](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html) and our [paper](https://arxiv.org/abs/2102.02888). ### 1.4 Configuration of 1-bit Adam The 1-bit Adam feature can be used by setting the optimizer configuration options as follows. An example json config file is shown below. @@ -215,7 +215,7 @@ We fixed the learning rate to 3e-5. The table below shows the F1 and the EM scor Figure 1: Scalability of 1-bit Adam for SQuAD Finetuning on V100 GPUs with batch size of 3/GPU. --> -Performance results of SQuAD Fine-tuning can be seen from our [blog post](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html) and our [paper](https://arxiv.org/abs/2102.02888). +Performance results of SQuAD Fine-tuning can be seen from our [blog post](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html) and our [paper](https://arxiv.org/abs/2102.02888). @@ -295,4 +295,4 @@ The above file is for BERT-large. For BERT-base training (sequence length 128), ### 3.3 Performance Results for BERT Pre-training -Performance results of BERT Pre-training can be seen from our [blog post](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html) and our [paper](https://arxiv.org/abs/2102.02888). +Performance results of BERT Pre-training can be seen from our [blog post](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html) and our [paper](https://arxiv.org/abs/2102.02888). diff --git a/docs/_tutorials/progressive_layer_dropping.md b/docs/_tutorials/progressive_layer_dropping.md index b7b868bf..b3da9ba8 100755 --- a/docs/_tutorials/progressive_layer_dropping.md +++ b/docs/_tutorials/progressive_layer_dropping.md @@ -95,7 +95,7 @@ Note that the above configuration assumes training on 64 X 32GB V100 GPUs. Each Table 1. Pre-training hyperparameters -**Note:** DeepSpeed now supports PreLayerNorm as the default way for training BERT, because of its ability to avoid vanishing gradient, stabilize optimization, and performance gains, as described in our fastest BERT training [blog post](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html). We therefore support the switchable Transformer block directly on the the BERT with PreLayerNorm. The implementation can be found at "example\bing_bert\nvidia\modelingpreln_layerdrop.py". +**Note:** DeepSpeed now supports PreLayerNorm as the default way for training BERT, because of its ability to avoid vanishing gradient, stabilize optimization, and performance gains, as described in our fastest BERT training [blog post](https://www.deepspeed.ai/2020/05/27/fastest-bert-training.html). We therefore support the switchable Transformer block directly on the the BERT with PreLayerNorm. The implementation can be found at "example\bing_bert\nvidia\modelingpreln_layerdrop.py". ## Fine-tuning with DeepSpeed on GLUE Tasks diff --git a/docs/_tutorials/sparse-attention.md b/docs/_tutorials/sparse-attention.md index d28b2d1f..bbcf0c51 100644 --- a/docs/_tutorials/sparse-attention.md +++ b/docs/_tutorials/sparse-attention.md @@ -9,7 +9,7 @@ In this tutorial we describe how to use DeepSpeed Sparse Attention (SA) and its {: .notice--warning} ## Sparse attention modules -* **MatMul**: This module handles block-sparse matrix-matrix multiplication. Currently it supports SDD, DSD, and DDS as described in [DeepSpeed Sparse Attention](https://www.deepspeed.ai/news/2020/09/08/sparse-attention.html) section. +* **MatMul**: This module handles block-sparse matrix-matrix multiplication. Currently it supports SDD, DSD, and DDS as described in [DeepSpeed Sparse Attention](https://www.deepspeed.ai/2020/09/08/sparse-attention.html) section. * **Softmax**: This module applies block sparse softmax. It handles both forward and backward pass. * **SparseSelfAttention**: This module uses MatMul and Softmax kernels and generates Context Layer output given Query, Keys and Values. It is a simplified version of common operations in any self-attention layer. It can also apply: * `Relative position embedding` diff --git a/docs/_tutorials/transformer_kernel.md b/docs/_tutorials/transformer_kernel.md index 915117fc..959f5b7d 100755 --- a/docs/_tutorials/transformer_kernel.md +++ b/docs/_tutorials/transformer_kernel.md @@ -14,7 +14,7 @@ To this end, we have developed a new kernel for transformer networks which inclu optimizations specific to these layers, which boost the training throughput on single GPU and scales well as we increase the number of GPUs. For more information on the details of transformer kernel, please visit our recent blog post on the [fastest BERT -training](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html). +training](https://www.deepspeed.ai/2020/05/27/fastest-bert-training.html). ## Prerequisites @@ -96,7 +96,7 @@ By setting the `normalize_invertible` flag, we force the kernel to drop the inpu The `attn_dropout_checkpoint` and `gelu_checkpoint` flags refer to the checkpointing approach, in which we drop the inputs to some parts of the transformer layer, attention dropout and Gelu, in order to save an important part of the activation memory. Based on our performance profiling, the performance cost of rematerializing these two are negligible and finally the performance benefit that we gain from running larger batch size compensate for that. -The following table shows which memory optimization flags need to be turned on when running BERT-Large on NVIDIA V100 GPU with 32GB of memory, considering different micro-batch sizes and sequence lengths. For the two sequence lengths, 128 and 512, used in our experiments, we have seen that larger batch size improves the overall training performance for both. Please see our [blog post](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html) for more information regarding the performance evaluation of these configurations. +The following table shows which memory optimization flags need to be turned on when running BERT-Large on NVIDIA V100 GPU with 32GB of memory, considering different micro-batch sizes and sequence lengths. For the two sequence lengths, 128 and 512, used in our experiments, we have seen that larger batch size improves the overall training performance for both. Please see our [blog post](https://www.deepspeed.ai/2020/05/27/fastest-bert-training.html) for more information regarding the performance evaluation of these configurations. | Micro-batch size | 128 sequence-length | 512 sequence-length | | :--------------: | :-----------------------: | :--------------------------------------: | -- GitLab