未验证 提交 c7d0b0ca 编写于 作者: S Shaden Smith 提交者: GitHub

Image links (#243)

上级 6622de16
......@@ -18,6 +18,6 @@ NVIDIA V100 GPUs**, compared with the best published result of 67 minutes on
the same number and generation of GPUs.
* Brief overview, see our [press release](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/).
* Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/news/2020/05/28/bert-record.html).
* Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html).
* Tutorial on how to reproduce our results, see our [BERT pre-training tutorial](https://www.deepspeed.ai/tutorials/bert-pretraining/).
* The source code for our transformer kernels can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed) and BERT pre-training code can be found in the [DeepSpeedExamples repo](https://github.com/microsoft/deepspeedexamples).
......@@ -55,9 +55,9 @@ sizes and configurations, since on average an overall batch size used in
practical scenarios range from a few hundred to a few thousand.
![Transformer-Kernel-Throughput-128](../assets/images/transformer_kernel_perf_seq128.PNG)
![Transformer-Kernel-Throughput-128](/assets/images/transformer_kernel_perf_seq128.PNG){: .align-center}
![Transformer-Kernel-Throughput-512](../assets/images/transformer_kernel_perf_seq512.PNG)
![Transformer-Kernel-Throughput-512](/assets/images/transformer_kernel_perf_seq512.PNG) {: .align-center}
Figure 1: Performance evaluation of BERT-Large on a single V100 GPU, comparing
DeepSpeed with NVIDIA and HuggingFace versions of BERT in mixed-sequence length
......@@ -102,7 +102,7 @@ approach the GPU peak performance, we employ two lines of optimizations in our
own Transformer kernel implementation: advanced fusion, and invertible
operators.
![Transformer-PreLN-Arch](../assets/images/transformer_preln_arch.png)
![Transformer-PreLN-Arch](/assets/images/transformer_preln_arch.png) {: .align-center}
Figure 2: Transformer Layer with Pre-LayerNorm Architecture
......@@ -133,7 +133,7 @@ shared memory, we reduce the cost of uncoalesced access to main memory to
better exploit memory bandwidth, resulting in 3% to 5% performance improvement
in the end-to-end training.
![QKV-Fusion](../assets/images/qkv_fusion.png)
![QKV-Fusion](/assets/images/qkv_fusion.png) {: .align-center}
Figure 3: QKV’s GEMM and transform Kernel-Fusion
......@@ -198,15 +198,15 @@ optimization, we are able to reduce the activation memory of the operator by
half, and the reduced memory allows us to train with larger batch sizes, which
once again improves GPU efficiency.
![Softmax-torch](../assets/images/softmax_pytorch.gif)
![Softmax-torch](/assets/images/softmax_pytorch.gif) {: .align-center}
![Softmax-DS](../assets/images/softmax_deepspeed.gif)
![Softmax-DS](/assets/images/softmax_deepspeed.gif) {: .align-center}
Figure 4: DeepSpeed invertible SoftMax operation versus Default PyTorch SoftMax operation
![LayerNorm-DS](../assets/images/layernorm_pytorch.gif)
![LayerNorm-DS](/assets/images/layernorm_pytorch.gif) {: .align-center}
![LayerNorm-DS](../assets/images/layernorm_deepspeed.gif)
![LayerNorm-DS](/assets/images/layernorm_deepspeed.gif) {: .align-center}
Figure 5: DeepSpeed invertible LayerNorm operation versus Default PyTorch LayerNorm operation
......
......@@ -387,7 +387,7 @@ of increasing the computational granularity as well as reducing communication, a
resulting in better performance. Therefore, with DeepSpeed and ZeRO-2 integration into Megatron,
we elevate the model scale and speed to an entirely new level compared to Megatron alone.
![DeepSpeed-vs-Megatron](../assets/images/zero-full.png)
![DeepSpeed-vs-Megatron](/assets/images/zero-full.png)
<p align="center">
<em>Figure 2: ZeRO-2 scales to 170 billion parameters, has up to 10x higher throughput, obtains super linear speedup, and improves usability by avoiding the need for code refactoring for models up to 13 billion parameters.</em>
</p>
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册