Image links (#243)

c7d0b0ca · Shaden Smith · GitHub · 6622de16 · c7d0b0ca · c7d0b0ca
3 changed file
--- a/docs/_posts/2020-05-19-bert-record.md
+++ b/docs/_posts/2020-05-19-bert-record.md
@@ -18,6 +18,6 @@ NVIDIA V100 GPUs**, compared with the best published result of 67 minutes on
 the same number and generation of GPUs.

 * Brief overview, see our [press release](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/).
-* Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/news/2020/05/28/bert-record.html).
+* Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html).
 * Tutorial on how to reproduce our results, see our [BERT pre-training tutorial](https://www.deepspeed.ai/tutorials/bert-pretraining/).
 * The source code for our transformer kernels can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed) and BERT pre-training code can be found in the [DeepSpeedExamples repo](https://github.com/microsoft/deepspeedexamples).
--- a/docs/_posts/2020-05-28-fastest-bert-training.md
+++ b/docs/_posts/2020-05-28-fastest-bert-training.md
@@ -55,9 +55,9 @@ sizes and configurations, since on average an overall batch size used in
 practical scenarios range from a few hundred to a few thousand.


-![Transformer-Kernel-Throughput-128](../assets/images/transformer_kernel_perf_seq128.PNG)
+![Transformer-Kernel-Throughput-128](/assets/images/transformer_kernel_perf_seq128.PNG){: .align-center}

-![Transformer-Kernel-Throughput-512](../assets/images/transformer_kernel_perf_seq512.PNG)
+![Transformer-Kernel-Throughput-512](/assets/images/transformer_kernel_perf_seq512.PNG) {: .align-center}

 Figure 1: Performance evaluation of BERT-Large on a single V100 GPU, comparing
 DeepSpeed with NVIDIA and HuggingFace versions of BERT in mixed-sequence length
@@ -102,7 +102,7 @@ approach the GPU peak performance,  we employ two lines of optimizations in our
 own Transformer kernel implementation: advanced fusion, and invertible
 operators.

-![Transformer-PreLN-Arch](../assets/images/transformer_preln_arch.png)
+![Transformer-PreLN-Arch](/assets/images/transformer_preln_arch.png) {: .align-center}

 Figure 2: Transformer Layer with Pre-LayerNorm Architecture

@@ -133,7 +133,7 @@ shared memory, we reduce the cost of uncoalesced access to main memory to
 better exploit memory bandwidth, resulting in 3% to 5% performance improvement
 in the end-to-end training.  

-![QKV-Fusion](../assets/images/qkv_fusion.png)
+![QKV-Fusion](/assets/images/qkv_fusion.png) {: .align-center}

 Figure 3: QKV’s GEMM and transform Kernel-Fusion

@@ -198,15 +198,15 @@ optimization, we are able to reduce the activation memory of the operator by
 half, and the reduced memory allows us to train with larger batch sizes, which
 once again improves GPU efficiency.

-![Softmax-torch](../assets/images/softmax_pytorch.gif)
+![Softmax-torch](/assets/images/softmax_pytorch.gif) {: .align-center}

-![Softmax-DS](../assets/images/softmax_deepspeed.gif)
+![Softmax-DS](/assets/images/softmax_deepspeed.gif) {: .align-center}

 Figure 4: DeepSpeed invertible SoftMax operation versus Default PyTorch SoftMax operation

-![LayerNorm-DS](../assets/images/layernorm_pytorch.gif)
+![LayerNorm-DS](/assets/images/layernorm_pytorch.gif) {: .align-center}

-![LayerNorm-DS](../assets/images/layernorm_deepspeed.gif)
+![LayerNorm-DS](/assets/images/layernorm_deepspeed.gif) {: .align-center}

 Figure 5: DeepSpeed invertible LayerNorm operation versus Default PyTorch LayerNorm operation


--- a/docs/_tutorials/megatron.md
+++ b/docs/_tutorials/megatron.md
@@ -387,7 +387,7 @@ of increasing the computational granularity as well as reducing communication, a
 resulting in better performance. Therefore, with DeepSpeed and ZeRO-2 integration into Megatron,
 we elevate the model scale and speed to an entirely new level compared to Megatron alone.

-![DeepSpeed-vs-Megatron](../assets/images/zero-full.png)
+![DeepSpeed-vs-Megatron](/assets/images/zero-full.png)
 <p align="center">
 <em>Figure 2: ZeRO-2 scales to 170 billion parameters, has up to 10x higher throughput, obtains super linear speedup, and improves usability by avoiding the need for code refactoring for models up to 13 billion parameters.</em>
 </p>