[docs] add MCR-DL paper to readme/docs (#3066)

b38b3036 · Quentin Anthony · GitHub · f1e4fb0b · b38b3036 · b38b3036
隐藏空白更改
内联并排

Showing with 4 addition and 2 deletion

README.md README.md +1 -0

csrc/transformer/general_kernels.cu csrc/transformer/general_kernels.cu +2 -2

docs/index.md docs/index.md +1 -0

未找到文件。
--- a/README.md
+++ b/README.md
@@ -213,6 +213,7 @@ Conduct](https://opensource.microsoft.com/codeofconduct/). For more information
 17. Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He. (2023) Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases. [arXiv:2301.12017](https://arxiv.org/abs/2301.12017).
 18. Syed Zawad, Cheng Li, Zhewei Yao, Elton Zheng, Yuxiong He, Feng Yan. (2023) DySR: Adaptive Super-Resolution via Algorithm and System Co-design. [ICLR:2023](https://openreview.net/forum?id=Pgtn4l6eKjv).
 19. Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, Yuxiong He. (2023) Scaling Vision-Language Models with Sparse Mixture of Experts. [arXiv:2303.07226](https://arxiv.org/abs/2303.07226).
+20. Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar Panda. (2023) MCR-DL: Mix-and-Match Communication Runtime for Deep Learning [arXiv:2303.08374](https://arxiv.org/abs/2303.08374) and will appear at IPDPS 2023.


 # Videos

--- a/csrc/transformer/general_kernels.cu
+++ b/csrc/transformer/general_kernels.cu
@@ -161,7 +161,7 @@ void launch_fused_add2<float>(float* out,
    int total_count = batch_size * seq_length * hidden_dim / 4;
    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);

-    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;        //(hidden_dim / 4);

    fused_add2_kernel<<<grid_dim, block_dim, 0, stream>>>(total_count, out, inp1, inp2);
 }
@@ -178,7 +178,7 @@ void launch_fused_add2<__half>(__half* out,
    int total_count = batch_size * seq_length * hidden_dim / 4;
    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);

-    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;        //(hidden_dim / 4);

    fused_add2_kernel<<<grid_dim, block_dim, 0, stream>>>(total_count, out, inp1, inp2);
 }

--- a/docs/index.md
+++ b/docs/index.md
@@ -126,6 +126,7 @@ comments.
 17. Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He. (2023) Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases. [arXiv:2301.12017](https://arxiv.org/abs/2301.12017).
 18. Syed Zawad, Cheng Li, Zhewei Yao, Elton Zheng, Yuxiong He, Feng Yan. (2023) DySR: Adaptive Super-Resolution via Algorithm and System Co-design. [ICLR:2023](https://openreview.net/forum?id=Pgtn4l6eKjv).
 19. Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, Yuxiong He. (2023) Scaling Vision-Language Models with Sparse Mixture of Experts. [arXiv:2303.07226](https://arxiv.org/abs/2303.07226).
+20. Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar Panda. (2023) MCR-DL: Mix-and-Match Communication Runtime for Deep Learning [arXiv:2303.08374](https://arxiv.org/abs/2303.08374) and will appear at IPDPS 2023.

 # Videos
 1. DeepSpeed KDD 2020 Tutorial