未验证 提交 ef869377 编写于 作者: C Conglong Li 提交者: GitHub

DeepSpeed Data Efficiency Library (#2585)

Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
上级 2600db54
......@@ -12,11 +12,11 @@
## Latest News
<b> DeepSpeed trained the world's most powerful language models ([MT-530B](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/), [BLOOM](https://huggingface.co/blog/bloom-megatron-deepspeed)); [learn how](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/).</b>
* [2022/12] [DeepSpeed Data Efficiency: A composable library that makes better use of data, increases training efficiency, and improves model quality](https://www.deepspeed.ai/2022/12/12/data-efficiency.html)
* [2022/11] [Stable Diffusion Image Generation under 1 second w. DeepSpeed MII](https://github.com/microsoft/DeepSpeed-MII/tree/main/examples/benchmark/txt2img)
* [2022/10] [DeepSpeed-MII: instant speedup on 24,000+ open-source DL models with up to 40x cheaper inference](https://www.deepspeed.ai/2022/10/10/mii.html)
* [2022/09] [ZeRO-Inference: Democratizing massive model inference](https://www.deepspeed.ai/2022/09/09/zero-inference.html)
* [2022/07] [Azure and DeepSpeed empower easy-to-use and high-performance model training](https://azure.microsoft.com/en-us/blog/azure-empowers-easytouse-highperformance-and-hyperscale-model-training-using-deepspeed/)
* [2022/07] [DeepSpeed Compression: A composable library for extreme compression](https://www.microsoft.com/en-us/research/blog/deepspeed-compression-a-composable-library-for-extreme-compression-and-zero-cost-quantization/)
---
......@@ -196,17 +196,19 @@ Conduct](https://opensource.microsoft.com/codeofconduct/). For more information
1. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: memory optimizations toward training trillion parameter models. [arXiv:1910.02054](https://arxiv.org/abs/1910.02054) and [In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20)](https://dl.acm.org/doi/10.5555/3433701.3433727).
2. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. [In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial)](https://dl.acm.org/doi/10.1145/3394486.3406703).
3. Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. [arXiv:2010.13369](https://arxiv.org/abs/2010.13369) and [NeurIPS 2020](https://proceedings.neurips.cc/paper/2020/hash/a1140a3d0df1c81e24ae954d935e8926-Abstract.html).
4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. [arXiv:2101.06840](https://arxiv.org/abs/2101.06840).
4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. [arXiv:2101.06840](https://arxiv.org/abs/2101.06840) and [USENIX ATC 2021](https://www.usenix.org/conference/atc21/presentation/ren-jie).
5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. [arXiv:2102.02888](https://arxiv.org/abs/2102.02888) and [ICML 2021](http://proceedings.mlr.press/v139/tang21a.html).
6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. [arXiv:2104.07857](https://arxiv.org/abs/2104.07857).
7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. [arXiv:2104.06069](https://arxiv.org/abs/2104.06069).
8. Conglong Li, Minjia Zhang, Yuxiong He. (2021) Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training. [arXiv:2108.06084](https://arxiv.org/abs/2108.06084).
6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. [arXiv:2104.07857](https://arxiv.org/abs/2104.07857) and [SC 2021](https://dl.acm.org/doi/abs/10.1145/3458817.3476205).
7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. [arXiv:2104.06069](https://arxiv.org/abs/2104.06069) and [HiPC 2022](https://hipc.org/advance-program/).
8. Conglong Li, Minjia Zhang, Yuxiong He. (2021) The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models. [arXiv:2108.06084](https://arxiv.org/abs/2108.06084) and [NeurIPS 2022](https://openreview.net/forum?id=JpZ5du_Kdh).
9. Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He. (2022) Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam. [arXiv:2202.06009](https://arxiv.org/abs/2202.06009).
10. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He. (2022) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [arXiv:2201.05596](https://arxiv.org/abs/2201.05596).
10. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He. (2022) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [arXiv:2201.05596](https://arxiv.org/abs/2201.05596) and [ICML 2022](https://proceedings.mlr.press/v162/rajbhandari22a.html).
11. Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro. (2022) Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model [arXiv:2201.11990](https://arxiv.org/abs/2201.11990).
12. Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He. (2022) Extreme Compression for Pre-trained Transformers Made Simple and Efficient. [arXiv:2206.01859](https://arxiv.org/abs/2206.01859).
13. Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He. (2022) ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. [arXiv:2206.01861](https://arxiv.org/abs/2206.01861).
14. Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He. (2022) DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. [arXiv:2207.00032](https://arxiv.org/abs/2207.00032).
12. Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He. (2022) Extreme Compression for Pre-trained Transformers Made Simple and Efficient. [arXiv:2206.01859](https://arxiv.org/abs/2206.01859) and [NeurIPS 2022](https://openreview.net/forum?id=xNeAhc2CNAl).
13. Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He. (2022) ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. [arXiv:2206.01861](https://arxiv.org/abs/2206.01861) and [NeurIPS 2022](https://openreview.net/forum?id=f-fVCElZ-G1).
14. Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He. (2022) DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. [arXiv:2207.00032](https://arxiv.org/abs/2207.00032) and [SC 2022](https://dl.acm.org/doi/abs/10.5555/3571885.3571946).
15. Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He. (2022) Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers. [arXiv:2211.11586](https://arxiv.org/abs/2211.11586).
16. Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Yuxiong He. (2022) DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. [arXiv:2212.03597](https://arxiv.org/abs/2212.03597).
# Videos
......
......@@ -5,7 +5,6 @@ Copyright 2022 The Microsoft DeepSpeed Team
#pragma once
#include "ds_kernel_utils.h"
#include "quantization.h"
#include <cuda.h>
#include <cuda_fp16.h>
......@@ -274,3 +273,54 @@ void launch_fuse_transpose_bias_kernel(const T* inp,
void launch_param_update(const float* input, __half* output, int size, cudaStream_t stream);
void launch_param_update_half(const float* input, __half* output, int size, cudaStream_t stream);
void launch_token_sort(int32_t* indices,
int layers,
int batch_size,
int reserved_size,
int original_tokens,
cudaStream_t stream);
template <typename T>
void launch_gather_tokens(T* retained_tokens,
T* activations,
int32_t* gather_indices,
int32_t batch_size,
int32_t sampled_tokens,
int32_t channels,
int32_t read_batch_stride,
int32_t read_seq_stride,
int32_t write_batch_stride,
int32_t write_seq_stride,
cudaStream_t stream);
template <typename T>
void launch_scatter_tokens(T* all_activations,
T* layer_activations,
int32_t* gather_indices,
int32_t batch_size,
int32_t sampled_tokens,
int32_t channels,
int32_t read_batch_stride,
int32_t read_seq_stride,
int32_t write_batch_stride,
int32_t write_seq_stride,
cudaStream_t stream);
template <typename T>
void launch_slice_gpt_mask(T* output_mask,
const T* input_mask,
int batch_size,
int truncated_seq_len,
int orig_seq_len,
cudaStream_t stream);
template <typename T>
void launch_slice_bert_mask(T* output_mask,
const T* input_mask,
const int32_t* retained_indices,
int32_t layers,
int32_t batch_size,
int32_t truncated_seq_len,
int32_t orig_seq_len,
cudaStream_t stream);
/*
Copyright 2022 The Microsoft DeepSpeed Team
*/
#include "custom_cuda_layers.h"
#include "memory_access_utils.h"
namespace cg = cooperative_groups;
namespace td_data {
constexpr int granularity = 16;
}
template <typename T>
__global__ void gather_tokens_impl(T* retained_tokens,
const T* activations,
int32_t* gather_indices,
int32_t sampled_tokens,
int32_t channels,
int32_t read_batch_stride,
int32_t read_seq_stride,
int32_t write_batch_stride,
int32_t write_seq_stride)
{
constexpr int mem_vals_t = td_data::granularity / sizeof(T);
cg::thread_block tb = cg::this_thread_block();
const int gather_idx = gather_indices[tb.group_index().x * sampled_tokens + tb.group_index().y];
const int read_offset = read_batch_stride * tb.group_index().x + read_seq_stride * gather_idx;
const int write_offset =
write_batch_stride * tb.group_index().x + write_seq_stride * tb.group_index().y;
for (int i = tb.thread_index().x * mem_vals_t; i < channels; i += blockDim.x * mem_vals_t) {
T local_data[mem_vals_t];
mem_access::load_global<td_data::granularity>(local_data, activations + read_offset + i);
mem_access::store_global<td_data::granularity>(retained_tokens + write_offset + i,
local_data);
}
}
template <typename T>
void launch_gather_tokens(T* retained_tokens,
T* activations,
int32_t* gather_indices,
int32_t batch_size,
int32_t sampled_tokens,
int32_t channels,
int32_t read_batch_stride,
int32_t read_seq_stride,
int32_t write_batch_stride,
int32_t write_seq_stride,
cudaStream_t stream)
{
constexpr int mem_vals_t = td_data::granularity / sizeof(T);
const int load_steps = (channels + mem_vals_t - 1) / mem_vals_t;
const int threads = (load_steps >= 1024) ? 1024 : load_steps;
dim3 block(threads);
dim3 grid(batch_size, sampled_tokens);
gather_tokens_impl<T><<<grid, block, 0, stream>>>(retained_tokens,
activations,
gather_indices,
sampled_tokens,
channels,
read_batch_stride,
read_seq_stride,
write_batch_stride,
write_seq_stride);
}
template void launch_gather_tokens<float>(float*,
float*,
int32_t*,
int32_t,
int32_t,
int32_t,
int32_t,
int32_t,
int32_t,
int32_t,
cudaStream_t);
template void launch_gather_tokens<__half>(__half*,
__half*,
int32_t*,
int32_t,
int32_t,
int32_t,
int32_t,
int32_t,
int32_t,
int32_t,
cudaStream_t);
template <typename T>
__global__ void scatter_tokens_impl(T* all_activations,
const T* layer_activations,
int32_t* gather_indices,
int32_t retained_tokens,
int32_t channels,
int32_t read_batch_stride,
int32_t read_seq_stride,
int32_t write_batch_stride,
int32_t write_seq_stride)
{
constexpr int mem_vals_t = td_data::granularity / sizeof(T);
cg::thread_block tb = cg::this_thread_block();
const int gather_idx =
gather_indices[tb.group_index().x * retained_tokens + tb.group_index().y];
const int read_offset =
read_batch_stride * tb.group_index().x + read_seq_stride * tb.group_index().y;
const int write_offset =
write_batch_stride * tb.group_index().x + write_seq_stride * gather_idx;
for (int i = tb.thread_index().x * mem_vals_t; i < channels; i += mem_vals_t * blockDim.x) {
T local_data[mem_vals_t];
mem_access::load_global<td_data::granularity>(local_data,
layer_activations + read_offset + i);
mem_access::store_global<td_data::granularity>(all_activations + write_offset + i,
local_data);
}
}
template <typename T>
void launch_scatter_tokens(T* all_activations,
T* layer_activations,
int32_t* gather_indices,
int32_t batch_size,
int32_t sampled_tokens,
int32_t channels,
int32_t read_batch_stride,
int32_t read_seq_stride,
int32_t write_batch_stride,
int32_t write_seq_stride,
cudaStream_t stream)
{
constexpr int mem_vals_t = td_data::granularity / sizeof(T);
const int load_steps = (channels + mem_vals_t - 1) / mem_vals_t;
const int threads = (load_steps >= 1024) ? 1024 : load_steps;
dim3 block(threads);
dim3 grid(batch_size, sampled_tokens);
scatter_tokens_impl<T><<<grid, block, 0, stream>>>(all_activations,
layer_activations,
gather_indices,
sampled_tokens,
channels,
read_batch_stride,
read_seq_stride,
write_batch_stride,
write_seq_stride);
}
template void launch_scatter_tokens<float>(float*,
float*,
int32_t*,
int32_t,
int32_t,
int32_t,
int32_t,
int32_t,
int32_t,
int32_t,
cudaStream_t);
template void launch_scatter_tokens<__half>(__half*,
__half*,
int32_t*,
int32_t,
int32_t,
int32_t,
int32_t,
int32_t,
int32_t,
int32_t,
cudaStream_t);
/*
Copyright 2022 The Microsoft DeepSpeed Team
*/
#include <torch/extension.h>
#include <vector>
#include "custom_cuda_layers.h"
torch::Tensor token_sort_(torch::Tensor& unsorted_token_ids, int64_t original_tokens)
{
const int layers = unsorted_token_ids.size(0);
const int batch_size = unsorted_token_ids.size(1);
const int reserved_tokens = unsorted_token_ids.size(2);
launch_token_sort(unsorted_token_ids.data_ptr<int32_t>(),
layers,
batch_size,
reserved_tokens,
original_tokens,
c10::cuda::getCurrentCUDAStream());
return unsorted_token_ids;
}
torch::Tensor token_gather(torch::Tensor& activations,
torch::Tensor& sorted_indices,
bool batch_first)
{
// Activations may be in either [N, S, C] or [S, N, C] while sorted_indices is
// always in [N, retained]
/*
TORCH_CHECK(sorted_indices.size(0) == activations.size(0) ||
sorted_indices.size(0) == activations.size(1),
"Unable to match the batch size of the sorted indices to the activation
shape."); TORCH_CHECK(activations.size(2) % 8 == 0, "Channels must be divisible by 8 to align
with vectorized loads.");
*/
// bool batch_first = sorted_indices.size(0) == activations.size(0);
const int64_t dim_0 = (batch_first) ? sorted_indices.size(0) : sorted_indices.size(1);
const int64_t dim_1 = (batch_first) ? sorted_indices.size(1) : sorted_indices.size(0);
const int64_t dim_2 = activations.size(2);
auto output = torch::empty({dim_0, dim_1, dim_2}, activations.options());
const int batch_size = sorted_indices.size(0);
const int channels = dim_2;
const int retained_tokens = sorted_indices.size(1);
const int read_batch_stride = (batch_first) ? activations.stride(0) : activations.stride(1);
const int read_seq_stride = (batch_first) ? activations.stride(1) : activations.stride(0);
const int write_batch_stride = (batch_first) ? output.stride(0) : output.stride(1);
const int write_seq_stride = (batch_first) ? output.stride(1) : output.stride(0);
if (activations.options().dtype() == torch::kFloat) {
launch_gather_tokens((float*)output.data_ptr(),
(float*)activations.data_ptr(),
(int32_t*)sorted_indices.data_ptr(),
batch_size,
retained_tokens,
channels,
read_batch_stride,
read_seq_stride,
write_batch_stride,
write_seq_stride,
c10::cuda::getCurrentCUDAStream());
} else {
launch_gather_tokens((__half*)output.data_ptr(),
(__half*)activations.data_ptr(),
(int32_t*)sorted_indices.data_ptr(),
batch_size,
retained_tokens,
channels,
read_batch_stride,
read_seq_stride,
write_batch_stride,
write_seq_stride,
c10::cuda::getCurrentCUDAStream());
}
return output;
}
torch::Tensor token_scatter_(torch::Tensor& all_activations,
torch::Tensor& layer_activations,
torch::Tensor& sorted_indices,
bool batch_first)
{
// Activations may be in either [N, S, C] or [S, N, C] while sorted_indices is
// always in [N, retained]
/*
TORCH_CHECK(sorted_indices.size(0) == all_activations.size(0) ||
sorted_indices.size(0) == all_activations.size(1),
"Unable to match the batch size of the sorted indices to the activation
shape."); TORCH_CHECK(all_activations.size(2) % 8 != 0, "Channels must be divisible by 8 to
align with vectorized loads.");
*/
// bool batch_first = sorted_indices.size(0) == all_activations.size(0);
const int batch_size = sorted_indices.size(0);
const int channels = all_activations.size(2);
const int retained_tokens = sorted_indices.size(1);
const int read_batch_stride = (batch_first) ? layer_activations.stride(0)
: layer_activations.stride(1);
const int read_seq_stride = (batch_first) ? layer_activations.stride(1)
: layer_activations.stride(0);
const int write_batch_stride = (batch_first) ? all_activations.stride(0)
: all_activations.stride(1);
const int write_seq_stride = (batch_first) ? all_activations.stride(1)
: all_activations.stride(0);
if (all_activations.options().dtype() == torch::kFloat) {
launch_scatter_tokens((float*)all_activations.data_ptr(),
(float*)layer_activations.data_ptr(),
(int32_t*)sorted_indices.data_ptr(),
batch_size,
retained_tokens,
channels,
read_batch_stride,
read_seq_stride,
write_batch_stride,
write_seq_stride,
c10::cuda::getCurrentCUDAStream());
} else {
launch_scatter_tokens((__half*)all_activations.data_ptr(),
(__half*)layer_activations.data_ptr(),
(int32_t*)sorted_indices.data_ptr(),
batch_size,
retained_tokens,
channels,
read_batch_stride,
read_seq_stride,
write_batch_stride,
write_seq_stride,
c10::cuda::getCurrentCUDAStream());
}
return all_activations;
}
torch::Tensor mask_gather_bert(torch::Tensor& dense_mask, torch::Tensor& sorted_indices)
{
// TORCH_CHECK(dense_mask.dim() == 4)
const int batch_size = dense_mask.size(0);
const int layers = sorted_indices.size(0);
/*
TORCH_CHECK(layers * batch_size == sorted_indices.size(0),
"Mismatch between the indices and the mask");
*/
const int orig_seq_len = dense_mask.size(3);
const int truncated_seq_len = sorted_indices.size(2);
auto output = torch::empty({layers, batch_size, 1, truncated_seq_len, truncated_seq_len},
dense_mask.options());
if (dense_mask.options().dtype() == torch::kFloat) {
launch_slice_bert_mask((float*)output.data_ptr(),
(const float*)dense_mask.data_ptr(),
(const int32_t*)sorted_indices.data_ptr(),
layers,
batch_size,
truncated_seq_len,
orig_seq_len,
c10::cuda::getCurrentCUDAStream());
} else {
launch_slice_bert_mask((__half*)output.data_ptr(),
(const __half*)dense_mask.data_ptr(),
(const int32_t*)sorted_indices.data_ptr(),
layers,
batch_size,
truncated_seq_len,
orig_seq_len,
c10::cuda::getCurrentCUDAStream());
}
return output;
}
torch::Tensor mask_gather_gpt(torch::Tensor dense_mask, int truncated_seq_len)
{
// TORCH_CHECK(dense_mask.dim() == 4)
const int batch_size = dense_mask.size(0);
const int orig_seq_len = dense_mask.size(3);
auto output =
torch::empty({batch_size, 1, truncated_seq_len, truncated_seq_len}, dense_mask.options());
if (dense_mask.options().dtype() == torch::kFloat) {
launch_slice_gpt_mask((float*)output.data_ptr(),
(const float*)dense_mask.data_ptr(),
batch_size,
truncated_seq_len,
orig_seq_len,
c10::cuda::getCurrentCUDAStream());
} else {
launch_slice_gpt_mask((__half*)output.data_ptr(),
(const __half*)dense_mask.data_ptr(),
batch_size,
truncated_seq_len,
orig_seq_len,
c10::cuda::getCurrentCUDAStream());
}
return output;
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
{
m.def("token_sort_", &token_sort_, "Comparison free sorting algorithm (CUDA)");
m.def("token_gather", &token_gather, "Parallel gather of tokens (CUDA)");
m.def("token_scatter_", &token_scatter_, "Parallel scatter of tokens (CUDA)");
m.def("mask_gather_bert", &mask_gather_bert, "Token-based mask gather for BERT masking (CUDA)");
m.def("mask_gather_gpt", &mask_gather_gpt, "Token-based mask gather for GPT masking (CUDA)");
}
/*
Copyright 2022 The Microsoft DeepSpeed Team
*/
#include "custom_cuda_layers.h"
#include "memory_access_utils.h"
namespace cg = cooperative_groups;
template <typename T>
__global__ void slice_gpt_mask_impl(T* output_mask,
const T* input_mask,
int truncated_seq_len,
int orig_seq_len)
{
const int in_batch_stride = orig_seq_len * orig_seq_len;
const int out_batch_stride = truncated_seq_len * truncated_seq_len;
cg::thread_block tb = cg::this_thread_block();
const T* input_mask_block =
input_mask + blockIdx.x * in_batch_stride + blockIdx.y * orig_seq_len;
T* output_mask_block =
output_mask + blockIdx.x * out_batch_stride + blockIdx.y * truncated_seq_len;
for (int i = tb.thread_index().x; i < truncated_seq_len; i += blockDim.x) {
output_mask_block[i] = input_mask_block[i];
}
}
template <typename T>
void launch_slice_gpt_mask(T* output_mask,
const T* input_mask,
int batch_size,
int truncated_seq_len,
int orig_seq_len,
cudaStream_t stream)
{
const int threads = (truncated_seq_len >= 1024) ? 1024 : truncated_seq_len;
dim3 block(threads);
dim3 grid(batch_size, truncated_seq_len);
slice_gpt_mask_impl<T>
<<<grid, block, 0, stream>>>(output_mask, input_mask, truncated_seq_len, orig_seq_len);
}
template void launch_slice_gpt_mask<float>(float*, const float*, int, int, int, cudaStream_t);
template void launch_slice_gpt_mask<__half>(__half*, const __half*, int, int, int, cudaStream_t);
template <typename T>
__global__ void slice_bert_mask_impl(T* output_mask,
const T* input_mask,
const int32_t* retained_indices,
int32_t truncated_seq_len,
int32_t orig_seq_len)
{
const int in_batch_stride = orig_seq_len * orig_seq_len;
const int out_batch_stride = truncated_seq_len * truncated_seq_len;
const int out_layer_stride = out_batch_stride * gridDim.y;
cg::thread_block tb = cg::this_thread_block();
const int out_layer_offset = tb.group_index().x * out_layer_stride;
const int in_batch_offset = tb.group_index().y * in_batch_stride;
const int out_batch_offset = tb.group_index().y * out_batch_stride;
const int32_t gather_row =
retained_indices[tb.group_index().y * truncated_seq_len + tb.group_index().z];
const int in_seq_offset = gather_row * orig_seq_len;
const int out_seq_offset = tb.group_index().z * truncated_seq_len;
const T* in_sequence = input_mask + in_batch_offset + in_seq_offset;
T* out_sequence = output_mask + out_layer_offset + out_batch_offset + out_seq_offset;
const int32_t* gather_data = retained_indices + tb.group_index().y * truncated_seq_len;
for (int i = tb.thread_index().x; i < truncated_seq_len; i += blockDim.x) {
out_sequence[i] = in_sequence[gather_data[i]];
}
}
/*
Since the Bert mask is not causal like GPT, we can't just generate a set of
masks for the entire model based off a single layer sample.
We map the kernel as follows:
z-dimension: layer
y-dimension: batch
x-dimension: sequence_offset
*/
template <typename T>
void launch_slice_bert_mask(T* output_mask,
const T* input_mask,
const int32_t* retained_indices,
int32_t layers,
int32_t batch_size,
int32_t truncated_seq_len,
int32_t orig_seq_len,
cudaStream_t stream)
{
const int threads = (truncated_seq_len >= 1024) ? 1024 : truncated_seq_len;
dim3 block(threads);
dim3 grid(layers, batch_size, truncated_seq_len);
slice_bert_mask_impl<T><<<grid, block, 0, stream>>>(
output_mask, input_mask, retained_indices, truncated_seq_len, orig_seq_len);
}
template void launch_slice_bert_mask<float>(float*,
const float*,
const int32_t*,
int32_t,
int32_t,
int32_t,
int32_t,
cudaStream_t);
template void launch_slice_bert_mask<__half>(__half*,
const __half*,
const int32_t*,
int32_t,
int32_t,
int32_t,
int32_t,
cudaStream_t);
/*
Copyright 2022 The Microsoft DeepSpeed Team
*/
#include <cassert>
#include "custom_cuda_layers.h"
#include "memory_access_utils.h"
namespace cg = cooperative_groups;
namespace td_sort {
constexpr int threads = 512;
constexpr int granularity = 16;
constexpr int mem_vals = granularity / sizeof(int32_t);
constexpr int max_buffer_size = (threads + 1) * mem_vals;
#ifdef __HIP_PLATFORM_HCC__
constexpr int warp_size = 64;
#else
constexpr int warp_size = 32;
#endif
constexpr int max_warps = threads / warp_size;
} // namespace td_sort
template <int VALS_PER_THREAD>
__global__ void scan_sort(int32_t* data, int reserved_tokens, int original_tokens)
{
cg::thread_block tb = cg::this_thread_block();
cg::thread_block_tile<td_sort::warp_size> warp = cg::tiled_partition<td_sort::warp_size>(tb);
__shared__ int32_t indices_buffer[td_sort::max_buffer_size];
__shared__ int32_t intermediate_buffer[td_sort::max_warps];
__shared__ int32_t sorted_indices_buffer[td_sort::max_buffer_size];
for (int i = tb.thread_index().x * td_sort::mem_vals; i < original_tokens + 1;
i += tb.group_dim().x * td_sort::mem_vals) {
uint32_t zeros[td_sort::mem_vals] = {0, 0, 0, 0};
mem_access::store_shared<td_sort::granularity>(indices_buffer + i, zeros);
}
int32_t local_vals[VALS_PER_THREAD];
// We flatten layers/batch into a single indexing dimension
int32_t* data_block = data + tb.group_index().x * reserved_tokens;
// The next two loops really could be fused for a more logical code layout, but don't want to
// move the barrier forward
#pragma unroll
for (int i = 0; i < VALS_PER_THREAD; i++) {
const int iter_idx = i * td_sort::threads + tb.thread_index().x;
if (iter_idx < reserved_tokens) {
mem_access::load_global<sizeof(int32_t)>(local_vals + i, data_block + iter_idx);
} else {
local_vals[i] = 0;
}
}
tb.sync();
#pragma unroll
for (int i = 0; i < VALS_PER_THREAD; i++) {
const int iter_idx = i * td_sort::threads + tb.thread_index().x;
if (iter_idx < reserved_tokens) {
const int32_t one = 1;
mem_access::store_shared<sizeof(int32_t)>(indices_buffer + local_vals[i], &one);
}
}
tb.sync();
int32_t local_input[td_sort::mem_vals];
mem_access::load_shared<td_sort::granularity>(
local_input, indices_buffer + tb.thread_index().x * td_sort::mem_vals);
int32_t reduce_vals[td_sort::mem_vals];
reduce_vals[0] = local_input[0];
#pragma unroll
for (int i = 1; i < td_sort::mem_vals; i++) {
reduce_vals[i] = local_input[i] + reduce_vals[i - 1];
}
int32_t step_1_val = reduce_vals[td_sort::mem_vals - 1];
// Short span exclusive scan algorithm (less work efficient)
#pragma unroll
for (int i = 1; i < td_sort::warp_size; i *= 2) {
int32_t step_val = warp.shfl_up(step_1_val, i);
step_1_val = (warp.thread_rank() < i) ? step_1_val : step_1_val + step_val;
}
if (warp.thread_rank() == td_sort::warp_size - 1) {
mem_access::store_shared<sizeof(int32_t)>(intermediate_buffer + warp.meta_group_rank(),
&step_1_val);
}
tb.sync();
if (warp.meta_group_rank() == 0) {
int32_t step_2_val = 0;
if (warp.thread_rank() < td_sort::max_warps) {
mem_access::load_shared<sizeof(int32_t)>(&step_2_val,
intermediate_buffer + warp.thread_rank());
}
#pragma unroll
for (int i = 1; i < td_sort::warp_size; i *= 2) {
int32_t step_val = warp.shfl_up(step_2_val, i);
step_2_val = (warp.thread_rank() < i) ? step_2_val : step_2_val + step_val;
}
if (warp.thread_rank() < td_sort::max_warps) {
mem_access::store_shared<sizeof(int32_t)>(intermediate_buffer + warp.thread_rank(),
&step_2_val);
}
}
tb.sync();
int step_2_val = 0;
if (warp.meta_group_rank() > 0) {
mem_access::load_shared<sizeof(int32_t)>(&step_2_val,
intermediate_buffer + warp.meta_group_rank() - 1);
}
const int thread_offset = reduce_vals[td_sort::mem_vals - 1];
#pragma unroll
for (int i = 0; i < td_sort::mem_vals; i++) {
reduce_vals[i] += step_1_val + step_2_val - thread_offset;
}
mem_access::store_shared<td_sort::granularity>(
indices_buffer + tb.thread_index().x * td_sort::mem_vals, reduce_vals);
if (tb.thread_index().x == 0) {
indices_buffer[original_tokens] = original_tokens - indices_buffer[original_tokens];
}
tb.sync();
for (int i = 0; i < VALS_PER_THREAD; i++) {
const int iter_idx = i * td_sort::threads + tb.thread_index().x;
if (iter_idx < reserved_tokens) {
if (local_vals[i] == 0) {
int zero = 0;
mem_access::store_shared<sizeof(int32_t)>(sorted_indices_buffer, &zero);
} else {
int sorted_idx;
mem_access::load_shared<sizeof(int32_t)>(&sorted_idx,
indices_buffer + local_vals[i] - 1);
mem_access::store_shared<sizeof(int32_t)>(sorted_indices_buffer + sorted_idx,
local_vals + i);
}
}
}
tb.sync();
#pragma unroll
for (int i = 0; i < VALS_PER_THREAD; i++) {
const int iter_idx = i * td_sort::threads + tb.thread_index().x;
if (iter_idx < reserved_tokens) {
int32_t store_val;
mem_access::load_shared<sizeof(int32_t)>(&store_val, sorted_indices_buffer + iter_idx);
mem_access::store_global<sizeof(int32_t)>(data_block + iter_idx, &store_val);
}
}
}
void launch_token_sort(int32_t* indices,
int layers,
int batch_size,
int reserved_size,
int original_tokens,
cudaStream_t stream)
{
// Each sort is completely independent, can flatten this dimension
dim3 grid(layers * batch_size);
dim3 block(td_sort::threads);
const int vals_per_thread = (reserved_size + td_sort::threads - 1) / td_sort::threads;
if (vals_per_thread == 1) {
scan_sort<1><<<grid, block, 0, stream>>>(indices, reserved_size, original_tokens);
} else if (vals_per_thread == 2) {
scan_sort<2><<<grid, block, 0, stream>>>(indices, reserved_size, original_tokens);
} else if (vals_per_thread == 3) {
scan_sort<3><<<grid, block, 0, stream>>>(indices, reserved_size, original_tokens);
} else if (vals_per_thread == 4) {
scan_sort<4><<<grid, block, 0, stream>>>(indices, reserved_size, original_tokens);
} else {
assert(false);
}
}
from .dropping_utils import gpt_sample_tokens, bert_sample_tokens, GatherTokens, ScatterTokens
"""
Copyright 2022 The Microsoft DeepSpeed Team
"""
import torch
from ..op_builder import RandomLTDBuilder
"""
Returns:
sampled_indices: [layers, batch_size, reserved_length]
new_mask: [batch_size, 1, reserved_length, reserved_length]
"""
random_ltd_module = None
def gpt_sample_tokens(reserved_length: int,
seq_length: int,
batch_size: int,
layers: int = 1,
device: str = 'cpu',
attn_mask: torch.Tensor = None):
prob_dist = torch.ones((layers * batch_size, seq_length), device=device)
sampled_indices = torch.multinomial(prob_dist, reserved_length)
sampled_indices = sampled_indices.reshape(layers,
batch_size,
reserved_length).to(torch.int32)
global random_ltd_module
if random_ltd_module is None:
random_ltd_module = RandomLTDBuilder().load()
sampled_indices = random_ltd_module.token_sort_(sampled_indices, seq_length)
# Not certain the optimized kernel is actually better here, cause it kind of screws
# with alignment right if the sequence length is not divisble by like 16
# new_mask = random_ltd_module.mask_gather_gpt(attn_mask, reserved_length)
if attn_mask is not None:
new_mask = attn_mask[:, :, :reserved_length, :reserved_length]
else:
new_mask = None
return sampled_indices, new_mask
"""
Returns:
sampled_indices: [layers, batch_size, reserved_length]
new_mask: [layers, batch_size, 1, reserved_length, reserved_length]
"""
def bert_sample_tokens(reserved_length: int,
seq_length: int,
batch_size: int,
layers: int = 1,
device: str = 'cpu',
attn_mask: torch.Tensor = None):
assert attn_mask is not None
prob_dist = torch.ones((layers * batch_size, seq_length), device=device)
sampled_indices = torch.multinomial(prob_dist, reserved_length)
sampled_indices = sampled_indices.reshape(layers,
batch_size,
reserved_length).to(torch.int32)
global random_ltd_module
if random_ltd_module is None:
random_ltd_module = RandomLTDBuilder().load()
sampled_indices = random_ltd_module.token_sort_(sampled_indices, seq_length)
dtype = sampled_indices.dtype
sampled_indices = sampled_indices.to(torch.long)
new_mask = []
for l in range(layers):
tmp_mask_list = []
for i in range(batch_size):
mask_tmp = attn_mask[i:i + 1, :, sampled_indices[l][i], :]
tmp_mask_list.append(mask_tmp[:, :, :, sampled_indices[l][i]])
new_mask.append(torch.cat(tmp_mask_list, dim=0))
return sampled_indices.to(dtype), new_mask
class GatherTokens(torch.autograd.Function):
@staticmethod
def forward(ctx,
activations: torch.Tensor,
sorted_indices: torch.Tensor,
batch_first: bool):
global random_ltd_module
if random_ltd_module is None:
random_ltd_module = RandomLTDBuilder().load()
ctx.save_for_backward(activations, sorted_indices)
ctx.batch_first = batch_first
return activations, random_ltd_module.token_gather(activations, sorted_indices, batch_first)
@staticmethod
def backward(ctx, a_gradients: torch.Tensor, g_gradients: torch.Tensor):
g_gradients = g_gradients.contiguous()
global random_ltd_module
if random_ltd_module is None:
random_ltd_module = RandomLTDBuilder().load()
activations, sorted_indices = ctx.saved_tensors
batch_first = ctx.batch_first
return random_ltd_module.token_scatter_(a_gradients,
g_gradients,
sorted_indices,
batch_first), None, None
class ScatterTokens(torch.autograd.Function):
@staticmethod
def forward(ctx,
all_activations: torch.Tensor,
layer_activations: torch.Tensor,
sorted_indices: torch.Tensor,
batch_first: bool):
global random_ltd_module
if random_ltd_module is None:
random_ltd_module = RandomLTDBuilder().load()
scatter_results = random_ltd_module.token_scatter_(all_activations.clone(),
layer_activations,
sorted_indices,
batch_first)
ctx.save_for_backward(sorted_indices)
ctx.batch_first = batch_first
return scatter_results
@staticmethod
def backward(ctx, out_gradients: torch.Tensor):
out_gradients = out_gradients.contiguous()
global random_ltd_module
if random_ltd_module is None:
random_ltd_module = RandomLTDBuilder().load()
sorted_indices, = ctx.saved_tensors
batch_first = ctx.batch_first
ret_val = random_ltd_module.token_gather(out_gradients,
sorted_indices,
batch_first)
return out_gradients, ret_val, None, None
......@@ -56,6 +56,9 @@ from ..compression.config import get_compression_config, get_quantize_enabled
from ..compression.constants import *
from .swap_tensor.aio_config import get_aio_config
from .data_pipeline.config import get_data_efficiency_enabled, get_data_efficiency_config, get_curriculum_enabled_legacy, get_curriculum_params_legacy
from .data_pipeline.constants import *
TENSOR_CORE_ALIGN_SIZE = 8
ADAGRAD_OPTIMIZER = 'adagrad'
......@@ -87,24 +90,6 @@ class DeepSpeedConfigError(Exception):
pass
def get_curriculum_enabled(param_dict):
if CURRICULUM_LEARNING in param_dict.keys():
return get_scalar_param(param_dict[CURRICULUM_LEARNING],
CURRICULUM_ENABLED,
CURRICULUM_ENABLED_DEFAULT)
else:
return False
def get_curriculum_params(param_dict):
if CURRICULUM_LEARNING in param_dict.keys():
curriculum_params = copy.copy(param_dict[CURRICULUM_LEARNING])
curriculum_params.pop(CURRICULUM_ENABLED)
return curriculum_params
else:
return False
def get_pld_enabled(param_dict):
if PROGRESSIVE_LAYER_DROP in param_dict.keys():
return get_scalar_param(param_dict[PROGRESSIVE_LAYER_DROP],
......@@ -898,8 +883,11 @@ class DeepSpeedConfig(object):
self.pld_enabled = get_pld_enabled(param_dict)
self.pld_params = get_pld_params(param_dict)
self.curriculum_enabled = get_curriculum_enabled(param_dict)
self.curriculum_params = get_curriculum_params(param_dict)
self.curriculum_enabled_legacy = get_curriculum_enabled_legacy(param_dict)
self.curriculum_params_legacy = get_curriculum_params_legacy(param_dict)
self.data_efficiency_enabled = get_data_efficiency_enabled(param_dict)
self.data_efficiency_config = get_data_efficiency_config(param_dict)
checkpoint_params = get_checkpoint_params(param_dict)
validation_mode = get_checkpoint_tag_validation_mode(checkpoint_params)
......
......@@ -343,14 +343,6 @@ PLD_THETA_DEFAULT = 1.0
PLD_GAMMA = "gamma"
PLD_GAMMA_DEFAULT = 0.001
#########################################
# Curriculum Learning
#########################################
CURRICULUM_LEARNING = "curriculum_learning"
CURRICULUM_ENABLED = "enabled"
CURRICULUM_ENABLED_DEFAULT = False
#########################################
# Validation modes
......@@ -419,3 +411,9 @@ DATALOADER_DROP_LAST_DEFAULT = False
# PIPELINE PARALLELISM
#########################################
PIPE_REPLICATED = 'ds_pipe_replicated'
#########################################
# DATA PARALLELISM
#########################################
DATA_PARALLEL_GROUP = "data_parallel_group"
GLOBAL_RANK = "global_rank"
'''
Copyright 2022 The Microsoft DeepSpeed Team
'''
from .constants import *
import copy
from ..config_utils import get_scalar_param
# TODO: Reducing config verbosity by returning None or {} when disabled.
# One challenge is that we still need to somehow include the default values,
# for example the *_ENABLED has default of false.
def get_data_efficiency_config(param_dict):
output = {}
output[DATA_EFFICIENCY_ENABLED] = get_data_efficiency_enabled(param_dict)
output[DATA_EFFICIENCY_SEED] = get_data_efficiency_seed(param_dict)
if DATA_EFFICIENCY not in param_dict.keys():
param_dict[DATA_EFFICIENCY] = {}
sub_param_dict = param_dict[DATA_EFFICIENCY]
output[DATA_SAMPLING] = get_data_sampling(sub_param_dict)
output[DATA_ROUTING] = get_data_routing(sub_param_dict)
return output
def get_data_efficiency_enabled(param_dict):
if DATA_EFFICIENCY in param_dict.keys():
return get_scalar_param(param_dict[DATA_EFFICIENCY],
DATA_EFFICIENCY_ENABLED,
DATA_EFFICIENCY_ENABLED_DEFAULT)
else:
return False
def get_data_efficiency_seed(param_dict):
if DATA_EFFICIENCY in param_dict.keys():
return get_scalar_param(param_dict[DATA_EFFICIENCY],
DATA_EFFICIENCY_SEED,
DATA_EFFICIENCY_SEED_DEFAULT)
else:
return DATA_EFFICIENCY_SEED_DEFAULT
def get_data_sampling(param_dict):
output = {}
output[DATA_SAMPLING_ENABLED] = get_data_sampling_enabled(param_dict)
output[DATA_SAMPLING_NUM_EPOCHS] = get_data_sampling_num_epochs(param_dict)
output[DATA_SAMPLING_NUM_WORKERS] = get_data_sampling_num_workers(param_dict)
if DATA_SAMPLING not in param_dict.keys():
param_dict[DATA_SAMPLING] = {}
sub_param_dict = param_dict[DATA_SAMPLING]
output[CURRICULUM_LEARNING] = get_curriculum_learning(sub_param_dict)
return output
def get_data_sampling_enabled(param_dict):
if DATA_SAMPLING in param_dict.keys():
return get_scalar_param(param_dict[DATA_SAMPLING],
DATA_SAMPLING_ENABLED,
DATA_SAMPLING_ENABLED_DEFAULT)
else:
return False
def get_data_sampling_num_epochs(param_dict):
if DATA_SAMPLING in param_dict.keys():
return get_scalar_param(param_dict[DATA_SAMPLING],
DATA_SAMPLING_NUM_EPOCHS,
DATA_SAMPLING_NUM_EPOCHS_DEFAULT)
else:
return DATA_SAMPLING_NUM_EPOCHS_DEFAULT
def get_data_sampling_num_workers(param_dict):
if DATA_SAMPLING in param_dict.keys():
return get_scalar_param(param_dict[DATA_SAMPLING],
DATA_SAMPLING_NUM_WORKERS,
DATA_SAMPLING_NUM_WORKERS_DEFAULT)
else:
return DATA_SAMPLING_NUM_WORKERS_DEFAULT
def get_curriculum_learning(param_dict):
output = {}
output[CURRICULUM_LEARNING_ENABLED] = get_curriculum_learning_enabled(param_dict)
if CURRICULUM_LEARNING not in param_dict.keys():
param_dict[CURRICULUM_LEARNING] = {}
sub_param_dict = param_dict[CURRICULUM_LEARNING]
if output[CURRICULUM_LEARNING_ENABLED]:
assert CURRICULUM_LEARNING_METRICS in sub_param_dict.keys(), f"Curriculum learning is enabled, {CURRICULUM_LEARNING_METRICS} must be specified"
for key, val in get_curriculum_learning_params(param_dict).items():
output[key] = val
return output
def get_curriculum_learning_enabled(param_dict):
if CURRICULUM_LEARNING in param_dict.keys():
return get_scalar_param(param_dict[CURRICULUM_LEARNING],
CURRICULUM_LEARNING_ENABLED,
CURRICULUM_LEARNING_ENABLED_DEFAULT)
else:
return False
def get_curriculum_learning_params(param_dict):
if CURRICULUM_LEARNING in param_dict.keys():
curriculum_learning_params = copy.copy(param_dict[CURRICULUM_LEARNING])
curriculum_learning_params.pop(CURRICULUM_LEARNING_ENABLED)
return curriculum_learning_params
else:
return {}
def get_curriculum_enabled_legacy(param_dict):
if CURRICULUM_LEARNING_LEGACY in param_dict.keys():
return get_scalar_param(param_dict[CURRICULUM_LEARNING_LEGACY],
CURRICULUM_ENABLED_LEGACY,
CURRICULUM_ENABLED_DEFAULT_LEGACY)
else:
return False
def get_curriculum_params_legacy(param_dict):
if CURRICULUM_LEARNING_LEGACY in param_dict.keys():
curriculum_params = copy.copy(param_dict[CURRICULUM_LEARNING_LEGACY])
curriculum_params.pop(CURRICULUM_ENABLED_LEGACY)
return curriculum_params
else:
return False
def get_data_routing(param_dict):
output = {}
output[DATA_ROUTING_ENABLED] = get_data_routing_enabled(param_dict)
if DATA_ROUTING not in param_dict.keys():
param_dict[DATA_ROUTING] = {}
sub_param_dict = param_dict[DATA_ROUTING]
output[RANDOM_LTD] = get_random_ltd(sub_param_dict)
return output
def get_data_routing_enabled(param_dict):
if DATA_ROUTING in param_dict.keys():
return get_scalar_param(param_dict[DATA_ROUTING],
DATA_ROUTING_ENABLED,
DATA_ROUTING_ENABLED_DEFAULT)
else:
return False
def get_random_ltd(param_dict):
output = {}
output[RANDOM_LTD_ENABLED] = RANDOM_LTD_ENABLED_DEFAULT
output[RANDOM_LTD_LAYER_TOKEN_LR_SCHEDULE] = {}
output[RANDOM_LTD_LAYER_TOKEN_LR_SCHEDULE][
RANDOM_LTD_LAYER_TOKEN_LR_ENABLED] = RANDOM_LTD_LAYER_TOKEN_LR_ENABLED_DEFAULT
if get_random_ltd_enabled(param_dict):
output[RANDOM_LTD_ENABLED] = get_random_ltd_enabled(param_dict)
for key, val in get_random_ltd_params(param_dict).items():
output[key] = val
return output
def get_random_ltd_enabled(param_dict):
if RANDOM_LTD in param_dict.keys():
return get_scalar_param(param_dict[RANDOM_LTD],
RANDOM_LTD_ENABLED,
RANDOM_LTD_ENABLED_DEFAULT)
else:
return False
def get_random_ltd_params(param_dict):
if RANDOM_LTD in param_dict.keys():
random_ltd_params = copy.copy(param_dict[RANDOM_LTD])
random_ltd_params.pop(RANDOM_LTD_ENABLED)
return random_ltd_params
else:
return {}
'''
Copyright 2022 The Microsoft DeepSpeed Team
'''
#########################################
# Data efficiency library
# See sample config at https://www.deepspeed.ai/docs/config-json/#data-efficiency
#########################################
DATA_EFFICIENCY = "data_efficiency"
DATA_EFFICIENCY_ENABLED = "enabled"
DATA_EFFICIENCY_ENABLED_DEFAULT = False
DATA_EFFICIENCY_SEED = "seed"
DATA_EFFICIENCY_SEED_DEFAULT = 1234
#########################################
# Data efficiency - Data Sampling
#########################################
DATA_SAMPLING = "data_sampling"
DATA_SAMPLING_ENABLED = "enabled"
DATA_SAMPLING_ENABLED_DEFAULT = False
DATA_SAMPLING_NUM_EPOCHS = "num_epochs"
DATA_SAMPLING_NUM_EPOCHS_DEFAULT = 1000
DATA_SAMPLING_NUM_WORKERS = "num_workers"
DATA_SAMPLING_NUM_WORKERS_DEFAULT = 0
#########################################
# Data efficiency - Data Sampling - Curriculum Learning
#########################################
CURRICULUM_LEARNING = "curriculum_learning"
CURRICULUM_LEARNING_ENABLED = "enabled"
CURRICULUM_LEARNING_ENABLED_DEFAULT = False
CURRICULUM_LEARNING_CLUSTER_PATH = "data_cluster_path"
CURRICULUM_LEARNING_METRICS = "curriculum_metrics"
CURRICULUM_LEARNING_SAMPLE_PATH = "index_to_sample_path"
CURRICULUM_LEARNING_METRIC_PATH = "index_to_metric_path"
CURRICULUM_LEARNING_CLUSTERING_TYPE = "clustering_type"
CURRICULUM_LEARNING_SINGLE_CLUSTER = "single_cluster"
CURRICULUM_LEARNING_CLUSTER_PREFIX = "cluster"
CURRICULUM_LEARNING_DIFFICULTY_TYPE = "difficulty_type"
CURRICULUM_LEARNING_VALUE_BASED = "value"
CURRICULUM_LEARNING_PERCENTILE_BASED = "percentile"
CURRICULUM_LEARNING_MIN_DIFFICULTY = "min_difficulty"
CURRICULUM_LEARNING_MAX_DIFFICULTY = "max_difficulty"
CURRICULUM_LEARNING_SCHEDULE_TYPE = "schedule_type"
CURRICULUM_LEARNING_SCHEDULE_CONFIG = "schedule_config"
CURRICULUM_LEARNING_SCHEDULE_DIFFICULTY = "difficulty"
CURRICULUM_LEARNING_SCHEDULE_MAX_STEP = "max_step"
CURRICULUM_LEARNING_SCHEDULE_TOTAL_STEP = "total_curriculum_step"
CURRICULUM_LEARNING_SCHEDULE_DIFFICULTY_STEP = "difficulty_step"
CURRICULUM_LEARNING_SCHEDULE_ROOT_DEGREE = "root_degree"
CURRICULUM_LEARNING_SCHEDULE_FIXED_DISCRETE = "fixed_discrete"
CURRICULUM_LEARNING_SCHEDULE_FIXED_ROOT = "fixed_root"
CURRICULUM_LEARNING_SCHEDULE_FIXED_LINEAR = "fixed_linear"
CURRICULUM_LEARNING_SCHEDULE_CUSTOM = "custom"
CURRICULUM_LEARNING_CURRENT_DIFFICULTY = "current_difficulty"
CURRICULUM_LEARNING_BATCH = "batch"
CURRICULUM_LEARNING_CONSUMED_SAMPLES = "consumed_samples"
CURRICULUM_LEARNING_STEP = "curriculum_step"
CURRICULUM_LEARNING_CURRENT_DIFFICULTIES = "current_difficulties"
CURRICULUM_LEARNING_DATA_CLUSTER_PATHS = "data_cluster_paths"
CURRICULUM_LEARNING_DATA_CLUSTER_CURRENT_POSITION = "data_cluster_current_position"
CURRICULUM_LEARNING_NP_RNG_STATE = "np_rng_state"
#########################################
# Curriculum Learning legacy implementation
#########################################
CURRICULUM_LEARNING_LEGACY = "curriculum_learning"
CURRICULUM_ENABLED_LEGACY = "enabled"
CURRICULUM_ENABLED_DEFAULT_LEGACY = False
#########################################
# Data efficiency - Data Routing
#########################################
DATA_ROUTING = "data_routing"
DATA_ROUTING_ENABLED = "enabled"
DATA_ROUTING_ENABLED_DEFAULT = False
#########################################
# Data efficiency - Data Routing - Random LTD
#########################################
RANDOM_LTD = "random_ltd"
RANDOM_LTD_ENABLED = "enabled"
RANDOM_LTD_ENABLED_DEFAULT = False
RANDOM_LTD_MODEL_MASK_NAME = "model_mask_name"
RANDOM_LTD_MODEL_TYPE = "model_type"
RANDOM_LTD_MICRO_BATCH_SIZE = "micro_batch_size"
RANDOM_LTD_GLOBAL_BATCH_SIZE = "global_batch_size"
RANDOM_LTD_SAMPLE_INDEX = "sample_idx"
RANDOM_LTD_ATTENTION_MASK = "attention_mask"
RANDOM_LTD_HIDDEN_STATE_ORDER = "hidden_state_order"
RANDOM_LTD_LAYER_NUM = "random_ltd_layer_num"
RANDOM_LTD_LAYER_ID = "random_ltd_layer_id"
RANDOM_LTD_TOTAL_LAYER_NUM = "total_layer_num"
RANDOM_LTD_CONSUMED_LAYER_TOKENS = "consumed_layer_tokens"
# scheduler
RANDOM_LTD_SCHEDULER = "random_ltd_schedule"
RANDOM_LTD_MAX_VALUE = "max_value"
RANDOM_LTD_MIN_VALUE = "min_value"
RANDOM_LTD_CURRENT_VALUE = "current_value"
RANDOM_LTD_SCHEDULE_CONFIG = "schedule_config"
RANDOM_LTD_INCREASE_STEP = "seq_per_step"
RANDOM_LTD_REQUIRE_STEP = "require_steps"
RANDOM_LTD_SCHEDULER_TYPE = "schedule_type"
RANDOM_LTD_CURR_STEP = "current_steps"
# learning rate schedulers
RANDOM_LTD_LAYER_TOKEN_LR_SCHEDULE = "layer_token_lr_schedule"
RANDOM_LTD_LAYER_TOKEN_LR_ENABLED = "enabled"
RANDOM_LTD_LAYER_TOKEN_LR_ENABLED_DEFAULT = False
RANDOM_LTD_TOTAL_LAYER_TOKENS = "total_layer_tokens"
RANDOM_LTD_WARMUP_TYPE = "warmup_type"
RANDOM_LTD_WARMUP_LAYER_TOKENS = "warmup_layer_tokens"
......@@ -3,22 +3,30 @@ Copyright 2021 The Microsoft DeepSpeed Team
'''
import math
from deepspeed.utils import logger
from .constants import *
class CurriculumScheduler(object):
def __init__(self, config):
super().__init__()
self.state = {}
assert "curriculum_type" in config, "Curriculum learning requires the config 'curriculum_type'"
assert "min_difficulty" in config, "Curriculum learning requires the config 'min_difficulty'"
assert "max_difficulty" in config, "Curriculum learning requires the config 'max_difficulty'"
assert "schedule_type" in config, "Curriculum learning requires the config 'schedule_type'"
self.state['min_difficulty'] = config['min_difficulty']
self.state['max_difficulty'] = config['max_difficulty']
self.state['current_difficulty'] = config['min_difficulty']
self.state['schedule_type'] = config['schedule_type']
assert CURRICULUM_LEARNING_MIN_DIFFICULTY in config, \
f"Curriculum learning requires the config '{CURRICULUM_LEARNING_MIN_DIFFICULTY}'"
assert CURRICULUM_LEARNING_MAX_DIFFICULTY in config, \
f"Curriculum learning requires the config '{CURRICULUM_LEARNING_MAX_DIFFICULTY}'"
assert CURRICULUM_LEARNING_SCHEDULE_TYPE in config, \
f"Curriculum learning requires the config '{CURRICULUM_LEARNING_SCHEDULE_TYPE}'"
self.state[CURRICULUM_LEARNING_MIN_DIFFICULTY] = config[
CURRICULUM_LEARNING_MIN_DIFFICULTY]
self.state[CURRICULUM_LEARNING_MAX_DIFFICULTY] = config[
CURRICULUM_LEARNING_MAX_DIFFICULTY]
self.state[CURRICULUM_LEARNING_CURRENT_DIFFICULTY] = config[
CURRICULUM_LEARNING_MIN_DIFFICULTY]
self.state[CURRICULUM_LEARNING_SCHEDULE_TYPE] = config[
CURRICULUM_LEARNING_SCHEDULE_TYPE]
self.first_step = True
if config['schedule_type'] == 'fixed_discrete':
if config[
CURRICULUM_LEARNING_SCHEDULE_TYPE] == CURRICULUM_LEARNING_SCHEDULE_FIXED_DISCRETE:
"""
The schedule_config is a list of difficulty and a list of max
step belonging to each difficulty. Example json config:
......@@ -28,17 +36,25 @@ class CurriculumScheduler(object):
}
The "max_step" has one less element than "difficulty", because
the last difficulty will be used for all following steps.
The self.state['schedule'] is a dictionary of
The self.state[CURRICULUM_LEARNING_SCHEDULE_CONFIG] is a dictionary of
difficulty : [max step for this difficulty, next difficulty].
"""
assert "difficulty" in config['schedule_config'], "Curriculum learning with fixed_discrete schedule requires the schedule_config 'difficulty'"
assert "max_step" in config['schedule_config'], "Curriculum learning with fixed_discrete schedule requires the schedule_config 'max_step'"
assert len(config['schedule_config']['max_step']) > 0
assert len(config['schedule_config']['difficulty']) > 0
assert len(config['schedule_config']['difficulty']) == len(
config['schedule_config']['max_step']) + 1
self.state['schedule'] = config['schedule_config']
elif config['schedule_type'] == 'fixed_root':
assert CURRICULUM_LEARNING_SCHEDULE_DIFFICULTY in config[CURRICULUM_LEARNING_SCHEDULE_CONFIG], \
f"Curriculum learning with fixed_discrete schedule requires the schedule_config '{CURRICULUM_LEARNING_SCHEDULE_DIFFICULTY}'"
assert CURRICULUM_LEARNING_SCHEDULE_MAX_STEP in config[CURRICULUM_LEARNING_SCHEDULE_CONFIG], \
f"Curriculum learning with fixed_discrete schedule requires the schedule_config '{CURRICULUM_LEARNING_SCHEDULE_MAX_STEP}'"
assert len(config[CURRICULUM_LEARNING_SCHEDULE_CONFIG]
[CURRICULUM_LEARNING_SCHEDULE_MAX_STEP]) > 0
assert len(config[CURRICULUM_LEARNING_SCHEDULE_CONFIG]
[CURRICULUM_LEARNING_SCHEDULE_DIFFICULTY]) > 0
assert len(config[CURRICULUM_LEARNING_SCHEDULE_CONFIG]
[CURRICULUM_LEARNING_SCHEDULE_DIFFICULTY]) == len(
config[CURRICULUM_LEARNING_SCHEDULE_CONFIG]
[CURRICULUM_LEARNING_SCHEDULE_MAX_STEP]) + 1
self.state[CURRICULUM_LEARNING_SCHEDULE_CONFIG] = config[
CURRICULUM_LEARNING_SCHEDULE_CONFIG]
elif config[
CURRICULUM_LEARNING_SCHEDULE_TYPE] == CURRICULUM_LEARNING_SCHEDULE_FIXED_ROOT:
"""
The schedule_config includes:
total_curriculum_step: how many steps the curriculum learning takes to go
......@@ -57,38 +73,59 @@ class CurriculumScheduler(object):
"root_degree": 2
}
"""
assert "total_curriculum_step" in config['schedule_config'], "Curriculum learning with fixed_root schedule requires the schedule_config 'total_curriculum_step'"
assert "difficulty_step" in config['schedule_config'], "Curriculum learning with fixed_root schedule requires the schedule_config 'difficulty_step'"
assert "root_degree" in config['schedule_config'], "Curriculum learning with fixed_root schedule requires the schedule_config 'root_degree'"
if config['schedule_config']['difficulty_step'] % 8 != 0:
assert CURRICULUM_LEARNING_SCHEDULE_TOTAL_STEP in config[CURRICULUM_LEARNING_SCHEDULE_CONFIG], \
f"Curriculum learning with fixed_root schedule requires the schedule_config '{CURRICULUM_LEARNING_SCHEDULE_TOTAL_STEP}'"
assert CURRICULUM_LEARNING_SCHEDULE_DIFFICULTY_STEP in config[CURRICULUM_LEARNING_SCHEDULE_CONFIG], \
f"Curriculum learning with fixed_root schedule requires the schedule_config '{CURRICULUM_LEARNING_SCHEDULE_DIFFICULTY_STEP}'"
assert CURRICULUM_LEARNING_SCHEDULE_ROOT_DEGREE in config[CURRICULUM_LEARNING_SCHEDULE_CONFIG], \
f"Curriculum learning with fixed_root schedule requires the schedule_config '{CURRICULUM_LEARNING_SCHEDULE_ROOT_DEGREE}'"
if config[CURRICULUM_LEARNING_SCHEDULE_CONFIG][
CURRICULUM_LEARNING_SCHEDULE_DIFFICULTY_STEP] % 8 != 0:
logger.warning(
f'The difficulty_step for curriculum learning has to be multiple of 8 (for FP16 data) or 16 (for INT8 data) to enable NVIDIA Tensor Core acceleration. Disregard this warning if this is unrelated to your hardware.'
f'When using seqlen metric, the difficulty_step for curriculum learning has to be multiple of 8 (for FP16 data) or 16 (for INT8 data) to enable NVIDIA Tensor Core acceleration. Disregard this warning if this is unrelated to your metric/hardware.'
)
self.state['schedule'] = config['schedule_config']
elif config['schedule_type'] == 'fixed_linear':
self.state[CURRICULUM_LEARNING_SCHEDULE_CONFIG] = config[
CURRICULUM_LEARNING_SCHEDULE_CONFIG]
elif config[
CURRICULUM_LEARNING_SCHEDULE_TYPE] == CURRICULUM_LEARNING_SCHEDULE_FIXED_LINEAR:
"""
The schedule_config is the same as 'fixed_root' but without the
The schedule_config is the same as CURRICULUM_LEARNING_SCHEDULE_FIXED_ROOT but without the
root_degree.
"schedule_config": {
"total_curriculum_step": 30000,
"difficulty_step": 8
}
"""
assert "total_curriculum_step" in config['schedule_config'], "Curriculum learning with fixed_linear schedule requires the schedule_config 'total_curriculum_step'"
assert "difficulty_step" in config['schedule_config'], "Curriculum learning with fixed_linear schedule requires the schedule_config 'difficulty_step'"
if config['schedule_config']['difficulty_step'] % 8 != 0:
assert CURRICULUM_LEARNING_SCHEDULE_TOTAL_STEP in config[CURRICULUM_LEARNING_SCHEDULE_CONFIG], \
f"Curriculum learning with fixed_linear schedule requires the schedule_config '{CURRICULUM_LEARNING_SCHEDULE_TOTAL_STEP}'"
assert CURRICULUM_LEARNING_SCHEDULE_DIFFICULTY_STEP in config[CURRICULUM_LEARNING_SCHEDULE_CONFIG], \
f"Curriculum learning with fixed_linear schedule requires the schedule_config '{CURRICULUM_LEARNING_SCHEDULE_DIFFICULTY_STEP}'"
if config[CURRICULUM_LEARNING_SCHEDULE_CONFIG][
CURRICULUM_LEARNING_SCHEDULE_DIFFICULTY_STEP] % 8 != 0:
logger.warning(
f'The difficulty_step for curriculum learning has to be multiple of 8 (for FP16 data) or 16 (for INT8 data) to enable NVIDIA Tensor Core acceleration. Disregard this warning if this is unrelated to your hardware.'
f'When using seqlen metric, the difficulty_step for curriculum learning has to be multiple of 8 (for FP16 data) or 16 (for INT8 data) to enable NVIDIA Tensor Core acceleration. Disregard this warning if this is unrelated to your metric/hardware.'
)
self.state['schedule'] = config['schedule_config']
self.state[CURRICULUM_LEARNING_SCHEDULE_CONFIG] = config[
CURRICULUM_LEARNING_SCHEDULE_CONFIG]
elif config[
CURRICULUM_LEARNING_SCHEDULE_TYPE] == CURRICULUM_LEARNING_SCHEDULE_CUSTOM:
"""
Fully customized schedule. User need to provide a custom schedule
function by using the set_custom_curriculum_learning_schedule API
in deepspeed/runtime/engine.py
"""
self.custom_get_difficulty = None
else:
raise RuntimeError('Unsupported curriculum schedule type')
def get_current_difficulty(self):
return self.state['current_difficulty']
return self.state[CURRICULUM_LEARNING_CURRENT_DIFFICULTY]
def set_current_difficulty(self, difficulty):
self.state['current_difficulty'] = difficulty
self.state[CURRICULUM_LEARNING_CURRENT_DIFFICULTY] = difficulty
def set_custom_get_difficulty(self, schedule_function):
self.custom_get_difficulty = schedule_function
def get_state(self):
return self.state
......@@ -97,38 +134,49 @@ class CurriculumScheduler(object):
self.state = state
def __fixed_discrete_get_difficulty(self, global_steps):
s_state = self.state['schedule']
if global_steps > s_state['max_step'][-1]:
return s_state['difficulty'][-1]
for i in range(len(s_state['max_step'])):
if global_steps <= s_state['max_step'][i]:
return s_state['difficulty'][i]
s_state = self.state[CURRICULUM_LEARNING_SCHEDULE_CONFIG]
if global_steps > s_state[CURRICULUM_LEARNING_SCHEDULE_MAX_STEP][-1]:
return s_state[CURRICULUM_LEARNING_SCHEDULE_DIFFICULTY][-1]
for i in range(len(s_state[CURRICULUM_LEARNING_SCHEDULE_MAX_STEP])):
if global_steps <= s_state[CURRICULUM_LEARNING_SCHEDULE_MAX_STEP][i]:
return s_state[CURRICULUM_LEARNING_SCHEDULE_DIFFICULTY][i]
def __fixed_root_get_difficulty(self, global_steps, root_degree=None):
s_state = self.state['schedule']
s_state = self.state[CURRICULUM_LEARNING_SCHEDULE_CONFIG]
if root_degree is None:
root_degree = s_state['root_degree']
root_degree = s_state[CURRICULUM_LEARNING_SCHEDULE_ROOT_DEGREE]
next_difficulty = (float(global_steps) /
s_state['total_curriculum_step'])**(1.0 / root_degree)
next_difficulty = math.floor(
next_difficulty *
(self.state['max_difficulty'] - self.state['min_difficulty']) +
self.state['min_difficulty'])
next_difficulty -= (next_difficulty % s_state['difficulty_step'])
next_difficulty = min(next_difficulty, self.state['max_difficulty'])
s_state[CURRICULUM_LEARNING_SCHEDULE_TOTAL_STEP])**(
1.0 / root_degree)
next_difficulty = math.floor(next_difficulty *
(self.state[CURRICULUM_LEARNING_MAX_DIFFICULTY] -
self.state[CURRICULUM_LEARNING_MIN_DIFFICULTY]) +
self.state[CURRICULUM_LEARNING_MIN_DIFFICULTY])
next_difficulty -= (next_difficulty %
s_state[CURRICULUM_LEARNING_SCHEDULE_DIFFICULTY_STEP])
next_difficulty = min(next_difficulty,
self.state[CURRICULUM_LEARNING_MAX_DIFFICULTY])
return next_difficulty
def get_difficulty(self, global_steps):
if self.state['schedule_type'] == 'fixed_discrete':
if self.state[
CURRICULUM_LEARNING_SCHEDULE_TYPE] == CURRICULUM_LEARNING_SCHEDULE_FIXED_DISCRETE:
return self.__fixed_discrete_get_difficulty(global_steps)
elif self.state['schedule_type'] == 'fixed_linear':
elif self.state[
CURRICULUM_LEARNING_SCHEDULE_TYPE] == CURRICULUM_LEARNING_SCHEDULE_FIXED_LINEAR:
return self.__fixed_root_get_difficulty(global_steps, 1)
elif self.state['schedule_type'] == 'fixed_root':
elif self.state[
CURRICULUM_LEARNING_SCHEDULE_TYPE] == CURRICULUM_LEARNING_SCHEDULE_FIXED_ROOT:
return self.__fixed_root_get_difficulty(global_steps)
elif self.state[
CURRICULUM_LEARNING_SCHEDULE_TYPE] == CURRICULUM_LEARNING_SCHEDULE_CUSTOM:
return self.custom_get_difficulty(global_steps)
else:
raise RuntimeError('Unsupported curriculum schedule type')
def update_difficulty(self, global_steps):
if self.state['current_difficulty'] < self.state['max_difficulty']:
self.state['current_difficulty'] = self.get_difficulty(global_steps)
return self.state['current_difficulty']
if self.state[CURRICULUM_LEARNING_CURRENT_DIFFICULTY] < self.state[
CURRICULUM_LEARNING_MAX_DIFFICULTY]:
self.state[CURRICULUM_LEARNING_CURRENT_DIFFICULTY] = self.get_difficulty(
global_steps)
return self.state[CURRICULUM_LEARNING_CURRENT_DIFFICULTY]
'''
Copyright 2022 The Microsoft DeepSpeed Team
'''
from deepspeed.utils import logger
from torch import Tensor
from torch.nn import Module
from ..constants import *
from deepspeed.ops.random_ltd.dropping_utils import gpt_sample_tokens, bert_sample_tokens, GatherTokens, ScatterTokens
#####based on the paper random-ltd: https://arxiv.org/abs/2211.11586
class RandomLayerTokenDrop(Module):
"""
A layer wrapper for random LTD
"""
def __init__(self, layer: Module):
super(RandomLayerTokenDrop, self).__init__()
self.random_ltd_layer = layer
self.reserved_length = None #config['max_value']
self.random_ltd_scheduler = None
self.max_length = None
self.reserved_length = -1
self.curr_seq = -1
self.batch_first = False
def init_config(self, config, scheduler, random_ltd_layer_id):
self.random_ltd_scheduler = scheduler
self.random_ltd_layer_id = random_ltd_layer_id
self.max_length = self.random_ltd_scheduler.state[RANDOM_LTD_MAX_VALUE]
self.mask_name = config[RANDOM_LTD_MODEL_MASK_NAME]
self.micro_bs = config[RANDOM_LTD_MICRO_BATCH_SIZE]
self.random_ltd_num_layer = self.random_ltd_scheduler.random_ltd_layer_num
hs_order = config[RANDOM_LTD_HIDDEN_STATE_ORDER]
self.model_type = config[RANDOM_LTD_MODEL_TYPE]
if hs_order == 'batch_seq_dim':
self.get_hidden_tensor_shape = self.get_bsh
self.batch_first = True
elif hs_order == 'seq_batch_dim':
self.get_hidden_tensor_shape = self.get_sbh
self.batch_first = False
else:
logger.warning(
"************For now, we only support batch_seq_dim or seq_batch_dim inputs. You can easily \
your own input dimension orders************")
raise NotImplementedError
if self.model_type == 'encoder':
self.index_generator = bert_sample_tokens
elif self.model_type == 'decoder':
self.index_generator = gpt_sample_tokens
else:
logger.warning(
"************For now, we only support encoder-only or decoder-only models************"
)
raise NotImplementedError
def get_bsh(self, hidden_stats):
self.curr_seq, self.curr_micro_batch = hidden_stats.size()[1], hidden_stats.size()[0]
def get_sbh(self, hidden_stats):
self.curr_seq, self.curr_micro_batch = hidden_stats.size()[0], hidden_stats.size()[1]
def forward(self, hidden_states, **kwargs) -> Tensor:
if self.random_ltd_scheduler is not None:
self.reserved_length = self.random_ltd_scheduler.get_current_seq()
self.get_hidden_tensor_shape(hidden_states)
if self.training and self.random_ltd_scheduler is not None and self.reserved_length < self.curr_seq:
if self.mask_name is not None:
mask = kwargs[self.mask_name]
else:
mask = None
if self.random_ltd_layer_id == 0:
sampled_indices, part_attention_mask = self.index_generator(self.reserved_length,\
self.curr_seq, \
self.curr_micro_batch, \
self.random_ltd_num_layer, \
hidden_states.device, mask)
self.random_ltd_scheduler.state[
RANDOM_LTD_SAMPLE_INDEX] = sampled_indices
self.random_ltd_scheduler.state[
RANDOM_LTD_ATTENTION_MASK] = part_attention_mask
else:
sampled_indices = self.random_ltd_scheduler.state[
RANDOM_LTD_SAMPLE_INDEX]
part_attention_mask = self.random_ltd_scheduler.state[
RANDOM_LTD_ATTENTION_MASK]
hidden_states, part_hidden_states = GatherTokens.apply(hidden_states, sampled_indices[self.random_ltd_layer_id,:,:], self.batch_first)
if self.mask_name is not None:
if self.model_type == 'encoder':
kwargs[self.mask_name] = part_attention_mask[
self.random_ltd_layer_id]
else:
kwargs[self.mask_name] = part_attention_mask
outputs = self.random_ltd_layer(part_hidden_states, **kwargs)
if isinstance(outputs, tuple):
hidden_states = ScatterTokens.apply(hidden_states, outputs[0], sampled_indices[self.random_ltd_layer_id,:,:], self.batch_first)
my_list = list(outputs)
my_list[0] = hidden_states
return tuple(my_list)
elif isinstance(outputs, Tensor):
hidden_states = ScatterTokens.apply(hidden_states, outputs, sampled_indices[self.random_ltd_layer_id,:,:], self.batch_first)
return hidden_states
else:
logger.warning(
"************For now, we only support tuple and tensor output. \
You need to adjust the output according to the layer in your model************"
)
raise NotImplementedError
else:
return self.random_ltd_layer(hidden_states, **kwargs)
'''
Copyright 2022 The Microsoft DeepSpeed Team
'''
from .basic_layer import RandomLayerTokenDrop
from collections import OrderedDict
from deepspeed.compression.helper import recursive_getattr, recursive_setattr
def convert_to_random_ltd(model, convert_type):
if hasattr(model, 'module'):
c_model = model.module
else:
c_model = model
for name, module in c_model.named_modules():
if isinstance(module, convert_type):
old_module = recursive_getattr(c_model, name)
new_module = RandomLayerTokenDrop(old_module)
recursive_setattr(c_model, name, new_module)
model.random_ltd_initialize()
return model
def save_without_random_ltd(model):
if hasattr(model, 'module'):
c_model = model.module
else:
c_model = model
model_dic = c_model.state_dict()
return remove_random_ltd_state_dict(model_dic)
def remove_random_ltd_state_dict(state_dict):
new_state_dict = OrderedDict()
for key, value in state_dict.items():
if '.random_ltd_layer' in key:
new_key = ''.join(key.split('.random_ltd_layer'))
else:
new_key = key
new_state_dict[new_key] = value
return new_state_dict
'''
Copyright 2022 The Microsoft DeepSpeed Team
'''
import math
from deepspeed.utils import logger
# from deepspeed.runtime.lr_schedules import WarmupLR
from ..constants import *
#####based on the paper random-ltd: https://arxiv.org/abs/2211.11586
class BaseScheduler(object):
def __init__(self):
self.state = {}
def __fixed_root_get_value(self, global_steps, root_degree=None):
s_state = self.state[RANDOM_LTD_SCHEDULE_CONFIG]
if root_degree is None:
root_degree = s_state['root_degree']
next_seq = (float(global_steps) /
s_state[RANDOM_LTD_REQUIRE_STEP])**(1.0 / root_degree)
next_seq = math.floor(
next_seq *
(self.state[RANDOM_LTD_MAX_VALUE] - self.state[RANDOM_LTD_MIN_VALUE]) +
self.state[RANDOM_LTD_MIN_VALUE])
next_seq -= (next_seq % s_state[RANDOM_LTD_INCREASE_STEP])
next_seq = min(next_seq, self.state[RANDOM_LTD_MAX_VALUE])
return next_seq
def get_value(self, global_steps):
if self.state[RANDOM_LTD_SCHEDULER_TYPE] == 'fixed_linear':
return self.__fixed_root_get_value(global_steps, 1)
else:
raise RuntimeError('Unsupported random LTD schedule type')
class RandomLTDScheduler(BaseScheduler):
def __init__(self, config):
super().__init__()
self.model_layer_num = config[RANDOM_LTD_TOTAL_LAYER_NUM]
self.random_ltd_layer_num = config[RANDOM_LTD_LAYER_NUM]
self.config_schedule = config[RANDOM_LTD_SCHEDULER]
self.global_batch_size = config[RANDOM_LTD_GLOBAL_BATCH_SIZE]
self.reset_to_init()
if config[RANDOM_LTD_LAYER_TOKEN_LR_SCHEDULE][RANDOM_LTD_LAYER_TOKEN_LR_ENABLED]:
logger.warning("**********Work In Progress************")
raise NotImplementedError
self.state[RANDOM_LTD_CONSUMED_LAYER_TOKENS] = 0
# self.first_step = True
def get_total_layer_tokens(self, train_iters):
for step in range(train_iters):
self.update_seq(step)
return self.state[RANDOM_LTD_CONSUMED_LAYER_TOKENS]
def reset_to_init(self):
if self.config_schedule is not None:
self.state[RANDOM_LTD_MIN_VALUE] = self.config_schedule[RANDOM_LTD_MIN_VALUE]
self.state[RANDOM_LTD_MAX_VALUE] = self.config_schedule[RANDOM_LTD_MAX_VALUE]
self.state[RANDOM_LTD_CURRENT_VALUE] = self.config_schedule[
RANDOM_LTD_MIN_VALUE]
self.state[RANDOM_LTD_SCHEDULE_CONFIG] = self.config_schedule[
RANDOM_LTD_SCHEDULE_CONFIG]
self.state[RANDOM_LTD_SCHEDULER_TYPE] = self.config_schedule[
RANDOM_LTD_SCHEDULER_TYPE]
self.state[RANDOM_LTD_CONSUMED_LAYER_TOKENS] = 0
self.state[RANDOM_LTD_CURR_STEP] = -1
def get_current_seq(self):
return self.state[RANDOM_LTD_CURRENT_VALUE]
def set_current_seq(self, seq_length):
self.state[RANDOM_LTD_CURRENT_VALUE] = seq_length
def get_random_ltd_layer_num(self):
return self.random_ltd_layer_num
def get_state(self):
return self.state
def set_state(self, state):
self.state = state
def update_seq(self, global_steps):
if self.state[RANDOM_LTD_CURRENT_VALUE] < self.state[RANDOM_LTD_MAX_VALUE]:
self.state[RANDOM_LTD_CURRENT_VALUE] = self.get_value(global_steps)
if global_steps != self.state[RANDOM_LTD_CURR_STEP]:
self.state[RANDOM_LTD_CONSUMED_LAYER_TOKENS] += self.global_batch_size*(self.state[RANDOM_LTD_CURRENT_VALUE] * self.random_ltd_layer_num \
+ self.state[RANDOM_LTD_MAX_VALUE] * (self.model_layer_num - self.random_ltd_layer_num))
self.state[RANDOM_LTD_CURR_STEP] = global_steps
def state_dict(self):
return {
RANDOM_LTD_CONSUMED_LAYER_TOKENS:
self.state[RANDOM_LTD_CONSUMED_LAYER_TOKENS],
RANDOM_LTD_CURR_STEP: self.state[RANDOM_LTD_CURR_STEP],
RANDOM_LTD_CURRENT_VALUE: self.state[RANDOM_LTD_CURRENT_VALUE],
RANDOM_LTD_MIN_VALUE: self.state[RANDOM_LTD_MIN_VALUE],
RANDOM_LTD_MAX_VALUE: self.state[RANDOM_LTD_MAX_VALUE],
}
def load_state_dict(self, state_dict):
self.state[RANDOM_LTD_CONSUMED_LAYER_TOKENS] = state_dict[
RANDOM_LTD_CONSUMED_LAYER_TOKENS]
self.state[RANDOM_LTD_CURR_STEP] = state_dict[RANDOM_LTD_CURR_STEP]
self.state[RANDOM_LTD_CURRENT_VALUE] = state_dict[RANDOM_LTD_CURRENT_VALUE]
self.state[RANDOM_LTD_MIN_VALUE] = state_dict[RANDOM_LTD_MIN_VALUE]
self.state[RANDOM_LTD_MAX_VALUE] = state_dict[RANDOM_LTD_MAX_VALUE]
'''
Copyright 2022 The Microsoft DeepSpeed Team
'''
import torch
def bsh_decoder_gather(reserved_length, hidden_states, mask):
# random-layer-token-drop
rand_list = []
part_hidden_states = [] # batch, seq, hidden ## different from megatron
for k in range(hidden_states.size(0)):
B_tmp = torch.randperm(hidden_states.size(1),
device=hidden_states.device)[:reserved_length]
B = B_tmp.sort()[0]
rand_list.append(B)
part_hidden_states.append(hidden_states[k:k + 1, B, :])
part_hidden_states = torch.cat(part_hidden_states, dim=0)
part_mask = mask[:, :, :reserved_length, :reserved_length]
return part_hidden_states, rand_list, part_mask
def bsh_decoder_scatter(hidden_states, part_hidden_states, rand_list):
for k in range(hidden_states.size(0)):
hidden_states[k, rand_list[k], :] = part_hidden_states[k, :, :]
return hidden_states
'''
Copyright 2022 The Microsoft DeepSpeed Team
'''
import math
import numpy as np
from deepspeed.utils import logger
from .indexed_dataset import MMapIndexedDatasetBuilder
def find_fit_int_dtype(min_value, max_value):
if min_value >= 0:
if max_value <= 255:
return np.uint8
elif max_value <= 65535:
return np.uint16
elif max_value <= 4294967295:
return np.uint32
else:
return np.uint64
else:
if max_value <= 127 and min_value >= -128:
return np.int8
elif max_value <= 32767 and min_value >= -32768:
return np.int16
elif max_value <= 2147483647 and min_value >= -2147483648:
return np.int32
else:
return np.int64
def split_index(start_idx, end_idx, num_partitions):
partition_size = math.ceil((end_idx - start_idx) / num_partitions)
partitions = [[
start_idx + x * partition_size,
min(end_idx,
start_idx + (x + 1) * partition_size)
] for x in range(num_partitions)]
return partitions
def split_dataset(dataset, num_workers, worker_id, num_threads):
worker_splits = split_index(0, len(dataset), num_workers)
thread_splits = split_index(worker_splits[worker_id][0],
worker_splits[worker_id][1],
num_threads)
return worker_splits, thread_splits
def create_mmap_dataset_builder(fname, dtype):
logger.info(f"Creating mmap dataset builder at {fname}.")
return MMapIndexedDatasetBuilder(f"{fname}.bin", dtype=dtype)
def close_mmap_dataset_builder(builder, fname):
builder.end_document()
builder.finalize(f"{fname}.idx")
logger.info(f"Finalized mmap dataset builder at {fname}.")
......@@ -6,6 +6,12 @@ import torch
from torch.utils.data import DataLoader, RandomSampler
from torch.utils.data.distributed import DistributedSampler
from deepspeed.runtime.data_pipeline.data_sampling.data_sampler import DeepSpeedDataSampler
from deepspeed.runtime.data_pipeline.constants import CURRICULUM_LEARNING, \
DATA_EFFICIENCY, DATA_SAMPLING_NUM_WORKERS
from deepspeed.runtime.constants import GRADIENT_ACCUMULATION_STEPS, \
DATA_PARALLEL_GROUP, GLOBAL_RANK
class RepeatingLoader:
def __init__(self, loader):
......@@ -42,24 +48,46 @@ class DeepSpeedDataLoader(object):
data_sampler=None,
data_parallel_world_size=None,
data_parallel_rank=None,
dataloader_drop_last=False):
dataloader_drop_last=False,
deepspeed_dataloader_config={}):
self.deepspeed_dataloader_config = deepspeed_dataloader_config
self.tput_timer = tput_timer
self.batch_size = batch_size
if local_rank >= 0:
if data_sampler is None:
data_sampler = DistributedSampler(dataset=dataset,
num_replicas=data_parallel_world_size,
rank=data_parallel_rank)
device_count = 1
else:
if data_sampler is None:
data_sampler = RandomSampler(dataset)
self.curriculum_learning_enabled = False
if CURRICULUM_LEARNING in deepspeed_dataloader_config:
self.curriculum_learning_enabled = deepspeed_dataloader_config[
CURRICULUM_LEARNING]
if self.curriculum_learning_enabled:
data_sampler = DeepSpeedDataSampler(
self.deepspeed_dataloader_config[DATA_EFFICIENCY],
len(dataset),
self.batch_size,
data_parallel_rank,
data_parallel_world_size,
self.deepspeed_dataloader_config[DATA_PARALLEL_GROUP],
self.deepspeed_dataloader_config[GRADIENT_ACCUMULATION_STEPS],
self.deepspeed_dataloader_config[GLOBAL_RANK],
drop_last=dataloader_drop_last)
device_count = torch.cuda.device_count()
batch_size *= device_count
if num_local_io_workers is None:
num_local_io_workers = 2 * device_count
num_local_io_workers = self.deepspeed_dataloader_config[
DATA_SAMPLING_NUM_WORKERS]
else:
if local_rank >= 0:
if data_sampler is None:
data_sampler = DistributedSampler(
dataset=dataset,
num_replicas=data_parallel_world_size,
rank=data_parallel_rank)
device_count = 1
else:
if data_sampler is None:
data_sampler = RandomSampler(dataset)
device_count = torch.cuda.device_count()
batch_size *= device_count
if num_local_io_workers is None:
num_local_io_workers = 2 * device_count
self.num_local_io_workers = num_local_io_workers
self.data_sampler = data_sampler
......@@ -70,6 +98,7 @@ class DeepSpeedDataLoader(object):
self.pin_memory = pin_memory
self.data = None
self.dataloader_drop_last = dataloader_drop_last
self.post_process_func = None
if self.dataloader_drop_last:
self.len = len(self.data_sampler) // self.batch_size
......@@ -87,27 +116,41 @@ class DeepSpeedDataLoader(object):
def __next__(self):
if self.tput_timer:
self.tput_timer.start()
return next(self.data)
if self.curriculum_learning_enabled:
data = next(self.data_iterator)
if self.post_process_func is not None:
data = self.post_process_func(data, self.data_sampler.state_dict())
return data
else:
return next(self.data)
def _create_dataloader(self):
if self.collate_fn is None:
if self.curriculum_learning_enabled:
self.dataloader = DataLoader(self.dataset,
batch_size=self.batch_size,
pin_memory=self.pin_memory,
sampler=self.data_sampler,
num_workers=self.num_local_io_workers,
drop_last=self.dataloader_drop_last)
batch_sampler=self.data_sampler,
num_workers=self.num_local_io_workers)
self.data_iterator = iter(self.dataloader)
return self.dataloader
else:
self.dataloader = DataLoader(self.dataset,
batch_size=self.batch_size,
pin_memory=self.pin_memory,
sampler=self.data_sampler,
collate_fn=self.collate_fn,
num_workers=self.num_local_io_workers,
drop_last=self.dataloader_drop_last)
self.data = (x for x in self.dataloader)
return self.dataloader
if self.collate_fn is None:
self.dataloader = DataLoader(self.dataset,
batch_size=self.batch_size,
pin_memory=self.pin_memory,
sampler=self.data_sampler,
num_workers=self.num_local_io_workers,
drop_last=self.dataloader_drop_last)
else:
self.dataloader = DataLoader(self.dataset,
batch_size=self.batch_size,
pin_memory=self.pin_memory,
sampler=self.data_sampler,
collate_fn=self.collate_fn,
num_workers=self.num_local_io_workers,
drop_last=self.dataloader_drop_last)
self.data = (x for x in self.dataloader)
return self.dataloader
# DataLoader([(torch.randn(3, 3), torch.tensor(i % 2)) for i in range(10)], batch_size=2))
......@@ -7,7 +7,7 @@ import re
import stat
import torch
import hashlib
from collections import defaultdict, OrderedDict
from collections import defaultdict, OrderedDict, deque
from shutil import copyfile
from torch.nn.modules import Module
......@@ -38,7 +38,8 @@ from deepspeed.runtime.config import DeepSpeedConfig, DEEPSPEED_OPTIMIZERS, \
from deepspeed.runtime.dataloader import DeepSpeedDataLoader
from deepspeed.runtime.constants import \
ROUTE_TRAIN, ROUTE_PREDICT, ROUTE_EVAL, \
PLD_THETA, PLD_GAMMA, BFLOAT16, FP16, AMP
PLD_THETA, PLD_GAMMA, BFLOAT16, FP16, AMP, GRADIENT_ACCUMULATION_STEPS, \
DATA_PARALLEL_GROUP, GLOBAL_RANK
from deepspeed.runtime.zero.config import ZeroStageEnum
from deepspeed.compression import compression_scheduler
from deepspeed.compression.constants import \
......@@ -64,7 +65,17 @@ from deepspeed.monitor.monitor import MonitorMaster
from deepspeed.runtime.progressive_layer_drop import ProgressiveLayerDrop
from deepspeed.runtime.utils import clip_grad_norm_
from deepspeed.runtime.eigenvalue import Eigenvalue
from deepspeed.runtime.data_pipeline.constants import DATA_SAMPLING, \
DATA_ROUTING, DATA_SAMPLING_ENABLED, CURRICULUM_LEARNING, \
CURRICULUM_LEARNING_ENABLED, DATA_SAMPLING_NUM_WORKERS, RANDOM_LTD, \
RANDOM_LTD_ENABLED, RANDOM_LTD_LAYER_ID, RANDOM_LTD_LAYER_NUM, \
RANDOM_LTD_LAYER_TOKEN_LR_SCHEDULE, RANDOM_LTD_LAYER_TOKEN_LR_ENABLED, \
RANDOM_LTD_GLOBAL_BATCH_SIZE, RANDOM_LTD_MICRO_BATCH_SIZE, DATA_EFFICIENCY
from deepspeed.runtime.data_pipeline.curriculum_scheduler import CurriculumScheduler
from deepspeed.runtime.data_pipeline.data_routing.scheduler import RandomLTDScheduler
from deepspeed.runtime.data_pipeline.data_routing.helper import remove_random_ltd_state_dict
from deepspeed.runtime.data_pipeline.data_routing.basic_layer import RandomLayerTokenDrop
from deepspeed.runtime.checkpoint_engine.torch_checkpoint_engine import TorchCheckpointEngine
from .pipe.module import PipelineModule
......@@ -358,8 +369,17 @@ class DeepSpeedEngine(Module):
if self.pld_enabled():
self.progressive_layer_drop = self._configure_progressive_layer_drop()
if self.curriculum_enabled():
self.curriculum_scheduler = self._configure_curriculum_scheduler()
if self.curriculum_enabled_legacy():
self.curriculum_scheduler_legacy = self._configure_curriculum_scheduler_legacy(
)
if self.random_ltd_enabled():
random_ltd_config = self.random_ltd_config()
random_ltd_config[RANDOM_LTD_GLOBAL_BATCH_SIZE] = self.train_batch_size()
random_ltd_config[
RANDOM_LTD_MICRO_BATCH_SIZE] = self.train_micro_batch_size_per_gpu()
self.random_ltd_scheduler = self._configure_random_ltd_scheduler(
random_ltd_config)
# Engine timers
......@@ -443,6 +463,15 @@ class DeepSpeedEngine(Module):
self._config.train_batch_size = train_batch_size
self._config.gradient_accumulation_steps = new_gas
def set_data_post_process_func(self, post_process_func):
if self.training_dataloader is not None:
self.training_dataloader.post_process_func = post_process_func
def set_custom_curriculum_learning_schedule(self, schedule_func_dict):
if self.training_dataloader is not None and self.curriculum_learning_enabled():
self.training_dataloader.data_sampler.set_custom_curriculum_learning_schedule(
schedule_func_dict)
def get_global_grad_norm(self) -> float:
"""Return the 2-norm of all gradients. If there is model parallelism,
the norm will be global.
......@@ -524,11 +553,64 @@ class DeepSpeedEngine(Module):
def eigenvalue_layer_num(self):
return self._config.eigenvalue_layer_num
def curriculum_enabled(self):
return self._config.curriculum_enabled
def curriculum_enabled_legacy(self):
return self._config.curriculum_enabled_legacy
def curriculum_params_legacy(self):
return self._config.curriculum_params_legacy
def data_efficiency_enabled(self):
return self._config.data_efficiency_enabled
def data_efficiency_config(self):
return self._config.data_efficiency_config
def data_sampling_enabled(self):
return self._config.data_efficiency_config[DATA_SAMPLING][DATA_SAMPLING_ENABLED]
def data_sampling_config(self):
return self._config.data_efficiency_config[DATA_SAMPLING]
def curriculum_learning_enabled(self):
return self._config.data_efficiency_config[DATA_SAMPLING][CURRICULUM_LEARNING][
CURRICULUM_LEARNING_ENABLED]
def curriculum_learning_config(self):
return self._config.data_efficiency_config[DATA_SAMPLING][CURRICULUM_LEARNING]
def curriculum_params(self):
return self._config.curriculum_params
def random_ltd_enabled(self):
return self._config.data_efficiency_config[DATA_ROUTING][RANDOM_LTD][
RANDOM_LTD_ENABLED]
def random_ltd_config(self):
return self._config.data_efficiency_config[DATA_ROUTING][RANDOM_LTD]
def random_ltd_initialize(self):
assert self.random_ltd_enabled()
random_ltd_config = self.random_ltd_config()
random_ltd_queue = deque(
[x for x in sorted(random_ltd_config[RANDOM_LTD_LAYER_ID])])
count = 0
for name, layer in self.module.named_modules():
if isinstance(layer, RandomLayerTokenDrop):
if len(random_ltd_queue) != 0 and str(
random_ltd_queue[0]) in name: ###[1,2,3]
layer.init_config(random_ltd_config,
self.random_ltd_scheduler,
count)
random_ltd_queue.popleft()
count += 1
if random_ltd_config[RANDOM_LTD_LAYER_NUM] != count:
raise ValueError(
f'random_ltd_layer_num {random_ltd_config[RANDOM_LTD_LAYER_NUM]} must be \
equivalent to the len of random_ltd_layer_id {count}')
if random_ltd_config[RANDOM_LTD_LAYER_TOKEN_LR_SCHEDULE][
RANDOM_LTD_LAYER_TOKEN_LR_ENABLED]:
assert self.client_lr_scheduler is None
raise ValueError(f'not yet support')
#self.lr_scheduler = lr_schedules.WarmupLayerTokenDecayLR(self.optimizer, self.random_ltd_scheduler)
def wall_clock_breakdown(self):
return self._config.wall_clock_breakdown
......@@ -1313,6 +1395,9 @@ class DeepSpeedEngine(Module):
def _configure_compression_scheduler(self):
return compression_scheduler(self.module, self._config.compression_config)
def _configure_random_ltd_scheduler(self, configs):
return RandomLTDScheduler(configs)
def _configure_quantization(self):
(
quantize_weight_in_forward,
......@@ -1556,8 +1641,8 @@ class DeepSpeedEngine(Module):
return pld
def _configure_curriculum_scheduler(self):
scheduler = CurriculumScheduler(self.curriculum_params())
def _configure_curriculum_scheduler_legacy(self):
scheduler = CurriculumScheduler(self.curriculum_params_legacy())
return scheduler
@staticmethod
......@@ -1616,17 +1701,36 @@ class DeepSpeedEngine(Module):
data_parallel_world_size = self.mpu.get_data_parallel_world_size()
data_parallel_rank = self.mpu.get_data_parallel_rank()
return DeepSpeedDataLoader(dataset=dataset,
batch_size=batch_size,
pin_memory=pin_memory,
collate_fn=collate_fn,
local_rank=self.local_rank,
tput_timer=deepspeed_io_timer,
num_local_io_workers=num_local_io_workers,
data_sampler=data_sampler,
data_parallel_world_size=data_parallel_world_size,
data_parallel_rank=data_parallel_rank,
dataloader_drop_last=self.dataloader_drop_last())
deepspeed_dataloader_config = {}
if self.curriculum_learning_enabled():
deepspeed_dataloader_config = {
CURRICULUM_LEARNING:
self.curriculum_learning_enabled(),
DATA_EFFICIENCY:
self.data_efficiency_config(),
DATA_PARALLEL_GROUP:
self.data_parallel_group,
GRADIENT_ACCUMULATION_STEPS:
self.gradient_accumulation_steps(),
GLOBAL_RANK:
self.global_rank,
DATA_SAMPLING_NUM_WORKERS:
self.data_sampling_config()[DATA_SAMPLING_NUM_WORKERS]
}
return DeepSpeedDataLoader(
dataset=dataset,
batch_size=batch_size,
pin_memory=pin_memory,
collate_fn=collate_fn,
local_rank=self.local_rank,
tput_timer=deepspeed_io_timer,
num_local_io_workers=num_local_io_workers,
data_sampler=data_sampler,
data_parallel_world_size=data_parallel_world_size,
data_parallel_rank=data_parallel_rank,
dataloader_drop_last=self.dataloader_drop_last(),
deepspeed_dataloader_config=deepspeed_dataloader_config)
def train(self, mode=True):
r""""""
......@@ -1701,14 +1805,17 @@ class DeepSpeedEngine(Module):
if self.__class__.__name__ != "PipelineEngine":
# TODO: The above if condition is a HACK since for PipelineEngine
# it's difficult to inject argument in forward pass.
if self.module.training and self.curriculum_enabled():
self.curriculum_scheduler.update_difficulty(self.global_steps + 1)
if self.curriculum_params()["curriculum_type"] == "seqlen":
if self.module.training and self.curriculum_enabled_legacy():
self.curriculum_scheduler_legacy.update_difficulty(self.global_steps + 1)
if self.curriculum_params_legacy()["curriculum_type"] == "seqlen":
kwargs.update({
"curriculum_seqlen":
self.curriculum_scheduler.get_current_difficulty()
self.curriculum_scheduler_legacy.get_current_difficulty()
})
if self.module.training and self.random_ltd_enabled():
self.random_ltd_scheduler.update_seq(self.global_steps)
if self.zero_optimization_partition_weights():
# Enable automated discovery of external parameters by indicating that
# we are in a forward pass.
......@@ -2435,6 +2542,8 @@ class DeepSpeedEngine(Module):
def module_state_dict(self, destination=None, prefix="", keep_vars=False):
sd = self.module.state_dict(destination, prefix, keep_vars)
if self.random_ltd_enabled():
sd = remove_random_ltd_state_dict(sd)
return sd
@staticmethod
......@@ -2744,6 +2853,15 @@ class DeepSpeedEngine(Module):
if load_lr_scheduler_states and self.lr_scheduler is not None:
self.lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
if self.random_ltd_enabled(
) and self.random_ltd_scheduler is not None and 'random_ltd' in checkpoint:
self.random_ltd_scheduler.load_state_dict(checkpoint['random_ltd'])
if self.training_dataloader is not None and self.curriculum_learning_enabled(
) and 'data_sampler' in checkpoint:
self.training_dataloader.data_sampler.load_state_dict(
checkpoint['data_sampler'])
def get_sparse_tensor_module_names(original_set,
loaded_set,
original_parameters,
......@@ -2790,7 +2908,9 @@ class DeepSpeedEngine(Module):
'skipped_steps',
'global_steps',
'dp_world_size',
'mp_world_size'
'mp_world_size',
'data_sampler',
'random_ltd'
]
client_state = {}
......@@ -3169,6 +3289,11 @@ class DeepSpeedEngine(Module):
if self.optimizer and zero_optimizer_state else None,
lr_scheduler=self.lr_scheduler.state_dict()
if self.lr_scheduler is not None else None,
data_sampler=self.training_dataloader.data_sampler.state_dict() if
(self.training_dataloader is not None
and self.curriculum_learning_enabled()) else None,
random_ltd=self.random_ltd_scheduler.state_dict()
if self.random_ltd_enabled() else None,
sparse_tensor_module_names=self.sparse_tensor_module_names,
skipped_steps=self.skipped_steps,
global_steps=self.global_steps,
......
......@@ -322,13 +322,13 @@ class PipelineEngine(DeepSpeedEngine):
f'train_batch() requires gradients enabled. Use eval_batch() instead.')
# Curriculum learning could change activation shape
if self.curriculum_enabled():
new_difficulty = self.curriculum_scheduler.update_difficulty( \
if self.curriculum_enabled_legacy():
new_difficulty = self.curriculum_scheduler_legacy.update_difficulty( \
self.global_steps + 1)
if self.global_steps == 0 or self.curriculum_scheduler.first_step:
if self.global_steps == 0 or self.curriculum_scheduler_legacy.first_step:
self.reset_activation_shape()
self.curriculum_scheduler.first_step = False
elif new_difficulty != self.curriculum_scheduler.get_difficulty( \
self.curriculum_scheduler_legacy.first_step = False
elif new_difficulty != self.curriculum_scheduler_legacy.get_difficulty( \
self.global_steps):
self.reset_activation_shape()
......@@ -413,13 +413,13 @@ class PipelineEngine(DeepSpeedEngine):
self.module.eval()
# Curriculum learning could change activation shape
if self.curriculum_enabled():
new_difficulty = self.curriculum_scheduler.update_difficulty( \
if self.curriculum_enabled_legacy():
new_difficulty = self.curriculum_scheduler_legacy.update_difficulty( \
self.global_steps + 1)
if self.global_steps == 0 or self.curriculum_scheduler.first_step:
if self.global_steps == 0 or self.curriculum_scheduler_legacy.first_step:
self.reset_activation_shape()
self.curriculum_scheduler.first_step = False
elif new_difficulty != self.curriculum_scheduler.get_difficulty( \
self.curriculum_scheduler_legacy.first_step = False
elif new_difficulty != self.curriculum_scheduler_legacy.get_difficulty( \
self.global_steps):
self.reset_activation_shape()
......
......@@ -39,6 +39,7 @@ collections:
- bert-pretraining.md
- cifar-10.md
- curriculum-learning.md
- data-efficiency.md
- flops-profiler.md
- pytorch-profiler.md
- autotuning.md
......
......@@ -44,6 +44,8 @@ lnav:
url: /docs/config-json/#communication-logging
- title: 'Model Compression'
url: /docs/config-json/#compression
- title: 'Data Efficiency'
url: /docs/config-json/#data-efficiency
- title: 'Tutorials'
url: /tutorials/
children:
......@@ -61,6 +63,8 @@ lnav:
url: /tutorials/cifar-10/
- title: 'Curriculum Learning'
url: /tutorials/curriculum-learning/
- title: 'Data Efficiency'
url: /tutorials/data-efficiency/
- title: 'Flops Profiler'
url: /tutorials/flops-profiler/
- title: 'PyTorch Profiler'
......
......@@ -898,7 +898,132 @@ Configuring the asynchronous I/O module for offloading parameter and optimizer s
}
```
### Data Efficiency
DeepSpeed Data Efficiency Library includes two techniques: curriculum learning and random layerwise token dropping (random-LTD). Read more about how to use the DeepSpeed Data Efficiency Library in our [tutorial](/tutorials/data-efficiency/).
```json
"data_efficiency": {
"enabled": true,
"seed": 1234,
"data_routing": {
"enabled": true,
"random_ltd":{
"enabled": true,
"total_layer_num": 24,
"random_ltd_layer_num": 22,
"random_ltd_layer_id": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
"model_mask_name": "attention_mask",
"model_type": "decoder",
"hidden_state_order": "seq_batch_dim",
"random_ltd_schedule": {
"min_value": 128,
"max_value": 2048,
"schedule_type":"fixed_linear",
"schedule_config": {
"require_steps": 200000,
"seq_per_step": 16
}
}
}
},
"data_sampling": {
"enabled": true,
"num_epochs": 1,
"num_workers": 0,
"curriculum_learning": {
"enabled": true,
"data_cluster_path": "/path/to/data_clusters",
"curriculum_metrics": {
"vocabularyrarity": {
"index_to_sample_path": "/path/to/index_to_sample",
"index_to_metric_path": "/path/to/index_to_metric",
"difficulty_type": "percentile",
"clustering_type": "schedule_based",
"min_difficulty": 1,
"max_difficulty": 100,
"schedule_type": "fixed_root",
"schedule_config": {
"total_curriculum_step": 110000,
"difficulty_step": 1,
"root_degree": 2
}
}
}
}
}
}
```
<i>**data_efficiency**</i>: [dictionary]
| Fields | Value | Default |
| ----- | ----- | ----- |
| <i>**enabled**</i>: [boolean] | Enable data efficiency or not. | `false` |
| <i>**seed**</i>: [integer] | Random seed for data sampling. | 1234 |
| <i>**data_routing**</i>: [dictionary] | Configs for data routing techniques. | N/A |
| <i>**data_sampling**</i>: [dictionary] | Configs for data sampling techniques. | N/A |
<i>**data_routing**</i>: [dictionary]
| Fields | Value | Default |
| ----- | ----- | ----- |
| <i>**enabled**</i>: [boolean] | Enable data routing techniques or not. | `false` |
| <i>**random_ltd**</i>: [dictionary] | Configs for random-LTD technique. | N/A |
<i>**data_sampling**</i>: [dictionary]
| Fields | Value | Default |
| ----- | ----- | ----- |
| <i>**enabled**</i>: [boolean] | Enable data sampling techniques or not. | `false` |
| <i>**num_epochs**</i>: [integer] | At most how many epoches of the original dataset will be iterated. | 1000 |
| <i>**num_workers**</i>: [integer] | Data loader number of workers. | 0 |
| <i>**curriculum_learning**</i>: [dictionary] | Configs for curriculum learing technique. | N/A |
<i>**random_ltd**</i>: [dictionary]
| Fields | Value | Default |
| ----- | ----- | ----- |
| <i>**enabled**</i>: [boolean] | Enable random-LTD technique or not. | `false` |
| <i>**total_layer_num**</i>: [integer] | The number of layer (or the depth) for the pretraining/fine-tuning model. | N/A |
| <i>**random_ltd_layer_num**</i>: [integer] | The number of layers that will be applied with random-LTD. | N/A |
| <i>**random_ltd_layer_id**</i>: [list] | The exact layer_id that will be applied with random-LTD. The length of this list must be the same as `random_ltd_layer_num`. | N/A |
| <i>**model_mask_name**</i>: [str] | The variable name of the attention_mask. Different libraries have different names, such as att_mask. For huggingface model, it’s named “attention_mask”. Users need to check the forward function in the original model files. If the attention mask input in the original model's forward function is not a keyword/named argument (e.g., attention_mask=None), user would need to change it to a keyword/named argument and provide that keyword as `model_mask_name`. | N/A |
| <i>**model_type**</i>: [str] | Users need to identify whether the model is `decoder` or `encoder`. Currently we only support these two. | N/A |
| <i>**hidden_state_order**</i>: [str] | Users need to know the input order of the hidden state tensor. Normally, it’s batch, sequence and then the hidden dimension, which is `batch_seq_dim`. Somethings, the order between batch and sequence will be switch like `seq_batch_dim`. Currently, we support these two. | N/A |
| <i>**random_ltd_schedule**</i>: [dictionary] | The schedule of the effective sequence length after token dropping. It's a linear function where random-LTD gradually drops less tokens and increases effective sequence length. | N/A |
| <i>&emsp;&emsp;**min_value**</i>: [integer] | The initial effective sequence length (after token dropping) at step/iteration 0. | N/A |
| <i>&emsp;&emsp;**max_value**</i>: [integer] | The max effective sequence length (usually the case without any token dropping). Usually this is set as baseline's seqlen. | N/A |
| <i>&emsp;&emsp;**schedule_type**</i>: [str] | The sequence length follows a linear increasing function starting from `min_value` and reaching `max_value`. We currently only support this type. | N/A |
| <i>&emsp;&emsp;**schedule_config**</i>: [dictionary] | Configs for the linear increasing function. | N/A |
| <i>&emsp;&emsp;&emsp;&emsp;**require_steps**</i>: [integer] | How many iterations will be needed to reach max_value from min_value. | N/A |
| <i>&emsp;&emsp;&emsp;&emsp;**seq_per_step**</i>: [integer] | At any time, the effective sequence length be multiple of this `seq_per_step`. Set this to multiple of 8 (for FP16 data) or 16 (for INT8 data) to enable NVIDIA Tensor Core acceleration. | N/A |
<i>**curriculum_learning**</i>: [dictionary]
| Fields | Value | Default |
| ----- | ----- | ----- |
| <i>**enabled**</i>: [boolean] | Enable curriculum learing technique or not. | `false` |
| <i>**data_cluster_path**</i>: [str] | Path to directory where curriculum learning will store the indexes of data samples within the same difficulty ranges. | N/A |
| <i>**curriculum_metrics**</i>: [dictionary] | This dictionary includes all desired curriculum metrics and their configs. Each metric will be a separate sub-dictionary, where the key is the metric name and the values are configs below. | N/A |
| <i>&emsp;&emsp;**index_to_sample_path**</i>: [str] | Path to the index_to_sample file generated during offline data analysis. Note that data analysis will generate two kinds of index_to_sample files: The metric_name_index_to_sample_percentile_merged file is a concatenated index for perf improvement, but it only works when you set difficulty_type=`percentile`. If you use difficulty_type=`value`, you need to change this to use the metric_name_index_to_sample file. | N/A |
| <i>&emsp;&emsp;**index_to_metric_path**</i>: [str] | Path to the index_to_metric_path file generated during offline data analysis. | N/A |
| <i>&emsp;&emsp;**difficulty_type**</i>: [str] | During training, how to increase the max accepted difficulty. Currently support `value` (increase by absolute value) and `percentile` (increase by difficulty percentile). | N/A |
| <i>&emsp;&emsp;**clustering_type**</i>: [str] | Currently support `schedule_based` (cluster data based on the difficulty schedule (pacing function) below) and `single_cluster` (no clustering required and probably CL is achieved by data postprocessing, such as sequence length truncation). | N/A |
| <i>&emsp;&emsp;**min_difficulty**</i>: [integer] | Starting difficulty at first step. When difficulty_type=`value` the `min_difficulty` is an absolute difficulty value. When difficulty_type=`percentile` the `min_difficulty` is a difficulty percentile value. | N/A |
| <i>&emsp;&emsp;**max_difficulty**</i>: [integer] | Final max difficulty. When difficulty_type=`value` the `max_difficulty` is an absolute difficulty value. When difficulty_type=`percentile` the `max_difficulty` is a difficulty percentile value. | N/A |
| <i>&emsp;&emsp;**schedule_type**</i>: [str] | The difficulty schedule (pacing function) that defines how the max accepted difficulty increases from `min_difficulty` to `max_difficulty` during training. Currently support `fixed_linear`, `fixed_root`, `fixed_discrete`, and `custom`. | N/A |
| <i>&emsp;&emsp;**schedule_config**</i>: [dictionary] | Configs for the pacing function. When schedule_type=`custom` this dictionary is not necessary. Instead user needs to provide a callback function (via the `set_custom_curriculum_learning_schedule` API in deepspeed/runtime/engine.py) which will update the max accepted difficulty during training. Configs below are all belongs to `schedule_config`. | N/A |
| <i>&emsp;&emsp;&emsp;&emsp;**total_curriculum_step**</i>: [integer] | How many steps the curriculum learning takes to go from min difficulty to max difficulty. Used by `fixed_linear` and `fixed_root` schedule. | N/A |
| <i>&emsp;&emsp;&emsp;&emsp;**difficulty_step**</i>: [integer] | The max accepted difficulty level determined every step must be a multiple of this `difficulty_step`. This is used to ensure the use of NVIDIA Tensor Core acceleration (requires multiple of 8 (FP16) or 16 (INT8)). Used by `fixed_linear` and `fixed_root` schedule. | N/A |
| <i>&emsp;&emsp;&emsp;&emsp;**root_degree**</i>: [integer] | The degree of the root function. Degree of 2 means square root and degree of 3 means cube root. Degree of 1 is equivalent to linear. Used by `fixed_root` schedule. | N/A |
| <i>&emsp;&emsp;&emsp;&emsp;**difficulty**</i>: [list] | List of max accepted difficulty levels to be used during schedule. Used by `fixed_discrete` schedule. | N/A |
| <i>&emsp;&emsp;&emsp;&emsp;**max_step**</i>: [list] | List of which step to change max accepted difficulty level. Used by `fixed_discrete` schedule. | N/A |
### Curriculum Learning
**Note:** On 12/12/2022, we released [DeepSpeed Data Efficiency Library](/tutorials/data-efficiency/) which provides a more general curriculum learning support. This legacy curriculum learning feature below is still supported but we recommend to use the Data Efficiency Library.
```json
"curriculum_learning": {
"enabled": true,
......
......@@ -101,6 +101,9 @@ Pipeline parallelism of DeepSpeed reduce communication volume during distributed
1-bit Adam, 0/1 Adam and 1-bit LAMB reduce communication volume by up to 26x while achieving similar convergence efficiency to Adam, allowing for scaling to different types of GPU clusters and networks. [1-bit Adam blog post](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html), [1-bit Adam tutorial](https://www.deepspeed.ai/tutorials/onebit-adam/), [0/1 Adam tutorial](https://www.deepspeed.ai/tutorials/zero-one-adam/), [1-bit LAMB tutorial](https://www.deepspeed.ai/tutorials/onebit-lamb/).
## Data efficiency
DeepSpeed Data Efficiency Library provides efficient data sampling via curriculum learning and efficient data routing via random layerwise token dropping. The composed solution enables up to 2x data and 2x time saving during GPT-3/BERT pretraining and GPT/ViT finetuning, or further improve model quality under the same data/time. See more in [the tutorial](/tutorials/data-efficiency).
## Supporting long sequence length
DeepSpeed offers sparse attention kernels—an instrumental technology to support long sequences of model inputs, whether for text, image, or sound. Compared with the classic dense Transformers, it powers **an order-of-magnitude longer input sequence** and obtains up to 6x faster execution with comparable accuracy. It also outperforms state-of-the-art sparse implementations with 1.5–3x faster execution. Furthermore, our sparse kernels support efficient execution of flexible sparse format and empower users to innovate on their custom sparse structures. [Read more here](https://www.deepspeed.ai/2020/09/08/sparse-attention.html).
......@@ -164,10 +167,15 @@ Below we provide a brief feature list, see our detailed [feature overview](https
* Learning Rate Range Test
* 1Cycle Learning Rate Schedule
* [Simplified Data Loader](https://www.deepspeed.ai/features/#simplified-data-loader)
* [Data Efficiency](https://www.deepspeed.ai/tutorials/data-efficiency/)
* Efficient data sampling via curriculum learning and efficient data routing via random layerwise token dropping
* Up to 2x data and 2x time saving during GPT-3/BERT pretraining and GPT/ViT finetuning
* Or further improve model quality under the same data/time
* [Curriculum Learning](https://www.deepspeed.ai/tutorials/curriculum-learning/)
* A curriculum learning-based data pipeline that presents easier or simpler examples earlier during training
* Stable and 3.3x faster GPT-2 pre-training with 8x/4x larger batch size/learning rate while maintaining token-wise convergence speed
* Complementary to many other DeepSpeed features
* Note that the Data Efficiency Library above provides more general curriculum learning support. This legacy curriculum learning feature is still supported but we recommend to use the Data Efficiency Library.
* [Progressive Layer Dropping](https://www.deepspeed.ai/2020/10/28/progressive-layer-dropping-news.html)
* Efficient and robust compressed training
* Up to 2.5x convergence speedup for pre-training
......@@ -419,8 +427,11 @@ DeepSpeed abstracts away data parallelism and model parallelism from the user wh
comes to data loading. Users simply provide a PyTorch dataset, and DeepSpeed data loader
can automatically handle batch creation appropriately.
## Data Efficiency
Please refer to the [Data Efficiency](/tutorials/data-efficiency/) tutorial.
## Curriculum Learning
Please refer to the [Curriculum Learning](/tutorials/curriculum-learning/) tutorial.
Please refer to the [Curriculum Learning](/tutorials/curriculum-learning/) tutorial. Note that the Data Efficiency Library above provides more general curriculum learning support. This legacy curriculum learning feature is still supported but we recommend to use the Data Efficiency Library.
## Performance Analysis and Debugging
......
---
title: "DeepSpeed Data Efficiency: A composable library that makes better use of data, increases training efficiency, and improves model quality"
excerpt: ""
date: 2022-12-12 00:09:00
tags: training
---
[ ![DeepSpeed Data Efficiency](/assets/images/data_efficiency/data_efficiecy_fig0.png) ](/assets/images/data_efficiency/data_efficiecy_fig0.png){: .align-center}
Recently, large-scale deep learning models are empowering us to achieve more in many ways, such as [improving programming efficiency by code generation](https://github.com/features/copilot) and [providing art inspiration by text-to-image generation](https://www.microsoft.com/en-us/microsoft-365/blog/2022/10/12/new-tools-from-microsoft-365-edge-and-bing-bridge-the-gap-between-productivity-and-creativity/). To enable these services and keep improving the quality, deep learning model architecture evolves rapidly, and the model size is also growing at a tremendous speed. For example, from GPT to GPT-3 the model size increased 1500x in 2 years. The increasing model size leads to unprecedented training cost, making it challenging for many AI practitioners to train their own models. On the other hand, a less-emphasized perspective is that **data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them.** In Figure 1 below we plot the model and data scales of several representative language models in the last 5 years. From the oldest model on the left to the newest models on the right, both the model and data scales increase at similar speed. This demonstrates the importance of improving data efficiency: achieve same model quality with less data and reduced training cost, or achieve better model quality with the same amount of data and similar training cost.
[ ![Model and data scales](/assets/images/data_efficiency/data_efficiecy_fig1.png) ](/assets/images/data_efficiency/data_efficiecy_fig1.png){: .align-center}
*Figure 1: Model scale (number of parameters) and data scale (number of tokens consumed during training) of representative language models in the last 5 years.*
There are two popular research directions among existing data efficiency techniques: Data sampling techniques aim to improve the convergence speed by sampling the most suitable next data batch from the whole data pool; Data routing techniques aim to reduce the computation by routing each data to only a subset of the model components. These techniques improve data and training efficiency, but existing solutions on them have limitations on **extensibility, flexibility, and composability.** They are commonly designed for specific training tasks, making them hard to be extended with customized strategies and making them less flexible to be applied on diverse workloads from different users. Furthermore, different techniques are implemented separately, making it challenging to compose multiple solutions to further improve data and training efficiency.
To address these challenges, we, the DeepSpeed team as part of Microsoft’s [AI at Scale](https://www.microsoft.com/en-us/research/project/ai-at-scale/) initiative, are proud to announce **DeepSpeed Data Efficiency Library** – a composable framework that makes better use of data, increases training efficiency, and improves model quality. DeepSpeed Data Efficiency takes extensibility, flexibility, and composability into consideration, and it specifically demonstrates the following innovations:
**Efficient data sampling via curriculum learning.** Curriculum learning (CL) improves data efficiency by sampling from easier data. We present a general curriculum learning library which enables users to employ curriculum learning to their models at **maximum extensibility**: users can easily analyze, index, and sample their training data based on various customizable strategies. Using this library, we were able to explore different CL strategies for GPT-3 and BERT pretraining and identify the best solution that provides up to **1.5x data saving** while still maintaining similar model quality.
**Efficient data routing via random layerwise token dropping.** We present a novel data routing technique called random layerwise token dropping (random-LTD) to skip the computation of a subset of the input tokens at all middle layers. Random-LTD employs a simple yet effective routing strategy and requires **minimal model architecture change.** It is **flexible** to apply random-LTD to various tasks (GPT-3/BERT pretraining and GPT/ViT finetuning), and we achieve great data efficiency improvement (up to **1.5x data saving** while still maintaining the model quality).
**Seamlessly composing multiple methods.** The proposed DeepSpeed Data Efficiency framework seamlessly composes the curriculum learning and random-LTD techniques, and only requires minimal changes on the user code side. Furthermore, by composing both methods we can achieve even better data and training efficiency: for GPT-3 1.3B pretraining, we achieve **2x data and 2x time savings** together with better or similar model quality compared to the baseline training. When using the same amount of data, our approach further improves the model quality over the baseline. Users can also extend and contribute to the library by adding additional data efficiency techniques to compose together.
Each of these advances is explored further in the blog post below. For more about the technical details, please read our papers, “[Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers](https://arxiv.org/abs/2211.11586)” which describes the random-LTD technique, and “[DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing](https://arxiv.org/abs/2212.03597)” which describes the curriculum learning technique and overall DeepSpeed Data Efficiency framework.
# Efficient Data Sampling via Curriculum Learning
## Motivation
Curriculum learning aims to improve training convergence speed by presenting relatively easier or simpler examples earlier during training. Building a curriculum learning solution usually requires two components: the difficulty metric (i.e., how to quantify the difficulty of each data sample) and the pacing function (i.e., how to decide the curriculum difficulty range when sampling next training data batch). Curriculum learning has been successfully applied to various training tasks, and last year we also released a specific curriculum learning technique (sequence length warmup) for GPT-style model pretraining (see technical details in our paper “[The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models](https://openreview.net/forum?id=JpZ5du_Kdh)” published in NeurIPS 2022). However, one common limitation among existing works is that there does not exist a generalized and extensible curriculum learning library, which allows practitioners to easily apply custom curriculum difficulty metrics, the combination of metrics, and pacing functions.
## Design
To solve the limitation of existing solutions, we design and implement a general curriculum learning library emphasizing the extensibility. It consists of three components as shown in Figure 2 below (top part). First, we use a data analyzer to perform the offline CPU-only data analysis which indexes the whole data pool based on any difficulty metric such as the sequence length, the vocabulary rarity, or anything defined by user. Next, during training, the curriculum scheduler determines the difficulty threshold for the current step based on a pacing function such as linear, rooted, or any strategy provided by users. Then the data sampler will sample the data with desired difficulty from the indexed data pool. Overall, this general implementation would enable users to explore curriculum learning on their workloads with maximum customizability (more technical details in [our DeepSpeed Data Efficiency paper](https://arxiv.org/abs/2212.03597)).
[ ![DeepSpeed Data Efficiency framework](/assets/images/data_efficiency/data_efficiecy_fig2.png) ](/assets/images/data_efficiency/data_efficiecy_fig2.png){: .align-center}
*Figure 2: Design of the DeepSpeed Data Efficiency framework.*
## Evaluation Results
Using this general and extensible curriculum learning solution for GPT-3 and BERT-Large model pretraining, we are able to easily analyze and index the huge training data based on up to 7 difficulty metrics and enable better data and training efficiency. For GPT-3 pretraining, our solution with the best difficulty metric (combination of truncation-based sequence length and vocabulary rarity) achieves 1.5x data and training cost saving while still maintaining model quality as baseline (Table 1 Case (8) vs. (1)). For BERT-Large pretraining, our solution with the best difficulty metric (vocabulary rarity) achieves 1.4x saving while still maintaining model quality (Table 2 Case (7) vs. (1)). On the other hand, our solutions can further improve model quality when using the same amount of data as baseline (Table 1 Case (2) to (6), Table 2 Case (2) to (6)).
| **Case** | **Pretrain data** | **Avg 0-shot accuracy** | **Avg 10-shot accuracy** |
| ---------- |---------- |---------- |---------- |
| (1) Baseline | 300B | 42.5 | 44.0 |
| (2) CL truncation-based sequence length | 300B | 43.4 | 44.8 |
| (3) CL reshape-based sequence length | 300B | 43.0 | 44.5 |
| (4) CL vocabulary rarity | 300B | 42.3 | 44.5 |
| (5) CL combining (2) and (4) | 300B | **43.6** | **44.9** |
| (6) CL combining (3) and (4) | 300B | 43.0 | 44.4 |
| (7) Baseline | 200B (1.5x) | 41.9 | 44.0 |
| (8) CL combining (2) and (4) | **200B (1.5x)** | 42.7 | 44.5 |
*Table 1: GPT-3 1.3B pretraining data consumption and average evaluation accuracy on 19 tasks.*
| **Case** | **Pretrain data** | **Avg finetune accuracy** |
| ---------- |---------- |---------- |
| (1) Baseline | 1049B | 85.42 |
| (2) CL truncation-based sequence length | 1049B | 85.77 |
| (3) CL reorder-based sequence length | 1049B | 85.46 |
| (4) CL vocabulary rarity | 1049B | **86.13** |
| (5) CL combining (2) and (4) | 1049B | 85.8 |
| (6) CL combining (3) and (4) | 1049B | 85.61 |
| (7) CL vocabulary rarity | **734B (1.4x)** | 85.59 |
*Table 2: BERT-Large pretraining data consumption and average finetuning accuracy on 4 tasks.*
# Efficient Data Routing via Random Layerwise Token Dropping
## Motivation
Standard data routing usually feeds the full images/sequences into all layers of a model. However, this process may not be optimal for training efficiency since some parts of an image (or words of a sentence) do not require a frequent feature update. As such, the token dropping method has been proposed, which is illustrated in Figure 3 (b) below, to skip the compute of some tokens/words (i.e., G-2 tokens in Figure 3 (b)) of a sentence in order to save the compute cost.
Although existing methods show promising results, they also exhibit several caveats: (1) most works solely focus on BERT (encoder-only on text data) pretraining and do not include decoder pretraining and/or other modalities (e.g., images); (2) the ability to skip layers is limited, which bounds the total amount of compute saving. By analyzing existing methods, we found out the potential main issue that limits their skipping and coverage abilities is the loss of attention mechanism for G-2 tokens for all skipped layers, since multi-head attention focuses on different tokens at different layer depths and the attention map aligns with the dependency relation most strongly in the middle of transformer architectures.
## Design
To resolve this main issue, we propose random-LTD, a **random** and **layerwise** token dropping mechanism, which processes only a subset of tokens among the entire data batch for all middle layers in order to save compute cost (see more details in [our Random-LTD paper](https://arxiv.org/abs/2211.11586)). As such, each token rarely bypasses all middle layers and its dependency with other tokens can be captured by the model. The illustration of random-LTD compared to baseline is shown in Figure 3 below, where random-LTD splits the input tokens into two groups and only the first group involves the compute.
[ ![random-LTD](/assets/images/data_efficiency/data_efficiecy_fig3.png) ](/assets/images/data_efficiency/data_efficiecy_fig3.png){: .align-center}
*Figure 3: Comparison between baseline, existing token dropping methods, and random-LTD. Note that for random-LTD, only part of the inputs (Group 1) is used for Layer i.*
Random-LTD is simple yet very effective. Particularly, compared to other existing token dropping methods, random-LTD (1) does a purely random selection for each layer for two different groups, as such we do not require any expert design for the selection criterion; (2) is able to apply to all middle layers to achieve better saving ratio; (3) demonstrates great generalizability for both encoder and decoder models; and (4) is easy to use without much modeling change. These advantages enable maximum flexibility when applying random-LTD to various workloads.
## Evaluation Results
Thanks to its great flexibility, we were able to apply random-LTD method to broader applications, including BERT and GPT pretraining as well as ViT and GPT finetuning tasks. For all cases, random-LTD achieves similar model quality as baseline while using less data, and/or achieve better model quality while using the same amount of data (Table 3 to 6). For GPT-3 and BERT-Large pretraining, random-LTD achieves 1.5x data saving while still maintaining the same model quality. For GPT-3 we also tested random-LTD with full data which further improves the model quality compared to baseline.
| **Case** | **Pretrain data** | **Avg 0-shot accuracy** |
| ---------- |---------- |---------- |
| (1) Baseline | 300B | 42.5 |
| (2) Random-LTD | 300B | **43.7** |
| (3) Random-LTD | **200B (1.5x)** | 42.5 |
*Table 3: GPT-3 1.3B pretraining data consumption and average evaluation accuracy on 19 tasks.*
| **Case** | **Pretrain data** | **Avg finetune accuracy** |
| ---------- |---------- |---------- |
| (1) Baseline | 1049B | 85.42 |
| (2) Random-LTD | **723B (1.5x)** | **86.42** |
*Table 4: BERT-Large pretraining data consumption and average finetuning accuracy on 4 tasks.*
| **Case** | **Train data** | **ImageNet Top-1 Acc** |
| ---------- |---------- |---------- |
| (1) Baseline | 100% | 84.65 |
| (2) Random-LTD | **77.7% (1.3x)** | **84.70** |
*Table 5: Finetuning result of ViT on ImageNet.*
| **Case** | **Train data** | **PTB PPL** |
| ---------- |---------- |---------- |
| (1) Baseline | 100% | 16.11 |
| (2) Random-LTD | 100% | **15.9** |
*Table 6: GPT-2 350M finetuning result on the PTB task.*
# Composing Data Efficiency Techniques to Achieve More
The curriculum learning and random-LTD techniques are complementary. Inside DeepSpeed Data Efficiency framework, we seamlessly compose the two techniques as shown in Figure 2 above, where curriculum learning helps to sample the next data batch and random-LTD helps to decide how to route each sampled data inside the model. DeepSpeed Data Efficiency solves several complexities when composing the two techniques so that users can easily apply each technique or both to their training pipeline. The composability of DeepSpeed Data Efficiency also applies to data sampling and routing techniques in general, so that it provides a platform to implement and compose additional data efficiency techniques.
The composed DeepSpeed Data Efficiency solution leverages both data efficiency techniques and achieves even better data and training efficiency. Take the GPT-3 pretraining task as an example, composing CL and random-LTD, with 100% data, leads to the best model quality in our experiments (Table 7 Case (1) to (4)). When pretraining with 50% data, the baseline training results in worse zero-shot and 10-shot evaluation accuracy, and using either CL or random-LTD can only recover part of the 10-shot accuracy loss. On the other hand, the composed data efficiency solution achieves the same or better accuracy results as baseline with 100% data, demonstrating a 2x data and 2x time saving (Case (5) to (8)).
| **Case** | **Pretrain data** | **Pretrain time (on 64 V100)** | **Avg 0-shot accuracy** | **Avg 10-shot accuracy** |
| ---------- |---------- |---------- |---------- |---------- |
| (1) Baseline | 300B | 260hr | 42.5 | 44.0 |
| (2) CL best metric | 300B | 259hr | 43.6 | 44.9 |
| (3) random-LTD | 300B | 263hr | 43.7 | 44.9 |
| (4) CL + random-LTD | 300B | 260hr | **43.8** | **45.1** |
| (5) Baseline | 150B (2x) | 130hr (2x) | 42.0 | 42.7 |
| (6) CL best metric | 150B (2x) | 129hr (2x) | 42.6 | 43.7 |
| (7) random-LTD | 150B (2x) | 131hr (2x) | 42.7 | 43.5 |
| (8) CL + random-LTD | **150B (2x)** | **130hr (2x)** | 42.8 | 44.0 |
*Table 7: GPT-3 1.3B pretraining data/time consumption and average evaluation accuracy on 19 tasks.*
# Concluding Remarks
We are very excited to share DeepSpeed Data Efficiency library with the community and improve it with your feedback. Please find the code, tutorial, and documents at the [DeepSpeed GitHub](https://github.com/microsoft/DeepSpeed), and [website](/tutorials/data-efficiency/). And for more technical details please read our [Random-LTD paper](https://arxiv.org/abs/2211.11586) and [DeepSpeed Data Efficiency paper](https://arxiv.org/abs/2212.03597). We believe that our composable library and novel data efficiency techniques will help users reduce training cost while maintaining model quality or achieve better quality under similar cost. And we hope DeepSpeed Data Efficiency could become a platform that motivates and accelerates future research on deep learning data efficiency.
......@@ -3,6 +3,10 @@ title: "Curriculum Learning: A Regularization Method for Efficient and Stable Bi
tags: training pre-training
---
**Watch out!**
On 12/12/2022, we released DeepSpeed Data Efficiency Library which provides a more general curriculum learning support. This legacy curriculum learning feature below is still supported but we recommend to use the Data Efficiency Library.
{: .notice--warning}
**Note:**
This tutorial was updated on 10/29/2021. Changes include: 1) A more detailed tuning strategy. 2) Pipeline parallelism support. 3) Token-based learning rate decay. 4) A new GPT-2 example at [github.com/microsoft/Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed). See details below.
{: .notice--info}
......
---
title: "DeepSpeed Data Efficiency: A composable library that makes better use of data, increases training efficiency, and improves model quality"
tags: training pre-training
---
**What is DeepSpeed Data Efficiency:** DeepSpeed Data Efficiency is a library purposely built to make better use of data, increases training efficiency, and improves model quality.
**Why use DeepSpeed Data Efficiency:** DeepSpeed Data Efficiency offers novel data efficiency techniques to achieve better training efficiency and/or better model quality. DeepSpeed Data Efficiency takes extensibility, flexibility, and composability into consideration, which makes it easier to customize the techniques, apply the techniques to various training tasks, and compose multiple techniques together. We highly recommend you also to read [our blog](https://www.deepspeed.ai/2022/12/12/data-efficiency.html) to learn more about (at a high level) why we build DeepSpeed Data Efficiency and what benefits it provides to users. Additional technical details can be found in our papers, “[Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers](https://arxiv.org/abs/2211.11586)” which describes the random-LTD technique, and “[DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing](https://arxiv.org/abs/2212.03597)” which describes the curriculum learning technique and overall DeepSpeed Data Efficiency framework.
**How to use DeepSpeed Data Efficiency:** In the following tutorial, the first two sections will describe the data efficiency techniques supported by the library. The third section will describe how to compose the two techniques to achieve even better training efficiency/model quality.
## 1. Curriculum Learning
### 1.1 What is Curriculum Learning
Curriculum learning (proposed by [Yoshua Bengio et al.](https://dl.acm.org/doi/abs/10.1145/1553374.1553380)) aims to improve training convergence speed by presenting relatively easier or simpler examples earlier during training. Building a curriculum learning solution usually requires two components: the difficulty metric (i.e., how to quantify the difficulty of each data sample) and the pacing function (i.e., how to decide the curriculum difficulty range when sampling next training data batch).
### 1.2 When to use Curriculum Learning
Curriculum learning has been successfully applied to various training tasks (see details in for example [this survey paper](https://arxiv.org/abs/2010.13166)), and last year we also released a specific curriculum learning technique (sequence length warmup) for GPT-style model pretraining (see technical details in our paper “[The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models](https://openreview.net/forum?id=JpZ5du_Kdh)” published in NeurIPS 2022 and the [tutorial for this legacy curriculum learning feature](/tutorials/curriculum-learning/)). This new general curriculum learning library inside DeepSpeed Data Efficiency enables users to employ curriculum learning to their models at **maximum extensibility**: users can easily analyze, index, and sample their training data based on various customizable strategies. Using this library, we were able to explore different CL strategies for GPT-3 and BERT pretraining and identify the best solution that provides up to **1.5x data saving** while still maintaining similar model quality.
### 1.3 How to use Curriculum Learning
#### 1.3.1 GPT-3 and BERT pretraining
The `examples/data_efficiency` directory in our [Megatron-DeepSpeed repo](https://github.com/microsoft/Megatron-DeepSpeed) includes our examples of how to apply curriculum learning to GPT-3 and BERT pretraining. There are 3 steps: data analysis, pretraining, and eval/finetuning.
**Data analysis:** Curriculum learning requires a data analysis before pretraining that calculate the difficulty of each data sample (based on the metric provided by user), and build an index that map difficulty value to corresponding data samples. (There are exceptions: for example the truncation-based sequence length metric can be achieved by data postprocessing without data analysis.) We provide a data analyzer to perform the offline CPU-only data analysis.
`examples/data_efficiency/gpt/ds_analyze_*.sh` and `examples/data_efficiency/bert/ds_analyze_*.sh` are example scripts for GPT-3 and BERT's data analysis. Our data analyzer employs a simple Map-Reduce scheme. First, at the Map stage the `ds_analyze_*_data_map.sh` is used to split the dataset and compute the difficulty value for each data sample. User would need to provide a function to compute the metric (we implement ours in `examples/data_efficiency/analyze_data.py`), the raw training dataset, and other configurations such as number of CPU nodes and number of threads per node. Then the data analyzer will automatically splits the dataset based on number of workers, compute the difficulty values in a batched fashion, and write the results to two indexes: one index maps each data sample to its difficulty value, and another index maps each distinct difficulty value to the corresponding samples. Second, at the Reduce stage the `ds_analyze_*_data_reduce.sh` is used to merge the index files produced by all workers. One thing to note is that in order to enable speedup by distribution yet still being able to merge all the output, the Map stage will potentially generate a lot of output files, which is proportional to number of CPU nodes, number of threads per node, and number of possible metric values. Thus to avoid generating too much output files, we recommend to start with a smaller number of nodes/threads (in the output log we provide an estimate required time for users to judge if they want to increase number of workers), and we recommend to limit number of possible difficulty values when designing your difficulty metric (our experience shows that a few thousands of distinct values is already sufficient to enjoy the benefit of curriculum learning).
**Pretraining** `examples/data_efficiency/gpt/pretrain` and `examples/data_efficiency/bert/pretrain` include the example pretraining scripts with curriculum learning feature. Several changes are needed to enable curriculum learning during pretraining: (1) User need to provide a DeepSpeed json config file which includes configurations for curriculum learning (see [list of configuration](/docs/config-json/#data-efficiency) for details). We provide tested example configurations in `examples/data_efficiency/gpt/pretrain/ds_pretrain_gpt_1.3B_dense_run.sh` and `examples/data_efficiency/bert/pretrain/ds_pretrain_bert_336M_run.sh`. (2) When initializing the DeepSpeed engine via `deepspeed.initialize`, user needs to provide the train dataset and use the dataloader returned by the initialization (this dataloader includes the curriculum learning capability). We provide an example implementation of this change in `megatron/training.py` function `setup_model_and_optimizer`. (3) If the curriculum learning metric requires data postprocessing (such as truncation-based sequence length), user needs to use the DeepSpeed engine's `set_data_post_process_func` API to provide the postprocessing function. We provide an example implementation of this change in `megatron/training.py`, `pretrain_bert.py`, and `pretrain_gpt.py`. (4) If the curriculum learning metric requires a custom scheduling strategy (the pacing function), user needs to use the DeepSpeed engine's `set_custom_curriculum_learning_schedule` API to provide the function to update the max accepted difficulty during training. DeepSpeed engine will provide a global train step input to this callback function.
**Eval/finetuning** `examples/data_efficiency/gpt/eval/` and `examples/data_efficiency/bert/finetune` include the example scripts for GPT-3 model's zero-/few-shot evaluation and BERT model's finetuning. Our [paper](https://arxiv.org/abs/2212.03597) includes the reference eval/finetune results if you follow our example scripts to perform the pretraining/eval/finetuning.
## 2. Random layerwise token dropping (random-LTD)
### 2.1 What is random-LTD
Random-LTD is an efficient token drop method applied to each layer with random assignment. Precisely, for each layer, as compared to the baseline, random-LTD randomly selects a subset of the tokens and feeds them into the transformer layer. Afterward, we combine the output of transformer layer with the dropped tokens to recover the full sequence length. Thus, the next layer still receives the full sequence and can repeat this process. For more technical details please read [our random-LTD paper](https://arxiv.org/abs/2211.11586).
### 2.2 When to use random-LTD
When you want to pretrain/fine-tune a transformer-based model, it is always a good idea to try random-LTD, as it can achieve a better performance than the standard baseline training given the same amount of computational cost. If you have limited resources, random-LTD achieves similar accuracy as the original baseline method with up to 33.3% theoretical cost saving and up to 25.6% wall-clock time saving. Particularly, if you need to train a much larger model with >=24 layers and with >=2048 sequence length, our method will be much more efficient than baseline.
### 2.3 How to use random-LTD
#### 2.3.1 GPT-3 and BERT pretraining
The `examples/data_efficiency` directory in our [Megatron-DeepSpeed repo](https://github.com/microsoft/Megatron-DeepSpeed) includes our examples of how to apply random-LTD to GPT-3 and BERT pretraining.
`examples/data_efficiency/gpt/pretrain` and `examples/data_efficiency/bert/pretrain` include the example pretraining scripts with random-LTD feature. Several changes are needed to enable random-LTD during pretraining: (1) User need to provide a DeepSpeed json config file which includes configurations for random-LTD (see [list of configuration](/docs/config-json/#data-efficiency) for details). We provide tested example configurations in `examples/data_efficiency/gpt/pretrain/ds_pretrain_gpt_1.3B_dense_run.sh` and `examples/data_efficiency/bert/pretrain/ds_pretrain_bert_336M_run.sh`. (2) After initializing the DeepSpeed engine via `deepspeed.initialize`, user needs to use the `convert_to_random_ltd` API to convert and wrap the model layers in order to enable the random-LTD feature. We provide an example implementation of this change in `megatron/training.py` function `setup_model_and_optimizer`. (3) In order for random-LTD to understand the input argument mapping of the forward function, user need to change all the input arguments (except the hidden_states input) into keyword/named argument. For example, in `megatron/model/transformer.py` we changed the forward function from `def forward(self, hidden_states, attention_mask, encoder_output=None, enc_dec_attn_mask=None, layer_past=None, get_key_value=False):` to `def forward(self, hidden_states, attention_mask=None, encoder_output=None, enc_dec_attn_mask=None, layer_past=None, get_key_value=False):`. (4) When saving model checkpoints, (especially if the state dictionary has non-traditional structure) user needs to use the `remove_random_ltd_state_dict` API to convert the random-LTD-wrapped layers back to original model layers. We provide an example implementation of this change in `megatron/model/language_model.py`.
For eval/finetuning of the pretrained model, see [previous section](#131-gpt-3-and-bert-pretraining) about how to use our example scripts.
#### 2.3.2 GPT-2 and ViT finetuning
The `data_efficiency` directory in our [DeepSpeedExamples repo](https://github.com/microsoft/DeepSpeedExamples) includes our examples of how to apply random-LTD to GPT-2 and ViT finetuning.
Just like pretraining case, similar changes are required to enable random-LTD for finetuning: (1) DeepSpeed json config file. (2) Use the `convert_to_random_ltd` API to convert and wrap the model layers. (3) When saving model checkpoints, use the `remove_random_ltd_state_dict` API to convert the random-LTD-wrapped layers back to original model layers.
One can run our GPT finetuning example by:
```shell
DeepSpeedExamples/data_efficiency/gpt_finetuning$ pip install -r requirement.txt
DeepSpeedExamples/data_efficiency/gpt_finetuning$ bash ./bash_script/run_base.sh
DeepSpeedExamples/data_efficiency/gpt_finetuning$ bash ./bash_script/run_medium.sh
```
And the reference final result is:
```shell
For run_base.sh:
'step':1047, 'ppl': 23.9859276900444, 'seq_len': 1024, 'consume layer-tokens': 19534848
For run_medium.sh:
'step':1047, 'ppl': 18.569010769928337, 'seq_len': 1024, 'consume layer-tokens': 35567104
```
One can run our ViT finetuning example by:
```shell
DeepSpeedExamples/data_efficiency/vit_finetuning$ pip install -r requirement.txt
DeepSpeedExamples/data_efficiency/vit_finetuning$ bash ./bash_script/run_cifar.sh
DeepSpeedExamples/data_efficiency/vit_finetuning$ bash ./bash_script/run_imagenet.sh
```
And the reference final result is:
```shell
For run_cifar.sh:
13 epoch at time 480.6546013355255s | researved_length 197
iter 5474 | LR [0.0001]| val_acc 97.97000122070312 | layer_token 305784192
```
## 3. Composing curriculum learning and random-LTD to achieve more
### 3.1 GPT-3 and BERT pretraining
The `examples/data_efficiency` directory in our [Megatron-DeepSpeed repo](https://github.com/microsoft/Megatron-DeepSpeed) includes our examples of how to compose curriculum learning random-LTD, and apply both of them to GPT-3 and BERT pretraining.
The changes needed are the same as described in previous two sections, since DeepSpeed Data Efficiency already handles the complexity when composing the two techniques. However, one thing to note is that since both random-LTD and some of the curriculum learning metrics will change the sequence length, it could require some extra code to calculate the effective sequence length at each step. We provide an example implementation of this change in `megatron/training.py` function `train` where we calculate the `actual_seq_length`.
......@@ -7,11 +7,11 @@ title: "Latest News"
---
<b> DeepSpeed trained the world's most powerful language models ([MT-530B](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/), [BLOOM](https://huggingface.co/blog/bloom-megatron-deepspeed)); [learn how](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/).</b>
* [2022/12] [DeepSpeed Data Efficiency: A composable library that makes better use of data, increases training efficiency, and improves model quality](https://www.deepspeed.ai/2022/12/12/data-efficiency.html)
* [2022/11] [Stable Diffusion Image Generation under 1 second w. DeepSpeed MII](https://github.com/microsoft/DeepSpeed-MII/tree/main/examples/benchmark/txt2img)
* [2022/10] [DeepSpeed-MII: instant speedup on 24,000+ open-source DL models with up to 40x cheaper inference](https://www.deepspeed.ai/2022/10/10/mii.html)
* [2022/09] [ZeRO-Inference: Democratizing massive model inference](https://www.deepspeed.ai/2022/09/09/zero-inference.html)
* [2022/07] [Azure and DeepSpeed empower easy-to-use and high-performance model training](https://azure.microsoft.com/en-us/blog/azure-empowers-easytouse-highperformance-and-hyperscale-model-training-using-deepspeed/)
* [2022/07] [DeepSpeed Compression: A composable library for extreme compression](https://www.microsoft.com/en-us/research/blog/deepspeed-compression-a-composable-library-for-extreme-compression-and-zero-cost-quantization/)
# Extreme Speed and Scale for DL Training and Inference
......@@ -109,17 +109,19 @@ comments.
1. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: memory optimizations toward training trillion parameter models. [arXiv:1910.02054](https://arxiv.org/abs/1910.02054) and [In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20)](https://dl.acm.org/doi/10.5555/3433701.3433727).
2. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. [In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial)](https://dl.acm.org/doi/10.1145/3394486.3406703).
3. Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. [arXiv:2010.13369](https://arxiv.org/abs/2010.13369) and [NeurIPS 2020](https://proceedings.neurips.cc/paper/2020/hash/a1140a3d0df1c81e24ae954d935e8926-Abstract.html).
4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. [arXiv:2101.06840](https://arxiv.org/abs/2101.06840).
4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. [arXiv:2101.06840](https://arxiv.org/abs/2101.06840) and [USENIX ATC 2021](https://www.usenix.org/conference/atc21/presentation/ren-jie).
5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. [arXiv:2102.02888](https://arxiv.org/abs/2102.02888) and [ICML 2021](http://proceedings.mlr.press/v139/tang21a.html).
6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. [arXiv:2104.07857](https://arxiv.org/abs/2104.07857).
7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. [arXiv:2104.06069](https://arxiv.org/abs/2104.06069).
8. Conglong Li, Minjia Zhang, Yuxiong He. (2021) Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training. [arXiv:2108.06084](https://arxiv.org/abs/2108.06084).
6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. [arXiv:2104.07857](https://arxiv.org/abs/2104.07857) and [SC 2021](https://dl.acm.org/doi/abs/10.1145/3458817.3476205).
7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. [arXiv:2104.06069](https://arxiv.org/abs/2104.06069) and [HiPC 2022](https://hipc.org/advance-program/).
8. Conglong Li, Minjia Zhang, Yuxiong He. (2021) The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models. [arXiv:2108.06084](https://arxiv.org/abs/2108.06084) and [NeurIPS 2022](https://openreview.net/forum?id=JpZ5du_Kdh).
9. Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He. (2022) Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam. [arXiv:2202.06009](https://arxiv.org/abs/2202.06009).
10. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He. (2022) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [arXiv:2201.05596](https://arxiv.org/abs/2201.05596).
10. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He. (2022) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [arXiv:2201.05596](https://arxiv.org/abs/2201.05596) and [ICML 2022](https://proceedings.mlr.press/v162/rajbhandari22a.html).
11. Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro. (2022) Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model [arXiv:2201.11990](https://arxiv.org/abs/2201.11990).
12. Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He. (2022) Extreme Compression for Pre-trained Transformers Made Simple and Efficient. [arXiv:2206.01859](https://arxiv.org/abs/2206.01859).
13. Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He. (2022) ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. [arXiv:2206.01861](https://arxiv.org/abs/2206.01861).
14. Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He. (2022) DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. [arXiv:2207.00032](https://arxiv.org/abs/2207.00032).
12. Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He. (2022) Extreme Compression for Pre-trained Transformers Made Simple and Efficient. [arXiv:2206.01859](https://arxiv.org/abs/2206.01859) and [NeurIPS 2022](https://openreview.net/forum?id=xNeAhc2CNAl).
13. Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He. (2022) ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. [arXiv:2206.01861](https://arxiv.org/abs/2206.01861) and [NeurIPS 2022](https://openreview.net/forum?id=f-fVCElZ-G1).
14. Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He. (2022) DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. [arXiv:2207.00032](https://arxiv.org/abs/2207.00032) and [SC 2022](https://dl.acm.org/doi/abs/10.5555/3571885.3571946).
15. Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He. (2022) Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers. [arXiv:2211.11586](https://arxiv.org/abs/2211.11586).
16. Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Yuxiong He. (2022) DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. [arXiv:2212.03597](https://arxiv.org/abs/2212.03597).
# Videos
1. DeepSpeed KDD 2020 Tutorial
......
......@@ -7,6 +7,7 @@ from .fused_adam import FusedAdamBuilder
from .fused_lamb import FusedLambBuilder
from .sparse_attn import SparseAttnBuilder
from .transformer import TransformerBuilder
from .random_ltd import RandomLTDBuilder
from .stochastic_transformer import StochasticTransformerBuilder
from .utils import UtilsBuilder
from .async_io import AsyncIOBuilder
......@@ -31,5 +32,6 @@ __op_builders__ = [
QuantizerBuilder(),
InferenceBuilder(),
SpatialInferenceBuilder(),
RandomLTDBuilder()
]
ALL_OPS = {op.name: op for op in __op_builders__}
"""
Copyright 2022 The Microsoft DeepSpeed Team
"""
from .builder import CUDAOpBuilder
class RandomLTDBuilder(CUDAOpBuilder):
BUILD_VAR = "DS_BUILD_RANDOM_LTD"
NAME = "random_ltd"
def __init__(self, name=None):
name = self.NAME if name is None else name
super().__init__(name=name)
def absolute_name(self):
return f'deepspeed.ops.{self.NAME}_op'
def extra_ldflags(self):
if not self.is_rocm_pytorch():
return ['-lcurand']
else:
return []
def sources(self):
return [
'csrc/random_ltd/pt_binding.cpp',
'csrc/random_ltd/gather_scatter.cu',
'csrc/random_ltd/slice_attn_masks.cu',
'csrc/random_ltd/token_sort.cu'
]
def include_paths(self):
includes = ['csrc/includes']
if self.is_rocm_pytorch():
from torch.utils.cpp_extension import ROCM_HOME
includes += [
'{}/hiprand/include'.format(ROCM_HOME),
'{}/rocrand/include'.format(ROCM_HOME)
]
return includes
import torch
import os
import deepspeed
from unit.common import DistributedTest
from unit.simple_model import Curriculum_SimpleModel, random_dataloader
from unit.simple_model import Curriculum_SimpleModel, SimpleModel, random_dataloader, random_dataset
class TestCurriculumScheduler(DistributedTest):
class MPU():
def __init__(self, tp_world_size):
self.rank = deepspeed.comm.get_rank()
self.world_size = deepspeed.comm.get_world_size()
self.tp_world_size = tp_world_size
for i in range(0, self.world_size, tp_world_size):
ranks = range(i, i + tp_world_size)
group = deepspeed.comm.new_group(ranks)
if self.rank in ranks:
self.tp_group = group
for i in range(0, tp_world_size):
ranks = range(i, self.world_size, tp_world_size)
group = deepspeed.comm.new_group(ranks)
if self.rank in ranks:
self.dp_group = group
def get_model_parallel_rank(self):
return self.rank % self.tp_world_size
def get_model_parallel_world_size(self):
return self.tp_world_size
def get_data_parallel_rank(self):
return self.rank // self.tp_world_size
def get_data_parallel_world_size(self):
return self.world_size // self.tp_world_size
def get_data_parallel_group(self):
return self.dp_group
def get_model_parallel_group(self):
return self.tp_group
class TestDataEfficiency(DistributedTest):
world_size = 2
def test_curriculum_learning(self):
config_dict = {
"train_batch_size": 2,
"steps_per_print": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015,
"weight_decay": 0.01
}
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": True,
"loss_scale": 0,
"initial_scale_power": 16
},
"data_efficiency": {
"enabled": True,
"seed": 1234,
"data_sampling": {
"enabled": True,
"num_workers": 0,
"curriculum_learning": {
"enabled": True,
"data_cluster_path": "data_clusters",
"curriculum_metrics": {
"dummy_metric": {
"index_to_sample_path": "dummy",
"index_to_metric_path": "dummy",
"difficulty_type": "value",
"clustering_type": "single_cluster",
"min_difficulty": 2,
"max_difficulty": 10,
"schedule_type": "fixed_root",
"schedule_config": {
"total_curriculum_step": 8,
"difficulty_step": 2,
"root_degree": 1
}
}
}
}
}
}
}
def data_post_process(data, data_sampler_state_dict):
assert 'dummy_metric' in data_sampler_state_dict['current_difficulties']
return data
hidden_dim = 10
model = SimpleModel(hidden_dim)
dataset = random_dataset(20, hidden_dim, torch.device('cpu'), dtype=torch.half)
model, _, data_loader, _ = deepspeed.initialize(config=config_dict,
model=model,
training_data=dataset,
model_parameters=model.parameters(),
mpu=MPU(1))
if model.mpu.get_data_parallel_rank(
) == 0 and not os.path.exists('data_clusters'):
os.makedirs('data_clusters')
model.set_data_post_process_func(data_post_process)
for n, batch in enumerate(data_loader):
x = batch[0].to(torch.cuda.current_device())
y = batch[1].to(torch.cuda.current_device())
loss = model(x, y)
model.backward(loss)
model.step()
if n >= 10:
break
class TestLegacyCurriculumScheduler(DistributedTest):
world_size = 2
def test_fixed_discrete(self):
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册