config-json.md 39.1 KB
Newer Older
S
Shaden Smith 已提交
1 2 3
---
title: "DeepSpeed Configuration JSON"
---
4 5 6 7 8

### Batch Size Related Parameters

**Note:** configuring ***train\_batch\_size*** is required.
{: .notice--warning}
S
Shaden Smith 已提交
9 10 11

***train\_batch\_size***: [integer]

C
Cheng Li 已提交
12 13 14
| Value                                                                                                                                                                                                                                                                                                                                                                             | Example |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| The effective training batch size. This is the amount of data samples that leads to one step of model update. ***train\_batch\_size*** is aggregated by the batch size that a single GPU processes in one forward/backward pass (a.k.a., ***train\_step\_batch\_size***),  the gradient accumulation steps (a.k.a., ***gradient\_accumulation\_steps***), and the number of GPUs. | `32`    |
S
Shaden Smith 已提交
15 16 17 18


***train\_micro\_batch\_size\_per\_gpu***: [integer]

C
Cheng Li 已提交
19 20
| Description                                                                                                                                                                                                                                                                                                                    | Default                        |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------ |
S
Shaden Smith 已提交
21 22 23 24
| Batch size to be processed by one GPU in one step (without gradient accumulation). When specified, ***gradient\_accumulation\_steps*** is automatically calculated using ***train\_batch\_size*** and number of GPUs. Should not be concurrently specified with ***gradient\_accumulation\_steps*** in the configuration JSON. | ***train\_batch\_size*** value |

***gradient\_accumulation\_steps***: [integer]

C
Cheng Li 已提交
25 26 27
| Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | Default |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| Number of training steps to accumulate gradients before averaging and applying them. This feature is sometimes useful to improve scalability since it results in less frequent communication of gradients between steps. Another impact of this feature is the ability to train with larger batch sizes per GPU. When specified, ***train\_step\_batch\_size*** is automatically calculated using ***train\_batch\_size*** and number of GPUs. Should not be concurrently specified with ***train\_step\_batch\_size*** in the configuration JSON. | `1`     |
S
Shaden Smith 已提交
28 29 30 31 32 33 34



### Optimizer Parameters

***optimizer***: [dictionary]

35 36 37 38
| Fields | Value                                                                                                                                                                                                                                                                                        | Example                      |
| ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------- |
| type   | The optimizer name. DeepSpeed natively supports **Adam**, **AdamW**, **OneBitAdam**, and **Lamb** optimizers (See [here](https://deepspeed.readthedocs.io/en/latest/optimizers.html) for details) and will import other optimizers from [torch](https://pytorch.org/docs/stable/optim.html). | `"Adam"`                     |
| params | Dictionary of parameters to instantiate optimizer. The parameter names must match the optimizer constructor signature (e.g., for [Adam](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam)).                                                                                       | `{"lr": 0.001, "eps": 1e-8}` |
S
Shaden Smith 已提交
39

40
  Example of ***optimizer*** with Adam
S
Shaden Smith 已提交
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

```json
"optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.001,
      "betas": [
        0.8,
        0.999
      ],
      "eps": 1e-8,
      "weight_decay": 3e-7
    }
  }
```
56
The Adam optimizer also supports the following two params keys/values in addition to the standard parameters from [torch.optim.Adam](https://pytorch.org/docs/stable/_modules/torch/optim/adam.html#Adam):
S
Stas Bekman 已提交
57

58
| "params" key  | Description                                                                 | Default |
C
Cheng Li 已提交
59
| ------------- | --------------------------------------------------------------------------- | ------- |
60 61 62
| torch\_adam   | Use torch's implementation of adam instead of our fused adam implementation | false   |
| adam\_w\_mode | Apply L2 regularization (also known as AdamW)                               | true    |

C
Conglong Li 已提交
63
  Another example of ***optimizer*** with 1-bit Adam
64 65 66 67 68 69 70 71 72 73 74 75 76

```json
"optimizer": {
    "type": "OneBitAdam",
    "params": {
      "lr": 0.001,
      "betas": [
        0.8,
        0.999
      ],
      "eps": 1e-8,
      "weight_decay": 3e-7,
      "freeze_step": 400,
C
Conglong Li 已提交
77 78
      "cuda_aware": false,
      "comm_backend_name": "nccl"
79 80 81
    }
  }
```
S
Shaden Smith 已提交
82

C
Conglong Li 已提交
83 84
The 1-bit Adam optimizer supports the following three params keys/values in addition to the standard Adam (learn more in our [tutorial](/tutorials/onebit-adam/)):

85 86 87 88 89
| "params" key        | Description                                                                        | Default |
| ------------------- | ---------------------------------------------------------------------------------- | ------- |
| freeze\_step        | Number of warm up steps before 1-bit compression gets applied to the communication | 100000  |
| cuda\_aware         | To indicate that the underlying MPI library supports CUDA-Aware communication      | false   |
| comm\_backend\_name | To indicate which backend implementation to use                                    | "nccl"  |
C
Conglong Li 已提交
90

S
Shaden Smith 已提交
91 92
### Scheduler Parameters

93 94
DeepSpeed calls the `step()` method of the scheduler at every training step when `model_engine.step()` is executed.

S
Shaden Smith 已提交
95 96
***scheduler***: [dictionary]

97 98 99 100
| Fields | Value                                                                                                                      | Example                                        |
| ------ | -------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------- |
| type   | The scheduler name. See [here](https://deepspeed.readthedocs.io/en/latest/schedulers.html) for list of support schedulers. | `"WarmupLR"`                                   |
| params | Dictionary of parameters to instantiate scheduler. The parameter names should match scheduler constructor signature.       | `{"warmup_min_lr": 0, "warmup_max_lr": 0.001}` |
S
Shaden Smith 已提交
101 102 103 104 105 106 107 108 109 110 111

Example of ***scheduler***

```json
 "scheduler": {
      "type": "WarmupLR",
      "params": {
          "warmup_min_lr": 0,
          "warmup_max_lr": 0.001,
          "warmup_num_steps": 1000
      }
S
Stas Bekman 已提交
112
  }
S
Shaden Smith 已提交
113 114 115 116 117 118
```

### Communication options

***fp32\_allreduce***: [boolean]

C
Cheng Li 已提交
119 120 121
| Description                                                    | Default |
| -------------------------------------------------------------- | ------- |
| During gradient averaging perform allreduce with 32 bit values | `false` |
S
Shaden Smith 已提交
122 123 124 125 126

***prescale\_gradients***: [boolean]

| Description                            | Default |
| -------------------------------------- | ------- |
C
Cheng Li 已提交
127
| Scale gradients before doing allreduce | `false` |
S
Shaden Smith 已提交
128

129 130
***gradient_predivide_factor***: [float]

C
Cheng Li 已提交
131 132 133
| Description                                                                                                                                       | Default |
| ------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| Before gradient averaging predivide gradients by a specified factor, can sometimes help with fp16 stability when scaling to large numbers of GPUs | `1.0`   |
134

S
Shaden Smith 已提交
135 136
***sparse\_gradients***: [boolean]

C
Cheng Li 已提交
137 138 139
| Description                                                                                                              | Default |
| ------------------------------------------------------------------------------------------------------------------------ | ------- |
| Enable sparse compression of [torch.nn.Embedding](https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding) gradients. | `false` |
S
Shaden Smith 已提交
140 141 142

### FP16 training options

J
Jeff Rasley 已提交
143 144 145
**Note:** this mode cannot be combined with the `amp` mode described below.
{: .notice--warning}

S
Shaden Smith 已提交
146 147
***fp16***: [dictionary]

C
Cheng Li 已提交
148 149
| Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | Default |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
J
Jeff Rasley 已提交
150
| Configuration for using mixed precision/FP16 training that leverages [NVIDIA's Apex package](https://nvidia.github.io/apex/). An example, including the available dictionary keys is illustrated below. NOTE: this does not use Apex's AMP mode that allows for more flexibility in mixed precision training modes, this mode is similar to AMP's O2 mode. Please see AMP support below if you want to use more complex mixed precision modes. If you want to use ZeRO (currently) you must use this mode. | None    |
S
Shaden Smith 已提交
151 152 153 154 155 156 157 158

```json
"fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
J
Jeff Rasley 已提交
159
    "min_loss_scale": 1
S
Shaden Smith 已提交
160 161 162 163 164
}
```

***fp16:enabled***: [boolean]

C
Cheng Li 已提交
165 166 167
| Description                                                                            | Default |
| -------------------------------------------------------------------------------------- | ------- |
| ***enabled*** is a **fp16** parameter indicating whether or not FP16 training enabled. | `false` |
S
Shaden Smith 已提交
168 169 170

***fp16:loss\_scale***: [float]

C
Cheng Li 已提交
171 172 173
| Description                                                                                                                                                                                                                  | Default |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| ***loss\_scale*** is a ***fp16*** parameter representing the loss scaling value for FP16 training. The default value of 0.0 results in dynamic loss scaling, otherwise the value will be used for static fixed loss scaling. | `0.0`   |
S
Shaden Smith 已提交
174 175 176

***fp16:initial\_scale\_power***: [integer]

177 178
| Description                                                                                                                                                                                       | Default |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
179
| ***initial\_scale\_power*** is a **fp16** parameter representing the power of the initial dynamic loss scale value. The actual loss scale is computed as 2<sup>***initial\_scale\_power***</sup>. | `32`    |
S
Shaden Smith 已提交
180 181 182

***fp16:loss\_scale\_window***: [integer]

C
Cheng Li 已提交
183 184 185
| Description                                                                                                                       | Default |
| --------------------------------------------------------------------------------------------------------------------------------- | ------- |
| ***loss\_scale\_window*** is a **fp16** parameter representing the window over which to raise/lower the dynamic loss scale value. | `1000`  |
S
Shaden Smith 已提交
186 187 188

***fp16:hysteresis***: [integer]

C
Cheng Li 已提交
189 190 191
| Description                                                                                    | Default |
| ---------------------------------------------------------------------------------------------- | ------- |
| ***hysteresis*** is a **fp16** parameter representing the delay shift in dynamic loss scaling. | `2`     |
S
Shaden Smith 已提交
192 193 194

***fp16:min\_loss\_scale***: [integer]

C
Cheng Li 已提交
195 196 197
| Description                                                                                        | Default |
| -------------------------------------------------------------------------------------------------- | ------- |
| ***min\_loss\_scale*** is  a **fp16** parameter representing the minimum dynamic loss scale value. | `1000`  |
S
Shaden Smith 已提交
198

J
Jeff Rasley 已提交
199 200 201 202 203 204 205
### Automatic mixed precision (AMP) training options

**Note:** this mode cannot be combined with the `fp16` mode described above. In addition this mode is not currently compatible with ZeRO.
{: .notice--warning}

***amp***: [dictionary]

C
Cheng Li 已提交
206 207
| Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Default |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
J
Jeff Rasley 已提交
208 209 210 211 212 213 214 215 216 217 218 219 220
| Configuration for using automatic mixed precision (AMP) training that leverages [NVIDIA's Apex AMP package](https://nvidia.github.io/apex/). An example, including the available dictionary keys is illustrated below. Is not compatible with `fp16` mode above or ZeRO. Any parameters outside of "enabled" will be passed to AMP's initialize call, see the API and descriptions here at the [apex.amp.initialize documentation](https://nvidia.github.io/apex/amp.html#apex.amp.initialize). | None    |

```json
"amp": {
    "enabled": true,
    ...
    "opt_level": "O1",
    ...
}
```

***amp:enabled***: [boolean]

C
Cheng Li 已提交
221 222 223
| Description                                                                              | Default |
| ---------------------------------------------------------------------------------------- | ------- |
| ***enabled*** is an **amp** parameter indicating whether or not AMP training is enabled. | `false` |
J
Jeff Rasley 已提交
224 225 226

***amp params***: [various]

C
Cheng Li 已提交
227 228
| Description                                                                                                                                                                                                            | Default |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
J
Jeff Rasley 已提交
229 230
| Any parameters outside of "enabled" will be passed to AMP's initialize call, see the API and descriptions here at the [apex.amp.initialize documentation](https://nvidia.github.io/apex/amp.html#apex.amp.initialize). | None    |

S
Shaden Smith 已提交
231 232 233 234 235 236
### Gradient Clipping

***gradient\_clipping***: [float]

| Description                         | Default |
| ----------------------------------- | ------- |
C
Cheng Li 已提交
237
| Enable gradient clipping with value | `0`     |
S
Shaden Smith 已提交
238

J
Jeff Rasley 已提交
239 240 241 242


### ZeRO Optimizations for FP16 Training

S
Stas Bekman 已提交
243
Enabling and configuring ZeRO memory optimizations
J
Jeff Rasley 已提交
244 245
```json
  "zero_optimization": {
S
Samyam Rajbhandari 已提交
246
    "stage": [0|1|2|3],
J
Jeff Rasley 已提交
247
    "allgather_partitions": [true|false],
S
Stas Bekman 已提交
248
    "allgather_bucket_size": 5e8,
249
    "overlap_comm": false,
J
Jeff Rasley 已提交
250
    "reduce_scatter": [true|false],
S
Stas Bekman 已提交
251
    "reduce_bucket_size": 5e8,
O
Olatunji Ruwase 已提交
252
    "contiguous_gradients" : [true|false],
S
Samyam Rajbhandari 已提交
253 254 255 256 257 258 259 260
    "cpu_offload": [true|false],
    "cpu_offload_params" : [true|false],
    "cpu_offload_use_pin_memory" : [true|false],
    "stage3_max_live_parameters" : 1e9,
    "stage3_max_reuse_distance" : 1e9,
    "stage3_prefetch_bucket_size" : 5e8,
    "stage3_param_persistence_threshold" : 1e6,
    "sub_group_size" : 1e12,
S
Stas Bekman 已提交
261 262
    "elastic_checkpoint" : [true|false],
    "stage3_gather_fp16_weights_on_model_save": [true|false]
J
Jeff Rasley 已提交
263 264 265 266 267
    }
```

***zero\_optimization***: [dictionary]

C
Cheng Li 已提交
268 269 270
| Description                                                                                               | Default |
| --------------------------------------------------------------------------------------------------------- | ------- |
| Enable ZeRO memory optimization wrapper for FP16 Training. Currently compatible only with Adam optimizer. | `false` |
J
Jeff Rasley 已提交
271 272 273

***stage***: [integer]

274 275
| Description                                                                                                                                                                                                               | Default |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
S
Samyam Rajbhandari 已提交
276
| Chooses different stages of ZeRO Optimizer. Stage 0, 1, 2, and 3 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitioning, and optimizer+gradient+parameter partitioning, respectively. | `0`     |
J
Jeff Rasley 已提交
277 278 279

***allgather_partitions***: [boolean]

C
Cheng Li 已提交
280 281 282
| Description                                                                                                                                      | Default |
| ------------------------------------------------------------------------------------------------------------------------------------------------ | ------- |
| Chooses between allgather collective or a series of broadcast collectives to gather updated parameters from all the GPUs at the end of each step | `true`  |
J
Jeff Rasley 已提交
283 284 285

***allgather_bucket_size***: [boolean]

C
Cheng Li 已提交
286 287 288
| Description                                                                                                  | Default |
| ------------------------------------------------------------------------------------------------------------ | ------- |
| Number of elements allgathered at a time. Limits the memory required for the allgather for large model sizes | `5e8`   |
J
Jeff Rasley 已提交
289

290 291
***overlap_comm***: [boolean]

C
Cheng Li 已提交
292 293 294
| Description                                                                  | Default |
| ---------------------------------------------------------------------------- | ------- |
| Attempts to overlap the reduction of the gradients with backward computation | `false` |
295

J
Jeff Rasley 已提交
296 297
***reduce_scatter***: [boolean]

C
Cheng Li 已提交
298 299 300
| Description                                                             | Default |
| ----------------------------------------------------------------------- | ------- |
| Uses reduce or reduce scatter instead of allreduce to average gradients | `true`  |
J
Jeff Rasley 已提交
301 302 303

***reduce_bucket_size***: [boolean]

C
Cheng Li 已提交
304 305 306
| Description                                                                                                         | Default |
| ------------------------------------------------------------------------------------------------------------------- | ------- |
| Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes | `5e8`   |
J
Jeff Rasley 已提交
307 308 309

***contiguous_gradients***: [boolean]

C
Cheng Li 已提交
310 311 312
| Description                                                                                                                                                     | Default |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| Copies the gradients to a contiguous buffer as they are produced. Avoids memory fragmentation during backward pass. Only useful when running very large models. | `False` |
J
Jeff Rasley 已提交
313

O
Olatunji Ruwase 已提交
314 315
***cpu_offload***: [boolean]

C
Cheng Li 已提交
316 317 318
| Description                                                                                                              | Default |
| ------------------------------------------------------------------------------------------------------------------------ | ------- |
| Enable offloading of optimizer memory and computation to CPU. This frees up GPU memory for larger models or batch sizes. | `False` |
J
Jeff Rasley 已提交
319

S
Samyam Rajbhandari 已提交
320 321 322 323 324 325 326 327
***cpu_offload_params***: [boolean]

| Description                                                                                                                       | Default |
| --------------------------------------------------------------------------------------------------------------------------------- | ------- |
| Enable offloading of model parameters to CPU. This frees up GPU memory for larger models or batch sizes. Valid only with stage 3. | `False` |

***cpu_offload_use_pin_memory***: [boolean]

328 329 330
| Description                                                                              | Default |
| ---------------------------------------------------------------------------------------- | ------- |
| Use pinned CPU memory when offloading. Can improve performance. Valid only with stage 3. | `False` |
S
Samyam Rajbhandari 已提交
331 332 333

***stage3_max_live_parameters***: [integer]

334 335
| Description                                                                                                                         | Default |
| ----------------------------------------------------------------------------------------------------------------------------------- | ------- |
S
Samyam Rajbhandari 已提交
336 337 338 339
| The maximum number of parameters resident per GPU before releasing. Smaller values use less memory, but perform more communication. | `1e9`   |

***stage3_max_reuse_distance***: [integer]

340 341
| Description                                                                                                                                          | Default |
| ---------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
S
Samyam Rajbhandari 已提交
342 343 344 345
| Do not release a parameter if it will be reused within this threshold of parameters. Smaller values use less memory, but perform more communication. | `1e9`   |

***stage3_prefetch_bucket_size***: [integer]

346 347
| Description                                                                                                                            | Default |
| -------------------------------------------------------------------------------------------------------------------------------------- | ------- |
S
Samyam Rajbhandari 已提交
348 349 350 351 352 353 354 355
| The size of the fixed buffer for prefetching parameters. Smaller values use less memory, but can increase stalls due to communication. | `5e8`   |


***stage3_param_persistence_threshold***: [integer]
| Description                                                                                                                                                          | Default |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| Do not partition parameters smaller than this threshold. Smaller values use less memory, but can greatly increase communication (especially latency-bound messages). | `1e6`   |

J
Jeff Rasley 已提交
356

S
Stas Bekman 已提交
357 358 359 360 361
***stage3_gather_fp16_weights_on_model_save***: [boolean]
| Description                                                                                                                                                          | Default |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| Consolidate the weights before saving the model by `save_fp16_model()`. Since the weights are partitioned across GPUs, they aren't part of `state_dict`, so this function automatically gather the weights when this option is enabled and then saves the fp16 model weights. | `False` |

S
Shaden Smith 已提交
362 363 364 365
### Logging

***steps\_per\_print***: [integer]

C
Cheng Li 已提交
366 367 368
| Description                    | Default |
| ------------------------------ | ------- |
| Print train loss every N steps | `10`    |
S
Shaden Smith 已提交
369 370 371

***wall\_clock\_breakdown***: [boolean]

C
Cheng Li 已提交
372 373 374
| Description                                                             | Default |
| ----------------------------------------------------------------------- | ------- |
| Enable timing of the latency of forward/backward/update training phases | `false` |
S
Shaden Smith 已提交
375 376 377

***dump_state***: [boolean]

C
Cheng Li 已提交
378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422
| Description                                                          | Default |
| -------------------------------------------------------------------- | ------- |
| Print out state information of DeepSpeed object after initialization | `false` |

### Flops Profiler
```json
{
  "flops_profiler": {
    "enabled": true,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 3,
    "detailed": true,
    }
}
```
***enabled***: [boolean]

| Description                 | Default |
| --------------------------- | ------- |
| Enables the flops profiler. | `false` |

***profile\_step***: [integer]

| Description                                                                                                     | Default |
| --------------------------------------------------------------------------------------------------------------- | ------- |
| The global training step at which to profile. Note that warm up steps are needed for accurate time measurement. | `1`     |

***module\_depth***: [integer]

| Description                                                                                                                                                            | Default |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| The depth of the model at which to print the aggregated module information. When set to `-1`, it prints information on the innermost modules (with the maximum depth). | `-1`    |

***top\_modules***: [integer]

| Description                                                                  | Default |
| ---------------------------------------------------------------------------- | ------- |
| Limits the aggregated profile output to the number of top modules specified. | `3`     |

***detailed***: [boolean]

| Description                                  | Default |
| -------------------------------------------- | ------- |
| Whether to print the detailed model profile. | `true`  |
J
Jeff Rasley 已提交
423 424 425 426 427 428 429 430 431 432 433 434 435 436

### Activation Checkpointing
```json
  "activation_checkpointing": {
    "partition_activations": false,
    "cpu_checkpointing": false,
    "contiguous_memory_optimization": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
    }
```
***partition\_activations***: [boolean]

C
Cheng Li 已提交
437 438 439
| Description                                                   | Default |
| ------------------------------------------------------------- | ------- |
| Enables partition activation when used with model parallelism | `false` |
J
Jeff Rasley 已提交
440 441 442

***cpu\_checkpointing***: [boolean]

C
Cheng Li 已提交
443 444 445
| Description                                                                 | Default |
| --------------------------------------------------------------------------- | ------- |
| Offloads partitioned activations to CPU if partition_activations is enabled | `false` |
J
Jeff Rasley 已提交
446 447 448 449


***contiguous\_memory\_optimization***: [boolean]

C
Cheng Li 已提交
450 451 452
| Description                                                          | Default |
| -------------------------------------------------------------------- | ------- |
| Copies partitioned activations so that they are contiguous in memory | `false` |
J
Jeff Rasley 已提交
453 454 455

***number_checkpoints***: [integer]

C
Cheng Li 已提交
456 457 458
| Description                                                                                              | Default |
| -------------------------------------------------------------------------------------------------------- | ------- |
| Total number of activation checkpoints used to allocate memory buffer for contiguous_memoty_optimization | `None`  |
J
Jeff Rasley 已提交
459 460 461

***synchronize\_checkpoint\_boundary***: [boolean]

C
Cheng Li 已提交
462 463 464
| Description                                                   | Default |
| ------------------------------------------------------------- | ------- |
| Inserts torch.cuda.synchronize() at each checkpoint boundary. | `false` |
J
Jeff Rasley 已提交
465 466 467 468


***profile***: [boolean]

C
Cheng Li 已提交
469 470 471
| Description                                                     | Default |
| --------------------------------------------------------------- | ------- |
| Logs the forward and backward time for each checkpoint function | `false` |
472 473 474 475 476

### Sparse Attention

***sparse\_attention***: [dictionary]

C
Cheng Li 已提交
477 478 479 480 481 482 483 484 485 486 487 488 489 490 491
| Fields                           | Value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Example           |
| -------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- |
| mode                             | A string determining sparsity structure type. Deepspeed currently supports `"dense"`, `"fixed"`, `"bigbird"`, `"bslongformer"`, and `"variable"`.                                                                                                                                                                                                                                                                                                                                                              | `"fixed"`         |
| block                            | An integer determining the block size. Current implementation of sparse self-attention is based on blocked sparse matrices. In which this parameter defines size of such blocks, `Block X Block`.                                                                                                                                                                                                                                                                                                              | 16                |
| different\_layout\_per\_head     | A boolean determining if each head should be assigned a different sparsity layout; this will be satisfied based on availability.                                                                                                                                                                                                                                                                                                                                                                               | false             |
| num\_local\_blocks               | An integer determining the number of random blocks in each block row; only used in `"fixed"` mode.                                                                                                                                                                                                                                                                                                                                                                                                             | 4                 |
| num\_global\_blocks              | An integer determining how many consecutive blocks in a local window is used as the representative of the window for global attention; used in `"fixed"` and `"bigbird"` modes.                                                                                                                                                                                                                                                                                                                                | 1                 |
| attention                        | A string determining attention type. Attention can be `"unidirectional"`, such as autoregressive models, in which tokens attend only to tokens appear before them in the context. Considering that, the upper triangular of attention matrix is empty. Or it can be `"bidirectional"`, such as BERT, in which tokens can attend to any other tokens before or after them. Then, the upper triangular part of the attention matrix is mirror of the lower triangular; used in `"fixed"` and `"variable"` modes. | `"bidirectional"` |
| horizontal\_global\_attention    | A boolean determining if blocks that are global representative of a local window, also attend to all other blocks. This is valid only if attention type is `"bidirectional"`. Looking at the attention matrix, that means global attention not only includes the vertical blocks, but also horizontal blocks; used in `"fixed"` and `"variable"` modes.                                                                                                                                                        | false             |
| num\_different\_global\_patterns | An integer determining number of different global attentions layouts. While global attention can be fixed by which block/s are representative of any local window, since there are multi-heads, each head can use a different global representative; used only in `"fixed"` mode.                                                                                                                                                                                                                              | 4                 |
| num\_random\_blocks              | An integer determining the number of random blocks in each block row; used in `"variable"` and `"bigbird"` modes.                                                                                                                                                                                                                                                                                                                                                                                              | 0                 |
| local\_window\_blocks            | A list of integers determining the number of blocks in each local attention window. It assumes first number determines # of blocks in the first local window, second the second window, ..., and the last number determines the number of blocks in the remaining local windows; only used in `"variable"` mode.                                                                                                                                                                                               | [4]               |
| global\_block\_indices           | A list of integers determining which blocks are considered as global attention. Given indices, determine the blocks that all other token blocks attend to and they attend to all other token blocks. Notice that if global\_block\_end\_indices parameter is set, this parameter is used as starting index of each global window; used in `"variable"` and `"bslongformer"` modes.                                                                                                                             | [0]               |
| global\_block\_end\_indices      | A list of integers determining end indices of global window blocks. By default this is not used. But if it is set, it must have the same size of global\_block\_indices parameter, and combining this two parameters, for each index i, blocks from global\_block\_indices[i] to global\_block\_end\_indices[i], exclusive, are considered as global attention; used in `"variable"` and `"bslongformer"` modes.                                                                                               | None              |
| num\_sliding\_window\_blocks     | An integer determining the number of blocks in sliding local attention window; used in `"bigbird"` and `"bslongformer"` modes.                                                                                                                                                                                                                                                                                                                                                                                 | 3                 |
492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511

  Example of ***sparse\_attention***

```json
  "sparse_attention": {
    "mode": "fixed",
    "block": 16,
    "different_layout_per_head": true,
    "num_local_blocks": 4,
    "num_global_blocks": 1,
    "attention": "bidirectional",
    "horizontal_global_attention": false,
    "num_different_global_patterns": 4,
    "num_random_blocks": 0,
    "local_window_blocks": [4],
    "global_block_indices": [0],
    "global_block_end_indices": None,
    "num_sliding_window_blocks": 3
  }
```