v0.3.10

v0.3.10 Release notes

Combined release notes since November 12th v0.3.1 release

Various updates to torch.distributed initialization
- New deepspeed.init_distributed API, #608, #645, #644
- Improved AzureML support for patching torch.distributed backend, #542
- Simplify dist init and only init if needed #553
Transformer kernel updates
- Support for different hidden dimensions #559
- Support arbitrary sequence-length #587
Elastic training support (#602)
- NOTE: More details to come on this feature, currently still in initial piloting of this feature.
Module replacement support #586
- NOTE: Will be used more and documented in the short-term to help automatically inject/replace deepspeed ops into client models.
#528 removes dependencies psutil and cpufeature
Various ZeRO 1 and 2 bug fixes and updates: #531, #532, #545, #548
#543 backwards compatible checkpoints with older deepspeed v0.2 version
Add static_loss_scale support to unfused optimizer #546
Bug fix for norm calculation in absence of model parallel group #551
Switch CI from azure pipelines to github actions
Deprecate client ability to disable gradient reduction #552
Bug fix for tracking optimizer step in cpu-adam when loading checkpoint #564
Improved support for Ampere architecture #572, #570, #577, #578, #591, #642
Fix potential random layout inconsistency issues in sparse attention modules #534
Supported customizing kwargs for lr_scheduler #584
Support deepspeed.initialize with dict configuration instead of arg #632
Allow DeepSpeed models to be initialized with optimizer=None #469