- 20 5月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
-
- 16 5月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Round robin partitioning to improve ZeRO-2 Offload CPU copy * Formatting fixes * Fix index issues in debug dumps * Remove debug prints * Code cleanup * Remove unintended stage3.py changes * Add TODO
-
- 08 5月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Unused parameters assert should be disabled by default * Fix message * Invert assert logic in unit test * Change option for ignoring unused parameters Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 30 4月, 2021 1 次提交
-
-
由 Samyam Rajbhandari 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 25 4月, 2021 1 次提交
-
-
由 hamlet 提交于
* Add find_unused_parameters option As unused parameters in modules may not be expected sometimes, add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707 * Add find_unused_parameters option As unused parameters in modules may not be expected sometimes, add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707 * Fix syntax error * Fix yapf error * Fix yapf error * Fix yapf error * Fix yapf error * Move stage2 find_unused_parameters to config file * Add stage2 find_unused_parameters * Add stage2 find_unused_parameters * Add stage2_find_unused_parameters option * Change error msg to reflect zero_optimization config change * Fix yapf error * Fix yapf errors * Change find_unused_parameters option name * Change find_unused_parameters option name * Change find_unused_parameters option name * Change find_unused_parameters option name * Change find_unused_parameters option name * Add UnusedParametersModel for test option find_unused_parameters * Add unit test for stage2 find_unused_parameters * Add cpu-adam compatible check * Remove dups import * Trim spaces * Fix yapf errors * Trim spaces * Add False Positive test check * Fix find_unused_parameters test * Trim spaces * Fix yapf error
-
- 15 4月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* faster flatten/unflatten with apex * switch to cpp flatten/unflatten * style * better comment * missing import * switch to build ops at run time * fixes Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 27 3月, 2021 1 次提交
-
-
由 hamlet 提交于
* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in https://github.com/microsoft/DeepSpeed/issues/707 As some model trainable parameters skipped in training, their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, so they have no norm_for_param_grads * Trim space * Trim space Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 16 3月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Ensure gradients of other partitions are cleared after reduction * Remove redundant code Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 12 3月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Control ZeRO wall clock timers * Disable more ZeRO3 debug prints Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 11 3月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 17 2月, 2021 1 次提交
-
-
由 Cheng Li 提交于
* check none tensors when splitting buckets
-
- 25 11月, 2020 1 次提交
-
-
由 Olatunji Ruwase 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 24 11月, 2020 1 次提交
-
-
由 Samyam Rajbhandari 提交于
In the absence of a model parallel group, model_parallel_allreduce should not do any reduction. This commit fixes the bug which was doing a model parallel allreduce across world group when model parallel group is None
-
- 21 11月, 2020 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Use zero-tensors for missing gradients to avoid size mismatch * Unit test for unbalanced gradients in ZeRO * Formatting fixes
-
- 13 11月, 2020 1 次提交
-
-
由 Jeff Rasley 提交于
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
-
- 06 11月, 2020 1 次提交
-
-
由 Reza Yazdani 提交于
* fixing cpu-adam * fixing copy with optimizer for data and model parallelism * fixing cpu-adam * fix cpu-adam * fix cpu-adam
-
- 31 10月, 2020 1 次提交
-
-
由 Reza Yazdani 提交于
* add adamW to CPU-ADAM implementation * supporting cpu-adam optimizer for zero-offload on deepspeed side * bump DSE to match cpu-adam updates Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 30 9月, 2020 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Disable default installation of CPU Adam * Handle cpufeature import/use errors separately
-
- 28 9月, 2020 1 次提交
-
-
由 Haibin Lin 提交于
-
- 17 9月, 2020 1 次提交
-
-
由 Haibin Lin 提交于
* Update stage2.py Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 10 9月, 2020 2 次提交
-
-
由 Jeff Rasley 提交于
-
由 Jeff Rasley 提交于
* ZeRO-Offload (squash) (#381) Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NReza Yazdani <reyazda@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NJie <37380896+jren73@users.noreply.github.com> Co-authored-by: NArash Ashari <arashari@microsoft.com> Co-authored-by: NReza Yazdani <reyazda@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Narashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com> Co-authored-by: NRezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: NReza Yazdani <reyazda@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
-
- 02 9月, 2020 1 次提交
-
-
由 Jeff Rasley 提交于
* Sparse attn + ops/runtime refactor + v0.3.0 Co-authored-by: NArash Ashari <arashari@microsoft.com> Co-authored-by: NArash Ashari <arashari@microsoft.com>
-
- 01 9月, 2020 1 次提交
-
-
由 Samyam Rajbhandari 提交于
* Adding gradient accumulation support for ZeRO Stage 2. Changing all Megatron-LM tests to also test gradient accumulation * Gradient Accumulation support for Stage 2. Model tests added to test the feature * formatting * Update deepspeed_light.py removing comment * Update ds_config_func_bs8_zero1.json reverting this file back. Its not needed for this PR * defining baseline prefix Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 14 7月, 2020 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Support saving and loading ZeRO checkpoints on different data parallelism degree. * Fix formatting * Support checkpoint with varying GPU count in ZeRO stage 1 * Fix formatting * Formatting fixes * Update model tests * Remove pprint * Minor fix * Fix formatting * Update model tests Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 07 7月, 2020 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Load non-DeepSpeed checkpoints into ZeRO optimizer * Handle parameters smaller than DP * Formatting fixes * Handle empty partitions * Fix perf bug Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 24 6月, 2020 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Load non-DeepSpeed checkpoints into ZeRO optimizer * Handle parameters smaller than DP * Formatting fixes
-
- 20 6月, 2020 1 次提交
-
-
由 Samyam Rajbhandari 提交于
* Removing handle_overflow debugging code in deepspeed_utils.py * Removing handle_overflow debugging code in deepspeed_zero_optimizer.py Removing unnecessary overflow handle code. Not sure why it was there in the first place.
-
- 05 6月, 2020 1 次提交
-
-
由 Chunyang Wen 提交于
* Add log util * replace all occurrences of print and logging * address format * disable propagate to avoid duplicate log
-
- 04 6月, 2020 1 次提交
-
-
由 eltonzheng 提交于
-
- 28 5月, 2020 2 次提交
-
-
由 Jeff Rasley 提交于
* add support for predivide as a flag * add predivide json config, remove allgather_disable (as it's not currently used anymore)
-
由 Samyam Rajbhandari 提交于
* Fix for CPU memory Bloating Issue caused by pyorch backward graph creation in allgather. Fixed by calling detach on tensors before calling all_gather * Fix for CPU memory Bloating Issue caused by pyorch backward graph creation in allgather. Fixed by calling detach on tensors before calling all_gather * Fix for CPU memory Bloating Issue caused by pyorch backward graph creation in allgather. Fixed by calling detach on tensors before calling all_gather
-
- 19 5月, 2020 1 次提交
-
-
由 Jeff Rasley 提交于
Updates for ZeRO stage 2 + ZeRO stage 1 w. RS Co-authored-by: NTunji Ruwase <olruwase@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com> Co-authored-by: NElton Zheng <eltonz@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Nyuxionghe <yuxhe@microsoft.com> Co-authored-by: NArash Ashari <arashari@microsoft.com>
-
- 25 4月, 2020 1 次提交
-
-
由 Olatunji Ruwase 提交于
-
- 21 4月, 2020 1 次提交
-
-
由 Olatunji Ruwase 提交于
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
-
- 03 4月, 2020 1 次提交
-
-
由 kouml 提交于
-
- 26 3月, 2020 1 次提交
-
-
由 Shaden Smith 提交于
-
- 11 3月, 2020 1 次提交
-
-
由 Samyam Rajbhandari 提交于
* Enhancement: Ability to load checkpoint without loading the optimizer states. Unittest testing saving and loading checkpoint with fused, unfused and zero optimizer. The unitest takes about 165s
-
- 04 2月, 2020 1 次提交
-
-
由 Samyam Rajbhandari 提交于
Different Optimizers in DeepSpeed.
-