- 20 5月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
-
- 19 5月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Align fp16 param wap buffers * Integrating swap buffer manager for fp16 params * Support swapping misaligned fp16 parameters
-
- 16 5月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Round robin partitioning to improve ZeRO-2 Offload CPU copy * Formatting fixes * Fix index issues in debug dumps * Remove debug prints * Code cleanup * Remove unintended stage3.py changes * Add TODO
-
- 14 5月, 2021 4 次提交
-
-
由 Shaden Smith 提交于
* is not -> != * Use pytest-randomly to seed unit tests.
-
由 Olatunji Ruwase 提交于
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Stas Bekman 提交于
* [configure_distributed_model] improve assert This PR changes the 2 asserts to actually print the names of the params that are wrong. e.g.: ``` fp16 is enabled but the following parameters have dtype that is not fp16: wav2vec2.masked_spec_embed ``` * style Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 13 5月, 2021 1 次提交
-
-
由 Cheng Li 提交于
* use the original function's name as the key to old_functions dict * update profile output format * print at global rank 0 * add flops calculation in bwd pass using time from ds timers * improve aggregated profiling out to show all depth * print samples/second * update readme and examples * update docs * fix typo and reorder printing * fix format
-
- 08 5月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Unused parameters assert should be disabled by default * Fix message * Invert assert logic in unit test * Change option for ignoring unused parameters Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 06 5月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* NVMe intra-request validation should be on entire file Optimizer swap buffer sizes should be aligned * Add fix message for missing aio lib error.
-
- 04 5月, 2021 2 次提交
-
-
由 Stas Bekman 提交于
* fix assert The current assert "Model must initialized in fp16 mode for ZeRO Stage 3." needs TLC - I rewrote it completely to match its cousen assert, so now we have 2 consistent matching asserts: - f"fp16 is enabled but one or several model parameters have dtype that is not fp16" - f"fp16 is not enabled but one or several model parameters have dtype of fp16" * remove f
-
由 janEbert 提交于
Fix #1032
-
- 03 5月, 2021 1 次提交
-
-
由 Cheng Li 提交于
-
- 01 5月, 2021 5 次提交
-
-
由 Sean Naren 提交于
* Add additional conditions when checking types of output from the model * Add test * Modify test to use torch.tensor as well Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Jiangang Zhu 提交于
Co-authored-by: NJiangang Zhu <jiangazh@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Cheng Li 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Stas Bekman 提交于
-
由 Stas Bekman 提交于
-
- 30 4月, 2021 2 次提交
-
-
由 Olatunji Ruwase 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Samyam Rajbhandari 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 29 4月, 2021 1 次提交
-
-
由 Sean Naren 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 25 4月, 2021 1 次提交
-
-
由 hamlet 提交于
* Add find_unused_parameters option As unused parameters in modules may not be expected sometimes, add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707 * Add find_unused_parameters option As unused parameters in modules may not be expected sometimes, add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707 * Fix syntax error * Fix yapf error * Fix yapf error * Fix yapf error * Fix yapf error * Move stage2 find_unused_parameters to config file * Add stage2 find_unused_parameters * Add stage2 find_unused_parameters * Add stage2_find_unused_parameters option * Change error msg to reflect zero_optimization config change * Fix yapf error * Fix yapf errors * Change find_unused_parameters option name * Change find_unused_parameters option name * Change find_unused_parameters option name * Change find_unused_parameters option name * Change find_unused_parameters option name * Add UnusedParametersModel for test option find_unused_parameters * Add unit test for stage2 find_unused_parameters * Add cpu-adam compatible check * Remove dups import * Trim spaces * Fix yapf errors * Trim spaces * Add False Positive test check * Fix find_unused_parameters test * Trim spaces * Fix yapf error
-
- 24 4月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Use amp autocast in ZeRO3 linear * Fix typo * Handle specific exceptions * CI breaks on torch.distributed * Add autocast unit test * Format fixes * Fix skip logic Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 23 4月, 2021 1 次提交
-
-
由 William Buchwalter 提交于
* Fix issue where gradient_predivide_factor was called as a func. `gradient_predivide_factor` is a `float`, hence shouldn't be called as func. This crashes when `reduce_scatter` flag is set to `False`.
-
- 22 4月, 2021 3 次提交
-
-
由 sdtblck 提交于
-
由 Olatunji Ruwase 提交于
* Make reduce scatter optional for ZeRO-1 as workaround * Make allreduce default for ZeRO 1 Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Cheng Li 提交于
* use wierd shaped tensor to avoid silent failures when not registering externel params * fix typo Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 21 4月, 2021 3 次提交
-
-
由 Conglong Li 提交于
1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. Author: @conglongli, @awan-10, @samyam, Hanlin Tang, Yuxiong He Paper: https://arxiv.org/abs/2104.06069Co-authored-by: Nsdtblck <46172032+sdtblck@users.noreply.github.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Stas Bekman 提交于
-
由 Sean Naren 提交于
* Add check to see if json file is already loaded * Update doc * Address review * Remove doc comment Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 20 4月, 2021 2 次提交
-
-
由 Shaden Smith 提交于
-
由 Shaden Smith 提交于
* zinf tutorial * more megatron integration docs * ZInf + tiling docs
-
- 19 4月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
-
- 17 4月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Fix UnboundLocalError * Get full partition size
-
- 15 4月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* faster flatten/unflatten with apex * switch to cpp flatten/unflatten * style * better comment * missing import * switch to build ops at run time * fixes Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 14 4月, 2021 2 次提交
-
-
由 Stas Bekman 提交于
* e-notation for large floats * handle ints too * readability * handle bool Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Stas Bekman 提交于
-
- 08 4月, 2021 3 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Samyam Rajbhandari 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-