提交 · cfa63f5dad2b0c45a23ee9e34d47206ee29a356a · Greenplum / DeepSpeed

20 5月, 2021 1 次提交
- J
  
  ZeRO stage 1 refresh (#1042) · cfa63f5d
  由 Jeff Rasley 提交于 5月 19, 2021
  
  cfa63f5d
16 5月, 2021 1 次提交

ZeRO2-Offload: Load balance gradient copying to CPU (#1067) · ee4deabd

由 Olatunji Ruwase 提交于 5月 15, 2021

* Round robin partitioning to improve ZeRO-2 Offload CPU copy

* Formatting fixes

* Fix index issues in debug dumps

* Remove debug prints

* Code cleanup

* Remove unintended stage3.py changes

* Add TODO

ee4deabd

08 5月, 2021 1 次提交

Avoid unused parameters assert by default (#1039) · 5b393f15

由 Olatunji Ruwase 提交于 5月 07, 2021

* Unused parameters assert should be disabled by default

* Fix message

* Invert assert logic in unit test

* Change option for ignoring unused parameters
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

5b393f15

30 4月, 2021 1 次提交
- S
  Samyamr/full precision for ZeRO Stage2 and Stage3 (#1004) · dad26428
  由 Samyam Rajbhandari 提交于 4月 29, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  dad26428
25 4月, 2021 1 次提交

Add find_unused_parameters option to DeepSpeedEngine (#945) · d0b61f18

由 hamlet 提交于 4月 25, 2021

* Add find_unused_parameters option

As unused parameters in modules may not be expected sometimes, 
add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707

* Add find_unused_parameters option

As unused parameters in modules may not be expected sometimes, 
add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707

* Fix syntax error

* Fix yapf error

* Fix yapf error

* Fix yapf error

* Fix yapf error

* Move stage2 find_unused_parameters to config file

* Add stage2 find_unused_parameters

* Add stage2 find_unused_parameters

* Add stage2_find_unused_parameters option

* Change error msg to reflect zero_optimization config change

* Fix yapf error

* Fix yapf errors

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Add UnusedParametersModel for test option find_unused_parameters

* Add unit test for stage2 find_unused_parameters

* Add cpu-adam compatible check

* Remove dups import

* Trim spaces

* Fix yapf errors

* Trim spaces

* Add False Positive test check

* Fix find_unused_parameters test

* Trim spaces

* Fix yapf error

d0b61f18

15 4月, 2021 1 次提交

[zero] faster flatten/unflatten (cpp version) (#910) · 8b8ed2a7

由 Stas Bekman 提交于 4月 14, 2021

* faster flatten/unflatten with apex

* switch to cpp flatten/unflatten

* style

* better comment

* missing import

* switch to build ops at run time

* fixes
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

8b8ed2a7

27 3月, 2021 1 次提交

Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (#861) · 7fcc8911

由 hamlet 提交于 3月 27, 2021

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in https://github.com/microsoft/DeepSpeed/issues/707

As some model trainable parameters skipped in training,
their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, 
so they have no norm_for_param_grads

* Trim space

* Trim space
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

7fcc8911

16 3月, 2021 1 次提交

ZeRO Stage 2: Clear reduced gradients (#856) · a75d971b

由 Olatunji Ruwase 提交于 3月 15, 2021

* Ensure gradients of other partitions are cleared after reduction

* Remove redundant code
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

a75d971b

12 3月, 2021 1 次提交

Control ZeRO wall clock timers (#849) · 311795d0

由 Olatunji Ruwase 提交于 3月 11, 2021

* Control ZeRO wall clock timers

* Disable more ZeRO3 debug prints
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

311795d0

11 3月, 2021 1 次提交
- S
  less scary overflow notice (#833) · 29853c3e
  由 Stas Bekman 提交于 3月 10, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  29853c3e
17 2月, 2021 1 次提交
- C
  Checks for None tensors and skip them when splitting the buckets in zero stage 2. (#728) · 7cab55c7
  由 Cheng Li 提交于 2月 16, 2021
```
* check none tensors when splitting buckets
```
  7cab55c7
25 11月, 2020 1 次提交
- O
  Deprecate client ability to disable gradient reduction (#552) · 6e65c2cc
  由 Olatunji Ruwase 提交于 11月 24, 2020
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  6e65c2cc
24 11月, 2020 1 次提交

Bug fix for norm calculation in absence of model parallel group (#551) · 00c3a254

由 Samyam Rajbhandari 提交于 11月 23, 2020

In the absence of a model parallel group, model_parallel_allreduce should not do any reduction. This commit fixes the bug which was doing a model parallel allreduce across world group when model parallel group is None

00c3a254

21 11月, 2020 1 次提交

Fix unbalanced gradients bug in ZeRO-2 gradient accumulation (#545) · 0178e6cc

由 Olatunji Ruwase 提交于 11月 20, 2020

* Use zero-tensors for missing gradients to avoid size mismatch

* Unit test for unbalanced gradients in ZeRO

* Formatting fixes

0178e6cc

13 11月, 2020 1 次提交

DeepSpeed JIT op + PyPI support (#496) · 31f46fee

由 Jeff Rasley 提交于 11月 12, 2020

Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>

31f46fee

06 11月, 2020 1 次提交

Fixing CPU-Adam convergence issue (#503) · 7d4d742b

由 Reza Yazdani 提交于 11月 05, 2020

* fixing cpu-adam

* fixing copy with optimizer for data and model parallelism

* fixing cpu-adam

* fix cpu-adam

* fix cpu-adam

7d4d742b

31 10月, 2020 1 次提交

Add CPUAdam optimizer for zero-offload in deepspeed engine (#484) · f5aa2547

由 Reza Yazdani 提交于 10月 30, 2020

* add adamW to CPU-ADAM implementation

* supporting cpu-adam optimizer for zero-offload on deepspeed side

* bump DSE to match cpu-adam updates
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

f5aa2547

30 9月, 2020 1 次提交
- O
  Disable default installation of CPU Adam (#450) · 7b8be2a7
  由 Olatunji Ruwase 提交于 9月 29, 2020
```
* Disable default installation of CPU Adam

* Handle cpufeature import/use errors separately
```
  7b8be2a7
28 9月, 2020 1 次提交
- H
  
  fix typos (#446) · 6f28ea30
  由 Haibin Lin 提交于 9月 28, 2020
  
  6f28ea30
17 9月, 2020 1 次提交
- H
  Fix a typo in comments (#415) · 4fef478f
  由 Haibin Lin 提交于 9月 16, 2020
```
* Update stage2.py
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  4fef478f
10 9月, 2020 2 次提交

J

fix for 16GB v100 nodes (#393) · b1d4bd73
由 Jeff Rasley 提交于 9月 10, 2020

b1d4bd73

ZeRO-Offload release (#391) · 41db1c2f

由 Jeff Rasley 提交于 9月 09, 2020

* ZeRO-Offload (squash) (#381)
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NJie <37380896+jren73@users.noreply.github.com>
Co-authored-by: NArash Ashari <arashari@microsoft.com>
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Narashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com>
Co-authored-by: NRezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>

41db1c2f

02 9月, 2020 1 次提交

Sparse attn + ops/runtime refactor + v0.3.0 (#343) · e5bbc2e5

由 Jeff Rasley 提交于 9月 01, 2020

* Sparse attn + ops/runtime refactor + v0.3.0
Co-authored-by: NArash Ashari <arashari@microsoft.com>
Co-authored-by: NArash Ashari <arashari@microsoft.com>

e5bbc2e5

01 9月, 2020 1 次提交

Samyamr/grad acc stage2 (#338) · 7240abf3

由 Samyam Rajbhandari 提交于 8月 31, 2020

* Adding gradient accumulation support for ZeRO Stage 2. Changing all Megatron-LM tests to also test gradient accumulation

* Gradient Accumulation support for Stage 2. Model tests added to test the feature

* formatting

* Update deepspeed_light.py

removing comment

* Update ds_config_func_bs8_zero1.json

reverting this file back. Its not needed for this PR

* defining baseline prefix
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

7240abf3

14 7月, 2020 1 次提交

Support loading and saving ZeRO checkpoints with changing DP degree (#240) · 7ccc9daf

由 Olatunji Ruwase 提交于 7月 14, 2020

* Support saving and loading ZeRO checkpoints on different data
parallelism degree.

* Fix formatting

* Support checkpoint with varying GPU count in ZeRO stage 1

* Fix formatting

* Formatting fixes

* Update model tests

* Remove pprint

* Minor fix

* Fix formatting

* Update model tests
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

7ccc9daf

07 7月, 2020 1 次提交

ZeRO-2: Handle gradients of empty partitions (#275) · 4a3234e0

由 Olatunji Ruwase 提交于 7月 06, 2020

* Load non-DeepSpeed checkpoints into ZeRO optimizer

* Handle parameters smaller than DP

* Formatting fixes

* Handle empty partitions

* Fix perf bug
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

4a3234e0

24 6月, 2020 1 次提交

Handle parameter groups smaller than DP (#273) · 88c319aa

由 Olatunji Ruwase 提交于 6月 23, 2020

* Load non-DeepSpeed checkpoints into ZeRO optimizer

* Handle parameters smaller than DP

* Formatting fixes

88c319aa

20 6月, 2020 1 次提交

Update deepspeed_utils.py (#270) · 224494bd

由 Samyam Rajbhandari 提交于 6月 19, 2020

* Removing handle_overflow debugging code in deepspeed_utils.py

* Removing handle_overflow debugging code in deepspeed_zero_optimizer.py

Removing unnecessary overflow handle code. Not sure why it was there in the first place.

224494bd

05 6月, 2020 1 次提交

Add log util (#230) · e1ad8803

由 Chunyang Wen 提交于 6月 05, 2020

* Add log util

* replace all occurrences of print and logging

* address format

* disable propagate to avoid duplicate log

e1ad8803

04 6月, 2020 1 次提交
- E
  
  reduce memcpy between host and device (#248) · 8353c594
  由 eltonzheng 提交于 6月 03, 2020
  
  8353c594
28 5月, 2020 2 次提交

add support for predivide as a config option (#235) · bc36b91d

由 Jeff Rasley 提交于 5月 27, 2020

* add support for predivide as a flag
* add predivide json config, remove allgather_disable (as it's not currently used anymore)

bc36b91d

Samyamr/cpu memory bloat fix zero (#233) · d24d3de9

由 Samyam Rajbhandari 提交于 5月 27, 2020

* Fix for CPU memory Bloating Issue caused by pyorch backward graph creation in allgather. Fixed by calling detach on tensors before calling all_gather

* Fix for CPU memory Bloating Issue caused by pyorch backward graph creation in allgather. Fixed by calling detach on tensors before calling all_gather

* Fix for CPU memory Bloating Issue caused by pyorch backward graph creation in allgather. Fixed by calling detach on tensors before calling all_gather

d24d3de9

19 5月, 2020 1 次提交

ZeRO-2 (#217) · f2ac7eaf

由 Jeff Rasley 提交于 5月 19, 2020

Updates for ZeRO stage 2 + ZeRO stage 1 w. RS
Co-authored-by: NTunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: NElton Zheng <eltonz@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Nyuxionghe <yuxhe@microsoft.com>
Co-authored-by: NArash Ashari <arashari@microsoft.com>

f2ac7eaf

25 4月, 2020 1 次提交
- O
  
  Fix index out of range error when parameter count is not multiple of ranks (#202) · 512a0d4d
  由 Olatunji Ruwase 提交于 4月 24, 2020
  
  512a0d4d
21 4月, 2020 1 次提交
- O
  Fix perf bug (#194) · bf4797c2
  由 Olatunji Ruwase 提交于 4月 20, 2020
```
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
```
  bf4797c2
03 4月, 2020 1 次提交
- K
  
  add representation to optimizer (#181) · 3637b86b
  由 kouml 提交于 4月 04, 2020
  
  3637b86b
26 3月, 2020 1 次提交
- S
  
  Adding static loss scaling for ZeRO. (#166) · a76572dc
  由 Shaden Smith 提交于 3月 25, 2020
  
  a76572dc
11 3月, 2020 1 次提交

Enhancement: Ability to load checkpoint without loading the optimizer… (#128) · 936117b5

由 Samyam Rajbhandari 提交于 3月 10, 2020

* Enhancement: Ability to load checkpoint without loading the optimizer states. Unittest testing saving and loading checkpoint with fused, unfused and zero optimizer. The unitest takes about 165s

936117b5

04 2月, 2020 1 次提交
- S
  Add files via upload · ec79b239
  由 Samyam Rajbhandari 提交于 2月 03, 2020
```
Different Optimizers in DeepSpeed.
```
  ec79b239

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年