提交 · 79093d74aa1632e5eac937480774b6fd29084228 · Greenplum / DeepSpeed

02 9月, 2020 1 次提交

Sparse attn + ops/runtime refactor + v0.3.0 (#343) · e5bbc2e5

由 Jeff Rasley 提交于 9月 01, 2020

* Sparse attn + ops/runtime refactor + v0.3.0
Co-authored-by: NArash Ashari <arashari@microsoft.com>
Co-authored-by: NArash Ashari <arashari@microsoft.com>

e5bbc2e5

01 9月, 2020 1 次提交

Samyamr/grad acc stage2 (#338) · 7240abf3

由 Samyam Rajbhandari 提交于 8月 31, 2020

* Adding gradient accumulation support for ZeRO Stage 2. Changing all Megatron-LM tests to also test gradient accumulation

* Gradient Accumulation support for Stage 2. Model tests added to test the feature

* formatting

* Update deepspeed_light.py

removing comment

* Update ds_config_func_bs8_zero1.json

reverting this file back. Its not needed for this PR

* defining baseline prefix
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

7240abf3

14 7月, 2020 1 次提交

Support loading and saving ZeRO checkpoints with changing DP degree (#240) · 7ccc9daf

由 Olatunji Ruwase 提交于 7月 14, 2020

* Support saving and loading ZeRO checkpoints on different data
parallelism degree.

* Fix formatting

* Support checkpoint with varying GPU count in ZeRO stage 1

* Fix formatting

* Formatting fixes

* Update model tests

* Remove pprint

* Minor fix

* Fix formatting

* Update model tests
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

7ccc9daf

07 7月, 2020 1 次提交

ZeRO-2: Handle gradients of empty partitions (#275) · 4a3234e0

由 Olatunji Ruwase 提交于 7月 06, 2020

* Load non-DeepSpeed checkpoints into ZeRO optimizer

* Handle parameters smaller than DP

* Formatting fixes

* Handle empty partitions

* Fix perf bug
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

4a3234e0

24 6月, 2020 1 次提交

Handle parameter groups smaller than DP (#273) · 88c319aa

由 Olatunji Ruwase 提交于 6月 23, 2020

* Load non-DeepSpeed checkpoints into ZeRO optimizer

* Handle parameters smaller than DP

* Formatting fixes

88c319aa

20 6月, 2020 1 次提交

Update deepspeed_utils.py (#270) · 224494bd

由 Samyam Rajbhandari 提交于 6月 19, 2020

* Removing handle_overflow debugging code in deepspeed_utils.py

* Removing handle_overflow debugging code in deepspeed_zero_optimizer.py

Removing unnecessary overflow handle code. Not sure why it was there in the first place.

224494bd

05 6月, 2020 1 次提交

Add log util (#230) · e1ad8803

由 Chunyang Wen 提交于 6月 05, 2020

* Add log util

* replace all occurrences of print and logging

* address format

* disable propagate to avoid duplicate log

e1ad8803

04 6月, 2020 1 次提交
- E
  
  reduce memcpy between host and device (#248) · 8353c594
  由 eltonzheng 提交于 6月 03, 2020
  
  8353c594
28 5月, 2020 2 次提交

add support for predivide as a config option (#235) · bc36b91d

由 Jeff Rasley 提交于 5月 27, 2020

* add support for predivide as a flag
* add predivide json config, remove allgather_disable (as it's not currently used anymore)

bc36b91d

Samyamr/cpu memory bloat fix zero (#233) · d24d3de9

由 Samyam Rajbhandari 提交于 5月 27, 2020

* Fix for CPU memory Bloating Issue caused by pyorch backward graph creation in allgather. Fixed by calling detach on tensors before calling all_gather

* Fix for CPU memory Bloating Issue caused by pyorch backward graph creation in allgather. Fixed by calling detach on tensors before calling all_gather

* Fix for CPU memory Bloating Issue caused by pyorch backward graph creation in allgather. Fixed by calling detach on tensors before calling all_gather

d24d3de9

19 5月, 2020 1 次提交

ZeRO-2 (#217) · f2ac7eaf

由 Jeff Rasley 提交于 5月 19, 2020

Updates for ZeRO stage 2 + ZeRO stage 1 w. RS
Co-authored-by: NTunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: NElton Zheng <eltonz@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Nyuxionghe <yuxhe@microsoft.com>
Co-authored-by: NArash Ashari <arashari@microsoft.com>

f2ac7eaf

25 4月, 2020 1 次提交
- O
  
  Fix index out of range error when parameter count is not multiple of ranks (#202) · 512a0d4d
  由 Olatunji Ruwase 提交于 4月 24, 2020
  
  512a0d4d
21 4月, 2020 1 次提交
- O
  Fix perf bug (#194) · bf4797c2
  由 Olatunji Ruwase 提交于 4月 20, 2020
```
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
```
  bf4797c2
03 4月, 2020 1 次提交
- K
  
  add representation to optimizer (#181) · 3637b86b
  由 kouml 提交于 4月 04, 2020
  
  3637b86b
26 3月, 2020 1 次提交
- S
  
  Adding static loss scaling for ZeRO. (#166) · a76572dc
  由 Shaden Smith 提交于 3月 25, 2020
  
  a76572dc
11 3月, 2020 1 次提交

Enhancement: Ability to load checkpoint without loading the optimizer… (#128) · 936117b5

由 Samyam Rajbhandari 提交于 3月 10, 2020

* Enhancement: Ability to load checkpoint without loading the optimizer states. Unittest testing saving and loading checkpoint with fused, unfused and zero optimizer. The unitest takes about 165s

936117b5

04 2月, 2020 1 次提交
- S
  Add files via upload · ec79b239
  由 Samyam Rajbhandari 提交于 2月 03, 2020
```
Different Optimizers in DeepSpeed.
```
  ec79b239

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年