提交 · a029239812e15cf35334514449ed3127b915780a · Greenplum / DeepSpeed

29 6月, 2021 1 次提交
- S
  clean up logging (#1190) · a0292398
  由 Stas Bekman 提交于 6月 28, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  a0292398
26 6月, 2021 1 次提交
- S
  undo noise (#1191) · bc019a53
  由 Stas Bekman 提交于 6月 25, 2021
```
* undo noise

* another
```
  bc019a53
24 6月, 2021 2 次提交

introduce debug utils (#1136) · c0c4ebf1

由 Stas Bekman 提交于 6月 23, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

c0c4ebf1

ZeRO 2+3 memory estimators (#965) · 0c1802cc

由 Stas Bekman 提交于 6月 23, 2021

Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

0c1802cc

17 6月, 2021 1 次提交

Samyamr/largest partitioned params calculation fix (#1150) · 4eaf9106

由 Samyam Rajbhandari 提交于 6月 16, 2021

* largest_partitioned_params calculation fix

largest partitioned params was getting calculated incorrectly

* Update stage3.py

* Update stage3.py

* formatting fix

* changing sub-group size default to 1e9
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

4eaf9106

09 6月, 2021 1 次提交

correct cpu_offload deprecation (#1140) · a8d6dfe8

由 Stas Bekman 提交于 6月 08, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

a8d6dfe8

08 6月, 2021 1 次提交

[zero] fix missed subclasses partitioning bug (#1135) · 5ca81678

由 Stas Bekman 提交于 6月 07, 2021

* fix missed subclassed partitioning bug

* fix on exit
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

5ca81678

22 5月, 2021 1 次提交
- M
  fix ZERO_OPTIMIZATION_REDUCE_SCATTER_DEFAULT default value (#1058) · 093e59ec
  由 Meng, Peng 提交于 5月 22, 2021
```
* fix Reduce Scatter default value

* Update constants.py
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  093e59ec
21 5月, 2021 1 次提交

ZeRO-Infinity: Swap into unaligned fp16 buffer (#1086) · e9e9d5b8

由 Olatunji Ruwase 提交于 5月 20, 2021

* Align fp16 param wap buffers

* Integrating swap buffer manager for fp16 params

* Support swapping misaligned fp16 parameters

* Support swap into unaligned fp16 buffer

e9e9d5b8

20 5月, 2021 1 次提交
- J
  
  ZeRO stage 1 refresh (#1042) · cfa63f5d
  由 Jeff Rasley 提交于 5月 19, 2021
  
  cfa63f5d
19 5月, 2021 1 次提交
- O
  ZeRO-Infinity: support swapping misaligned sized fp16 tensors (#1076) · d88d9279
  由 Olatunji Ruwase 提交于 5月 18, 2021
```
* Align fp16 param wap buffers

* Integrating swap buffer manager for fp16 params

* Support swapping misaligned fp16 parameters
```
  d88d9279
16 5月, 2021 1 次提交

ZeRO2-Offload: Load balance gradient copying to CPU (#1067) · ee4deabd

由 Olatunji Ruwase 提交于 5月 15, 2021

* Round robin partitioning to improve ZeRO-2 Offload CPU copy

* Formatting fixes

* Fix index issues in debug dumps

* Remove debug prints

* Code cleanup

* Remove unintended stage3.py changes

* Add TODO

ee4deabd

14 5月, 2021 2 次提交
- O
  
  Get correct fp16 reuse buffer size (#1071) · 6b49b60e
  由 Olatunji Ruwase 提交于 5月 13, 2021
  
  6b49b60e
- S
  ensure only ds params are gathered (#1044) · 29b444b6
  由 Stas Bekman 提交于 5月 13, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  29b444b6
08 5月, 2021 1 次提交

Avoid unused parameters assert by default (#1039) · 5b393f15

由 Olatunji Ruwase 提交于 5月 07, 2021

* Unused parameters assert should be disabled by default

* Fix message

* Invert assert logic in unit test

* Change option for ignoring unused parameters
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

5b393f15

01 5月, 2021 2 次提交
- S
  [Stage][Fix] Add additional conditions when checking types of output from the model (#1026) · b3870363
  由 Sean Naren 提交于 5月 01, 2021
```
* Add additional conditions when checking types of output from the model

* Add test

* Modify test to use torch.tensor as well
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  b3870363
- S
  
  [fp32] fix default dtype (#1023) · 18a26e86
  由 Stas Bekman 提交于 4月 30, 2021
  
  18a26e86
30 4月, 2021 2 次提交
- O
  Handle Norm allreduce when no mp (#1021) · 429dfa6c
  由 Olatunji Ruwase 提交于 4月 29, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  429dfa6c
- S
  Samyamr/full precision for ZeRO Stage2 and Stage3 (#1004) · dad26428
  由 Samyam Rajbhandari 提交于 4月 29, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  dad26428
29 4月, 2021 1 次提交
- S
  Refactor param_dict to config (#1008) · 41ab660b
  由 Sean Naren 提交于 4月 29, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  41ab660b
25 4月, 2021 1 次提交

Add find_unused_parameters option to DeepSpeedEngine (#945) · d0b61f18

由 hamlet 提交于 4月 25, 2021

* Add find_unused_parameters option

As unused parameters in modules may not be expected sometimes, 
add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707

* Add find_unused_parameters option

As unused parameters in modules may not be expected sometimes, 
add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707

* Fix syntax error

* Fix yapf error

* Fix yapf error

* Fix yapf error

* Fix yapf error

* Move stage2 find_unused_parameters to config file

* Add stage2 find_unused_parameters

* Add stage2 find_unused_parameters

* Add stage2_find_unused_parameters option

* Change error msg to reflect zero_optimization config change

* Fix yapf error

* Fix yapf errors

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Add UnusedParametersModel for test option find_unused_parameters

* Add unit test for stage2 find_unused_parameters

* Add cpu-adam compatible check

* Remove dups import

* Trim spaces

* Fix yapf errors

* Trim spaces

* Add False Positive test check

* Fix find_unused_parameters test

* Trim spaces

* Fix yapf error

d0b61f18

24 4月, 2021 1 次提交

Use amp autocast in ZeRO3 linear (#990) · e88ebbcf

由 Olatunji Ruwase 提交于 4月 23, 2021

* Use amp autocast in ZeRO3 linear

* Fix typo

* Handle specific exceptions

* CI breaks on torch.distributed

* Add autocast unit test

* Format fixes

* Fix skip logic
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

e88ebbcf

23 4月, 2021 1 次提交

Fix issue where gradient_predivide_factor was called as a func. (#996) · a7118789

由 William Buchwalter 提交于 4月 22, 2021

* Fix issue where gradient_predivide_factor was called as a func.

`gradient_predivide_factor` is a `float`, hence shouldn't be called as func.
This crashes when `reduce_scatter` flag is set to `False`.

a7118789

22 4月, 2021 2 次提交

Make reduce scatter optional for ZeRO-1 as workaround (#971) · 0b80ad06

由 Olatunji Ruwase 提交于 4月 21, 2021

* Make reduce scatter optional for ZeRO-1 as workaround

* Make allreduce default for ZeRO 1
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

0b80ad06

Use odd shape tensor to represent parameter data in partitioned state (#981) · 894f21da

由 Cheng Li 提交于 4月 21, 2021

* use wierd shaped tensor to avoid silent failures when not registering externel params

* fix typo
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

894f21da

21 4月, 2021 2 次提交
- S
  
  remove debug prints: (#986) · eecef309
  由 Stas Bekman 提交于 4月 20, 2021
  
  eecef309
- S
  [ZeRO Infinity] Allow Init to take a dict for the deepspeed config (#983) · 35251023
  由 Sean Naren 提交于 4月 20, 2021
```
* Add check to see if json file is already loaded

* Update doc

* Address review

* Remove doc comment
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  35251023
20 4月, 2021 1 次提交
- S
  ZeRO-Infinity docs (#979) · 11279ae4
  由 Shaden Smith 提交于 4月 19, 2021
```
* zinf tutorial

* more megatron integration docs

* ZInf + tiling docs
```
  11279ae4
19 4月, 2021 1 次提交

ZeRO-Infinity (#976) · 0d4a54a0

由 Jeff Rasley 提交于 4月 18, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>

0d4a54a0

17 4月, 2021 1 次提交
- O
  Fix ZeRO-3 UnboundLocalError (#968) · 2805c393
  由 Olatunji Ruwase 提交于 4月 16, 2021
```
* Fix UnboundLocalError

* Get full partition size
```
  2805c393
15 4月, 2021 1 次提交

[zero] faster flatten/unflatten (cpp version) (#910) · 8b8ed2a7

由 Stas Bekman 提交于 4月 14, 2021

* faster flatten/unflatten with apex

* switch to cpp flatten/unflatten

* style

* better comment

* missing import

* switch to build ops at run time

* fixes
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

8b8ed2a7

14 4月, 2021 1 次提交
- S
  
  fix double linear override; spelling (#954) · adac058a
  由 Stas Bekman 提交于 4月 14, 2021
  
  adac058a
08 4月, 2021 4 次提交
- S
  Samyamr/stage 3 skip modules without parameters (#867) · 7b46d11f
  由 Samyam Rajbhandari 提交于 4月 07, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  7b46d11f
- S
  improved readability + typos (#895) · 5ca86ae4
  由 Stas Bekman 提交于 4月 07, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  5ca86ae4
- S
  [zero3] GatheredParameters can now handle a list of params (#884) · 6d94afb5
  由 Stas Bekman 提交于 4月 07, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  6d94afb5
- S
  Fix for fragmented linear inputs in ZeRO 3 Linear layers where reshap… (#881) · b5f56b2c
  由 Samyam Rajbhandari 提交于 4月 07, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  b5f56b2c
02 4月, 2021 1 次提交

zero.Init() clarification (#880) · 5d721e09

由 Stas Bekman 提交于 4月 01, 2021

* zero.Init() clarification

clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must.

this proposal is via @samyam's clarification shared elsewhere.

Thank you.

* style

* add clarity

* style
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

5d721e09

27 3月, 2021 2 次提交

Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (#861) · 7fcc8911

由 hamlet 提交于 3月 27, 2021

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in https://github.com/microsoft/DeepSpeed/issues/707

As some model trainable parameters skipped in training,
their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, 
so they have no norm_for_param_grads

* Trim space

* Trim space
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

7fcc8911

S
save_fp16_model consolidated for zero3 (#893) · 39013dd2
由 Stas Bekman 提交于 3月 26, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
39013dd2

17 3月, 2021 1 次提交
- O
  Make config objects json serializable (#862) · 7bcd72a2
  由 Olatunji Ruwase 提交于 3月 16, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  7bcd72a2

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年