提交 · e08c239c2d76c919f7237d5a505074767ddc8e44 · Greenplum / DeepSpeed

02 9月, 2021 1 次提交
- O
  
  Use mpu in zero.Init() (#1325) · e08c239c
  由 Olatunji Ruwase 提交于 9月 01, 2021
  
  e08c239c
26 8月, 2021 1 次提交

Support Callable type for client optimizer and lr_scheduler (#1316) · 274c375c

由 Olatunji Ruwase 提交于 8月 25, 2021

* Callable option for optimizer and scheduler

* Add unit test

* Formatting

* Disable debug prints

* Use base optimizer to construct lr scheduler

* Formatting

* Remove dead import

274c375c

17 8月, 2021 1 次提交

DeepSpeed MoE (#1310) · f2843244

由 Ammar Ahmad Awan 提交于 8月 16, 2021

Co-authored-by: NAlex Muzio <Alex.Muzio@microsoft.com>
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NConglong Li <conglong.li@gmail.com>
Co-authored-by: NFelipe Cruz Salinas <Andres.Cruz@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <shaden.smith@microsoft.com>
Co-authored-by: NYoung Jin Kim <youki@microsoft.com>
Co-authored-by: Nbapatra <bapatra@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <shaden.smith@microsoft.com>
Co-authored-by: NYoung Jin Kim <youki@microsoft.com>

f2843244

07 8月, 2021 2 次提交
- O
  Use correct input size for splits (#1284) · c543a41b
  由 Olatunji Ruwase 提交于 8月 06, 2021
```
* Use correct input size for splits

* Use smarter partitioning
```
  c543a41b
- O
  
  Align gradient partitioning order (#1274) · b1b41754
  由 Olatunji Ruwase 提交于 8月 06, 2021
  
  b1b41754
03 8月, 2021 1 次提交

ZeRO-1 empty grads fix + tests (#1273) · adc21a4d

由 Jeff Rasley 提交于 8月 02, 2021

* fix empty grad zero tests

* dont clear grads in stage 1 code path

* prevent none grads from being reduced

adc21a4d

29 7月, 2021 2 次提交
- O
  Use correct default for round robin gradients (#1258) · 97f7ed9e
  由 Olatunji Ruwase 提交于 7月 28, 2021
```
* Make round robin gradient partitioning configurable (default False)

* Use the correct default

* Log config setting
```
  97f7ed9e
- O
  
  Make round robin gradient partitioning configurable (default False) (#1256) · 4d420df5
  由 Olatunji Ruwase 提交于 7月 28, 2021
  
  4d420df5
20 7月, 2021 1 次提交
- S
  [zero2] zero_param_shapes: switch to round_robin_fp16_groups (#1240) · 7392e459
  由 Stas Bekman 提交于 7月 20, 2021
```
* zero_param_shapes: switch to round_robin_fp16_groups

* add test

* old torch workaround
```
  7392e459
14 7月, 2021 1 次提交
- S
  [zero3] fix reference counting in backward over multiple forwards (#1227) · 3fa24208
  由 Stas Bekman 提交于 7月 14, 2021
```
* fix reference counting in backward over multiple forwards

* test + cleanup
```
  3fa24208
13 7月, 2021 3 次提交

O
ZeRO2-Offload: Disable copy overlapping (#1219) · 97207c8c
由 Olatunji Ruwase 提交于 7月 13, 2021
```
* Disable copy stream

* Format fixes

* Remove debug codes

* Remove debug codes

* Fix indent
```
97207c8c
S
[zero3] release tmp memory when consolidating fp16 weights (#1220) · 2660cc4d
由 Stas Bekman 提交于 7月 12, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
2660cc4d

[model weights] zero_to_fp32 multiple improvements (#1181) · 2a921069

由 Stas Bekman 提交于 7月 12, 2021

* add live zero checkpoint to fp32 consolidation version

* some more docs

* zero2 model states uses a different filename

* fix

* make debug mode cli configurable

* copy the script only on node 0 process 0

* validate that we have the right number of files

* revamp _get_zero_param_shapes, instrument with easier debug

* correct assertion

* rename API; add even simpler API

* style

* docs improve

* update the docs

* revert the unpartitioned_params detection and report as it's most likely persistent params
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

2a921069

12 7月, 2021 1 次提交
- S
  
  improve debug (#1215) · 5127b2fa
  由 Stas Bekman 提交于 7月 12, 2021
  
  5127b2fa
10 7月, 2021 2 次提交

[zero.Init] post_init partitining is to be run only by a child module (#1202) · 497b741f

由 Stas Bekman 提交于 7月 09, 2021

* post_init to be run only by a child module

* better solution

* add test

* safer attr name

* wants half()

* improve doc
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

497b741f

[zero3] params_to_reduce isn't always there (#1214) · 91f58c06

由 Stas Bekman 提交于 7月 09, 2021

* [zero3] params_to_reduce isn't always there

Trying to port HF's Electra model's to Deepspeed I'm getting this on the very first backward step (with some extra debug):

```
Incrementing with parameter id 42
------ Before allocating allgather param name=generator_lm_head.weight id=41 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=327680
------allgather param with name=generator_lm_head.weight id=41 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=327680
------ Before allocating allgather param name=generator_lm_head.bias id=42 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=5120
------allgather param with name=generator_lm_head.bias id=42 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=5120
Backward name=generator_lm_head.weight id=41 shape=torch.Size([5120, 64])
Inside reduce ipg buckets. name=generator_lm_head.weight id=41 shape=torch.Size([5120, 64]), ipg elements 0, reduce bucket size 4096
Params in ipg bucket []
Reducing []
GOT 1
torch.Size([4096])
Traceback (most recent call last):
  File "examples/pytorch/language-modeling/run_mlm.py", line 533, in <module>
    main()
  File "examples/pytorch/language-modeling/run_mlm.py", line 484, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero_to_fp32-tests/src/transformers/trainer.py", line 1269, in train
    tr_loss += self.training_step(model, inputs)
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero_to_fp32-tests/src/transformers/trainer.py", line 1778, in training_step
    loss = self.deepspeed.backward(loss)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/engine.py", line 1188, in backward
    self.optimizer.backward(loss)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2964, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1867, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2212, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1897, in reduce_independent_p_g_buckets_and_remove_grads
    self.reduce_ipg_grads()
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2193, in reduce_ipg_grads
    self.average_tensor(reduction_list, params_to_reduce)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1972, in average_tensor
    params_to_reduce[0].reduce_gradients_at_owner(
```

Is it always that `params_to_reduce` is populated?

If I add this check the problem goes away it seems.

* real fix

91f58c06

02 7月, 2021 1 次提交

contiguous gradients should be set to True by default (#1199) · c9fee821

由 Samyam Rajbhandari 提交于 7月 01, 2021

* contiguous gradients should be set to True by default

* Set contiguous gradients to True by default

Features such as reduce_scatter depends on contiguous gradients being True. This is also the preferred default configuration.

c9fee821

29 6月, 2021 1 次提交
- S
  clean up logging (#1190) · a0292398
  由 Stas Bekman 提交于 6月 28, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  a0292398
26 6月, 2021 1 次提交
- S
  undo noise (#1191) · bc019a53
  由 Stas Bekman 提交于 6月 25, 2021
```
* undo noise

* another
```
  bc019a53
24 6月, 2021 2 次提交

introduce debug utils (#1136) · c0c4ebf1

由 Stas Bekman 提交于 6月 23, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

c0c4ebf1

ZeRO 2+3 memory estimators (#965) · 0c1802cc

由 Stas Bekman 提交于 6月 23, 2021

Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

0c1802cc

17 6月, 2021 1 次提交

Samyamr/largest partitioned params calculation fix (#1150) · 4eaf9106

由 Samyam Rajbhandari 提交于 6月 16, 2021

* largest_partitioned_params calculation fix

largest partitioned params was getting calculated incorrectly

* Update stage3.py

* Update stage3.py

* formatting fix

* changing sub-group size default to 1e9
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

4eaf9106

09 6月, 2021 1 次提交

correct cpu_offload deprecation (#1140) · a8d6dfe8

由 Stas Bekman 提交于 6月 08, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

a8d6dfe8

08 6月, 2021 1 次提交

[zero] fix missed subclasses partitioning bug (#1135) · 5ca81678

由 Stas Bekman 提交于 6月 07, 2021

* fix missed subclassed partitioning bug

* fix on exit
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

5ca81678

22 5月, 2021 1 次提交
- M
  fix ZERO_OPTIMIZATION_REDUCE_SCATTER_DEFAULT default value (#1058) · 093e59ec
  由 Meng, Peng 提交于 5月 22, 2021
```
* fix Reduce Scatter default value

* Update constants.py
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  093e59ec
21 5月, 2021 1 次提交

ZeRO-Infinity: Swap into unaligned fp16 buffer (#1086) · e9e9d5b8

由 Olatunji Ruwase 提交于 5月 20, 2021

* Align fp16 param wap buffers

* Integrating swap buffer manager for fp16 params

* Support swapping misaligned fp16 parameters

* Support swap into unaligned fp16 buffer

e9e9d5b8

20 5月, 2021 1 次提交
- J
  
  ZeRO stage 1 refresh (#1042) · cfa63f5d
  由 Jeff Rasley 提交于 5月 19, 2021
  
  cfa63f5d
19 5月, 2021 1 次提交
- O
  ZeRO-Infinity: support swapping misaligned sized fp16 tensors (#1076) · d88d9279
  由 Olatunji Ruwase 提交于 5月 18, 2021
```
* Align fp16 param wap buffers

* Integrating swap buffer manager for fp16 params

* Support swapping misaligned fp16 parameters
```
  d88d9279
16 5月, 2021 1 次提交

ZeRO2-Offload: Load balance gradient copying to CPU (#1067) · ee4deabd

由 Olatunji Ruwase 提交于 5月 15, 2021

* Round robin partitioning to improve ZeRO-2 Offload CPU copy

* Formatting fixes

* Fix index issues in debug dumps

* Remove debug prints

* Code cleanup

* Remove unintended stage3.py changes

* Add TODO

ee4deabd

14 5月, 2021 2 次提交
- O
  
  Get correct fp16 reuse buffer size (#1071) · 6b49b60e
  由 Olatunji Ruwase 提交于 5月 13, 2021
  
  6b49b60e
- S
  ensure only ds params are gathered (#1044) · 29b444b6
  由 Stas Bekman 提交于 5月 13, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  29b444b6
08 5月, 2021 1 次提交

Avoid unused parameters assert by default (#1039) · 5b393f15

由 Olatunji Ruwase 提交于 5月 07, 2021

* Unused parameters assert should be disabled by default

* Fix message

* Invert assert logic in unit test

* Change option for ignoring unused parameters
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

5b393f15

01 5月, 2021 2 次提交
- S
  [Stage][Fix] Add additional conditions when checking types of output from the model (#1026) · b3870363
  由 Sean Naren 提交于 5月 01, 2021
```
* Add additional conditions when checking types of output from the model

* Add test

* Modify test to use torch.tensor as well
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  b3870363
- S
  
  [fp32] fix default dtype (#1023) · 18a26e86
  由 Stas Bekman 提交于 4月 30, 2021
  
  18a26e86
30 4月, 2021 2 次提交
- O
  Handle Norm allreduce when no mp (#1021) · 429dfa6c
  由 Olatunji Ruwase 提交于 4月 29, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  429dfa6c
- S
  Samyamr/full precision for ZeRO Stage2 and Stage3 (#1004) · dad26428
  由 Samyam Rajbhandari 提交于 4月 29, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  dad26428
29 4月, 2021 1 次提交
- S
  Refactor param_dict to config (#1008) · 41ab660b
  由 Sean Naren 提交于 4月 29, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  41ab660b
25 4月, 2021 1 次提交

Add find_unused_parameters option to DeepSpeedEngine (#945) · d0b61f18

由 hamlet 提交于 4月 25, 2021

* Add find_unused_parameters option

As unused parameters in modules may not be expected sometimes, 
add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707

* Add find_unused_parameters option

As unused parameters in modules may not be expected sometimes, 
add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707

* Fix syntax error

* Fix yapf error

* Fix yapf error

* Fix yapf error

* Fix yapf error

* Move stage2 find_unused_parameters to config file

* Add stage2 find_unused_parameters

* Add stage2 find_unused_parameters

* Add stage2_find_unused_parameters option

* Change error msg to reflect zero_optimization config change

* Fix yapf error

* Fix yapf errors

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Add UnusedParametersModel for test option find_unused_parameters

* Add unit test for stage2 find_unused_parameters

* Add cpu-adam compatible check

* Remove dups import

* Trim spaces

* Fix yapf errors

* Trim spaces

* Add False Positive test check

* Fix find_unused_parameters test

* Trim spaces

* Fix yapf error

d0b61f18

24 4月, 2021 1 次提交

Use amp autocast in ZeRO3 linear (#990) · e88ebbcf

由 Olatunji Ruwase 提交于 4月 23, 2021

* Use amp autocast in ZeRO3 linear

* Fix typo

* Handle specific exceptions

* CI breaks on torch.distributed

* Add autocast unit test

* Format fixes

* Fix skip logic
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

e88ebbcf

23 4月, 2021 1 次提交

Fix issue where gradient_predivide_factor was called as a func. (#996) · a7118789

由 William Buchwalter 提交于 4月 22, 2021

* Fix issue where gradient_predivide_factor was called as a func.

`gradient_predivide_factor` is a `float`, hence shouldn't be called as func.
This crashes when `reduce_scatter` flag is set to `False`.

a7118789

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年