提交 · 7567c76c05626c5acd8b5700bedfc412c55d5354 · Greenplum / DeepSpeed

14 11月, 2021 1 次提交
- O
  Update offload parameter names (#1536) · 7567c76c
  由 Olatunji Ruwase 提交于 11月 13, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  7567c76c
13 11月, 2021 1 次提交
- O
  
  Fix zinf none swapper (#1550) · 488105eb
  由 Olatunji Ruwase 提交于 11月 12, 2021
  
  488105eb
31 10月, 2021 1 次提交

ZeRO3, improved parameter all-gather operation (#1188) · c0eeb69d

由 Zhen Zhang 提交于 10月 31, 2021

* remove norm(), avoid memcpy after allgather

1) Removing the norm computation in debug printing
2) Changing _all_gather to be sync op in fetch_sub_module
    Reason: the async version is not async at all, because each
    all_gather calls torch.cuda.synchronize() to guarantee previous
    communication op to be completed
3) Adding new function _allgather_params_split_launch
    the existing _allgather_params has explicit memcpy after the
    all-gather op. We can avoid the explicit memory copy at
    python side, to improve the performance.

Known issue:
    the `torch.distributed.all_gather` will do implicit memcpy
    at the end of each `ncclAllgather`.

* WIP: wrapped ncclAllgather as customized op in DS

micro benchmark shows the improvement of allgather a
transformer layer with 9834560 elements in half precision is about
1.1ms on aws-p4d instance.

* WIP: integrated into partition_parameters

Performance improvement of 5.1B bert on aws-p4d:
fwd: 300ms -> 200ms
bwd: 680ms -> 610ms

* Fix format

* cleaned dead code, modified unit test

* removed customized c++ extension

revert back to use torch distributed API

* change torch.ones to torch empty

* typo

* warn if not cuda tensor for allgather

* fix formatting

* fix: move ds_tensor to cuda device

but it is strange that the ds_tensor haven't been moved to cuda

* remove try clause on the path for fetching params
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

c0eeb69d

22 10月, 2021 1 次提交
- O
  Ensure single zero3 context (#1462) · 58a8e13c
  由 Olatunji Ruwase 提交于 10月 21, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  58a8e13c
02 10月, 2021 1 次提交

Fix many typos (#1423) · be789b16

由 Alex Hedges 提交于 10月 01, 2021

* Fix typos in docs/

* Fix typos in code comments and output strings

* Fix typos in the code itself

* Fix typos in tests/
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

be789b16

30 9月, 2021 1 次提交

Big science related changes (#1407) · e2fdd254

由 Jeff Rasley 提交于 9月 29, 2021

Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NShaden Smith <shaden.smith@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Neltonzheng <eltonz@microsoft.com>
Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>

e2fdd254

14 7月, 2021 1 次提交
- S
  [zero3] fix reference counting in backward over multiple forwards (#1227) · 3fa24208
  由 Stas Bekman 提交于 7月 14, 2021
```
* fix reference counting in backward over multiple forwards

* test + cleanup
```
  3fa24208
13 7月, 2021 1 次提交

[model weights] zero_to_fp32 multiple improvements (#1181) · 2a921069

由 Stas Bekman 提交于 7月 12, 2021

* add live zero checkpoint to fp32 consolidation version

* some more docs

* zero2 model states uses a different filename

* fix

* make debug mode cli configurable

* copy the script only on node 0 process 0

* validate that we have the right number of files

* revamp _get_zero_param_shapes, instrument with easier debug

* correct assertion

* rename API; add even simpler API

* style

* docs improve

* update the docs

* revert the unpartitioned_params detection and report as it's most likely persistent params
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

2a921069

12 7月, 2021 1 次提交
- S
  
  improve debug (#1215) · 5127b2fa
  由 Stas Bekman 提交于 7月 12, 2021
  
  5127b2fa
10 7月, 2021 1 次提交

[zero3] params_to_reduce isn't always there (#1214) · 91f58c06

由 Stas Bekman 提交于 7月 09, 2021

* [zero3] params_to_reduce isn't always there

Trying to port HF's Electra model's to Deepspeed I'm getting this on the very first backward step (with some extra debug):

```
Incrementing with parameter id 42
------ Before allocating allgather param name=generator_lm_head.weight id=41 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=327680
------allgather param with name=generator_lm_head.weight id=41 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=327680
------ Before allocating allgather param name=generator_lm_head.bias id=42 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=5120
------allgather param with name=generator_lm_head.bias id=42 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=5120
Backward name=generator_lm_head.weight id=41 shape=torch.Size([5120, 64])
Inside reduce ipg buckets. name=generator_lm_head.weight id=41 shape=torch.Size([5120, 64]), ipg elements 0, reduce bucket size 4096
Params in ipg bucket []
Reducing []
GOT 1
torch.Size([4096])
Traceback (most recent call last):
  File "examples/pytorch/language-modeling/run_mlm.py", line 533, in <module>
    main()
  File "examples/pytorch/language-modeling/run_mlm.py", line 484, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero_to_fp32-tests/src/transformers/trainer.py", line 1269, in train
    tr_loss += self.training_step(model, inputs)
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero_to_fp32-tests/src/transformers/trainer.py", line 1778, in training_step
    loss = self.deepspeed.backward(loss)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/engine.py", line 1188, in backward
    self.optimizer.backward(loss)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2964, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1867, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2212, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1897, in reduce_independent_p_g_buckets_and_remove_grads
    self.reduce_ipg_grads()
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2193, in reduce_ipg_grads
    self.average_tensor(reduction_list, params_to_reduce)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1972, in average_tensor
    params_to_reduce[0].reduce_gradients_at_owner(
```

Is it always that `params_to_reduce` is populated?

If I add this check the problem goes away it seems.

* real fix

91f58c06

29 6月, 2021 1 次提交
- S
  clean up logging (#1190) · a0292398
  由 Stas Bekman 提交于 6月 28, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  a0292398
26 6月, 2021 1 次提交
- S
  undo noise (#1191) · bc019a53
  由 Stas Bekman 提交于 6月 25, 2021
```
* undo noise

* another
```
  bc019a53
24 6月, 2021 2 次提交

introduce debug utils (#1136) · c0c4ebf1

由 Stas Bekman 提交于 6月 23, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

c0c4ebf1

ZeRO 2+3 memory estimators (#965) · 0c1802cc

由 Stas Bekman 提交于 6月 23, 2021

Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

0c1802cc

17 6月, 2021 1 次提交

Samyamr/largest partitioned params calculation fix (#1150) · 4eaf9106

由 Samyam Rajbhandari 提交于 6月 16, 2021

* largest_partitioned_params calculation fix

largest partitioned params was getting calculated incorrectly

* Update stage3.py

* Update stage3.py

* formatting fix

* changing sub-group size default to 1e9
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

4eaf9106

21 5月, 2021 1 次提交

ZeRO-Infinity: Swap into unaligned fp16 buffer (#1086) · e9e9d5b8

由 Olatunji Ruwase 提交于 5月 20, 2021

* Align fp16 param wap buffers

* Integrating swap buffer manager for fp16 params

* Support swapping misaligned fp16 parameters

* Support swap into unaligned fp16 buffer

e9e9d5b8

19 5月, 2021 1 次提交
- O
  ZeRO-Infinity: support swapping misaligned sized fp16 tensors (#1076) · d88d9279
  由 Olatunji Ruwase 提交于 5月 18, 2021
```
* Align fp16 param wap buffers

* Integrating swap buffer manager for fp16 params

* Support swapping misaligned fp16 parameters
```
  d88d9279
14 5月, 2021 1 次提交
- O
  
  Get correct fp16 reuse buffer size (#1071) · 6b49b60e
  由 Olatunji Ruwase 提交于 5月 13, 2021
  
  6b49b60e
01 5月, 2021 1 次提交

[Stage][Fix] Add additional conditions when checking types of output from the model (#1026) · b3870363

由 Sean Naren 提交于 5月 01, 2021

* Add additional conditions when checking types of output from the model

* Add test

* Modify test to use torch.tensor as well
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

b3870363

30 4月, 2021 2 次提交
- O
  Handle Norm allreduce when no mp (#1021) · 429dfa6c
  由 Olatunji Ruwase 提交于 4月 29, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  429dfa6c
- S
  Samyamr/full precision for ZeRO Stage2 and Stage3 (#1004) · dad26428
  由 Samyam Rajbhandari 提交于 4月 29, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  dad26428
23 4月, 2021 1 次提交

Fix issue where gradient_predivide_factor was called as a func. (#996) · a7118789

由 William Buchwalter 提交于 4月 22, 2021

* Fix issue where gradient_predivide_factor was called as a func.

`gradient_predivide_factor` is a `float`, hence shouldn't be called as func.
This crashes when `reduce_scatter` flag is set to `False`.

a7118789

21 4月, 2021 1 次提交
- S
  
  remove debug prints: (#986) · eecef309
  由 Stas Bekman 提交于 4月 20, 2021
  
  eecef309
19 4月, 2021 1 次提交

ZeRO-Infinity (#976) · 0d4a54a0

由 Jeff Rasley 提交于 4月 18, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>

0d4a54a0

15 4月, 2021 1 次提交

[zero] faster flatten/unflatten (cpp version) (#910) · 8b8ed2a7

由 Stas Bekman 提交于 4月 14, 2021

* faster flatten/unflatten with apex

* switch to cpp flatten/unflatten

* style

* better comment

* missing import

* switch to build ops at run time

* fixes
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

8b8ed2a7

08 4月, 2021 2 次提交
- S
  Samyamr/stage 3 skip modules without parameters (#867) · 7b46d11f
  由 Samyam Rajbhandari 提交于 4月 07, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  7b46d11f
- S
  improved readability + typos (#895) · 5ca86ae4
  由 Stas Bekman 提交于 4月 07, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  5ca86ae4
17 3月, 2021 1 次提交
- O
  Fix ZeRO3 save_checkpoint (#857) · fa87a73a
  由 Olatunji Ruwase 提交于 3月 16, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  fa87a73a
16 3月, 2021 1 次提交

Samyamr/inference hook fix (#851) · 46018859

由 Samyam Rajbhandari 提交于 3月 15, 2021

* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

46018859

12 3月, 2021 1 次提交

Control ZeRO wall clock timers (#849) · 311795d0

由 Olatunji Ruwase 提交于 3月 11, 2021

* Control ZeRO wall clock timers

* Disable more ZeRO3 debug prints
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

311795d0

09 3月, 2021 1 次提交

ZeRO 3 Offload (#834) · 599258f9

由 Samyam Rajbhandari 提交于 3月 08, 2021

* Squash stage3 v1 (#146)
Co-authored-by: NSamyam <samyamr@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Neltonzheng <eltonz@microsoft.com>

* Fix correctness bug (#147)

* formatting fix (#150)

* stage3 bugfix (API) update and simplified FP16 Z3 tests (#151)

* fp16 Z3 API update and bugfix

* revert debug change

* ZeRO-3 detach and race condition bugfixes (#149)

* trying out ZeRO-3 race condition fix

* CUDA sync instead of stream

* reduction stream sync

* remove commented code

* Fix optimizer state_dict KeyError (#148)
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152)

* Simplifying the logic for getting averaged gradients (#153)

* skip for now

* Z3 Docs redux (#154)

* removing some TODOs and commented code (#155)

* New Z3 defaults (#156)
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* formatting

* megatron external params
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Neltonzheng <eltonz@microsoft.com>

599258f9

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年