提交 · 73c0798bd725cc275ee59cf7b936ff3597fc46c8 · Greenplum / DeepSpeed

07 5月, 2022 1 次提交

GatheredParameters - accept a tuple of params (#1941) · 73c0798b

由 Stas Bekman 提交于 5月 07, 2022

* GatheredParameters - accept any iterable

* torch tensor is an iterable, so can't use collections.abc.Iterable

* fix

73c0798b

06 5月, 2022 1 次提交

Improve z3 trace management (#1916) · 673cb608

由 Olatunji Ruwase 提交于 5月 06, 2022

* Fix OOM and type mismatch

* Toggle prefetching

* Disable z3 prefetching for inference (temp workaround)

* Fix zero3 tracing issues

* Remove debug prints

* Enable prefetch for inference

* Code clarity

* Invalidate trace cache

* Trace cache invalidation when needed
Separate nvme prefetch from all-gather prefetch

* Track last used step id

* Use debug name in error message

* Construct param trace from module trace
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

673cb608

27 4月, 2022 1 次提交

Inference PP changes for neox (#1899) · b4fcd98f

由 Jeff Rasley 提交于 4月 26, 2022

Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>

b4fcd98f

26 4月, 2022 1 次提交
- O
  Fix OOM and type mismatch (#1884) · 32d97976
  由 Olatunji Ruwase 提交于 4月 25, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  32d97976
20 4月, 2022 1 次提交
- S
  [partition_parameters.py] better diagnostics (#1887) · fb00e6a1
  由 Stas Bekman 提交于 4月 19, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  fb00e6a1
23 1月, 2022 1 次提交
- A
  
  Add codespell to pre-commit checks (#1717) · 4cf970e6
  由 Alex Hedges 提交于 1月 22, 2022
  
  4cf970e6
21 1月, 2022 1 次提交

Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support) (#1453) · 4912e0ad

由 Justin Chiu 提交于 1月 20, 2022

* Changes for bfloat16 Zero2

* ZeRO stage3 optimizations, with some bug fixes

optimizations for stage3:
- prefetching improvements
- batching allgather calls to amortize fixed overhead and improve
  bandwidth utilization
- batching reduce_scatter calls to amortize fixed overhead and
  improve bandwidth utilization
- using *_base variants of allgather and reduce scatter to reduce memory
  allocations and data movement
- more fine grained synchronization for communication that allows
  blocking on less work
- precomputation of fetching code - using a fetch queue rather than
  deciding what to (pre)fetch at each iteration
- limiting queued coalesced communication ops to reduce memory pressure
  on pytorch cuda caching allocator (not elegant solution)

optimizations for stage3-offload:
- made some host-device tensor copies async to improve performance

bug fixes and qol improvements:
- fix init context method when parent modules modify child weights
- speed up model initialization by moving model to GPU before weight
  initialization
- fixed unit test imports so that unit tests can be run from any
  directory
- change performance logging to include memory consumption
- add logging w/ model size when done partitioning model

new features
- bfloat16 support for ZeRO 3

* fix import in ut

* ran yapf

* improvements to cache flush warn log

* backwards compatibility with older versions of pytorch

* handle edge case where reduced tensor smaller than world size

* moved event synchronization to allgather handle wait() call

* removed unnecessary barrier call

* formatting fix after resolving merge conflict

* skip nvme prefetch when trace not complete

* opportunistically avoid memory allocation in allgather coalesced where possible

* fix indentation after merge

* fixes to account for parameter offload

* accounting for torch.cuda.memory_stats not being available

* moved partition_all_params to optimizer step

* allgathering on params before item gets called

* fix param status checks

needed after moving partition_all_parameters call to optimizer step

* fix grad accumulation with optimizer offload

* grad norm computation fix for optimizer offload

* change post divide in reduce-scatter to pre divide

* fix gradient race condition w/ optimizer offload

* improve inf/nan gradient tracking

* don't prefetch when not in training mode

* format fix after merging

* fix prefetching issue when using NVME offload

* improved defragmentation for fp16 parameters

* relative imports for bf16 tests

* changes for bwd compatibility with pytorch 1.2

* remove buffered_reduce_fallback

* removed unused parameter offset bookkeeping

* fixed tracking for multiple param groups

* unbroke bfloat16 config after merge conflict

* using base allgather params when only 1 param

* cleanup/fixes for fp16 partition defragmentation

* switch to CRLF

* convert to same new-line style as master

* align new line with master

* Fix merge issues

* switch to CRLF

* fix to LF line endings

* minor merge fixes

* remove extra bfloat16_enabled definition

* asserting params inflight for AllGatherHandle

* remove get_cuda_mem_allocated_str

* Format fixes

* fix bfloat16 zero stage check (broken after merge commit)

* +self.communication_data_type, -self.allreduce_always_fp32; delete dead code

* Add self.reduce_scatter

* Format fix

* Fix merge issues

* iterate over params_to_fetch rather than make another iterator

* add some TODOs

* remove unnecessary division by micro_step_id

* rename config keys "bfloat16" -> "bf16"

* rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bit_weights_on_model_save

* add unit test to check backwards compatibility for gather_16bit_weights

* added test to confirm bf16 key bwd compatibility

* Format fixes
Co-authored-by: NRana Ali Amjad <raamjad@amazon.com>
Co-authored-by: NJustin Chiu <justchiu@amazon.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

4912e0ad

02 12月, 2021 1 次提交
- J
  
  [zero-3] set default device during zero.Init (#1605) · 4b854a37
  由 Jeff Rasley 提交于 12月 01, 2021
  
  4b854a37
01 12月, 2021 1 次提交
- A
  Improve pre-commit hooks (#1602) · fc2f378e
  由 Alex Hedges 提交于 11月 30, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  fc2f378e
19 11月, 2021 1 次提交
- J
  
  Enables ZeRO-3 inference (#1514) · 2332cb31
  由 Jeff Rasley 提交于 11月 18, 2021
  
  2332cb31
13 11月, 2021 1 次提交

Autotuning (#1554) · 9caa74e5

由 Cheng Li 提交于 11月 13, 2021

* [squash] Staging autotuning v4
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NMinjia Zhang <minjiaz@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* add new extra, guard xgboost, cleanup dead files (#268)

* Fix autotuning docs (#1553)

* fix docs

* rewording the goal

* fix typos

* fix typos (#1556)

* fix typos

* fix format

* fix bug (#1557)

* fix bug
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NMinjia Zhang <minjiaz@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

9caa74e5

11 11月, 2021 1 次提交
- O
  
  Use cuda tensors for allgather (#1548) · bd3ebddf
  由 Olatunji Ruwase 提交于 11月 10, 2021
  
  bd3ebddf
31 10月, 2021 1 次提交

ZeRO3, improved parameter all-gather operation (#1188) · c0eeb69d

由 Zhen Zhang 提交于 10月 31, 2021

* remove norm(), avoid memcpy after allgather

1) Removing the norm computation in debug printing
2) Changing _all_gather to be sync op in fetch_sub_module
    Reason: the async version is not async at all, because each
    all_gather calls torch.cuda.synchronize() to guarantee previous
    communication op to be completed
3) Adding new function _allgather_params_split_launch
    the existing _allgather_params has explicit memcpy after the
    all-gather op. We can avoid the explicit memory copy at
    python side, to improve the performance.

Known issue:
    the `torch.distributed.all_gather` will do implicit memcpy
    at the end of each `ncclAllgather`.

* WIP: wrapped ncclAllgather as customized op in DS

micro benchmark shows the improvement of allgather a
transformer layer with 9834560 elements in half precision is about
1.1ms on aws-p4d instance.

* WIP: integrated into partition_parameters

Performance improvement of 5.1B bert on aws-p4d:
fwd: 300ms -> 200ms
bwd: 680ms -> 610ms

* Fix format

* cleaned dead code, modified unit test

* removed customized c++ extension

revert back to use torch distributed API

* change torch.ones to torch empty

* typo

* warn if not cuda tensor for allgather

* fix formatting

* fix: move ds_tensor to cuda device

but it is strange that the ds_tensor haven't been moved to cuda

* remove try clause on the path for fetching params
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

c0eeb69d

22 10月, 2021 1 次提交
- O
  Ensure single zero3 context (#1462) · 58a8e13c
  由 Olatunji Ruwase 提交于 10月 21, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  58a8e13c
02 10月, 2021 1 次提交

Fix many typos (#1423) · be789b16

由 Alex Hedges 提交于 10月 01, 2021

* Fix typos in docs/

* Fix typos in code comments and output strings

* Fix typos in the code itself

* Fix typos in tests/
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

be789b16

01 10月, 2021 1 次提交
- M
  
  Fix typo in assert message (#1420) · 56124621
  由 Manuel R. Ciosici 提交于 9月 30, 2021
  
  56124621
30 9月, 2021 1 次提交

Big science related changes (#1407) · e2fdd254

由 Jeff Rasley 提交于 9月 29, 2021

Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NShaden Smith <shaden.smith@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Neltonzheng <eltonz@microsoft.com>
Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>

e2fdd254

16 9月, 2021 1 次提交
- S
  [zero Init] fix regression (#1373) · cf22a691
  由 Stas Bekman 提交于 9月 15, 2021
```
* [zero Init] fix regression

* clean up the warning
```
  cf22a691
02 9月, 2021 1 次提交
- O
  
  Use mpu in zero.Init() (#1325) · e08c239c
  由 Olatunji Ruwase 提交于 9月 01, 2021
  
  e08c239c
13 7月, 2021 1 次提交
- S
  [zero3] release tmp memory when consolidating fp16 weights (#1220) · 2660cc4d
  由 Stas Bekman 提交于 7月 12, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  2660cc4d
12 7月, 2021 1 次提交
- S
  
  improve debug (#1215) · 5127b2fa
  由 Stas Bekman 提交于 7月 12, 2021
  
  5127b2fa
10 7月, 2021 1 次提交

[zero.Init] post_init partitining is to be run only by a child module (#1202) · 497b741f

由 Stas Bekman 提交于 7月 09, 2021

* post_init to be run only by a child module

* better solution

* add test

* safer attr name

* wants half()

* improve doc
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

497b741f

29 6月, 2021 1 次提交
- S
  clean up logging (#1190) · a0292398
  由 Stas Bekman 提交于 6月 28, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  a0292398
24 6月, 2021 1 次提交

introduce debug utils (#1136) · c0c4ebf1

由 Stas Bekman 提交于 6月 23, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

c0c4ebf1

08 6月, 2021 1 次提交

[zero] fix missed subclasses partitioning bug (#1135) · 5ca81678

由 Stas Bekman 提交于 6月 07, 2021

* fix missed subclassed partitioning bug

* fix on exit
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

5ca81678

14 5月, 2021 1 次提交
- S
  ensure only ds params are gathered (#1044) · 29b444b6
  由 Stas Bekman 提交于 5月 13, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  29b444b6
01 5月, 2021 1 次提交
- S
  
  [fp32] fix default dtype (#1023) · 18a26e86
  由 Stas Bekman 提交于 4月 30, 2021
  
  18a26e86
30 4月, 2021 1 次提交
- S
  Samyamr/full precision for ZeRO Stage2 and Stage3 (#1004) · dad26428
  由 Samyam Rajbhandari 提交于 4月 29, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  dad26428
29 4月, 2021 1 次提交
- S
  Refactor param_dict to config (#1008) · 41ab660b
  由 Sean Naren 提交于 4月 29, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  41ab660b
22 4月, 2021 1 次提交

Use odd shape tensor to represent parameter data in partitioned state (#981) · 894f21da

由 Cheng Li 提交于 4月 21, 2021

* use wierd shaped tensor to avoid silent failures when not registering externel params

* fix typo
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

894f21da

21 4月, 2021 1 次提交

[ZeRO Infinity] Allow Init to take a dict for the deepspeed config (#983) · 35251023

由 Sean Naren 提交于 4月 20, 2021

* Add check to see if json file is already loaded

* Update doc

* Address review

* Remove doc comment
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

35251023

19 4月, 2021 1 次提交

ZeRO-Infinity (#976) · 0d4a54a0

由 Jeff Rasley 提交于 4月 18, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>

0d4a54a0

17 4月, 2021 1 次提交
- O
  Fix ZeRO-3 UnboundLocalError (#968) · 2805c393
  由 Olatunji Ruwase 提交于 4月 16, 2021
```
* Fix UnboundLocalError

* Get full partition size
```
  2805c393
14 4月, 2021 1 次提交
- S
  
  fix double linear override; spelling (#954) · adac058a
  由 Stas Bekman 提交于 4月 14, 2021
  
  adac058a
08 4月, 2021 3 次提交
- S
  improved readability + typos (#895) · 5ca86ae4
  由 Stas Bekman 提交于 4月 07, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  5ca86ae4
- S
  [zero3] GatheredParameters can now handle a list of params (#884) · 6d94afb5
  由 Stas Bekman 提交于 4月 07, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  6d94afb5
- S
  Fix for fragmented linear inputs in ZeRO 3 Linear layers where reshap… (#881) · b5f56b2c
  由 Samyam Rajbhandari 提交于 4月 07, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  b5f56b2c
02 4月, 2021 1 次提交

zero.Init() clarification (#880) · 5d721e09

由 Stas Bekman 提交于 4月 01, 2021

* zero.Init() clarification

clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must.

this proposal is via @samyam's clarification shared elsewhere.

Thank you.

* style

* add clarity

* style
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

5d721e09

16 3月, 2021 1 次提交

Samyamr/inference hook fix (#851) · 46018859

由 Samyam Rajbhandari 提交于 3月 15, 2021

* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

46018859

09 3月, 2021 1 次提交

ZeRO 3 Offload (#834) · 599258f9

由 Samyam Rajbhandari 提交于 3月 08, 2021

* Squash stage3 v1 (#146)
Co-authored-by: NSamyam <samyamr@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Neltonzheng <eltonz@microsoft.com>

* Fix correctness bug (#147)

* formatting fix (#150)

* stage3 bugfix (API) update and simplified FP16 Z3 tests (#151)

* fp16 Z3 API update and bugfix

* revert debug change

* ZeRO-3 detach and race condition bugfixes (#149)

* trying out ZeRO-3 race condition fix

* CUDA sync instead of stream

* reduction stream sync

* remove commented code

* Fix optimizer state_dict KeyError (#148)
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152)

* Simplifying the logic for getting averaged gradients (#153)

* skip for now

* Z3 Docs redux (#154)

* removing some TODOs and commented code (#155)

* New Z3 defaults (#156)
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* formatting

* megatron external params
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Neltonzheng <eltonz@microsoft.com>

599258f9

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年