提交 · fe6785447d4e7a3918aa5d6b24b5c9473a695483 · Greenplum / DeepSpeed

18 11月, 2022 1 次提交
- M
  
  Add missing Inference sub-configs (#2518) · fe678544
  由 Michael Wyatt 提交于 11月 17, 2022
  
  fe678544
16 11月, 2022 1 次提交
- M
  Update docs to autogenerate pydantic config model docs (#2509) · 43bf035c
  由 Michael Wyatt 提交于 11月 15, 2022
```
* update zero config docs
* add autogenerated docs for pydantic models used in ZeRO and Inference configs
```
  43bf035c
12 8月, 2022 1 次提交
- O
  Correctly detect offload configuration (#2208) · 5870f36c
  由 Olatunji Ruwase 提交于 8月 11, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  5870f36c
01 8月, 2022 1 次提交

Release swap buffers for persisted params (#2089) · 2210ebe7

由 Olatunji Ruwase 提交于 7月 31, 2022

* Split parameter offload from z3

* Format fixes

* Bug fixes

* Cleanup

* Remove dead code

* Release swap buffers for persisted params

* Format fixes

* Format fixes

* Pass args correctly

* Use pinned memory for nvme offload

* Merge with masster

* Fix missing import

* model pesistence params

* Fix merge issues

* Handle none device

* Usse log_dist

2210ebe7

27 7月, 2022 1 次提交

Refactor ZeRO configs to use Pydantic (#2004) · 59975896

由 Michael Wyatt 提交于 7月 27, 2022

* first pass at pydanticifying Zero Configs

* added pydantic to reqs

* fixed bug with deprecated values not being type-checked

* fixing zero config bugs from unit tests

* fixed access of Config values

* removing zero constants

* formatting/fix broken import

* fixed bad merge

* fixed issue with missing aliased field

* fix for failing tests

* fix how deprecated fields are processed

* only process dep params when they are set

* fix mistyped field name

* fixes, docs, removed more constants

* fix merge

* more fixes after merge w master

* added unit tests

* formatting

* added fix for transformers unit tests

* separated offload config from zero config

* fixed bad import

* formatting and flake fixes

* implement suggestion from review
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

59975896

21 1月, 2022 1 次提交

Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support) (#1453) · 4912e0ad

由 Justin Chiu 提交于 1月 20, 2022

* Changes for bfloat16 Zero2

* ZeRO stage3 optimizations, with some bug fixes

optimizations for stage3:
- prefetching improvements
- batching allgather calls to amortize fixed overhead and improve
  bandwidth utilization
- batching reduce_scatter calls to amortize fixed overhead and
  improve bandwidth utilization
- using *_base variants of allgather and reduce scatter to reduce memory
  allocations and data movement
- more fine grained synchronization for communication that allows
  blocking on less work
- precomputation of fetching code - using a fetch queue rather than
  deciding what to (pre)fetch at each iteration
- limiting queued coalesced communication ops to reduce memory pressure
  on pytorch cuda caching allocator (not elegant solution)

optimizations for stage3-offload:
- made some host-device tensor copies async to improve performance

bug fixes and qol improvements:
- fix init context method when parent modules modify child weights
- speed up model initialization by moving model to GPU before weight
  initialization
- fixed unit test imports so that unit tests can be run from any
  directory
- change performance logging to include memory consumption
- add logging w/ model size when done partitioning model

new features
- bfloat16 support for ZeRO 3

* fix import in ut

* ran yapf

* improvements to cache flush warn log

* backwards compatibility with older versions of pytorch

* handle edge case where reduced tensor smaller than world size

* moved event synchronization to allgather handle wait() call

* removed unnecessary barrier call

* formatting fix after resolving merge conflict

* skip nvme prefetch when trace not complete

* opportunistically avoid memory allocation in allgather coalesced where possible

* fix indentation after merge

* fixes to account for parameter offload

* accounting for torch.cuda.memory_stats not being available

* moved partition_all_params to optimizer step

* allgathering on params before item gets called

* fix param status checks

needed after moving partition_all_parameters call to optimizer step

* fix grad accumulation with optimizer offload

* grad norm computation fix for optimizer offload

* change post divide in reduce-scatter to pre divide

* fix gradient race condition w/ optimizer offload

* improve inf/nan gradient tracking

* don't prefetch when not in training mode

* format fix after merging

* fix prefetching issue when using NVME offload

* improved defragmentation for fp16 parameters

* relative imports for bf16 tests

* changes for bwd compatibility with pytorch 1.2

* remove buffered_reduce_fallback

* removed unused parameter offset bookkeeping

* fixed tracking for multiple param groups

* unbroke bfloat16 config after merge conflict

* using base allgather params when only 1 param

* cleanup/fixes for fp16 partition defragmentation

* switch to CRLF

* convert to same new-line style as master

* align new line with master

* Fix merge issues

* switch to CRLF

* fix to LF line endings

* minor merge fixes

* remove extra bfloat16_enabled definition

* asserting params inflight for AllGatherHandle

* remove get_cuda_mem_allocated_str

* Format fixes

* fix bfloat16 zero stage check (broken after merge commit)

* +self.communication_data_type, -self.allreduce_always_fp32; delete dead code

* Add self.reduce_scatter

* Format fix

* Fix merge issues

* iterate over params_to_fetch rather than make another iterator

* add some TODOs

* remove unnecessary division by micro_step_id

* rename config keys "bfloat16" -> "bf16"

* rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bit_weights_on_model_save

* add unit test to check backwards compatibility for gather_16bit_weights

* added test to confirm bf16 key bwd compatibility

* Format fixes
Co-authored-by: NRana Ali Amjad <raamjad@amazon.com>
Co-authored-by: NJustin Chiu <justchiu@amazon.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

4912e0ad

01 12月, 2021 1 次提交
- A
  Improve pre-commit hooks (#1602) · fc2f378e
  由 Alex Hedges 提交于 11月 30, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  fc2f378e
29 7月, 2021 2 次提交
- O
  Use correct default for round robin gradients (#1258) · 97f7ed9e
  由 Olatunji Ruwase 提交于 7月 28, 2021
```
* Make round robin gradient partitioning configurable (default False)

* Use the correct default

* Log config setting
```
  97f7ed9e
- O
  
  Make round robin gradient partitioning configurable (default False) (#1256) · 4d420df5
  由 Olatunji Ruwase 提交于 7月 28, 2021
  
  4d420df5
09 6月, 2021 1 次提交

correct cpu_offload deprecation (#1140) · a8d6dfe8

由 Stas Bekman 提交于 6月 08, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

a8d6dfe8

20 5月, 2021 1 次提交
- J
  
  ZeRO stage 1 refresh (#1042) · cfa63f5d
  由 Jeff Rasley 提交于 5月 19, 2021
  
  cfa63f5d
08 5月, 2021 1 次提交

Avoid unused parameters assert by default (#1039) · 5b393f15

由 Olatunji Ruwase 提交于 5月 07, 2021

* Unused parameters assert should be disabled by default

* Fix message

* Invert assert logic in unit test

* Change option for ignoring unused parameters
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

5b393f15

25 4月, 2021 1 次提交

Add find_unused_parameters option to DeepSpeedEngine (#945) · d0b61f18

由 hamlet 提交于 4月 25, 2021

* Add find_unused_parameters option

As unused parameters in modules may not be expected sometimes, 
add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707

* Add find_unused_parameters option

As unused parameters in modules may not be expected sometimes, 
add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707

* Fix syntax error

* Fix yapf error

* Fix yapf error

* Fix yapf error

* Fix yapf error

* Move stage2 find_unused_parameters to config file

* Add stage2 find_unused_parameters

* Add stage2 find_unused_parameters

* Add stage2_find_unused_parameters option

* Change error msg to reflect zero_optimization config change

* Fix yapf error

* Fix yapf errors

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Add UnusedParametersModel for test option find_unused_parameters

* Add unit test for stage2 find_unused_parameters

* Add cpu-adam compatible check

* Remove dups import

* Trim spaces

* Fix yapf errors

* Trim spaces

* Add False Positive test check

* Fix find_unused_parameters test

* Trim spaces

* Fix yapf error

d0b61f18

19 4月, 2021 1 次提交

ZeRO-Infinity (#976) · 0d4a54a0

由 Jeff Rasley 提交于 4月 18, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>

0d4a54a0

08 4月, 2021 1 次提交

improved readability + typos (#895) · 5ca86ae4

由 Stas Bekman 提交于 4月 07, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

5ca86ae4

27 3月, 2021 1 次提交
- S
  save_fp16_model consolidated for zero3 (#893) · 39013dd2
  由 Stas Bekman 提交于 3月 26, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  39013dd2
17 3月, 2021 1 次提交
- O
  Make config objects json serializable (#862) · 7bcd72a2
  由 Olatunji Ruwase 提交于 3月 16, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  7bcd72a2
09 3月, 2021 1 次提交

ZeRO 3 Offload (#834) · 599258f9

由 Samyam Rajbhandari 提交于 3月 08, 2021

* Squash stage3 v1 (#146)
Co-authored-by: NSamyam <samyamr@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Neltonzheng <eltonz@microsoft.com>

* Fix correctness bug (#147)

* formatting fix (#150)

* stage3 bugfix (API) update and simplified FP16 Z3 tests (#151)

* fp16 Z3 API update and bugfix

* revert debug change

* ZeRO-3 detach and race condition bugfixes (#149)

* trying out ZeRO-3 race condition fix

* CUDA sync instead of stream

* reduction stream sync

* remove commented code

* Fix optimizer state_dict KeyError (#148)
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152)

* Simplifying the logic for getting averaged gradients (#153)

* skip for now

* Z3 Docs redux (#154)

* removing some TODOs and commented code (#155)

* New Z3 defaults (#156)
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* formatting

* megatron external params
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Neltonzheng <eltonz@microsoft.com>

599258f9

12 12月, 2020 1 次提交
- S
  add DeepSpeedZeroConfig repr method (#596) · 66268bd3
  由 Stas Bekman 提交于 12月 11, 2020
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  66268bd3
18 11月, 2020 1 次提交

Fix layout bug in ZeRO Stage 1 checkpoint logic (#531) · 7752dc5e

由 Olatunji Ruwase 提交于 11月 17, 2020

* Fix layout bug in ZeRO Stage 1 checkpoint logic
Add elastic checkpoint option for ZeRO stage 1, default to True

* Format fixes

7752dc5e

10 9月, 2020 1 次提交

ZeRO-Offload release (#391) · 41db1c2f

由 Jeff Rasley 提交于 9月 09, 2020

* ZeRO-Offload (squash) (#381)
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NJie <37380896+jren73@users.noreply.github.com>
Co-authored-by: NArash Ashari <arashari@microsoft.com>
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Narashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com>
Co-authored-by: NRezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>

41db1c2f

02 9月, 2020 1 次提交

Sparse attn + ops/runtime refactor + v0.3.0 (#343) · e5bbc2e5

由 Jeff Rasley 提交于 9月 01, 2020

* Sparse attn + ops/runtime refactor + v0.3.0
Co-authored-by: NArash Ashari <arashari@microsoft.com>
Co-authored-by: NArash Ashari <arashari@microsoft.com>

e5bbc2e5

14 7月, 2020 1 次提交

Support loading and saving ZeRO checkpoints with changing DP degree (#240) · 7ccc9daf

由 Olatunji Ruwase 提交于 7月 14, 2020

* Support saving and loading ZeRO checkpoints on different data
parallelism degree.

* Fix formatting

* Support checkpoint with varying GPU count in ZeRO stage 1

* Fix formatting

* Formatting fixes

* Update model tests

* Remove pprint

* Minor fix

* Fix formatting

* Update model tests
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

7ccc9daf

05 6月, 2020 1 次提交

Add log util (#230) · e1ad8803

由 Chunyang Wen 提交于 6月 05, 2020

* Add log util

* replace all occurrences of print and logging

* address format

* disable propagate to avoid duplicate log

e1ad8803

28 5月, 2020 1 次提交
- S
  Default Contiguous Gradients False (#239) · 01e848b3
  由 Samyam Rajbhandari 提交于 5月 27, 2020
```
Contiguous Gradients should be set to false by default. Its not useful unless the model is very large
```
  01e848b3
19 5月, 2020 1 次提交

ZeRO-2 (#217) · f2ac7eaf

由 Jeff Rasley 提交于 5月 19, 2020

Updates for ZeRO stage 2 + ZeRO stage 1 w. RS
Co-authored-by: NTunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: NElton Zheng <eltonz@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Nyuxionghe <yuxhe@microsoft.com>
Co-authored-by: NArash Ashari <arashari@microsoft.com>

f2ac7eaf

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年