提交 · d56268f3752433d63b1025435d8c597174bf7033 · Greenplum / DeepSpeed

26 4月, 2023 1 次提交

fixing default communication_data_type for bfloat16_enabled and docs (#3370) · d56268f3

由 Alexander Jipa 提交于 4月 25, 2023

Co-authored-by: NAlexander Jipa <azzhipa@amazon.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

d56268f3

18 4月, 2023 1 次提交
- S
  
  [fp16/doc] correct initial_scale_power default value (#3275) · 72c2d6df
  由 Stas Bekman 提交于 4月 17, 2023
  
  72c2d6df
12 4月, 2023 1 次提交
- D
  Fix typo (#3164) · c0add889
  由 dawei-wang 提交于 4月 11, 2023
```
Fix microsoft/DeepSpeed#3163
```
  c0add889
24 3月, 2023 1 次提交
- M
  pre-commit check for torch.cuda in code (#2981) · 090d49e7
  由 Ma, Guokai 提交于 3月 24, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  090d49e7
09 1月, 2023 1 次提交
- S
  [fp16] lower initial_scale_power (#2663) · f30a0308
  由 Stas Bekman 提交于 1月 09, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  f30a0308
04 1月, 2023 1 次提交
- S
  [doc] fix `min_loss_scale` default (#2660) · f2ea7a38
  由 Stas Bekman 提交于 1月 03, 2023
```
* [doc] fix `min_loss_scale` default

* align
```
  f2ea7a38
13 12月, 2022 1 次提交
- C
  DeepSpeed Data Efficiency Library (#2585) · ef869377
  由 Conglong Li 提交于 12月 12, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  ef869377
28 11月, 2022 1 次提交

Adding Gradient Accumulation Data Type Config (#2512) · 21c28029

由 Joe Mayer 提交于 11月 27, 2022

* Adding gradient accumulation dtype config.

* Switching to new DtypeEnum

* Adding standalone check function, and unit tests

* Variable disambiguation

* Adding checks for unsupported states.

* Updating for PR comments.

* Reorganizing unit test.
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

21c28029

05 11月, 2022 1 次提交
- J
  Updating autotune json default in docs. (#2476) · 4a06ecf6
  由 Joe Mayer 提交于 11月 04, 2022
```
* Updating autotune default in docs.

* Running pre-commit.
```
  4a06ecf6
25 10月, 2022 1 次提交
- J
  Fix Bug #2319 (#2438) · 7d113633
  由 Joe Mayer 提交于 10月 24, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  7d113633
24 9月, 2022 1 次提交
- J
  
  fix zero docs (#2350) · 76de924b
  由 Jeff Rasley 提交于 9月 23, 2022
  
  76de924b
04 8月, 2022 1 次提交
- J
  update offload docs to include stage 1 (#2178) · fad0a410
  由 Jeff Rasley 提交于 8月 03, 2022
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  fad0a410
31 7月, 2022 1 次提交
- J
  
  enable fp16 input autocasting (#2158) · a039e226
  由 Jeff Rasley 提交于 7月 30, 2022
  
  a039e226
30 7月, 2022 1 次提交

Elastic Training support in DeepSpeed (#2153) (#2156) · 1ed5aa96

由 Arpan Jain 提交于 7月 29, 2022

Co-authored-by: NArpan Jain <t-arpanjain@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

1ed5aa96

26 7月, 2022 1 次提交

DeepSpeed Communication Profiling and Logging (#2012) · 5349347b

由 Quentin Anthony 提交于 7月 25, 2022

Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

5349347b

22 7月, 2022 1 次提交

[docs] website refresh (#2123) · a2506b54

由 Jeff Rasley 提交于 7月 21, 2022

Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Nyaozhewei <zheweiy@berkeley.edu>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>

a2506b54

20 7月, 2022 1 次提交

Adding DeepSpeed Compression Composer (#2105) · 0f4f2f98

由 Zhewei Yao 提交于 7月 19, 2022

Co-authored-by: Nyaozhewei <zheweiy@berkeley.edu>
Co-authored-by: Nxiaoxiawu <yxiaoxiawu@microsoft.com>
Co-authored-by: NConglong Li <conglong.li@gmail.com>
Co-authored-by: NXiaoxia (Shirley) Wu <94406484+xiaoxiawu-microsoft@users.noreply.github.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

0f4f2f98

14 7月, 2022 1 次提交
- M
  Add missing newline for ZeroOneAdam parameter table (#2088) · db3252b0
  由 Manuel R. Ciosici 提交于 7月 13, 2022
```
Co-authored-by: NConglong Li <conglong.li@gmail.com>
```
  db3252b0
16 6月, 2022 1 次提交
- Q
  DeepSpeed Monitor Module (Master) (#2013) · c87f6ee2
  由 Quentin Anthony 提交于 6月 16, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  c87f6ee2
23 3月, 2022 1 次提交
- S
  Update config-json.md (#1853) · b61d7199
  由 Sayed Hadi Hashemi 提交于 3月 22, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  b61d7199
11 3月, 2022 1 次提交

01 adam optimizer (#1790) · b80e5624

由 Yucheng Lu 提交于 3月 10, 2022

Co-authored-by: NConglong Li <conglong.li@gmail.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

b80e5624

23 1月, 2022 1 次提交
- A
  
  Add codespell to pre-commit checks (#1717) · 4cf970e6
  由 Alex Hedges 提交于 1月 22, 2022
  
  4cf970e6
22 1月, 2022 1 次提交
- M
  
  Align bfloat16 docs (#1715) · 09c065b4
  由 Manuel R. Ciosici 提交于 1月 21, 2022
  
  09c065b4
21 1月, 2022 1 次提交

Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support) (#1453) · 4912e0ad

由 Justin Chiu 提交于 1月 20, 2022

* Changes for bfloat16 Zero2

* ZeRO stage3 optimizations, with some bug fixes

optimizations for stage3:
- prefetching improvements
- batching allgather calls to amortize fixed overhead and improve
  bandwidth utilization
- batching reduce_scatter calls to amortize fixed overhead and
  improve bandwidth utilization
- using *_base variants of allgather and reduce scatter to reduce memory
  allocations and data movement
- more fine grained synchronization for communication that allows
  blocking on less work
- precomputation of fetching code - using a fetch queue rather than
  deciding what to (pre)fetch at each iteration
- limiting queued coalesced communication ops to reduce memory pressure
  on pytorch cuda caching allocator (not elegant solution)

optimizations for stage3-offload:
- made some host-device tensor copies async to improve performance

bug fixes and qol improvements:
- fix init context method when parent modules modify child weights
- speed up model initialization by moving model to GPU before weight
  initialization
- fixed unit test imports so that unit tests can be run from any
  directory
- change performance logging to include memory consumption
- add logging w/ model size when done partitioning model

new features
- bfloat16 support for ZeRO 3

* fix import in ut

* ran yapf

* improvements to cache flush warn log

* backwards compatibility with older versions of pytorch

* handle edge case where reduced tensor smaller than world size

* moved event synchronization to allgather handle wait() call

* removed unnecessary barrier call

* formatting fix after resolving merge conflict

* skip nvme prefetch when trace not complete

* opportunistically avoid memory allocation in allgather coalesced where possible

* fix indentation after merge

* fixes to account for parameter offload

* accounting for torch.cuda.memory_stats not being available

* moved partition_all_params to optimizer step

* allgathering on params before item gets called

* fix param status checks

needed after moving partition_all_parameters call to optimizer step

* fix grad accumulation with optimizer offload

* grad norm computation fix for optimizer offload

* change post divide in reduce-scatter to pre divide

* fix gradient race condition w/ optimizer offload

* improve inf/nan gradient tracking

* don't prefetch when not in training mode

* format fix after merging

* fix prefetching issue when using NVME offload

* improved defragmentation for fp16 parameters

* relative imports for bf16 tests

* changes for bwd compatibility with pytorch 1.2

* remove buffered_reduce_fallback

* removed unused parameter offset bookkeeping

* fixed tracking for multiple param groups

* unbroke bfloat16 config after merge conflict

* using base allgather params when only 1 param

* cleanup/fixes for fp16 partition defragmentation

* switch to CRLF

* convert to same new-line style as master

* align new line with master

* Fix merge issues

* switch to CRLF

* fix to LF line endings

* minor merge fixes

* remove extra bfloat16_enabled definition

* asserting params inflight for AllGatherHandle

* remove get_cuda_mem_allocated_str

* Format fixes

* fix bfloat16 zero stage check (broken after merge commit)

* +self.communication_data_type, -self.allreduce_always_fp32; delete dead code

* Add self.reduce_scatter

* Format fix

* Fix merge issues

* iterate over params_to_fetch rather than make another iterator

* add some TODOs

* remove unnecessary division by micro_step_id

* rename config keys "bfloat16" -> "bf16"

* rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bit_weights_on_model_save

* add unit test to check backwards compatibility for gather_16bit_weights

* added test to confirm bf16 key bwd compatibility

* Format fixes
Co-authored-by: NRana Ali Amjad <raamjad@amazon.com>
Co-authored-by: NJustin Chiu <justchiu@amazon.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

4912e0ad

04 1月, 2022 1 次提交
- M
  Various small documentation text improvements (#1665) · d0ab7224
  由 Manuel R. Ciosici 提交于 1月 03, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  d0ab7224
14 12月, 2021 1 次提交
- J
  Refactor ZeRO naming to reduce confusion (#1607) · 1d295ff5
  由 Jeff Rasley 提交于 12月 13, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  1d295ff5
27 11月, 2021 1 次提交

allreduce_always_fp16 (#1487) · d14baad9

由 Mikhail Druzhinin 提交于 11月 26, 2021

* fp16 allreduce

* Undo sparse sum in nan check

* communication_data_type instead of fp32_allreduce and fp16_allreduce

* sparse_allreduce with fp32 or fp16 data type

* FIx communication_data_type checks

* Allow only torch data types for communication_data_type

* Fix Zero assert messages
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

d14baad9

23 11月, 2021 1 次提交
- M
  Add documentation for TensorBoard logging (#1577) · e1b4aa8f
  由 Manuel R. Ciosici 提交于 11月 23, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  e1b4aa8f
13 11月, 2021 2 次提交

Autotuning (#1554) · 9caa74e5

由 Cheng Li 提交于 11月 13, 2021

* [squash] Staging autotuning v4
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NMinjia Zhang <minjiaz@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* add new extra, guard xgboost, cleanup dead files (#268)

* Fix autotuning docs (#1553)

* fix docs

* rewording the goal

* fix typos

* fix typos (#1556)

* fix typos

* fix format

* fix bug (#1557)

* fix bug
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NMinjia Zhang <minjiaz@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

9caa74e5

M
Add documentation for bfloat16 (git commit 648f7bfa) (#1516) · b7cc7c8e
由 Manuel R. Ciosici 提交于 11月 12, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
b7cc7c8e

02 10月, 2021 1 次提交

Fix many typos (#1423) · be789b16

由 Alex Hedges 提交于 10月 01, 2021

* Fix typos in docs/

* Fix typos in code comments and output strings

* Fix typos in the code itself

* Fix typos in tests/
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

be789b16

01 10月, 2021 1 次提交
- J
  
  Add assert to ensure we don't skip unsupported grad dtypes (#1418) · 0457bb1c
  由 Jeff Rasley 提交于 9月 30, 2021
  
  0457bb1c
17 8月, 2021 1 次提交

Curriculum learning (#1307) · b2b34ae3

由 Conglong Li 提交于 8月 16, 2021

Co-authored-by: NConglong Li <conglong.li@gmail.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

b2b34ae3

30 7月, 2021 1 次提交

[Doc] round_robin_gradients (#1261) · 40c381df

由 Olatunji Ruwase 提交于 7月 29, 2021

* Fix docstring

* Make screenshots clickable for easier viewing

* Navigation menu in alphabetical order; More clicable screenshots

* Rename 1Cycle doc

* Tweak naming

* Remove no longer used flag

* ZeRO3 Offload release

* Single GPU results

* Rearrange figures

* Single GPU text

* tweak intro

* zero3-offload section

* Add asynchronous i/o docs

* Fix print_per_steps doc

* Document round_robin_gradients

* Tweak description

* Trigger CI

40c381df

02 7月, 2021 1 次提交

contiguous gradients should be set to True by default (#1199) · c9fee821

由 Samyam Rajbhandari 提交于 7月 01, 2021

* contiguous gradients should be set to True by default

* Set contiguous gradients to True by default

Features such as reduce_scatter depends on contiguous gradients being True. This is also the preferred default configuration.

c9fee821

17 6月, 2021 1 次提交

[Doc] Fix steps_per_print description (#1163) · fa7921e2

由 Olatunji Ruwase 提交于 6月 16, 2021

* Fix docstring

* Make screenshots clickable for easier viewing

* Navigation menu in alphabetical order; More clicable screenshots

* Rename 1Cycle doc

* Tweak naming

* Remove no longer used flag

* ZeRO3 Offload release

* Single GPU results

* Rearrange figures

* Single GPU text

* tweak intro

* zero3-offload section

* Add asynchronous i/o docs

* Fix print_per_steps doc

fa7921e2

09 6月, 2021 1 次提交

correct cpu_offload deprecation (#1140) · a8d6dfe8

由 Stas Bekman 提交于 6月 08, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

a8d6dfe8

20 5月, 2021 1 次提交
- J
  
  ZeRO stage 1 refresh (#1042) · cfa63f5d
  由 Jeff Rasley 提交于 5月 19, 2021
  
  cfa63f5d
14 5月, 2021 1 次提交

[docs] unused parameter handling (#1060) · 63c5070e

由 Olatunji Ruwase 提交于 5月 13, 2021

* Fix docstring

* Make screenshots clickable for easier viewing

* Navigation menu in alphabetical order; More clicable screenshots

* Rename 1Cycle doc

* Tweak naming

* Remove no longer used flag

* ZeRO3 Offload release

* Single GPU results

* Rearrange figures

* Single GPU text

* tweak intro

* zero3-offload section

* Add asynchronous i/o docs

63c5070e

13 5月, 2021 1 次提交

Improve flops profiler functionality (#1065) · 4544b7d2

由 Cheng Li 提交于 5月 12, 2021

* use the original function's name as the key to old_functions dict

* update profile output format

* print at global rank 0

* add flops calculation in bwd pass using time from ds timers

* improve aggregated profiling out to show all depth

* print samples/second

* update readme and examples

* update docs

* fix typo and reorder printing

* fix format

4544b7d2

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年