提交 · 73c0798bd725cc275ee59cf7b936ff3597fc46c8 · Greenplum / DeepSpeed

07 5月, 2022 1 次提交

GatheredParameters - accept a tuple of params (#1941) · 73c0798b

由 Stas Bekman 提交于 5月 07, 2022

* GatheredParameters - accept any iterable

* torch tensor is an iterable, so can't use collections.abc.Iterable

* fix

73c0798b

06 5月, 2022 1 次提交

Improve z3 trace management (#1916) · 673cb608

由 Olatunji Ruwase 提交于 5月 06, 2022

* Fix OOM and type mismatch

* Toggle prefetching

* Disable z3 prefetching for inference (temp workaround)

* Fix zero3 tracing issues

* Remove debug prints

* Enable prefetch for inference

* Code clarity

* Invalidate trace cache

* Trace cache invalidation when needed
Separate nvme prefetch from all-gather prefetch

* Track last used step id

* Use debug name in error message

* Construct param trace from module trace
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

673cb608

03 5月, 2022 1 次提交
- J
  
  [ZeRO-3] Rename confusing log message (#1932) · a8d26d6a
  由 Jeff Rasley 提交于 5月 02, 2022
  
  a8d26d6a
27 4月, 2022 2 次提交
- J
  
  [zero-3] add bwd support for list/dict types returned in fwd (#1857) · a52cbf80
  由 Jeff Rasley 提交于 4月 26, 2022
  
  a52cbf80
- J
  Inference PP changes for neox (#1899) · b4fcd98f
  由 Jeff Rasley 提交于 4月 26, 2022
```
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
```
  b4fcd98f
26 4月, 2022 1 次提交
- O
  Fix OOM and type mismatch (#1884) · 32d97976
  由 Olatunji Ruwase 提交于 4月 25, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  32d97976
21 4月, 2022 1 次提交
- O
  Fix multiple zero 3 tracing errors (#1901) · ef17c895
  由 Olatunji Ruwase 提交于 4月 20, 2022
```
* Fix zero3 tracing issues

* Remove debug prints

* Code clarity
```
  ef17c895
20 4月, 2022 2 次提交

S
[partition_parameters.py] better diagnostics (#1887) · fb00e6a1
由 Stas Bekman 提交于 4月 19, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
fb00e6a1

bf16+pipeline parallelism (#1801) · 56c52238

由 Olatunji Ruwase 提交于 4月 19, 2022

* bf16 updates

* Got bf16 working

* fp32 reduction; flattened tensors

* bf16+zero_stage_1 first cut

* finish zero_stage 1 sharding

* Matching fp16 with debugging codes

* Matching loss with fp16

* Fix gradient clipping

* bf16 gradient clipping fix
bf16 checkpoint save/load

* Unscale grad norm

* Fix grad norm scaling

* Enable loading fp16_zero_1 into bf16_zero_1 engine and vice versa

* Fix clip_grad key error

* Reduce tied weight gradients

* Fix grad norm for moe

* Reduce specified gradients

* Use O(n) instead of O(n^2)

* Remove optimizer restriction for bf16

* Link bf16 & fp32 params

* Clip gradients of last stage tied weights

* Simplify tied weights reduction logic

* Also clip all tp rank parameters

* lp to hp mapping

* Link lp/hp/optim state; Refresh links after checkpoint load

* Remove debug print

* Remove debug print

* Simplify zero_grad logic

* fp32 accessors

* Fix update bug
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

56c52238

23 3月, 2022 1 次提交
- A
  
  deepscale --> deepspeed in prints. (#1854) · 788e1c40
  由 Ammar Ahmad Awan 提交于 3月 22, 2022
  
  788e1c40
18 3月, 2022 1 次提交
- O
  Track only trainable parameters (#1780) · b84edef2
  由 Olatunji Ruwase 提交于 3月 18, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  b84edef2
17 3月, 2022 1 次提交
- J
  
  [ZeRO-1] fix bug w. cpu-offload + > 1 GPU (#1841) · 28434c00
  由 Jeff Rasley 提交于 3月 16, 2022
  
  28434c00
01 3月, 2022 1 次提交
- A
  Refactor MoE and Groups API to simplify model creation and mangement (#1798) · c0af6d90
  由 Ammar Ahmad Awan 提交于 2月 28, 2022
```
Co-authored-by: Nyaozhewei <zheweiy@berkeley.edu>
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
```
  c0af6d90
16 2月, 2022 1 次提交
- R
  Fix CPU-Offload: Send groups of parameter lists as the FP16 parameters (#1774) · 5ca26277
  由 Reza Yazdani 提交于 2月 15, 2022
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  5ca26277
12 2月, 2022 1 次提交
- O
  Prepare zero-3 optimizer for ckpt load/save (#1750) · 5fe5b38e
  由 Olatunji Ruwase 提交于 2月 11, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  5fe5b38e
08 2月, 2022 1 次提交
- O
  Move param_shapes to model files (#1732) · 135a6256
  由 Olatunji Ruwase 提交于 2月 07, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  135a6256
31 1月, 2022 1 次提交
- S
  [debug log] disable (#1736) · bea701a1
  由 Stas Bekman 提交于 1月 30, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  bea701a1
30 1月, 2022 1 次提交
- S
  [zero3] remove debug print (#1733) · fdd59ca2
  由 Stas Bekman 提交于 1月 29, 2022
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  fdd59ca2
23 1月, 2022 1 次提交
- A
  
  Add codespell to pre-commit checks (#1717) · 4cf970e6
  由 Alex Hedges 提交于 1月 22, 2022
  
  4cf970e6
21 1月, 2022 2 次提交

O

Fix checkpoint api (#1714) · e40558de
由 Olatunji Ruwase 提交于 1月 21, 2022

e40558de

Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support) (#1453) · 4912e0ad

由 Justin Chiu 提交于 1月 20, 2022

* Changes for bfloat16 Zero2

* ZeRO stage3 optimizations, with some bug fixes

optimizations for stage3:
- prefetching improvements
- batching allgather calls to amortize fixed overhead and improve
  bandwidth utilization
- batching reduce_scatter calls to amortize fixed overhead and
  improve bandwidth utilization
- using *_base variants of allgather and reduce scatter to reduce memory
  allocations and data movement
- more fine grained synchronization for communication that allows
  blocking on less work
- precomputation of fetching code - using a fetch queue rather than
  deciding what to (pre)fetch at each iteration
- limiting queued coalesced communication ops to reduce memory pressure
  on pytorch cuda caching allocator (not elegant solution)

optimizations for stage3-offload:
- made some host-device tensor copies async to improve performance

bug fixes and qol improvements:
- fix init context method when parent modules modify child weights
- speed up model initialization by moving model to GPU before weight
  initialization
- fixed unit test imports so that unit tests can be run from any
  directory
- change performance logging to include memory consumption
- add logging w/ model size when done partitioning model

new features
- bfloat16 support for ZeRO 3

* fix import in ut

* ran yapf

* improvements to cache flush warn log

* backwards compatibility with older versions of pytorch

* handle edge case where reduced tensor smaller than world size

* moved event synchronization to allgather handle wait() call

* removed unnecessary barrier call

* formatting fix after resolving merge conflict

* skip nvme prefetch when trace not complete

* opportunistically avoid memory allocation in allgather coalesced where possible

* fix indentation after merge

* fixes to account for parameter offload

* accounting for torch.cuda.memory_stats not being available

* moved partition_all_params to optimizer step

* allgathering on params before item gets called

* fix param status checks

needed after moving partition_all_parameters call to optimizer step

* fix grad accumulation with optimizer offload

* grad norm computation fix for optimizer offload

* change post divide in reduce-scatter to pre divide

* fix gradient race condition w/ optimizer offload

* improve inf/nan gradient tracking

* don't prefetch when not in training mode

* format fix after merging

* fix prefetching issue when using NVME offload

* improved defragmentation for fp16 parameters

* relative imports for bf16 tests

* changes for bwd compatibility with pytorch 1.2

* remove buffered_reduce_fallback

* removed unused parameter offset bookkeeping

* fixed tracking for multiple param groups

* unbroke bfloat16 config after merge conflict

* using base allgather params when only 1 param

* cleanup/fixes for fp16 partition defragmentation

* switch to CRLF

* convert to same new-line style as master

* align new line with master

* Fix merge issues

* switch to CRLF

* fix to LF line endings

* minor merge fixes

* remove extra bfloat16_enabled definition

* asserting params inflight for AllGatherHandle

* remove get_cuda_mem_allocated_str

* Format fixes

* fix bfloat16 zero stage check (broken after merge commit)

* +self.communication_data_type, -self.allreduce_always_fp32; delete dead code

* Add self.reduce_scatter

* Format fix

* Fix merge issues

* iterate over params_to_fetch rather than make another iterator

* add some TODOs

* remove unnecessary division by micro_step_id

* rename config keys "bfloat16" -> "bf16"

* rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bit_weights_on_model_save

* add unit test to check backwards compatibility for gather_16bit_weights

* added test to confirm bf16 key bwd compatibility

* Format fixes
Co-authored-by: NRana Ali Amjad <raamjad@amazon.com>
Co-authored-by: NJustin Chiu <justchiu@amazon.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

4912e0ad

19 1月, 2022 1 次提交

MoE inference + PR-MoE model support (#1705) · e46d808a

由 Jeff Rasley 提交于 1月 18, 2022

Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NZhewei Yao <zheweiy@berkeley.edu>
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>

e46d808a

15 1月, 2022 1 次提交

[ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory... · 3293cf72

由 Jeff Rasley 提交于 1月 14, 2022

[ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory overhead during ckpt load (#1525)
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

3293cf72

07 1月, 2022 1 次提交
- O
  
  Copy grads to cpu in z1-offload (#1679) · cef116f8
  由 Olatunji Ruwase 提交于 1月 06, 2022
  
  cef116f8
06 1月, 2022 1 次提交
- O
  Fix largest param numel calculation (#1623) · 4354c3cc
  由 Olatunji Ruwase 提交于 1月 05, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  4354c3cc
14 12月, 2021 1 次提交
- J
  Refactor ZeRO naming to reduce confusion (#1607) · 1d295ff5
  由 Jeff Rasley 提交于 12月 13, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  1d295ff5
02 12月, 2021 1 次提交
- J
  
  [zero-3] set default device during zero.Init (#1605) · 4b854a37
  由 Jeff Rasley 提交于 12月 01, 2021
  
  4b854a37
01 12月, 2021 1 次提交
- A
  Improve pre-commit hooks (#1602) · fc2f378e
  由 Alex Hedges 提交于 11月 30, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  fc2f378e
30 11月, 2021 1 次提交

force set lf instead of crlf... · a10e4811

由 Jeff Rasley 提交于 11月 29, 2021

force set lf instead of crlf (https://github.com/pre-commit/pre-commit-hooks#mixed-line-ending) (#1598)

a10e4811

28 11月, 2021 1 次提交
- S
  
  port OVERFLOW log to ZeRO-2 (#1593) · 7a132a9f
  由 Stas Bekman 提交于 11月 27, 2021
  
  7a132a9f
27 11月, 2021 1 次提交

allreduce_always_fp16 (#1487) · d14baad9

由 Mikhail Druzhinin 提交于 11月 26, 2021

* fp16 allreduce

* Undo sparse sum in nan check

* communication_data_type instead of fp32_allreduce and fp16_allreduce

* sparse_allreduce with fp32 or fp16 data type

* FIx communication_data_type checks

* Allow only torch data types for communication_data_type

* Fix Zero assert messages
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

d14baad9

19 11月, 2021 1 次提交
- J
  
  Enables ZeRO-3 inference (#1514) · 2332cb31
  由 Jeff Rasley 提交于 11月 18, 2021
  
  2332cb31
17 11月, 2021 1 次提交

Enforce nccl/rccl alignment of start location of each shard (#1564) · 4a0b1032

由 Aswin John Mathews 提交于 11月 17, 2021

* Enforce nccl/rccl alignment of start location of each shard

* Making yapf happy
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

4a0b1032

14 11月, 2021 1 次提交
- O
  Update offload parameter names (#1536) · 7567c76c
  由 Olatunji Ruwase 提交于 11月 13, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  7567c76c
13 11月, 2021 2 次提交

Autotuning (#1554) · 9caa74e5

由 Cheng Li 提交于 11月 13, 2021

* [squash] Staging autotuning v4
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NMinjia Zhang <minjiaz@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* add new extra, guard xgboost, cleanup dead files (#268)

* Fix autotuning docs (#1553)

* fix docs

* rewording the goal

* fix typos

* fix typos (#1556)

* fix typos

* fix format

* fix bug (#1557)

* fix bug
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NMinjia Zhang <minjiaz@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

9caa74e5

O

Fix zinf none swapper (#1550) · 488105eb
由 Olatunji Ruwase 提交于 11月 12, 2021

488105eb

11 11月, 2021 1 次提交
- O
  
  Use cuda tensors for allgather (#1548) · bd3ebddf
  由 Olatunji Ruwase 提交于 11月 10, 2021
  
  bd3ebddf
02 11月, 2021 1 次提交

Bfloat16 zero2 (#1398) · 648f7bfa

由 Rana Ali Amjad 提交于 11月 01, 2021

* Changes for bfloat16 Zero2

* Cleaned up additional comments and debugging code

* Adapted fp16_master_weights_and_grads option to cover BF16

* Reverted fp16_master_weights_and_gradients extension to BFloat16 and minor cleanup

* Fixed formatting and variable naming errors recognized in testing

* Added relevant unit tests for bfloat16 with ZeRO-2

* Updates conditions for skipping BFloat16 unit tests

* Added check for NCCL inconsistent version naming convention

* Update skip message for Bfloat16 tests to mention additional checks
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

648f7bfa

31 10月, 2021 1 次提交

ZeRO3, improved parameter all-gather operation (#1188) · c0eeb69d

由 Zhen Zhang 提交于 10月 31, 2021

* remove norm(), avoid memcpy after allgather

1) Removing the norm computation in debug printing
2) Changing _all_gather to be sync op in fetch_sub_module
    Reason: the async version is not async at all, because each
    all_gather calls torch.cuda.synchronize() to guarantee previous
    communication op to be completed
3) Adding new function _allgather_params_split_launch
    the existing _allgather_params has explicit memcpy after the
    all-gather op. We can avoid the explicit memory copy at
    python side, to improve the performance.

Known issue:
    the `torch.distributed.all_gather` will do implicit memcpy
    at the end of each `ncclAllgather`.

* WIP: wrapped ncclAllgather as customized op in DS

micro benchmark shows the improvement of allgather a
transformer layer with 9834560 elements in half precision is about
1.1ms on aws-p4d instance.

* WIP: integrated into partition_parameters

Performance improvement of 5.1B bert on aws-p4d:
fwd: 300ms -> 200ms
bwd: 680ms -> 610ms

* Fix format

* cleaned dead code, modified unit test

* removed customized c++ extension

revert back to use torch distributed API

* change torch.ones to torch empty

* typo

* warn if not cuda tensor for allgather

* fix formatting

* fix: move ds_tensor to cuda device

but it is strange that the ds_tensor haven't been moved to cuda

* remove try clause on the path for fetching params
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

c0eeb69d

23 10月, 2021 1 次提交

add moe+zero ckpt unit test. (#1429) · 0b77d1d9

由 Ammar Ahmad Awan 提交于 10月 22, 2021

* Add unit test to check moe+zero checkpoints
* Fix zero stage2 checkpoint loading logic to deal with experts related state dicts.

0b77d1d9

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年