提交 · 4cf970e6bb3c2ff29b2f03fcddb6f2cf26245a23 · Greenplum / DeepSpeed

23 1月, 2022 1 次提交
- A
  
  Add codespell to pre-commit checks (#1717) · 4cf970e6
  由 Alex Hedges 提交于 1月 22, 2022
  
  4cf970e6
22 1月, 2022 1 次提交
- M
  
  Align bfloat16 docs (#1715) · 09c065b4
  由 Manuel R. Ciosici 提交于 1月 21, 2022
  
  09c065b4
21 1月, 2022 1 次提交

Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support) (#1453) · 4912e0ad

由 Justin Chiu 提交于 1月 20, 2022

* Changes for bfloat16 Zero2

* ZeRO stage3 optimizations, with some bug fixes

optimizations for stage3:
- prefetching improvements
- batching allgather calls to amortize fixed overhead and improve
  bandwidth utilization
- batching reduce_scatter calls to amortize fixed overhead and
  improve bandwidth utilization
- using *_base variants of allgather and reduce scatter to reduce memory
  allocations and data movement
- more fine grained synchronization for communication that allows
  blocking on less work
- precomputation of fetching code - using a fetch queue rather than
  deciding what to (pre)fetch at each iteration
- limiting queued coalesced communication ops to reduce memory pressure
  on pytorch cuda caching allocator (not elegant solution)

optimizations for stage3-offload:
- made some host-device tensor copies async to improve performance

bug fixes and qol improvements:
- fix init context method when parent modules modify child weights
- speed up model initialization by moving model to GPU before weight
  initialization
- fixed unit test imports so that unit tests can be run from any
  directory
- change performance logging to include memory consumption
- add logging w/ model size when done partitioning model

new features
- bfloat16 support for ZeRO 3

* fix import in ut

* ran yapf

* improvements to cache flush warn log

* backwards compatibility with older versions of pytorch

* handle edge case where reduced tensor smaller than world size

* moved event synchronization to allgather handle wait() call

* removed unnecessary barrier call

* formatting fix after resolving merge conflict

* skip nvme prefetch when trace not complete

* opportunistically avoid memory allocation in allgather coalesced where possible

* fix indentation after merge

* fixes to account for parameter offload

* accounting for torch.cuda.memory_stats not being available

* moved partition_all_params to optimizer step

* allgathering on params before item gets called

* fix param status checks

needed after moving partition_all_parameters call to optimizer step

* fix grad accumulation with optimizer offload

* grad norm computation fix for optimizer offload

* change post divide in reduce-scatter to pre divide

* fix gradient race condition w/ optimizer offload

* improve inf/nan gradient tracking

* don't prefetch when not in training mode

* format fix after merging

* fix prefetching issue when using NVME offload

* improved defragmentation for fp16 parameters

* relative imports for bf16 tests

* changes for bwd compatibility with pytorch 1.2

* remove buffered_reduce_fallback

* removed unused parameter offset bookkeeping

* fixed tracking for multiple param groups

* unbroke bfloat16 config after merge conflict

* using base allgather params when only 1 param

* cleanup/fixes for fp16 partition defragmentation

* switch to CRLF

* convert to same new-line style as master

* align new line with master

* Fix merge issues

* switch to CRLF

* fix to LF line endings

* minor merge fixes

* remove extra bfloat16_enabled definition

* asserting params inflight for AllGatherHandle

* remove get_cuda_mem_allocated_str

* Format fixes

* fix bfloat16 zero stage check (broken after merge commit)

* +self.communication_data_type, -self.allreduce_always_fp32; delete dead code

* Add self.reduce_scatter

* Format fix

* Fix merge issues

* iterate over params_to_fetch rather than make another iterator

* add some TODOs

* remove unnecessary division by micro_step_id

* rename config keys "bfloat16" -> "bf16"

* rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bit_weights_on_model_save

* add unit test to check backwards compatibility for gather_16bit_weights

* added test to confirm bf16 key bwd compatibility

* Format fixes
Co-authored-by: NRana Ali Amjad <raamjad@amazon.com>
Co-authored-by: NJustin Chiu <justchiu@amazon.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

4912e0ad

04 1月, 2022 1 次提交
- M
  Various small documentation text improvements (#1665) · d0ab7224
  由 Manuel R. Ciosici 提交于 1月 03, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  d0ab7224
14 12月, 2021 1 次提交
- J
  Refactor ZeRO naming to reduce confusion (#1607) · 1d295ff5
  由 Jeff Rasley 提交于 12月 13, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  1d295ff5
27 11月, 2021 1 次提交

allreduce_always_fp16 (#1487) · d14baad9

由 Mikhail Druzhinin 提交于 11月 26, 2021

* fp16 allreduce

* Undo sparse sum in nan check

* communication_data_type instead of fp32_allreduce and fp16_allreduce

* sparse_allreduce with fp32 or fp16 data type

* FIx communication_data_type checks

* Allow only torch data types for communication_data_type

* Fix Zero assert messages
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

d14baad9

23 11月, 2021 1 次提交
- M
  Add documentation for TensorBoard logging (#1577) · e1b4aa8f
  由 Manuel R. Ciosici 提交于 11月 23, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  e1b4aa8f
13 11月, 2021 2 次提交

Autotuning (#1554) · 9caa74e5

由 Cheng Li 提交于 11月 13, 2021

* [squash] Staging autotuning v4
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NMinjia Zhang <minjiaz@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* add new extra, guard xgboost, cleanup dead files (#268)

* Fix autotuning docs (#1553)

* fix docs

* rewording the goal

* fix typos

* fix typos (#1556)

* fix typos

* fix format

* fix bug (#1557)

* fix bug
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NMinjia Zhang <minjiaz@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

9caa74e5

M
Add documentation for bfloat16 (git commit 648f7bfa) (#1516) · b7cc7c8e
由 Manuel R. Ciosici 提交于 11月 12, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
b7cc7c8e

02 10月, 2021 1 次提交

Fix many typos (#1423) · be789b16

由 Alex Hedges 提交于 10月 01, 2021

* Fix typos in docs/

* Fix typos in code comments and output strings

* Fix typos in the code itself

* Fix typos in tests/
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

be789b16

01 10月, 2021 1 次提交
- J
  
  Add assert to ensure we don't skip unsupported grad dtypes (#1418) · 0457bb1c
  由 Jeff Rasley 提交于 9月 30, 2021
  
  0457bb1c
17 8月, 2021 1 次提交

Curriculum learning (#1307) · b2b34ae3

由 Conglong Li 提交于 8月 16, 2021

Co-authored-by: NConglong Li <conglong.li@gmail.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

b2b34ae3

30 7月, 2021 1 次提交

[Doc] round_robin_gradients (#1261) · 40c381df

由 Olatunji Ruwase 提交于 7月 29, 2021

* Fix docstring

* Make screenshots clickable for easier viewing

* Navigation menu in alphabetical order; More clicable screenshots

* Rename 1Cycle doc

* Tweak naming

* Remove no longer used flag

* ZeRO3 Offload release

* Single GPU results

* Rearrange figures

* Single GPU text

* tweak intro

* zero3-offload section

* Add asynchronous i/o docs

* Fix print_per_steps doc

* Document round_robin_gradients

* Tweak description

* Trigger CI

40c381df

02 7月, 2021 1 次提交

contiguous gradients should be set to True by default (#1199) · c9fee821

由 Samyam Rajbhandari 提交于 7月 01, 2021

* contiguous gradients should be set to True by default

* Set contiguous gradients to True by default

Features such as reduce_scatter depends on contiguous gradients being True. This is also the preferred default configuration.

c9fee821

17 6月, 2021 1 次提交

[Doc] Fix steps_per_print description (#1163) · fa7921e2

由 Olatunji Ruwase 提交于 6月 16, 2021

* Fix docstring

* Make screenshots clickable for easier viewing

* Navigation menu in alphabetical order; More clicable screenshots

* Rename 1Cycle doc

* Tweak naming

* Remove no longer used flag

* ZeRO3 Offload release

* Single GPU results

* Rearrange figures

* Single GPU text

* tweak intro

* zero3-offload section

* Add asynchronous i/o docs

* Fix print_per_steps doc

fa7921e2

09 6月, 2021 1 次提交

correct cpu_offload deprecation (#1140) · a8d6dfe8

由 Stas Bekman 提交于 6月 08, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

a8d6dfe8

20 5月, 2021 1 次提交
- J
  
  ZeRO stage 1 refresh (#1042) · cfa63f5d
  由 Jeff Rasley 提交于 5月 19, 2021
  
  cfa63f5d
14 5月, 2021 1 次提交

[docs] unused parameter handling (#1060) · 63c5070e

由 Olatunji Ruwase 提交于 5月 13, 2021

* Fix docstring

* Make screenshots clickable for easier viewing

* Navigation menu in alphabetical order; More clicable screenshots

* Rename 1Cycle doc

* Tweak naming

* Remove no longer used flag

* ZeRO3 Offload release

* Single GPU results

* Rearrange figures

* Single GPU text

* tweak intro

* zero3-offload section

* Add asynchronous i/o docs

63c5070e

13 5月, 2021 2 次提交

Improve flops profiler functionality (#1065) · 4544b7d2

由 Cheng Li 提交于 5月 12, 2021

* use the original function's name as the key to old_functions dict

* update profile output format

* print at global rank 0

* add flops calculation in bwd pass using time from ds timers

* improve aggregated profiling out to show all depth

* print samples/second

* update readme and examples

* update docs

* fix typo and reorder printing

* fix format

4544b7d2

W
[docs] Rename train_step_batch_size to train_micro_batch_size_per_gpu (#1066) · 1f82ab78
由 William Buchwalter 提交于 5月 12, 2021
```
* rename train_step_batch_size to train_micro_batch_size_per_gpu

* clarify batch_size related doc
```
1f82ab78

27 4月, 2021 1 次提交

fix gradient_clipping default (#656) · b7f97061

由 Stas Bekman 提交于 4月 26, 2021

Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

b7f97061

25 4月, 2021 1 次提交

Add find_unused_parameters option to DeepSpeedEngine (#945) · d0b61f18

由 hamlet 提交于 4月 25, 2021

* Add find_unused_parameters option

As unused parameters in modules may not be expected sometimes, 
add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707

* Add find_unused_parameters option

As unused parameters in modules may not be expected sometimes, 
add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707

* Fix syntax error

* Fix yapf error

* Fix yapf error

* Fix yapf error

* Fix yapf error

* Move stage2 find_unused_parameters to config file

* Add stage2 find_unused_parameters

* Add stage2 find_unused_parameters

* Add stage2_find_unused_parameters option

* Change error msg to reflect zero_optimization config change

* Fix yapf error

* Fix yapf errors

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Add UnusedParametersModel for test option find_unused_parameters

* Add unit test for stage2 find_unused_parameters

* Add cpu-adam compatible check

* Remove dups import

* Trim spaces

* Fix yapf errors

* Trim spaces

* Add False Positive test check

* Fix find_unused_parameters test

* Trim spaces

* Fix yapf error

d0b61f18

23 4月, 2021 2 次提交

Asynchronous I/O docs (#1000) · bff4bc72

由 Olatunji Ruwase 提交于 4月 22, 2021

* Fix docstring

* Make screenshots clickable for easier viewing

* Navigation menu in alphabetical order; More clicable screenshots

* Rename 1Cycle doc

* Tweak naming

* Remove no longer used flag

* ZeRO3 Offload release

* Single GPU results

* Rearrange figures

* Single GPU text

* tweak intro

* zero3-offload section

* Add asynchronous i/o docs

bff4bc72

[doc] add missing pin_memory entry (#999) · ecf2e1bc

由 Stas Bekman 提交于 4月 22, 2021

- `offload_param` was missing `pin_memory` 
- also moved the entry in `offload_optimizer` to have it in the same place.

ecf2e1bc

21 4月, 2021 2 次提交

1-bit LAMB optimizer (#970) · 67a48aaa

由 Conglong Li 提交于 4月 20, 2021

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed.
Author: @conglongli, @awan-10, @samyam, Hanlin Tang, Yuxiong He
Paper: https://arxiv.org/abs/2104.06069Co-authored-by: Nsdtblck <46172032+sdtblck@users.noreply.github.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

67a48aaa

S
make bold+italic work without escaping _ (#775) · 835b4c87
由 Stas Bekman 提交于 4月 20, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
835b4c87

19 4月, 2021 1 次提交

ZeRO-Infinity (#976) · 0d4a54a0

由 Jeff Rasley 提交于 4月 18, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>

0d4a54a0

15 4月, 2021 1 次提交

update lr scheduler doc for doing per step or epoch update (#913) · c83e49f9

由 Cheng Li 提交于 4月 14, 2021

* update lr scheduler doc for doing per step or epoch update

* work

* trigger build
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

c83e49f9

08 4月, 2021 1 次提交

docs (#909) · 31699291

由 Stas Bekman 提交于 4月 07, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

31699291

17 3月, 2021 1 次提交

1-bit Adam v2 (#817) · 68c8481b

由 Conglong Li 提交于 3月 16, 2021

Authors: @awan-10 @conglongli @samyam @jeffra

What's new:

NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., #813).

* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594)

* NCCL based 1-bit Implementation + Refactor to add communication backends (#593)

* add nccl 1-bit optim.

* temporary commit to save stuff.

* Use dist collectives instead of mpi routines.

* remove old code for comm.

* Fix bugs. still does not work.

* modify to test the nccl side code path

* Initial gather impl. Works intra-node.

* Updates to comm. phase 2. nccl comm. passed the tests.

* refactor code to introduce nccl/mpi as backends for onebit adam.

* Refactor updates to test/engine.

* Fix compile/runtime errors.

* simplify support for nccl/mpi backends.

* Add missign file

* Add compression backend in constructor. Revert later.

* modify test with some perf counting.

* Implement a true non-blocking gather for nccl side.

* Revert "Add compression backend in constructor. Revert later."

This reverts commit df8c40d3.

* improve the 1-bit adam test.

* Refactor comm. and compression backend in 1-bit adam.

* Fix the test.

* Fix runtime errors and typos in nccl backend

* fix mpi backend. modify tests.

* modify nccl perf test.

* fix mpi side errors.

* Add an mpi perf test

* Sync DSE.

* Remove old collectives file.

* Undo a typo.

* Graceful failure for torch versions that don't support nccl pt2pt.

* Revert "Merge branch 'master' into staging-1bit-nccl-v2"

This reverts commit 78400850, reversing
changes made to a6dba72a.

* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""

This reverts commit 6dbdd985.

* comm optimization + 1-bit lamb

* Saving/debugging commit.

* finalizing 1-bit lamb

* add momentum mask and chkpt handling for 1-bit adam

* Cleanup and modify nccl test to be runnable with deepspeed launcher.

* Fix format.

* fix formatting again.

* make test runnable without mpi4py

* Add dist.alltoall and dist.allgather instead of custom functions.

* remove debug prints.

* formatting and renaming

* renaming

* add unit test, fix existing tests

* skip unit test when torch < 1.8

* revert 1-bit lamb

* flatten momentum when dimension is more than 1

* add warning message for 1-bit adam under fp32

* improve version check

* add fp32 test

* 1-bit adam doc

* fix file name

* doc fix

* torch 1.8 is released

* doc fix

* fix tests

* update news

* add doc for momentum mask

* fix checkpoing handling, add unit test

* checkpoint handling doc

* doc final cleanup

* bump dates

* update tests

* url change

* doc fix

* fix test

* doc update
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

68c8481b

12 3月, 2021 1 次提交

Add optimizers and schedules to RTD and updated the corresponding part in the website (#799) · e0f36ed5

由 Cheng Li 提交于 3月 11, 2021

* add optimizers and schedules to rtd

* update ds website and fix links

* add optimizers and schedules to rtd

* update ds website and fix links

* add flops profiler to rtd

* fix
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>

e0f36ed5

09 3月, 2021 1 次提交

ZeRO 3 Offload (#834) · 599258f9

由 Samyam Rajbhandari 提交于 3月 08, 2021

* Squash stage3 v1 (#146)
Co-authored-by: NSamyam <samyamr@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Neltonzheng <eltonz@microsoft.com>

* Fix correctness bug (#147)

* formatting fix (#150)

* stage3 bugfix (API) update and simplified FP16 Z3 tests (#151)

* fp16 Z3 API update and bugfix

* revert debug change

* ZeRO-3 detach and race condition bugfixes (#149)

* trying out ZeRO-3 race condition fix

* CUDA sync instead of stream

* reduction stream sync

* remove commented code

* Fix optimizer state_dict KeyError (#148)
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152)

* Simplifying the logic for getting averaged gradients (#153)

* skip for now

* Z3 Docs redux (#154)

* removing some TODOs and commented code (#155)

* New Z3 defaults (#156)
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* formatting

* megatron external params
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Neltonzheng <eltonz@microsoft.com>

599258f9

21 2月, 2021 1 次提交
- S
  [doc] fix incorrect param name (#773) · e60e92eb
  由 Stas Bekman 提交于 2月 20, 2021
```
Invalid param name

Thanks.
```
  e60e92eb
11 2月, 2021 1 次提交

Add flops profiler tutorial (#682) · e2dfe0d1

由 Cheng Li 提交于 2月 10, 2021

* work on flops profiler tutorial

* update flops profiler tutorial

* add flops profiler tutorial and fix names

* work on flops profiler tutorial

* update flops profiler tutorial

* add flops profiler tutorial and fix names

* fix tailing ws

* fix names

* remove multistep profiling and update docs

* fix cases where functionals and submodules coexist in a parent module, update readme

* fix typo

* always invoke post hook function

* fix module flops sum and update tests

* update tutorial

e2dfe0d1

21 1月, 2021 1 次提交
- S
  [tutorials] typos (#676) · 7b0bee0b
  由 Stas Bekman 提交于 1月 20, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  7b0bee0b
16 1月, 2021 2 次提交
- S
  doc fix (#651) · 7b07e123
  由 Stas Bekman 提交于 1月 15, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  7b07e123
- S
  Add AdamW to the supported optimizers (#672) · c5e42641
  由 Stas Bekman 提交于 1月 15, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  c5e42641
31 10月, 2020 1 次提交

Add CPUAdam optimizer for zero-offload in deepspeed engine (#484) · f5aa2547

由 Reza Yazdani 提交于 10月 30, 2020

* add adamW to CPU-ADAM implementation

* supporting cpu-adam optimizer for zero-offload on deepspeed side

* bump DSE to match cpu-adam updates
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

f5aa2547

11 10月, 2020 1 次提交

Add DeepSpeed_Adam optimizer (#468) · 23fc48f3

由 Olatunji Ruwase 提交于 10月 10, 2020

* Update installation instructions

* Format fix

* ZeRO tutorial

* Format fixes

* ZeRO-Offload

* ZeRO and ZeRO-Offload tutorials

* Update navigation page

* Format fixes

* Add yuxhe feedback

* Fix blog post link

* Fix OneBit-Adam link
Tweak scheduler example

* Fix date link

* Add DeepSpeed_Adam
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

23fc48f3

17 9月, 2020 1 次提交

Minor doc fixes (#417) · 7d91be97

由 Olatunji Ruwase 提交于 9月 16, 2020

* Update installation instructions

* Format fix

* ZeRO tutorial

* Format fixes

* ZeRO-Offload

* ZeRO and ZeRO-Offload tutorials

* Update navigation page

* Format fixes

* Add yuxhe feedback

* Fix blog post link

* Fix OneBit-Adam link
Tweak scheduler example

* Fix date link
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

7d91be97

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年