提交 · a6dba72aeafad63661dfe566d3accd03d00be78c · Greenplum / DeepSpeed

15 12月, 2020 1 次提交

NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594) · a6dba72a

由 Ammar Ahmad Awan 提交于 12月 14, 2020

* NCCL based 1-bit Implementation + Refactor to add communication backends (#593)

* add nccl 1-bit optim.

* temporary commit to save stuff.

* Use dist collectives instead of mpi routines.

* remove old code for comm.

* Fix bugs. still does not work.

* modify to test the nccl side code path

* Initial gather impl. Works intra-node.

* Updates to comm. phase 2. nccl comm. passed the tests.

* refactor code to introduce nccl/mpi as backends for onebit adam.

* Refactor updates to test/engine.

* Fix compile/runtime errors.

* simplify support for nccl/mpi backends.

* Add missign file

* Add compression backend in constructor. Revert later.

* modify test with some perf counting.

* Implement a true non-blocking gather for nccl side.

* Revert "Add compression backend in constructor. Revert later."

This reverts commit df8c40d3.

* improve the 1-bit adam test.

* Refactor comm. and compression backend in 1-bit adam.

* Fix the test.

* Fix runtime errors and typos in nccl backend

* fix mpi backend. modify tests.

* modify nccl perf test.

* fix mpi side errors.

* Add an mpi perf test

* Sync DSE.

* Remove old collectives file.

* Undo a typo.

* Graceful failure for torch versions that don't support nccl pt2pt.

a6dba72a

10 12月, 2020 4 次提交
- J
  
  Add AML video link · 7300f3e3
  由 Jeff Rasley 提交于 12月 09, 2020
  
  7300f3e3
- J
  
  Add papers/videos to readme/website (#592) · 19acd6cf
  由 Jeff Rasley 提交于 12月 09, 2020
  
  19acd6cf
- J
  
  bump to 0.3.8 · cb7c7da6
  由 Jeff Rasley 提交于 12月 09, 2020
  
  cb7c7da6
- J
  
  Pin triton to 0.2.3 for now, 0.3.0 is broken · d901a6d2
  由 Jeff Rasley 提交于 12月 09, 2020
  
  d901a6d2
09 12月, 2020 1 次提交
- S
  Pipeline warnings and checkpoint portability (#588) · 2f626978
  由 Shaden Smith 提交于 12月 08, 2020
```
* Switch from deprecated allreduce interface.

* Make pipeline checkpoint files portable.
```
  2f626978
08 12月, 2020 2 次提交

[build] add compute_86 (#577) · e8b126d9

由 Stas Bekman 提交于 12月 07, 2020

RTX-30 series are compute_86
```
python -c "import torch; print(torch.cuda.get_device_capability())"
```
This PR adds support for this compute capability.

Reference: https://developer.nvidia.com/cuda-gpusCo-authored-by: NJeff Rasley <jerasley@microsoft.com>

e8b126d9

S

[build] make builder smarter and configurable wrt compute capabilities + docs (#578) · ce363d0e
由 Stas Bekman 提交于 12月 07, 2020

ce363d0e

05 12月, 2020 1 次提交

Fix potential random layout inconsistency issues in sparse attention modules (#534) · 1e44d48d

由 Zhun 提交于 12月 04, 2020

* 1) Register layout as buffer of module so that we can save/load checkpoint; 2) Add a broadcast of layout at the beginning to ensure different processes will have consistent layout during distributed training.

* Add docstring for max_seq_length argument in SparseSelfAttention
Co-authored-by: NZhun Liu <zhunliu@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

1e44d48d

03 12月, 2020 5 次提交
- S
  
  [build] build against installed cuda-11.1 while torch built w/ cuda-11.0 (#570) · ff58fa7e
  由 Stas Bekman 提交于 12月 02, 2020
  
  ff58fa7e
- J
  
  Add compute capability 8.0 if on cuda 11+ (#572) · be33bea4
  由 Jeff Rasley 提交于 12月 02, 2020
  
  be33bea4
- S
  
  [engine] train should be able to get `mode` arg (#571) · 2d1f7c01
  由 Stas Bekman 提交于 12月 02, 2020
  
  2d1f7c01
- J
  
  Add 'latest' checkpoint save/load support (#569) · 845921b3
  由 Jeff Rasley 提交于 12月 02, 2020
  
  845921b3
- S
  [cifar tutorial] improve readability (#567) · 7a75f8b3
  由 Stas Bekman 提交于 12月 02, 2020
```
* [cifar tutorial] improve readability 
```
  7a75f8b3
02 12月, 2020 2 次提交

tracking optimizer step in cpu-adam when loading checkpoint (#564) · 9f52a36f

由 Reza Yazdani 提交于 12月 01, 2020

* tracking optimizer step in cpu-adam when loading checkpoint

* add warning/error message for updating optimizer step count

* resolve build issue

* supporting state update from the python side

* track step from python in all cases

* remove comma

9f52a36f

supporting different hidden dimensions (#559) · c78c29f9

由 Reza Yazdani 提交于 12月 01, 2020

* supporting different hidden dimensions

* add support for larger hidden dimensions (greater than 8K)

* remove empty line

* add loop unrolling factor for dropout kernels

* update different kernels based on the reviews
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

c78c29f9

28 11月, 2020 1 次提交

[doc] typo fix and clarification (#563) · 17f36f1b

由 Stas Bekman 提交于 11月 27, 2020

This PR:
* fixes a misspelled method name
* also `( () )` doesn't read too well, until one reads the code and understands that it's not a formatting bug. I proposed to simply say that it's a callable object.

17f36f1b

26 11月, 2020 4 次提交
- J
  
  bump to 0.3.7 · c51fa65d
  由 Jeff Rasley 提交于 11月 25, 2020
  
  c51fa65d
- J
  
  update manifest · e4e20662
  由 Jeff Rasley 提交于 11月 25, 2020
  
  e4e20662
- J
  
  bump to 0.3.6 and fix manifest to include reqs (#561) · 73c3262d
  由 Jeff Rasley 提交于 11月 25, 2020
  
  73c3262d
- S
  
  Adds long_description to setup.py (#560) · 60097136
  由 Shaden Smith 提交于 11月 25, 2020
  
  60097136
23 11月, 2020 1 次提交
- J
  
  bump to 0.3.5 · 16313a96
  由 Jeff Rasley 提交于 11月 23, 2020
  
  16313a96
25 11月, 2020 6 次提交
- J
  
  Turn back on PP tests (#558) · eec44af1
  由 Jeff Rasley 提交于 11月 24, 2020
  
  eec44af1
- A
  Simplify dist init and only init if needed. (#553) · 0e831e23
  由 Ammar Ahmad Awan 提交于 11月 24, 2020
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  0e831e23
- O
  Deprecate client ability to disable gradient reduction (#552) · 6e65c2cc
  由 Olatunji Ruwase 提交于 11月 24, 2020
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  6e65c2cc
- J
  
  Update badges and CI name (#557) · 1ef5cd23
  由 Jeff Rasley 提交于 11月 24, 2020
  
  1ef5cd23
- J
  
  Switch to CI to GitHub Actions (#556) · 3347460e
  由 Jeff Rasley 提交于 11月 24, 2020
  
  3347460e
- J
  
  Create main.yml · c18fb0de
  由 Jeff Rasley 提交于 11月 24, 2020
  
  c18fb0de
24 11月, 2020 1 次提交

Bug fix for norm calculation in absence of model parallel group (#551) · 00c3a254

由 Samyam Rajbhandari 提交于 11月 23, 2020

In the absence of a model parallel group, model_parallel_allreduce should not do any reduction. This commit fixes the bug which was doing a model parallel allreduce across world group when model parallel group is None

00c3a254

23 11月, 2020 1 次提交
- S
  
  Adding static_loss_scale to unfused optimizer (#546) · bcd56f97
  由 Samyam Rajbhandari 提交于 11月 22, 2020
  
  bcd56f97
22 11月, 2020 1 次提交
- O
  
  Support non-tensor state in checkpoint (#548) · 6021b702
  由 Olatunji Ruwase 提交于 11月 21, 2020
  
  6021b702
21 11月, 2020 1 次提交

Fix unbalanced gradients bug in ZeRO-2 gradient accumulation (#545) · 0178e6cc

由 Olatunji Ruwase 提交于 11月 20, 2020

* Use zero-tensors for missing gradients to avoid size mismatch

* Unit test for unbalanced gradients in ZeRO

* Formatting fixes

0178e6cc

20 11月, 2020 6 次提交
- J
  
  bump version 0.3.4 · 6b28bc5d
  由 Jeff Rasley 提交于 11月 19, 2020
  
  6b28bc5d
- A
  Discover variables for NCCL backend on AML without mpi4py (#542) · 1b45917c
  由 Ammar Ahmad Awan 提交于 11月 19, 2020
```
* Use AML method to set env vars instead of using mpi4py.
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  1b45917c
- S
  Fix setup.py for cpu-only environment installation (#538) · d81cb26d
  由 Seunghwan Hong 提交于 11月 20, 2020
```
* Add guard to not using `torch.version.cuda` above no-CUDA environment.
* Fix several typos on setup.py.
Signed-off-by: NSeunghwan Hong <seunghwan@scatterlab.co.kr>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  d81cb26d
- J
  
  backwards compatability w. v020 ckpts, fix issue with zero-1 ckpts (#543) · dce054db
  由 Jeff Rasley 提交于 11月 19, 2020
  
  dce054db
- J
  
  bump to v0.3.3 · 9de21b72
  由 Jeff Rasley 提交于 11月 19, 2020
  
  9de21b72
- J
  ZeRO-1 tune max-elems + bug fix (#532) · 08c96a1b
  由 Jeff Rasley 提交于 11月 19, 2020
```
* zero-1 memory fix

* auto-tune max elems per comm to reduce padding/comm intervals

* clean-up and added previously missing reduction options

* fix testing backing to work with torch1.7
```
  08c96a1b
19 11月, 2020 2 次提交
- J
  
  more fine-grained manifest file for includes/excludes (#540) · fdd81c30
  由 Jeff Rasley 提交于 11月 18, 2020
  
  fdd81c30
- J
  
  append job-name if explicit output dir is given (#539) · 5b09be60
  由 Jeff Rasley 提交于 11月 18, 2020
  
  5b09be60

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年