提交 · 72a30c1eaba0383eba83f2aed006899316ec2b0a · Greenplum / DeepSpeed

19 4月, 2021 2 次提交

J

revert zero-inf change to launcher · 72a30c1e
由 Jeff Rasley 提交于 4月 18, 2021

72a30c1e

由 Jeff Rasley 提交于 4月 18, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>

0d4a54a0

17 4月, 2021 1 次提交
- O
  Fix ZeRO-3 UnboundLocalError (#968) · 2805c393
  由 Olatunji Ruwase 提交于 4月 16, 2021
```
* Fix UnboundLocalError

* Get full partition size
```
  2805c393
15 4月, 2021 2 次提交

update lr scheduler doc for doing per step or epoch update (#913) · c83e49f9

由 Cheng Li 提交于 4月 14, 2021

* update lr scheduler doc for doing per step or epoch update

* work

* trigger build
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

c83e49f9

[zero] faster flatten/unflatten (cpp version) (#910) · 8b8ed2a7

由 Stas Bekman 提交于 4月 14, 2021

* faster flatten/unflatten with apex

* switch to cpp flatten/unflatten

* style

* better comment

* missing import

* switch to build ops at run time

* fixes
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

8b8ed2a7

14 4月, 2021 3 次提交
- S
  [config] turn exponential notation back on for config dump (#955) · c87118b0
  由 Stas Bekman 提交于 4月 14, 2021
```
* e-notation for large floats

* handle ints too

* readability

* handle bool
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  c87118b0
- S
  
  fix double linear override; spelling (#954) · adac058a
  由 Stas Bekman 提交于 4月 14, 2021
  
  adac058a
- T
  
  Delete check of pdsh (#941) · e6999ebd
  由 Takuya Makino 提交于 4月 14, 2021
  
  e6999ebd
08 4月, 2021 6 次提交

docs (#909) · 31699291

由 Stas Bekman 提交于 4月 07, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

31699291

S
Samyamr/stage 3 skip modules without parameters (#867) · 7b46d11f
由 Samyam Rajbhandari 提交于 4月 07, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
7b46d11f

improved readability + typos (#895) · 5ca86ae4

由 Stas Bekman 提交于 4月 07, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

5ca86ae4

fix cpu_adam memory leak on deepspeed re-use in the same process (#896) · c79184eb

由 Stas Bekman 提交于 4月 07, 2021

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

c79184eb

S
[zero3] GatheredParameters can now handle a list of params (#884) · 6d94afb5
由 Stas Bekman 提交于 4月 07, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
6d94afb5
S
Fix for fragmented linear inputs in ZeRO 3 Linear layers where reshap… (#881) · b5f56b2c
由 Samyam Rajbhandari 提交于 4月 07, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
b5f56b2c

07 4月, 2021 1 次提交
- T
  
  Add space in help string (#926) · ce14cf1a
  由 Takuya Makino 提交于 4月 07, 2021
  
  ce14cf1a
02 4月, 2021 1 次提交

zero.Init() clarification (#880) · 5d721e09

由 Stas Bekman 提交于 4月 01, 2021

* zero.Init() clarification

clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must.

this proposal is via @samyam's clarification shared elsewhere.

Thank you.

* style

* add clarity

* style
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

5d721e09

31 3月, 2021 1 次提交
- J
  
  update backward api doc (#903) · 23ff6cb7
  由 Jeff Rasley 提交于 3月 30, 2021
  
  23ff6cb7
27 3月, 2021 3 次提交

Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (#861) · 7fcc8911

由 hamlet 提交于 3月 27, 2021

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in https://github.com/microsoft/DeepSpeed/issues/707

As some model trainable parameters skipped in training,
their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, 
so they have no norm_for_param_grads

* Trim space

* Trim space
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

7fcc8911

S
save_fp16_model consolidated for zero3 (#893) · 39013dd2
由 Stas Bekman 提交于 3月 26, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
39013dd2
S

full fp32 weights reconstruction for zero 2+3 (#892) · 7531c6bf
由 Stas Bekman 提交于 3月 26, 2021

7531c6bf

25 3月, 2021 1 次提交

[debug utils] see_memory_usage fixes (#890) · 7f03282c

由 Stas Bekman 提交于 3月 25, 2021

* see_memory_usage fixes

* didn't expect pt-1.2

* fix the order of things

* fix the order of things

7f03282c

18 3月, 2021 1 次提交

consistent checkpoint filenaming (#865) · 10c0bea6

由 Stas Bekman 提交于 3月 18, 2021

* consistent checkpoint filenaming

* backward compatible rename
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

10c0bea6

17 3月, 2021 6 次提交

1-bit Adam v2 (#817) · 68c8481b

由 Conglong Li 提交于 3月 16, 2021

Authors: @awan-10 @conglongli @samyam @jeffra

What's new:

NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., #813).

* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594)

* NCCL based 1-bit Implementation + Refactor to add communication backends (#593)

* add nccl 1-bit optim.

* temporary commit to save stuff.

* Use dist collectives instead of mpi routines.

* remove old code for comm.

* Fix bugs. still does not work.

* modify to test the nccl side code path

* Initial gather impl. Works intra-node.

* Updates to comm. phase 2. nccl comm. passed the tests.

* refactor code to introduce nccl/mpi as backends for onebit adam.

* Refactor updates to test/engine.

* Fix compile/runtime errors.

* simplify support for nccl/mpi backends.

* Add missign file

* Add compression backend in constructor. Revert later.

* modify test with some perf counting.

* Implement a true non-blocking gather for nccl side.

* Revert "Add compression backend in constructor. Revert later."

This reverts commit df8c40d3.

* improve the 1-bit adam test.

* Refactor comm. and compression backend in 1-bit adam.

* Fix the test.

* Fix runtime errors and typos in nccl backend

* fix mpi backend. modify tests.

* modify nccl perf test.

* fix mpi side errors.

* Add an mpi perf test

* Sync DSE.

* Remove old collectives file.

* Undo a typo.

* Graceful failure for torch versions that don't support nccl pt2pt.

* Revert "Merge branch 'master' into staging-1bit-nccl-v2"

This reverts commit 78400850, reversing
changes made to a6dba72a.

* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""

This reverts commit 6dbdd985.

* comm optimization + 1-bit lamb

* Saving/debugging commit.

* finalizing 1-bit lamb

* add momentum mask and chkpt handling for 1-bit adam

* Cleanup and modify nccl test to be runnable with deepspeed launcher.

* Fix format.

* fix formatting again.

* make test runnable without mpi4py

* Add dist.alltoall and dist.allgather instead of custom functions.

* remove debug prints.

* formatting and renaming

* renaming

* add unit test, fix existing tests

* skip unit test when torch < 1.8

* revert 1-bit lamb

* flatten momentum when dimension is more than 1

* add warning message for 1-bit adam under fp32

* improve version check

* add fp32 test

* 1-bit adam doc

* fix file name

* doc fix

* torch 1.8 is released

* doc fix

* fix tests

* update news

* add doc for momentum mask

* fix checkpoing handling, add unit test

* checkpoint handling doc

* doc final cleanup

* bump dates

* update tests

* url change

* doc fix

* fix test

* doc update
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

68c8481b

O
Make config objects json serializable (#862) · 7bcd72a2
由 Olatunji Ruwase 提交于 3月 16, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
7bcd72a2
O
Fix ZeRO3 save_checkpoint (#857) · fa87a73a
由 Olatunji Ruwase 提交于 3月 16, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
fa87a73a
J

Allow args to be optional in deepspeed.initialize (#825) · 871f3048
由 Jeff Rasley 提交于 3月 16, 2021

871f3048
B

docs: minor spelling tweaks (#858) · 547d1c5f
由 brett koonce 提交于 3月 16, 2021

547d1c5f
S
[runner/launch] propagate the error (#854) · 24335d49
由 Stas Bekman 提交于 3月 16, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
24335d49

16 3月, 2021 2 次提交

ZeRO Stage 2: Clear reduced gradients (#856) · a75d971b

由 Olatunji Ruwase 提交于 3月 15, 2021

* Ensure gradients of other partitions are cleared after reduction

* Remove redundant code
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

a75d971b

Samyamr/inference hook fix (#851) · 46018859

由 Samyam Rajbhandari 提交于 3月 15, 2021

* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

46018859

13 3月, 2021 1 次提交
- C
  Bug fix: Remove client optimizer param_group list item that does not have 'params' (#827) · 458ff028
  由 Cheng Li 提交于 3月 12, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  458ff028
12 3月, 2021 3 次提交

[WarmupDecayLR] fix log(0) & 1/log(1) bugs (#772) · 18a26f3f

由 Stas Bekman 提交于 3月 11, 2021

* fix log(0) & 1/log(1) bugs

* simplify
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: NCheng Li <pistasable@gmail.com>

18a26f3f

Control ZeRO wall clock timers (#849) · 311795d0

由 Olatunji Ruwase 提交于 3月 11, 2021

* Control ZeRO wall clock timers

* Disable more ZeRO3 debug prints
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

311795d0

Add optimizers and schedules to RTD and updated the corresponding part in the website (#799) · e0f36ed5

由 Cheng Li 提交于 3月 11, 2021

* add optimizers and schedules to rtd

* update ds website and fix links

* add optimizers and schedules to rtd

* update ds website and fix links

* add flops profiler to rtd

* fix
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>

e0f36ed5

11 3月, 2021 2 次提交
- S
  less scary overflow notice (#833) · 29853c3e
  由 Stas Bekman 提交于 3月 10, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  29853c3e
- J
  
  set adamw_mode default true (follows FusedAdam and < 0.3.11 logic) (#844) · dd03cff2
  由 Jeff Rasley 提交于 3月 10, 2021
  
  dd03cff2
10 3月, 2021 2 次提交
- J
  
  Fix regression in runner (#843) · 2e6692c8
  由 Jeff Rasley 提交于 3月 09, 2021
  
  2e6692c8
- J
  
  replace home env with ~ · 49496364
  由 Jeff Rasley 提交于 3月 09, 2021
  
  49496364
09 3月, 2021 1 次提交

ZeRO 3 Offload (#834) · 599258f9

由 Samyam Rajbhandari 提交于 3月 08, 2021

* Squash stage3 v1 (#146)
Co-authored-by: NSamyam <samyamr@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Neltonzheng <eltonz@microsoft.com>

* Fix correctness bug (#147)

* formatting fix (#150)

* stage3 bugfix (API) update and simplified FP16 Z3 tests (#151)

* fp16 Z3 API update and bugfix

* revert debug change

* ZeRO-3 detach and race condition bugfixes (#149)

* trying out ZeRO-3 race condition fix

* CUDA sync instead of stream

* reduction stream sync

* remove commented code

* Fix optimizer state_dict KeyError (#148)
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152)

* Simplifying the logic for getting averaged gradients (#153)

* skip for now

* Z3 Docs redux (#154)

* removing some TODOs and commented code (#155)

* New Z3 defaults (#156)
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* formatting

* megatron external params
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Neltonzheng <eltonz@microsoft.com>

599258f9

27 2月, 2021 1 次提交
- V
  
  fixed typo (#802) · db987cf1
  由 vfdev 提交于 2月 27, 2021
  
  db987cf1

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年