提交 · 42c1e916f63883cd52928ba5c730f91827abda49 · Greenplum / DeepSpeed

29 8月, 2023 1 次提交

feat(activation_checkpointing): add `non_reentrant_checkpoint` to support... · 42c1e916

由 Hugh Pu 提交于 8月 29, 2023

feat(activation_checkpointing): add `non_reentrant_checkpoint` to support inputs require no grad (#4118)

* feat: add `non_reentrant_checkpoint`

* feat: add missing output postprocess and change the hook to record leaf forward tensor refs

* fix: make the multi_grad_hook registered after graph construction

* fix: backward compatibility for multi_tensor_hook

* fix: nonlocal reference error of deepspeed_saved_tensors

* fix: reduce repeating hook registration

* test: add test for `activation_checkpointing.checkpointing.non_reentrant_checkpoint`

* Pass correct node size for ZeRO++ (#4085)

* Pass correct node size

* formatting

---------
Co-authored-by: NConnor Holmes <development@cmikeh2.me>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

* add deepspeed chat arxiv report (#4110)

* add deepspeed chat arxiv report

* add zeroquant v2 and fp

* add selective enhencement

* add ignore for 'Youn' in spell checker

---------
Co-authored-by: Nyaozhewei <zheweiy@berkeley.edu>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

* style: change flake8 detected style missmatch

* test: hack to clone the `test_activation_checkpointing` module for reuse and add regression tests

* doc: explain the introduction of `non_reentrant_checkpoint`

* doc: explain the test of `non_reentrant_checkpoint`

---------
Co-authored-by: NConnor Holmes <connorholmes@microsoft.com>
Co-authored-by: NConnor Holmes <development@cmikeh2.me>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NConglong Li <conglong.li@gmail.com>
Co-authored-by: Nyaozhewei <zheweiy@berkeley.edu>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

42c1e916

26 8月, 2023 1 次提交

Fix pipline dataloader when batch elements contain tuple (#565) · c69bd1f7

由 hamlet 提交于 8月 26, 2023

Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>

c69bd1f7

25 8月, 2023 8 次提交

B
Fixes timer error referenced in #4212 (#4213) · 0b7a760c
由 Björn Plüster 提交于 8月 25, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
```
0b7a760c
D
add meta onDevice support for LLAMA2 (#4147) · 0712e299
由 Dino Chen 提交于 8月 25, 2023
```
Co-authored-by: NMolly Smith <112220543+molly-smith@users.noreply.github.com>
```
0712e299
J
Simplify Gradient Attribute Names (#4214) · f6903190
由 Joe Mayer 提交于 8月 24, 2023
```
* name changes

* formatting changes
```
f6903190

Add MuP optimizers (#2043) · 9647ea79

由 Michael Wyatt 提交于 8月 24, 2023

* added paths for mup optimizers

* added tests

* formatting

* Add license, fix missing distributed test, formatting

* Add mpi4py to confirm tests work

* Undo requirements change

* Move to runtime folder

* Rework to match new format

* missing comma

* hidden dim fix

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NLogan Adams <loadams@microsoft.com>

9647ea79

C

add ulysses blog index (#4215) · d6c2e6b0
由 Conglong Li 提交于 8月 24, 2023

d6c2e6b0

DeepSpeed Ulysses Chinese blog translation (#4210) · 63e17769

由 Heyang Qin 提交于 8月 24, 2023

* Chinese translation with Conglong's feedback

* fix format

---------
Co-authored-by: NConglong Li <conglong.li@gmail.com>

63e17769

Add Japanese blog of DS-Ulysses (#4209) · 3808273c

由 Masahiro Tanaka 提交于 8月 24, 2023

* add Japanese blog of DS-Ulysses

* fix fig

---------
Co-authored-by: NConglong Li <conglong.li@gmail.com>

3808273c

S

Update README.md (#4211) · c274e512
由 Sam Ade Jacobs 提交于 8月 24, 2023

c274e512

24 8月, 2023 8 次提交
- S
  Update Ulyssess (#4205) · 10bef7ac
  由 Sam Ade Jacobs 提交于 8月 24, 2023
```
* Update README.md

* Update README.md

* Format fix

---------
Co-authored-by: NLogan Adams <loadams@microsoft.com>
```
  10bef7ac
- S
  DS-Ulysses formating (#4204) · 961827be
  由 Sam Ade Jacobs 提交于 8月 23, 2023
```
* fix identation

* fix formatting

---------
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  961827be
- J
  
  [docs] fix pypi badge · 3e82cb64
  由 Jeff Rasley 提交于 8月 23, 2023
  
  3e82cb64
- S
  
  Ds ulysses news (#4202) · 5de0662c
  由 Sam Ade Jacobs 提交于 8月 23, 2023
  
  5de0662c
- S
  
  Deepspeed-Ulysses blog (#4201) · 4e5d39fe
  由 Sam Ade Jacobs 提交于 8月 23, 2023
  
  4e5d39fe
- S
  DeepSpeed Ulysses release (#4198) · a855405e
  由 Sam Ade Jacobs 提交于 8月 23, 2023
```
Co-authored-by: NMasahiro Tanaka <mtanaka@microsoft.com>
```
  a855405e
- O
  Load z3 checkpoints for inference (#4171) · 6df15873
  由 Olatunji Ruwase 提交于 8月 23, 2023
```
* Load z3 checkpoints for inference

* PR feedback

* Fix API bugs

* Fix typo
```
  6df15873
- M
  DeepSpeed Ulysses tutorial (#4200) · b5453990
  由 Minjia Zhang 提交于 8月 23, 2023
```
* add tutorial file from Minjia.

* fix format.

---------
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
```
  b5453990
23 8月, 2023 4 次提交

Fix ZeRO parameter initialization for tensors with `requires_grad=True` (#4138) · 426810a2

由 Xuehai Pan 提交于 8月 23, 2023

* Fix ZeRO parameter initialization for tensors with `requires_grad=True`

* Simplify detach logic

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

426810a2

Add unit test to check HF low_cpu_mem_usage_flag (#4184) · 9723a879

由 Logan Adams 提交于 8月 22, 2023

* Add unittest to check huggingface low_cpu_mem_usageflag

* change lag to true

* Formatting has changes

* Indentation fix

* Fix chanves

* final format fix

* Accidently dropped pytestmark from other test

* Remove invalid model test config as that was removed.

* Whitespace and PR feedback

* Format and PR feedback means that we can remove the import we added.

* Update tests/unit/inference/test_inference.py

* Update tests/unit/inference/test_inference.py

---------
Co-authored-by: NLok Chand Koppaka <lokoppak@microsoft.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

9723a879

K

Fix path (#4193) · 6c684e12
由 Kuan-Ying Lai 提交于 8月 23, 2023

6c684e12

Fix nv-nightly workflow (#4163) · d9a889d5

由 Michael Wyatt 提交于 8月 22, 2023

* Disable nv-nightly workflow since it doesn't work

* Run on PRs to debug

* fix for nv-nightly

* fix

* OOM fix?

* Update nv-nightly.yml

---------
Co-authored-by: NLogan Adams <loadams@microsoft.com>

d9a889d5

22 8月, 2023 2 次提交
- W
  enable autoTP for mpt in huggingface model hub without trust_remote_code (#4062) · 5e16eb2c
  由 Wang, Yi 提交于 8月 22, 2023
```
see  https://github.com/huggingface/transformers/tree/main/src/transformers/models/mptCo-authored-by: NMolly Smith <112220543+molly-smith@users.noreply.github.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  5e16eb2c
- L
  
  Treat empty environment variables as unset in (#4185) · 8fb111c0
  由 Logan Adams 提交于 8月 21, 2023
  
  8fb111c0
21 8月, 2023 2 次提交

do allgather only in shared optimizer states groups (#4167) · 7f3e82fe

由 mzl 提交于 8月 21, 2023

* skip all-gather

* add notes

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

7f3e82fe

MP ZeRO++ (#3954) · 7711bdbb

由 Heyang Qin 提交于 8月 20, 2023

* zero++ tutorial PR (#3783)

* [Fix] _conv_flops_compute when padding is a str and stride=1 (#3169)

* fix conv_flops_compute when padding is a str when stride=1

* fix error

* change type of paddings to tuple

* fix padding calculation

* apply formatting check

---------
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

* fix interpolate flops compute (#3782)

* use `Flops Profiler` to test `model.generate()` (#2515)

* Update profiler.py

* pre-commit run --all-files

* Delete .DS_Store

* Delete .DS_Store

* Delete .DS_Store

---------
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NCheng Li <pistasable@gmail.com>

* revert PR #3611 (#3786)

* bump to 0.9.6

* ZeRO++ chinese blog (#3793)

* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format

* remove staging trigger (#3792)

* DeepSpeed-Triton for Inference (#3748)
Co-authored-by: NStephen Youn <styoun@microsoft.com>
Co-authored-by: NArash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NEthan Doe <yidoe@microsoft.com>
Co-authored-by: Nyidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* ZeRO++ (#3784)
Co-authored-by: NHeyangQin <heyangqin@microsoft.com>
Co-authored-by: NGuanhuaWang <alexwgh333@gmail.com>
Co-authored-by: Ncmikeh2 <connorholmes@microsoft.com>
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>

* adding zero++ to navigation panel of deepspeed.ai (#3796)

* Add ZeRO++ Japanese blog (#3797)

* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format

* add ZeRO++ Japanese blog

* add links

---------
Co-authored-by: NHeyangQin <heyangqin@microsoft.com>
Co-authored-by: NConglong Li <conglong.li@gmail.com>

* Bug Fixes for autotuner and flops profiler (#1880)

* fix autotuner when backward is not called

* fix format

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

* Missing strided copy for gated MLP (#3788)
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

* Requires grad checking. (#3789)
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* bump to 0.10.0

* Fix Bug in transform.cu (#3534)

* Bug fix

* Fixed formatting error

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

* bug fix: triton importing error (#3799)
Co-authored-by: NStephen Youn <styoun@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* init commit for mixed precision lora

* fix format

* patch _allgather_params & minor fixes

* make sure initial quantization are finished

* make sure dequantization is finished

* skip quantization for small parameters

* fix format

* remove unused async_op

* lazy load of quantizer kernels

* add mixed precision lora tutorial

* cleanup mics

* cleanup mics

* replace get_accelerator().current_device()

* add kwargs to mics

* fix format

* seperate code and tutorial

* fix _all_gather in zero3

---------
Co-authored-by: NBill Luo <50068224+zhiruiluo@users.noreply.github.com>
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NGuorun <84232793+CaffreyR@users.noreply.github.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: Nstephen youn <13525892+stephen-youn@users.noreply.github.com>
Co-authored-by: NStephen Youn <styoun@microsoft.com>
Co-authored-by: NArash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: NEthan Doe <yidoe@microsoft.com>
Co-authored-by: Nyidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: NGuanhuaWang <alexwgh333@gmail.com>
Co-authored-by: Ncmikeh2 <connorholmes@microsoft.com>
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NMasahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: NConglong Li <conglong.li@gmail.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NJoe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: NRamya Ramineni <62723901+rraminen@users.noreply.github.com>

7711bdbb

19 8月, 2023 3 次提交
- J
  
  bump to 0.10.2 · f036f00c
  由 Jeff Rasley 提交于 8月 18, 2023
  
  f036f00c
- M
  
  pin transformers to last known good commit (#4174) · 46d859a7
  由 Michael Wyatt 提交于 8月 18, 2023
  
  46d859a7
- L
  Add DSE branch input to nv-ds-chat (#4173) · a3540f17
  由 Lev Kurilenko 提交于 8月 18, 2023
```
* Add DSE branch input to nv-ds-chat

* Use provided DSE branch

* Echo DSE branch
```
  a3540f17
17 8月, 2023 4 次提交

M
[CPU][Bugfix] Make uid and addr_port part of SHM name in CCL backend (#4115) · 19e9a7c0
由 Ma, Guokai 提交于 8月 17, 2023
```
* distinguish shm name with uid and addr_port

* fix formatting
```
19e9a7c0

Add DS-Chat CI workflow (#4127) · 64c670ef

由 Lev Kurilenko 提交于 8月 16, 2023

* Add DS Chat CI workflow

* Add CRITIC_CKPT_DIR env variable to actions.yml

* Update step 2 opt 125m ckpt dir name

* Update test dir

* Add workflow_dispatch

* Add :

* Add nv-ds-chat badge to main README

* Open GH issue if DS Chat CI fails

* Remove pull_request and merge_group conditions

* Update and test torch version

* Remove PR trigger

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

64c670ef

M

fix badges (#4162) · bd65eeaf
由 Michael Wyatt 提交于 8月 16, 2023

bd65eeaf
L

Handling for SIGTERM as well (#4160) · 1a295739
由 Logan Adams 提交于 8月 16, 2023

1a295739

16 8月, 2023 3 次提交
- S
  
  Fixes #4151 (#4152) · 740b7805
  由 Sam Foreman 提交于 8月 16, 2023
  
  740b7805
- M
  Return nn.parameter type for weights and biases (#4146) · 341cefd2
  由 Molly Smith 提交于 8月 15, 2023
```
* Return nn.parameter type for weights and biases

* whitespace

* Fix bias tensor size
```
  341cefd2
- L
  Remove incorrect async-io library checking code. (#4150) · a4523018
  由 Logan Adams 提交于 8月 15, 2023
```
* Update library installed checker to use check_cmd

* This code was used for checking if aio was installed but this was refactored and this code was left
```
  a4523018
15 8月, 2023 3 次提交

O
Respect memory pinning config (#4131) · 9d79cfd1
由 Olatunji Ruwase 提交于 8月 14, 2023
```
* Respect memory pinning config

* Bug fix
```
9d79cfd1
O
Generalize frozen weights unit test (#4140) · 7a282db8
由 Olatunji Ruwase 提交于 8月 14, 2023
```
* Fix unit test

* Fix unit test
```
7a282db8

Handle PermissionError in os.chmod Call - Update engine.py (#4139) · 629b2039

由 Chris M 提交于 8月 14, 2023

* Update engine.py

This branch includes changes to handle potential exceptions that may occur when attempting to change file permissions using the os.chmod function within the DeepSpeed engine. The specific issue addressed is the PermissionError that may arise when working with certain filesystems or under restricted permissions.

* Change to use logger

* Split permissions out and add unit test

* UnitTest(use DistTestClass) + trailing whitespace

* update unit test

* UT parametrize 1, 2 ,3

* trim white space from unit test

* change to PermissionError

* run pre-commit formats

* Catch FileNotFoundError & PermissionError

629b2039

11 8月, 2023 1 次提交

Update torch1.9 tests to 1.10 to match latest accelerate. (#4126) · ff7d5275

由 Logan Adams 提交于 8月 10, 2023

* Fix torch19 tests

* test pip list and --no-build-isolation

* Enable verbosity

* pin to older accelerate version

* Update oldest tested torch to 1.10

* Properly rename directories

* Return PR tests to CI again.

* Remove -vv

ff7d5275

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年