提交 · 5e16eb2c939707d0d0062a458d77998fccb3afad · Greenplum / DeepSpeed

22 8月, 2023 2 次提交
- W
  enable autoTP for mpt in huggingface model hub without trust_remote_code (#4062) · 5e16eb2c
  由 Wang, Yi 提交于 8月 22, 2023
```
see  https://github.com/huggingface/transformers/tree/main/src/transformers/models/mptCo-authored-by: NMolly Smith <112220543+molly-smith@users.noreply.github.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  5e16eb2c
- L
  
  Treat empty environment variables as unset in (#4185) · 8fb111c0
  由 Logan Adams 提交于 8月 21, 2023
  
  8fb111c0
21 8月, 2023 2 次提交

do allgather only in shared optimizer states groups (#4167) · 7f3e82fe

由 mzl 提交于 8月 21, 2023

* skip all-gather

* add notes

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

7f3e82fe

MP ZeRO++ (#3954) · 7711bdbb

由 Heyang Qin 提交于 8月 20, 2023

* zero++ tutorial PR (#3783)

* [Fix] _conv_flops_compute when padding is a str and stride=1 (#3169)

* fix conv_flops_compute when padding is a str when stride=1

* fix error

* change type of paddings to tuple

* fix padding calculation

* apply formatting check

---------
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

* fix interpolate flops compute (#3782)

* use `Flops Profiler` to test `model.generate()` (#2515)

* Update profiler.py

* pre-commit run --all-files

* Delete .DS_Store

* Delete .DS_Store

* Delete .DS_Store

---------
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NCheng Li <pistasable@gmail.com>

* revert PR #3611 (#3786)

* bump to 0.9.6

* ZeRO++ chinese blog (#3793)

* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format

* remove staging trigger (#3792)

* DeepSpeed-Triton for Inference (#3748)
Co-authored-by: NStephen Youn <styoun@microsoft.com>
Co-authored-by: NArash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NEthan Doe <yidoe@microsoft.com>
Co-authored-by: Nyidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* ZeRO++ (#3784)
Co-authored-by: NHeyangQin <heyangqin@microsoft.com>
Co-authored-by: NGuanhuaWang <alexwgh333@gmail.com>
Co-authored-by: Ncmikeh2 <connorholmes@microsoft.com>
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>

* adding zero++ to navigation panel of deepspeed.ai (#3796)

* Add ZeRO++ Japanese blog (#3797)

* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format

* add ZeRO++ Japanese blog

* add links

---------
Co-authored-by: NHeyangQin <heyangqin@microsoft.com>
Co-authored-by: NConglong Li <conglong.li@gmail.com>

* Bug Fixes for autotuner and flops profiler (#1880)

* fix autotuner when backward is not called

* fix format

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

* Missing strided copy for gated MLP (#3788)
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

* Requires grad checking. (#3789)
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* bump to 0.10.0

* Fix Bug in transform.cu (#3534)

* Bug fix

* Fixed formatting error

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

* bug fix: triton importing error (#3799)
Co-authored-by: NStephen Youn <styoun@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

* init commit for mixed precision lora

* fix format

* patch _allgather_params & minor fixes

* make sure initial quantization are finished

* make sure dequantization is finished

* skip quantization for small parameters

* fix format

* remove unused async_op

* lazy load of quantizer kernels

* add mixed precision lora tutorial

* cleanup mics

* cleanup mics

* replace get_accelerator().current_device()

* add kwargs to mics

* fix format

* seperate code and tutorial

* fix _all_gather in zero3

---------
Co-authored-by: NBill Luo <50068224+zhiruiluo@users.noreply.github.com>
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NGuorun <84232793+CaffreyR@users.noreply.github.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: Nstephen youn <13525892+stephen-youn@users.noreply.github.com>
Co-authored-by: NStephen Youn <styoun@microsoft.com>
Co-authored-by: NArash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: NEthan Doe <yidoe@microsoft.com>
Co-authored-by: Nyidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: NGuanhuaWang <alexwgh333@gmail.com>
Co-authored-by: Ncmikeh2 <connorholmes@microsoft.com>
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NMasahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: NConglong Li <conglong.li@gmail.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NJoe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: NRamya Ramineni <62723901+rraminen@users.noreply.github.com>

7711bdbb

19 8月, 2023 3 次提交
- J
  
  bump to 0.10.2 · f036f00c
  由 Jeff Rasley 提交于 8月 18, 2023
  
  f036f00c
- M
  
  pin transformers to last known good commit (#4174) · 46d859a7
  由 Michael Wyatt 提交于 8月 18, 2023
  
  46d859a7
- L
  Add DSE branch input to nv-ds-chat (#4173) · a3540f17
  由 Lev Kurilenko 提交于 8月 18, 2023
```
* Add DSE branch input to nv-ds-chat

* Use provided DSE branch

* Echo DSE branch
```
  a3540f17
17 8月, 2023 4 次提交

M
[CPU][Bugfix] Make uid and addr_port part of SHM name in CCL backend (#4115) · 19e9a7c0
由 Ma, Guokai 提交于 8月 17, 2023
```
* distinguish shm name with uid and addr_port

* fix formatting
```
19e9a7c0

Add DS-Chat CI workflow (#4127) · 64c670ef

由 Lev Kurilenko 提交于 8月 16, 2023

* Add DS Chat CI workflow

* Add CRITIC_CKPT_DIR env variable to actions.yml

* Update step 2 opt 125m ckpt dir name

* Update test dir

* Add workflow_dispatch

* Add :

* Add nv-ds-chat badge to main README

* Open GH issue if DS Chat CI fails

* Remove pull_request and merge_group conditions

* Update and test torch version

* Remove PR trigger

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

64c670ef

M

fix badges (#4162) · bd65eeaf
由 Michael Wyatt 提交于 8月 16, 2023

bd65eeaf
L

Handling for SIGTERM as well (#4160) · 1a295739
由 Logan Adams 提交于 8月 16, 2023

1a295739

16 8月, 2023 3 次提交
- S
  
  Fixes #4151 (#4152) · 740b7805
  由 Sam Foreman 提交于 8月 16, 2023
  
  740b7805
- M
  Return nn.parameter type for weights and biases (#4146) · 341cefd2
  由 Molly Smith 提交于 8月 15, 2023
```
* Return nn.parameter type for weights and biases

* whitespace

* Fix bias tensor size
```
  341cefd2
- L
  Remove incorrect async-io library checking code. (#4150) · a4523018
  由 Logan Adams 提交于 8月 15, 2023
```
* Update library installed checker to use check_cmd

* This code was used for checking if aio was installed but this was refactored and this code was left
```
  a4523018
15 8月, 2023 3 次提交

O
Respect memory pinning config (#4131) · 9d79cfd1
由 Olatunji Ruwase 提交于 8月 14, 2023
```
* Respect memory pinning config

* Bug fix
```
9d79cfd1
O
Generalize frozen weights unit test (#4140) · 7a282db8
由 Olatunji Ruwase 提交于 8月 14, 2023
```
* Fix unit test

* Fix unit test
```
7a282db8

Handle PermissionError in os.chmod Call - Update engine.py (#4139) · 629b2039

由 Chris M 提交于 8月 14, 2023

* Update engine.py

This branch includes changes to handle potential exceptions that may occur when attempting to change file permissions using the os.chmod function within the DeepSpeed engine. The specific issue addressed is the PermissionError that may arise when working with certain filesystems or under restricted permissions.

* Change to use logger

* Split permissions out and add unit test

* UnitTest(use DistTestClass) + trailing whitespace

* update unit test

* UT parametrize 1, 2 ,3

* trim white space from unit test

* change to PermissionError

* run pre-commit formats

* Catch FileNotFoundError & PermissionError

629b2039

11 8月, 2023 1 次提交

Update torch1.9 tests to 1.10 to match latest accelerate. (#4126) · ff7d5275

由 Logan Adams 提交于 8月 10, 2023

* Fix torch19 tests

* test pip list and --no-build-isolation

* Enable verbosity

* pin to older accelerate version

* Update oldest tested torch to 1.10

* Properly rename directories

* Return PR tests to CI again.

* Remove -vv

ff7d5275

10 8月, 2023 3 次提交

Update nightly workflows to open an issue if CI fails (#3952) · 0c75f4a3

由 Logan Adams 提交于 8月 09, 2023

* Update H100 workflow to open an issue if nightly CI fails

* Test running as not CI

* Add all nightly/switch envvar name

* Test with AMD

* Add way to get url, switch path of template

* Add additional checkout step

* Move actions checkout step

* Try absolute path with github workspace

* Create issue without template/path

* Re-enable and add debug logic

* add if failed()

* More debug

* Try without checkout action uses

* Rename file

* Update variables

* Update issue template

* Confirm removing permissions still work

* Revert "Confirm removing permissions still work"

This reverts commit e7c2915a.

* Re-enable permissions

* Remove PR trigger for AMD MI200 tests

* Revert "Remove PR trigger for AMD MI200 tests"

This reverts commit 5c5c5fd6.

* Test update_existing

* Switch to composite action

* Fix line ending encoding issue

* Switch failure to be a variable

* Test with second workflow

* Format fix

* Switch failure to always

* Switch back to previously working way

* Test permission changes

* Revert "Test permission changes"

This reverts commit e051da75.

* Update existing bugs with newest build failure link

* Remove PR triggers for that were used for testing.

0c75f4a3

L

Add ops (#4119) · d300517f
由 Logan Adams 提交于 8月 09, 2023

d300517f

Fix Issue 4083 (#4084) · 8a8683d3

由 Joe Mayer 提交于 8月 09, 2023

* removing bad check

* adding offload check for bf16 optimizer

* grad reduce for extra large param

* check grad_accum exists before converting

---------
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

8a8683d3

09 8月, 2023 6 次提交

enable pipeline checkpoint loading mode (#3629) · 1e0c39c6

由 leiwen83 提交于 8月 09, 2023

In cpu ram limited machine, loading checkpoint at the start up may
cause oom as all rank in the same node are loading the opt state
in the same time. So for this scenario, we make a choice that loading
checkpoint could be made pipeline way.
Signed-off-by: NLei Wen <wenlei03@qiyi.com>
Co-authored-by: NLei Wen <wenlei03@qiyi.com>

1e0c39c6

add deepspeed chat arxiv report (#4110) · 78d985ab

由 Conglong Li 提交于 8月 08, 2023

* add deepspeed chat arxiv report

* add zeroquant v2 and fp

* add selective enhencement

* add ignore for 'Youn' in spell checker

---------
Co-authored-by: Nyaozhewei <zheweiy@berkeley.edu>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

78d985ab

Pass correct node size for ZeRO++ (#4085) · f0463b4d

由 Connor Holmes 提交于 8月 08, 2023

* Pass correct node size

* formatting

---------
Co-authored-by: NConnor Holmes <development@cmikeh2.me>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

f0463b4d

O
Disable z3 tracing profiler (#4106) · 977254c1
由 Olatunji Ruwase 提交于 8月 08, 2023
```
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
```
977254c1

use correct ckpt path when base_dir not available (#4101) · abe293b4

由 Polisetty V R K Jyothendra Varma 提交于 8月 09, 2023

* base_dir may not present all time and results in incorrect path

* Update replace_module.py

* Update config.py

---------
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

abe293b4

M

set temperature to avoid config validation error (#4107) · 975bcbc0
由 Michael Wyatt 提交于 8月 08, 2023

975bcbc0

08 8月, 2023 2 次提交
- E
  add type checker ignore to resolve that pylance can't resolved noqa annotation (#4102) · 57a27b08
  由 Earlee 提交于 8月 08, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
```
  57a27b08
- E
  
  zero_to_fp32 script adds support for tag argument (#4089) · 241ae39a
  由 Earlee 提交于 8月 08, 2023
  
  241ae39a
05 8月, 2023 2 次提交

update ut/doc for glm/codegen (#4057) · 85dc854b

由 mzl 提交于 8月 05, 2023

* update ut/doc for glm/codegen

* formatting/spacing on docs

* re-order/alphabetize the models

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NLogan Adams <loadams@microsoft.com>

85dc854b

D

fix typo: change polciies to policies (#4090) · 4cde5da8
由 digger yu 提交于 8月 05, 2023

4cde5da8

04 8月, 2023 2 次提交
- M
  Spread layers more uniformly when using partition_uniform (#4053) · e8318634
  由 marcobellagente93 提交于 8月 03, 2023
```
* update partition_uniform util function

* formatting

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  e8318634
- L
  Fix Stable Diffusion Injection (#4078) · 1ba40989
  由 Lev Kurilenko 提交于 8月 03, 2023
```
* Initial commit

* Clean up

* Fix formatting
```
  1ba40989
03 8月, 2023 1 次提交
- M
  
  unpin datasets in UT (#4079) · a7fe3bcc
  由 Michael Wyatt 提交于 8月 02, 2023
  
  a7fe3bcc
01 8月, 2023 3 次提交

Refactor autoTP inference for HE (#4040) · 94c7233a

由 Molly Smith 提交于 7月 31, 2023

* Refactor autoTP inference for HE

* Formatting

* Move redundant functions to autotp

* Remove self from loading class

* formatting

* Some gpt2 autotp path fixes

* precommit

94c7233a

H

fix: remove unnessary `#` punct in the second `sed` command (#4061) · e31b4041
由 Hugh Pu 提交于 8月 01, 2023

e31b4041

add reproducible compilation environment (#3943) · f763b93d

由 Xie Zejian 提交于 8月 01, 2023

* add reproducible compilation environment

* fix ci

* fix typo for formatting check

* Fix casing for format

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NMichael Wyatt <mrwyattii@gmail.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NLogan Adams <loadams@microsoft.com>

f763b93d

29 7月, 2023 1 次提交
- Z
  save_non_zero_checkpoint on first partition group (#3787) · 8a63754b
  由 Zhen Zhang 提交于 7月 28, 2023
```
Co-authored-by: NZhen Zhang <zhzhn@amazon.com>
```
  8a63754b
28 7月, 2023 2 次提交

Fix deadlock when SHM based allreduce spin too fast (#4048) · 82c498d9

由 Ma, Guokai 提交于 7月 28, 2023

* Fix deadlock when allreduce spin too fast

* Change state to enum to increase readability

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

82c498d9

Multiple zero stage 3 related fixes (#3886) · 7f90ef4b

由 Olatunji Ruwase 提交于 7月 28, 2023

* Option to override module apply

* Removing early partitioning in override

* Unit tests

* Cleanup

* Adapt unit test to succeed

* Handle missed params

* Add accelerate

* Code cleanup

* Add doc

* Add doc

* Add doc

7f90ef4b

Greenplum / DeepSpeed 上一次同步 1 年多

Greenplum / DeepSpeed
上一次同步 1 年多