提交 · v0.9.4 · Greenplum / DeepSpeed

10 6月, 2023 3 次提交
- L
  Update Dockerfile with newer cuda and torch. (#3716) · a65f6b9e
  由 Logan Adams 提交于 6月 09, 2023
```
* Add non-interactive prompt, causing issues for some users

* Update pytorch version too
```
  a65f6b9e
- A
  single node pdsh sigkill (#3730) · 26b3e732
  由 Abhilash Majumder 提交于 6月 10, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
```
  26b3e732
- M
  [Bugfix][CPU] Remove C++ version in CPU OpBuilder (#3643) · 8bfbb0e3
  由 Ma, Guokai 提交于 6月 10, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  8bfbb0e3
09 6月, 2023 3 次提交

Increase tensor creator coverage (#3684) · 046afced

由 Olatunji Ruwase 提交于 6月 08, 2023

Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

046afced

L
Fix typo in name of hybrid engine function (#3704) · fc8e5c88
由 Logan Adams 提交于 6月 08, 2023
```
* Fix typo in name of hybrid engine function

* Fix
```
fc8e5c88

zero3 performance optimizations (#3622) · 0977106a

由 hablb 提交于 6月 08, 2023

* Remove dead code

params_already_reduced is not used

* Prevent evaluation of debug strings

Debug strings are evaluated even when logging is disabled

* Use contiguous gradients tensor reduce scatter between ranks

Use allreduce instead of reduce scatter. lower cpu overhead.

* move overflow tracker to optimizer.step

Don't check overflow in gradients for every bucket.
Do overflow chack once on grad flat buffer just before optimizer step

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

0977106a

08 6月, 2023 6 次提交

C
DeepSpeed overview in Japanese (#3709) · df425097
由 Conglong Li 提交于 6月 07, 2023
```
* DeepSpeed overview in Japanese

* DeepSpeed overview in Japanese
```
df425097

Small tweak on cuda version mismatch documentation (#3706) · d414678d

由 john li 提交于 6月 07, 2023

* Small tweak on cuda version mismatch documentation

* clarify minor versions should also match

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

d414678d

Fix unit test typo in tests/unit/ops/transformer/inference (#3697) · fb2b4ab1

由 Michael Wyatt 提交于 6月 07, 2023

* mix typo and missing epsilon value

* Touch file to re-build

* revert changes

* Touch file to re-build

* Format

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NLogan Adams <loadams@microsoft.com>

fb2b4ab1

D
change partititon_name to partition_name (#3700) · c5edc91e
由 digger yu 提交于 6月 08, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
```
c5edc91e

Fix gpt-j inference issue (#3639) · 34a9fbf1

由 Reza Yazdani 提交于 6月 07, 2023

* fix gpt-j inference issue for mlp_gemm_func call

* bring back the gpt-j inference-test

* fix formatting

* fix the neox and pythia injection issue

34a9fbf1

L
Revert "fix typo name (#3689)" (#3702) · 7e59ef12
由 Logan Adams 提交于 6月 07, 2023
```
This reverts commit f2f5f21b.
```
7e59ef12

07 6月, 2023 5 次提交

fix typo name (#3689) · f2f5f21b

由 tensor-tang 提交于 6月 07, 2023

Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

f2f5f21b

L
Fix incorrectly formatted f string (#3698) · d8aaa581
由 Logan Adams 提交于 6月 06, 2023
```
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
```
d8aaa581
A
Correct world_size/backend for mpi (#3694) · c17313fb
由 Abhilash Majumder 提交于 6月 07, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
```
c17313fb

Fix local rank mismatch for heterogeneous nodes (#3409) · b7f463dd

由 Byungsoo Oh 提交于 6月 07, 2023

Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

b7f463dd

non-JIT build fix on ROCm (#3638) · 4cd0a003

由 Ramya Ramineni 提交于 6月 06, 2023

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

4cd0a003

06 6月, 2023 3 次提交
- S
  
  Update README to add ICS'23 paper (#3687) · 2d737edd
  由 Siddharth Singh 提交于 6月 06, 2023
  
  2d737edd
- O
  Use logger in accelerator (#3682) · e5fe5f65
  由 Olatunji Ruwase 提交于 6月 05, 2023
```
* Use logger in accelerator

* Handle pre-build cases

* Explain possible import failure
```
  e5fe5f65
- D
  fix some typo (#3675) · 3fb3cfdc
  由 digger yu 提交于 6月 06, 2023
```
* fix typo deepspeed/runtime

* fix some typo

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  3fb3cfdc
05 6月, 2023 1 次提交

[MiCS] [Fix] saving and loading model checkpoint logic for MiCS sharding (#3440) · c88af214

由 Zhen Zhang 提交于 6月 04, 2023

* fix mics save checkpoint hanging

* MiCS load_checkpoint

* copyright

* fix for torch-1.9.0

all_reduce_coalesced api does not support nccl backend

* Naming alignment

* adding more test conditions for mics shard size

* test with different shard sizes

* adding assertion for better error msg

---------
Co-authored-by: NZhen Zhang <zhzhn@amazon.com>

c88af214

03 6月, 2023 3 次提交

J

bump to 0.9.4 · f483c034
由 Jeff Rasley 提交于 6月 02, 2023

f483c034

Refactor check_enabled root validator in DeepSpeedMonitorConfig (#3616) · 4559aa9b

由 Buğra 提交于 6月 02, 2023

* Refactor check_enabled root validator in DeepSpeedMonitorConfig

* formatting

* formatting

---------
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NMichael Wyatt <mrwyattii@gmail.com>

4559aa9b

D
fix typo deepspeed/runtime (#3663) · 5d14afd2
由 digger yu 提交于 6月 03, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
5d14afd2

02 6月, 2023 5 次提交

flops_profiler: add option recompute_fwd_factor for the case of activation recompute (#3362) · 460bec46

由郭叶军提交于 6月 02, 2023

When activation checkpointing is enabled, most of forward is re-computed,
and so the FLOPS calculation should be updated with recompute_fwd_factor=1.0

I don't find a way to pass the option from model script to deepspeed engine,
and so add option directly for flops_profiler.
Co-authored-by: NCheng Li <pistasable@gmail.com>

460bec46

fix typo with deepspeed/ (#3547) · cd4e473e

由 digger yu 提交于 6月 02, 2023

* fix spelling error with deepspeed/runtime/

* fix typo docs/

* fix typo in comments with deepspeed/

* fix typo deepspeed/

* Update constants.py

Remove the space after nebula

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

cd4e473e

M
allow dict datatype for checkpoints (#3007) · da8f4e01
由 Michael Wyatt 提交于 6月 01, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
da8f4e01
H
Fix RuntimeError when using ZeRO Stage3 with mpu: #3564 (#3565) · f5dde36c
由 Haodong Lyu 提交于 6月 02, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
f5dde36c
deepspeed/comm/comm.py: fix typo of warning message (#3636) · 3b299997
由郭叶军提交于 6月 02, 2023
```
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
```
3b299997

01 6月, 2023 3 次提交

Typo Correction (#3621) · e02b8d0b

由 Micah Zoltu 提交于 6月 01, 2023

Code (in this context) is mass noun, and thus has no plural form.
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

e02b8d0b

W
Update megatron.md (#3641) · 8f459c50
由 Will Jessup 提交于 5月 31, 2023
```
grammar fix.
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
```
8f459c50
M
Skip tests on docs-only changes (#3651) · 8b8c7031
由 Michael Wyatt 提交于 5月 31, 2023
```
* skip test for docs-only changes

* add missing skip to blog changes
```
8b8c7031

31 5月, 2023 3 次提交

Add Ascend NPU accelerator support (#3595) · f3c8eaca

由 CurryRice233 提交于 5月 31, 2023

* add Ascend NPU accelerator support

* clean code

---------
Co-authored-by: Njializheng <jializheng@huawei.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

f3c8eaca

stage3.py: do not scale if gradient_predivide_factor is 1.0 (#3630) · 52907a66

由郭叶军提交于 5月 31, 2023

this change also aligns with the logic before reduce_scatter_coalesced
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

52907a66

AISC launcher fixes (#3637) · 49a73549

由 Jeff Rasley 提交于 5月 30, 2023

* tmp remove launcher args

* add exclude list for env variables on aisc

* add comment

49a73549

27 5月, 2023 1 次提交

Align InferenceEngine to store ms in _model_times (#3501) · d755b9d6

由 Danny Semiat 提交于 5月 27, 2023

* Align InferenceEngine to store ms in _model_times

   When using cuda_events, the measured model time is stored in ms.
   When not using cuda_events, the measured model time was stored in seconds.
   This commit fixes the units and aligns them to store ms, the same as elapsed() function.
   This was observed when running the following pytest:
   unit/inference/test_model_profiling.py::TestModelProfiling::test[False-True-roberta-base-fill-mask]

   Returned values were:
     count=0 e2e_t=895.174312 model_t=0.8529715538024902
     count=1 e2e_t=7.500252 model_t=0.0041310787200927734
     count=2 e2e_t=3.887346 model_t=0.0018568038940429688
     count=3 e2e_t=3.577845 model_t=0.0016334056854248047
     count=4 e2e_t=3.43976 model_t=0.0016703605651855469
     count=5 e2e_t=3.310903 model_t=0.0016107559204101562
     count=6 e2e_t=3.299556 model_t=0.001603841781616211
     count=7 e2e_t=3.605722 model_t=0.0015969276428222656
     count=8 e2e_t=3.273741 model_t=0.0015516281127929688
     count=9 e2e_t=3.46306 model_t=0.0016617774963378906

   The units difference is observed here, when model_t is in ther order of 10e-3 comparing to e2e_t

* Update engine.py

---------
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

d755b9d6

26 5月, 2023 3 次提交
- Q
  Expose Consecutive Hysteresis to Users (#3553) · 0411a9f8
  由 Quentin Anthony 提交于 5月 25, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  0411a9f8
- O
  
  DS init should not broadcast or move zero.Init models (#3611) · d39c311f
  由 Olatunji Ruwase 提交于 5月 25, 2023
  
  d39c311f
- C
  
  bug fix (#3609) · 736bf185
  由 Conglong Li 提交于 5月 25, 2023
  
  736bf185
25 5月, 2023 1 次提交
- N
  
  Fix op_builder against PyTorch nightly (#3596) · 6622776c
  由 Nikita Shulga 提交于 5月 24, 2023
  
  6622776c

Greenplum / DeepSpeed 上一次同步 11 个月

Greenplum / DeepSpeed
上一次同步 11 个月