提交 · 2e99f6edf6df018d33be71f2cfe64c12bae3b662 · Greenplum / DeepSpeed

26 4月, 2023 2 次提交

[DRAFT] Tentative implementation of MiCS (#2964) · 2e99f6ed

由 Zhen Zhang 提交于 4月 25, 2023

* include mics config and optimizer

* change private vars to public vars

so the child class can initialize these vars

* Port the init function from stage3

* adding a model test file for mics

* adopt to get_acceleartor api and fp16 group defrag

* WIP: porting mics modification to ms master

* WIP: included gradient all-reduce among replication groups

* WIP: ported hierarchical all gather part

did basic loss test on a simple MLP model

* [Bug fix] using the comm group attached on the param

* torch2.0 support

* remove print

* delegate wait op

* [Bug] fix naming

* adding doc string

* resolving recursive import

* fix formating, typo and license

* fix license and unit test error

---------
Co-authored-by: NUbuntu <ubuntu@ip-172-31-14-191.us-west-2.compute.internal>
Co-authored-by: NUbuntu <ubuntu@ip-172-31-7-70.us-west-2.compute.internal>
Co-authored-by: NZhen Zhang <zhzhn@amazon.com>
Co-authored-by: Nzhzhn <zhzhn@ip-10-2-57-114.us-west-2.compute.internal>

2e99f6ed

fixing default communication_data_type for bfloat16_enabled and docs (#3370) · d56268f3

由 Alexander Jipa 提交于 4月 25, 2023

Co-authored-by: NAlexander Jipa <azzhipa@amazon.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

d56268f3

21 4月, 2023 2 次提交
- M
  Fix for dist not being initialized when constructing main config (#3324) · ad168a69
  由 Michael Wyatt 提交于 4月 20, 2023
```
* move dist init out of Engine
```
  ad168a69
- O
  zero3 checkpoint frozen params (#3205) · dd8df20f
  由 Olatunji Ruwase 提交于 4月 20, 2023
```
* zero3 checkpoint frozen params

* Remove debug prints

* Move to cpu

* WIP

* WIP

* WIP

* Cleanup

* Cleanup

* Extend unit test for frozen params

* API fix
```
  dd8df20f
12 4月, 2023 1 次提交

DeepSpeed Chat (#3186) · 47f9f13b

由 Olatunji Ruwase 提交于 4月 11, 2023

Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Nyaozhewei <zheweiy@berkeley.edu>
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NConnor Holmes <connorholmes@microsoft.com>
Co-authored-by: NLok Chand Koppaka <lokoppak@microsoft.com>
Co-authored-by: NMasahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

47f9f13b

06 4月, 2023 1 次提交
- S
  Update engine.py (#2826) · bcb03531
  由 Stas Bekman 提交于 4月 05, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  bcb03531
31 3月, 2023 1 次提交
- M
  Update DeepSpeed copyright license to Apache 2.0 (#3111) · b361c727
  由 Michael Wyatt 提交于 3月 30, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  b361c727
30 3月, 2023 1 次提交
- O
  Make fp32 default communication data type (#2970) · 261d6370
  由 Olatunji Ruwase 提交于 3月 30, 2023
```
* Make fp32 default communication data type

* Fix asserts
```
  261d6370
27 3月, 2023 1 次提交
- J
  
  update formatter version and style settings (#3098) · 91d63e02
  由 Jeff Rasley 提交于 3月 27, 2023
  
  91d63e02
24 3月, 2023 2 次提交
- O
  Empty ZeRO3 partition cache (#3060) · e80ae088
  由 Olatunji Ruwase 提交于 3月 23, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  e80ae088
- F
  Fix nebula in save_16bit_model issue (#3023) · a78d6b89
  由 FreyaRao 提交于 3月 24, 2023
```
Co-authored-by: NQinghuan Rao <qinghuanrao@microsoft.com>
```
  a78d6b89
22 3月, 2023 1 次提交
- M
  Remove bf16 from inference config dtye enum (#3010) · 27e1b02d
  由 Molly Smith 提交于 3月 22, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  27e1b02d
15 3月, 2023 2 次提交
- J
  Convert model parameters from generator to list. (#3017) · 94f7da26
  由 Joe Mayer 提交于 3月 15, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  94f7da26
- J
  adding attribute checks for bf opt with zero (#3022) · d7c925e4
  由 Joe Mayer 提交于 3月 14, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  d7c925e4
13 3月, 2023 1 次提交
- A
  ckpt: create directories in checkpoint_engine (#2988) · 43d58d99
  由 Adam Moody 提交于 3月 13, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  43d58d99
10 3月, 2023 1 次提交
- J
  
  [zero] prevent poor configs from running w. zero-offload (#2971) · 457850dc
  由 Jeff Rasley 提交于 3月 09, 2023
  
  457850dc
28 2月, 2023 1 次提交
- M
  better eval sampler (#2907) · f1d2a15b
  由 Mayank Mishra 提交于 2月 28, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  f1d2a15b
22 2月, 2023 2 次提交

Make z3 respect comm dtype (#2807) · 81b4d5db

由 Olatunji Ruwase 提交于 2月 22, 2023

* Make z3 respect comm dtype

* Support fp32 comm dtype

* Remove obsolete assert

* Code cleanup

81b4d5db

Data efficiency library update (#2866) · 7c99def0

由 Conglong Li 提交于 2月 21, 2023

* data efficiency library update

* data efficiency library update

* data efficiency update

* data efficiency update

7c99def0

07 2月, 2023 1 次提交

remove outdated comment (#2786) · d323abd8

由 Stas Bekman 提交于 2月 06, 2023

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

d323abd8

01 2月, 2023 1 次提交

Fix hardcoded instances to fp16 in optimizer creation log messages to the correct dtype. (#2743) · 86477538

由 Logan Adams 提交于 1月 31, 2023

* Remove hardcoded instances to fp16 in log messages.

* Add model_dtype to print the correct format

* Respond to PR feedback

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

86477538

31 1月, 2023 1 次提交
- B
  Bing/formatting correction (#2764) · 8d3b42c2
  由 Bing Xie 提交于 1月 30, 2023
```
* modify engine.py for formatting

* commit formatting changes on engine.py
```
  8d3b42c2
27 1月, 2023 2 次提交

J
[zero] remove misleading dtype log (#2732) · a60e31a7
由 Jeff Rasley 提交于 1月 26, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
a60e31a7

Fix autotuning so that it records Floating Point Operations per second, not microsecond (#2711) · d4bfae41

由 Dashiell Stander 提交于 1月 26, 2023

* Fix how autotuning reports TFLOPS so that they are reported in FLOPS per second, not millisecond
Co-authored-by: NNick Sarkauskas <nsarka00@gmail.com>
Co-authored-by: NQuentin Anthony <anthony.301@osu.edu>
Signed-off-by: NDashiell Stander <dstander@protonmail.com>

* Actually it is microseconds -> seconds
Signed-off-by: NDashiell Stander <dstander@protonmail.com>

* Actually it is microseconds -> seconds
Signed-off-by: NDashiell Stander <dstander@protonmail.com>
Signed-off-by: NDashiell Stander <dstander@protonmail.com>
Co-authored-by: NNick Sarkauskas <nsarka00@gmail.com>
Co-authored-by: NQuentin Anthony <anthony.301@osu.edu>

d4bfae41

26 1月, 2023 1 次提交

Abstract accelerator (step 3) (#2677) · 98cc35b6

由 Ma, Guokai 提交于 1月 26, 2023

* Integrate accelerator abstraction interface into deepspeed/

* Fix error message in fp16/fused_optimizer

* fix error message in fp16/unfused_optimizer.py

* assign get_accelerator().pin_memory() result to input Tensor name

* no need to check cuda and whether nvtx supported

* move try-except into inner most block

* call Event() and Stream() in get_accelerator() for data type

* Make Stream and Event as properties of abstract interface so they can be used as data type in deepspeed

* Apply op_builder backend api change from #2705 from @jeffra

* fix tests where Builder NAME is used

* keep original ...Builder.NAME interface instead of ...Builder().NAME interface

* fix builder closure for installation

* fix randomltd builder

* add comments to clarify create_op_builder and get_op_builder

* fix compatibility with pip install -e
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

98cc35b6

25 1月, 2023 1 次提交
- J
  fixing optimizer sanity check (#2742) · 4be8df72
  由 Joe Mayer 提交于 1月 25, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  4be8df72
20 1月, 2023 1 次提交

Inference Refactor (replace_with_policy, model_implementations) (#2554) · 867da307

由 Ammar Ahmad Awan 提交于 1月 19, 2023

Co-authored-by: NLev Kurilenko <lekurile@microsoft.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

867da307

19 1月, 2023 1 次提交

BF16 optimizer for BF16+ZeRO Stage 1 (#2706) · 8d87c89e

由 Joe Mayer 提交于 1月 18, 2023

* BF16 optimizer only with ZeRO stage 1.

* Updating to grad accum of fp32 for BF16 ZeRO1 case.
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

8d87c89e

18 1月, 2023 1 次提交
- J
  non-MoE stage 1 requires CG disabled (#2703) · e4ba7222
  由 Jeff Rasley 提交于 1月 17, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  e4ba7222
17 12月, 2022 1 次提交

fixes #2498 (#2603) · 0f0e38c5

由 Alexander Jipa 提交于 12月 16, 2022

taking gradient accumulation steps into account for throughput calculation
Co-authored-by: NAlexander Jipa <azzhipa@amazon.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

0f0e38c5

13 12月, 2022 1 次提交
- C
  DeepSpeed Data Efficiency Library (#2585) · ef869377
  由 Conglong Li 提交于 12月 12, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  ef869377
06 12月, 2022 1 次提交

Support fp32 gradaccum for bf16 model (#2566) · 06938835

由 Ma, Guokai 提交于 12月 06, 2022

* allow bf16 model with fp32 gradient accumulation datatype

* allow fp32 gradient accumulation and bfloat16 model in amp mode

* alternative fix for grad accumulation type mismatch.  In the case of zero optimizer we should have grad accum type == model data type
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

06938835

30 11月, 2022 1 次提交

encoded ds config into command line argument when launching child processes in autotuning (#2524) · abe4fc6b

由 Cheng Li 提交于 11月 29, 2022

* rollback ds config changes

* fix format

* Fix error when output_file is a relative path without a prefix (#2397)
Co-authored-by: NBenjamin Steenhoek <benjaminjsteenhoek@gmail.com>

* fix restuls and exprs path to use absolute path

* use base64 encoded ds config as cmd arg

* fix format

* remove assert

* write out optimial config after tuning

* fix format

* no need to update ds config path when encoding ds config

* udpate

* do not use abs path for result and expr dir

* fix conflicts

* fix run mode

* fix format

* fix format
Co-authored-by: NBenjamin Steenhoek <benjaminjsteenhoek@gmail.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

abe4fc6b

29 11月, 2022 1 次提交
- S
  Report progress at gradient accumulation boundary (#2553) · 340fc0cf
  由 ShijieZZZZ 提交于 11月 28, 2022
```
* report progress at gradient accumulation boundary

* format

* format
```
  340fc0cf
28 11月, 2022 1 次提交

Adding Gradient Accumulation Data Type Config (#2512) · 21c28029

由 Joe Mayer 提交于 11月 27, 2022

* Adding gradient accumulation dtype config.

* Switching to new DtypeEnum

* Adding standalone check function, and unit tests

* Variable disambiguation

* Adding checks for unsupported states.

* Updating for PR comments.

* Reorganizing unit test.
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

21c28029

11 11月, 2022 1 次提交
- O
  
  Make bf16_optimizer work for non pipeline (#2470) · ee39187d
  由 Olatunji Ruwase 提交于 11月 10, 2022
  
  ee39187d
25 10月, 2022 1 次提交
- J
  Fix Bug #2319 (#2438) · 7d113633
  由 Joe Mayer 提交于 10月 24, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  7d113633
22 10月, 2022 1 次提交

parallelize writing of layer checkpoint files across data parallel instances (#1419) · b8fb9c3f

由 Adam Moody 提交于 10月 21, 2022

* parallelize layer checkpoints across data parallel groups

* use partition_uniform to determine start/end index values

* formatting fix

* config: add option for parallel write of layer checkpoints in pipeline stage

* yapf fixes

* enable parallel layer write according to config param

* avoid extraneous makedir when rank 0 writes all layers
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

b8fb9c3f

18 10月, 2022 2 次提交

Universal checkpoint for zero stage 1 (#2284) · 799120e7

由 Olatunji Ruwase 提交于 10月 18, 2022

* Refactor universal checkpointing and tensor fragments

* Formatting

* Support zero stage1; Expand TP dim

* Remove debug prints

* Detect sharded optimizer state

* Format fixes

* Encode reshaping guide

* More symbolic constants
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

799120e7

Fixing bug 2361 (#2410) · 906b4a02

由 Joe Mayer 提交于 10月 17, 2022

* fixing bug 2361

* adding pytest for config initialization

* chaning expected output to FusedAdam

* remove print statement

* running yapf on modified files

* running pre-commit formatting
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

906b4a02

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年