提交 · c69bd1f7b7768720a488b126fbf9cef6653c3890 · Greenplum / DeepSpeed

26 8月, 2023 1 次提交

Fix pipline dataloader when batch elements contain tuple (#565) · c69bd1f7

由 hamlet 提交于 8月 26, 2023

Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>

c69bd1f7

25 8月, 2023 1 次提交
- B
  Fixes timer error referenced in #4212 (#4213) · 0b7a760c
  由 Björn Plüster 提交于 8月 25, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
```
  0b7a760c
24 8月, 2023 1 次提交
- O
  Load z3 checkpoints for inference (#4171) · 6df15873
  由 Olatunji Ruwase 提交于 8月 23, 2023
```
* Load z3 checkpoints for inference

* PR feedback

* Fix API bugs

* Fix typo
```
  6df15873
27 7月, 2023 1 次提交

polishing timers and log_dist (#3996) · b354c28b

由 Alexander Jipa 提交于 7月 26, 2023

Co-authored-by: NAlexander Jipa <azzhipa@amazon.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

b354c28b

26 7月, 2023 1 次提交

remove duplicate check for pp and zero stage (#4033) · e8600b7f

由 mzl 提交于 7月 26, 2023

* remove duplicate check for pp and zero stage

* remove line

* use ZeroStageEnum

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

e8600b7f

20 7月, 2023 1 次提交
- O
  Option to exclude frozen weights for checkpoint save (#3953) · 0a0819b7
  由 Olatunji Ruwase 提交于 7月 20, 2023
```
* Option to exclude frozen weights for checkpoint save

* Extend unit test

* Support PP training
```
  0a0819b7
15 7月, 2023 1 次提交
- M
  do bcast only pp_group_size>1 (#3915) · 05a6cee1
  由 mzl 提交于 7月 15, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
```
  05a6cee1
09 5月, 2023 1 次提交

fix a small type error on bf16+pp (#3441) · 195563a2

由 YiSheng5 提交于 5月 09, 2023

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

195563a2

03 5月, 2023 1 次提交

Add ZeRO 1 support to PP for BF16. (#3399) · dcb4a7d6

由 Joe Mayer 提交于 5月 02, 2023

* Add ZeRO 1 support to PP for BF16.

* Switching enum.

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

dcb4a7d6

02 5月, 2023 1 次提交
- N
  Fix `PipelineEngine.eval_batch` result (#3316) · b0d9c4d0
  由 Nr Wu 提交于 5月 02, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  b0d9c4d0
30 4月, 2023 1 次提交
- H
  Fix pipeline module evaluation when contiguous activation checkpointing is enabled (#3005) · 7ddc3b01
  由 hablb 提交于 4月 30, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  7ddc3b01
21 4月, 2023 1 次提交

zero3 checkpoint frozen params (#3205) · dd8df20f

由 Olatunji Ruwase 提交于 4月 20, 2023

* zero3 checkpoint frozen params

* Remove debug prints

* Move to cpu

* WIP

* WIP

* WIP

* Cleanup

* Cleanup

* Extend unit test for frozen params

* API fix

dd8df20f

31 3月, 2023 1 次提交
- M
  Update DeepSpeed copyright license to Apache 2.0 (#3111) · b361c727
  由 Michael Wyatt 提交于 3月 30, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  b361c727
27 3月, 2023 1 次提交
- J
  
  update formatter version and style settings (#3098) · 91d63e02
  由 Jeff Rasley 提交于 3月 27, 2023
  
  91d63e02
24 3月, 2023 1 次提交
- S
  allow list (#3042) · 5c2a81c2
  由 Satpal Singh Rathore 提交于 3月 23, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  5c2a81c2
26 1月, 2023 1 次提交

Abstract accelerator (step 3) (#2677) · 98cc35b6

由 Ma, Guokai 提交于 1月 26, 2023

* Integrate accelerator abstraction interface into deepspeed/

* Fix error message in fp16/fused_optimizer

* fix error message in fp16/unfused_optimizer.py

* assign get_accelerator().pin_memory() result to input Tensor name

* no need to check cuda and whether nvtx supported

* move try-except into inner most block

* call Event() and Stream() in get_accelerator() for data type

* Make Stream and Event as properties of abstract interface so they can be used as data type in deepspeed

* Apply op_builder backend api change from #2705 from @jeffra

* fix tests where Builder NAME is used

* keep original ...Builder.NAME interface instead of ...Builder().NAME interface

* fix builder closure for installation

* fix randomltd builder

* add comments to clarify create_op_builder and get_op_builder

* fix compatibility with pip install -e
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

98cc35b6

17 12月, 2022 1 次提交

fixes #2498 (#2603) · 0f0e38c5

由 Alexander Jipa 提交于 12月 16, 2022

taking gradient accumulation steps into account for throughput calculation
Co-authored-by: NAlexander Jipa <azzhipa@amazon.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

0f0e38c5

13 12月, 2022 1 次提交
- C
  DeepSpeed Data Efficiency Library (#2585) · ef869377
  由 Conglong Li 提交于 12月 12, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  ef869377
22 10月, 2022 1 次提交

parallelize writing of layer checkpoint files across data parallel instances (#1419) · b8fb9c3f

由 Adam Moody 提交于 10月 21, 2022

* parallelize layer checkpoints across data parallel groups

* use partition_uniform to determine start/end index values

* formatting fix

* config: add option for parallel write of layer checkpoints in pipeline stage

* yapf fixes

* enable parallel layer write according to config param

* avoid extraneous makedir when rank 0 writes all layers
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

b8fb9c3f

30 7月, 2022 1 次提交

Elastic Training support in DeepSpeed (#2153) (#2156) · 1ed5aa96

由 Arpan Jain 提交于 7月 29, 2022

Co-authored-by: NArpan Jain <t-arpanjain@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

1ed5aa96

28 7月, 2022 1 次提交

Trajepl/nebula ckpt engine (#2085) · e669aaf5

由 trajep 提交于 7月 28, 2022

* enable checkpoint engine

* seprated nebula config

* add __init__.py for nebula importing

* linter fix

* fix: ds_config is None

* fix: ds config

* fix: get sd loader fix

* align the API with torch raw code

* linter fix

* remove duplicate tag params

* make checkpoint_engine as required args

* fix args

* extract parameters out to config

* fix: load state dict

* separate load engine

* linter fix

* extract checkpoint engine to abstract calss

* linter fix

* construct function args fix

* add docs for dev/customers

* linter fix

* remove load engine

* print->log_dist

* linter fix

* add tag flag to distinguish the loading order
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

e669aaf5

26 7月, 2022 2 次提交
- A
  
  Add flake8 to pre-commit checks (#2051) · 316c4a43
  由 Alex Hedges 提交于 7月 25, 2022
  
  316c4a43
- Q
  DeepSpeed Communication Profiling and Logging (#2012) · 5349347b
  由 Quentin Anthony 提交于 7月 25, 2022
```
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  5349347b
21 6月, 2022 1 次提交
- K
  fix import errors (#2026) · 735406e5
  由 Karim Foda 提交于 6月 20, 2022
```
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
```
  735406e5
16 6月, 2022 1 次提交
- Q
  DeepSpeed Monitor Module (Master) (#2013) · c87f6ee2
  由 Quentin Anthony 提交于 6月 16, 2022
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  c87f6ee2
11 6月, 2022 1 次提交

DeepSpeed comm backend v1 (#1985) · 36ad3119

由 Ammar Ahmad Awan 提交于 6月 10, 2022

Co-authored-by: NQuentin Anthony <qganthony@yahoo.com>
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

36ad3119

12 5月, 2022 1 次提交
- J
  
  Fairseq support (#1915) · 50893458
  由 Jeff Rasley 提交于 5月 11, 2022
  
  50893458
10 5月, 2022 1 次提交

[pipe] prevent deadlock with multiple evals sequence (#1944) · dbeadf16

由 Stas Bekman 提交于 5月 09, 2022

* [pipe] prevent deadlock with multiple evals sequence

* style

* style

* style

* align DSE commit w. latest master
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

dbeadf16

04 5月, 2022 1 次提交
- Z
  
  Fix time error (#1934) · a3b90030
  由 Zhengqiang Yin 提交于 5月 04, 2022
  
  a3b90030
27 4月, 2022 1 次提交

Inference PP changes for neox (#1899) · b4fcd98f

由 Jeff Rasley 提交于 4月 26, 2022

Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>

b4fcd98f

20 4月, 2022 1 次提交

bf16+pipeline parallelism (#1801) · 56c52238

由 Olatunji Ruwase 提交于 4月 19, 2022

* bf16 updates

* Got bf16 working

* fp32 reduction; flattened tensors

* bf16+zero_stage_1 first cut

* finish zero_stage 1 sharding

* Matching fp16 with debugging codes

* Matching loss with fp16

* Fix gradient clipping

* bf16 gradient clipping fix
bf16 checkpoint save/load

* Unscale grad norm

* Fix grad norm scaling

* Enable loading fp16_zero_1 into bf16_zero_1 engine and vice versa

* Fix clip_grad key error

* Reduce tied weight gradients

* Fix grad norm for moe

* Reduce specified gradients

* Use O(n) instead of O(n^2)

* Remove optimizer restriction for bf16

* Link bf16 & fp32 params

* Clip gradients of last stage tied weights

* Simplify tied weights reduction logic

* Also clip all tp rank parameters

* lp to hp mapping

* Link lp/hp/optim state; Refresh links after checkpoint load

* Remove debug print

* Remove debug print

* Simplify zero_grad logic

* fp32 accessors

* Fix update bug
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

56c52238

11 2月, 2022 1 次提交
- D
  
  fixing a bf16 support issue (#1760) · 97f8a9eb
  由 Du Li 提交于 2月 10, 2022
  
  97f8a9eb
23 1月, 2022 1 次提交
- A
  
  Add codespell to pre-commit checks (#1717) · 4cf970e6
  由 Alex Hedges 提交于 1月 22, 2022
  
  4cf970e6
22 10月, 2021 2 次提交
- C
  
  fix pp (#1474) · 29bee73f
  由 Conglong Li 提交于 10月 21, 2021
  
  29bee73f
- C
  fix pipeline engine (#1473) · 17a479dd
  由 Conglong Li 提交于 10月 21, 2021
```
* fix pp

* better fix
```
  17a479dd
10 10月, 2021 1 次提交
- C
  
  fix cl for pp support (#1443) · cd7967d6
  由 Conglong Li 提交于 10月 09, 2021
  
  cd7967d6
09 10月, 2021 1 次提交
- C
  CL for big science (#1440) · fbea7b49
  由 Conglong Li 提交于 10月 08, 2021
```
* CL+PP

* add TODO
```
  fbea7b49
08 10月, 2021 1 次提交

Big science fix passing multiple tensors (#1400) · 9c672783

由 Thomas Wang 提交于 10月 07, 2021

Co-authored-by: NThomas <thomas@Thomass-MacBook-Pro.local>
Co-authored-by: NShaden Smith <shaden.smith@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NTunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>

9c672783

02 10月, 2021 1 次提交

Add flexibility of pipeline parallel module and engine (#1399) · 30965ea7

由 Hyunwoong Ko 提交于 10月 02, 2021

* Add flexibility of pipeline module and engine

* Separate PRs

* Separate PRs
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

30965ea7

30 9月, 2021 1 次提交

Big science related changes (#1407) · e2fdd254

由 Jeff Rasley 提交于 9月 29, 2021

Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NShaden Smith <shaden.smith@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Neltonzheng <eltonz@microsoft.com>
Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>

e2fdd254

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年