提交 · 349f845b838c6992c5ea19e80fc728bef9645962 · Greenplum / DeepSpeed

11 2月, 2023 1 次提交
- M
  
  Handle hanged tests in CI (#2808) · 349f845b
  由 Michael Wyatt 提交于 2月 10, 2023
  
  349f845b
09 2月, 2023 2 次提交
- L
  Fix Slurm launcher user args (#2806) · d038dbd2
  由 Logan Adams 提交于 2月 08, 2023
```
Fix missing connections from --launcher_args to Slurm srun command.
```
  d038dbd2
- L
  Add user defined launcher args for PDSH launcher (#2804) · 4af1f76a
  由 Logan Adams 提交于 2月 08, 2023
```
* Add user defined launcher args for PDSH launcher

* Formatting fixes
```
  4af1f76a
08 2月, 2023 2 次提交

Add container load checkpoint error reporting + refactor (#2792) · 10f3c301

由 Lev Kurilenko 提交于 2月 07, 2023

This PR refactors the organization of meta tensor checkpoint loading as follows:

- Move get_param_names() abstract method definition from TransformerPolicy into MetaTensorContainer
- Model-specific get_param_names() definitions moved from policy into model-specific container
- selected_policy_g, megatron_v2_g, and transformer_config_g globals replaced with a single container_g global, since the container will contain all of the information those globals previously captured
- ckpt_load_enabled flag added to containers that's set to False by default in the base.py container and gets set to True when the MetaTensorContainer feature is inherited
- Assertion added to replace_transformer_layer before performing checkpoint loading to check if ckpt_load_enabled ==True, otherwise an error message will be printed saying that the container does not support meta tensor checkpoint loading.

The aim of these changes is to more closely couple meta tensor checkpoint loading code to the MetaTensorContainer and to allow for better error reporting of load checkpoint use on model types that don't support this feature.

10f3c301

Enable page-locked tensors without CUDA (#2775) · c9b08888

由 Olatunji Ruwase 提交于 2月 07, 2023

* Enable page-locked memory in cpu only env

* Enable page-locked memory in cpu only env

* Formatting

* Add TODOs; Release page-locked memory

* Update perf microbenchmark; Reduce unit test memory

* Reduce CI mem usage

c9b08888

07 2月, 2023 2 次提交
- S
  remove outdated comment (#2786) · d323abd8
  由 Stas Bekman 提交于 2月 06, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  d323abd8
- R
  
  Fixing broken link to azureml-examples recipes (#2795) · f376daea
  由 Razvan Tanase 提交于 2月 06, 2023
  
  f376daea
05 2月, 2023 1 次提交

Common location to install libaio-dev (#2779) · 7d9fae4d

由 Olatunji Ruwase 提交于 2月 04, 2023

* Common location to install libaio-dev

* Update .github/workflows/setup-venv/action.yml
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

---------
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

7d9fae4d

04 2月, 2023 2 次提交

Container param cleanup + remove qkv_merging (#2780) · 0a73e6e6

由 Lev Kurilenko 提交于 2月 03, 2023

This PR cleans up some container items and removes an unused qkv_merging parameter:

- Remove qkv_merging=True from BERT containers
- Change containers config object to ds_model_config
- Remove qkv_merging param

0a73e6e6

Reset KV-cache at the beginning of text-generation (#2669) · 9f41ffe4

由 Reza Yazdani 提交于 2月 03, 2023

Co-authored-by: NMartin Cai <martincai@users.noreply.github.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

9f41ffe4

03 2月, 2023 2 次提交
- M
  add support for hjson config files (#2783) · 4079077c
  由 Michael Wyatt 提交于 2月 03, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  4079077c
- R
  Fix Checkpoint-loading with Meta-tensor (#2781) · 2c6e8194
  由 Reza Yazdani 提交于 2月 02, 2023
```
* Reset KV-cache at the beginning of text-generation

* Pass the ckpt-loading arguments to work with meta-tensor

* remove unrelated changes
```
  2c6e8194
02 2月, 2023 3 次提交

M

Fix broken kernel inject bug (#2776) · c5b983e9
由 Molly Smith 提交于 2月 01, 2023

c5b983e9
C
fix upsample flops compute by skipping unused kargs (#2773) · b5750b64
由 Cheng Li 提交于 2月 01, 2023
```
* fix upsample flops compute by skipping unused kargs

* fix format
```
b5750b64

some fix in flops_profiler (#2068) · e2a31d80

由 swli 提交于 2月 02, 2023

* bugs in profiler:
1. Tensor.bmm missed in _patch_tensor_methods function
2. missed funtions in _reload_functionals and _reload_tensor_methods functions
3. torch.mm and torch.Tensor.mm will have same __name__ in wrapFunc, my suggustion is use __str__ instead.

* formatting

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NCheng Li <pistasable@gmail.com>

e2a31d80

01 2月, 2023 4 次提交
- M
  Fix for diffusers v0.12.0 (#2753) · ef6a958e
  由 Michael Wyatt 提交于 1月 31, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  ef6a958e
- C
  Pin minimum `packaging` requirement (#2771) · 02e95e6a
  由 Carlos Mocholí 提交于 1月 31, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  02e95e6a
- M
  Refactor/Pydantify monitoring config (#2640) · d923f7c8
  由 Michael Wyatt 提交于 1月 31, 2023
```
* pydantify monitoring configs

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  d923f7c8
- L
  Fix hardcoded instances to fp16 in optimizer creation log messages to the correct dtype. (#2743) · 86477538
  由 Logan Adams 提交于 1月 31, 2023
```
* Remove hardcoded instances to fp16 in log messages.

* Add model_dtype to print the correct format

* Respond to PR feedback

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  86477538
31 1月, 2023 2 次提交
- C
  Add links to new azureML examples (#2756) · 1db4ade3
  由 cassieesvelt 提交于 1月 30, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  1db4ade3
- B
  Bing/formatting correction (#2764) · 8d3b42c2
  由 Bing Xie 提交于 1月 30, 2023
```
* modify engine.py for formatting

* commit formatting changes on engine.py
```
  8d3b42c2
29 1月, 2023 1 次提交
- C
  
  Add environment variable to make nvcc compilation more verbose (#2759) · 258d2831
  由 Connor Holmes 提交于 1月 28, 2023
  
  258d2831
27 1月, 2023 5 次提交

Skip test_bias_gelu unit test if torch < 1.12 (#2754) · cc3d7cb9

由 Lev Kurilenko 提交于 1月 26, 2023

This PR adds a torch version check in the test_bias_gelu unit test to skip if the torch version < 1.12. This is due to gelu implementation differences in versions prior to 1.12.

cc3d7cb9

Fix softmax backward (#2709) · 0b06e0cb

由 Reza Yazdani 提交于 1月 26, 2023

* Reset KV-cache at the beginning of text-generation

* Add new backward kernel to handle large softmax-length

* remove unrelated changes
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NConnor Holmes <connorholmes@microsoft.com>

0b06e0cb

J
[zero] remove misleading dtype log (#2732) · a60e31a7
由 Jeff Rasley 提交于 1月 26, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
a60e31a7

fix a mispelled attribute (#2750) · 30d3f5df

由 Stas Bekman 提交于 1月 26, 2023

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

30d3f5df

Fix autotuning so that it records Floating Point Operations per second, not microsecond (#2711) · d4bfae41

由 Dashiell Stander 提交于 1月 26, 2023

* Fix how autotuning reports TFLOPS so that they are reported in FLOPS per second, not millisecond
Co-authored-by: NNick Sarkauskas <nsarka00@gmail.com>
Co-authored-by: NQuentin Anthony <anthony.301@osu.edu>
Signed-off-by: NDashiell Stander <dstander@protonmail.com>

* Actually it is microseconds -> seconds
Signed-off-by: NDashiell Stander <dstander@protonmail.com>

* Actually it is microseconds -> seconds
Signed-off-by: NDashiell Stander <dstander@protonmail.com>
Signed-off-by: NDashiell Stander <dstander@protonmail.com>
Co-authored-by: NNick Sarkauskas <nsarka00@gmail.com>
Co-authored-by: NQuentin Anthony <anthony.301@osu.edu>

d4bfae41

26 1月, 2023 2 次提交

Abstract accelerator (step 3) (#2677) · 98cc35b6

由 Ma, Guokai 提交于 1月 26, 2023

* Integrate accelerator abstraction interface into deepspeed/

* Fix error message in fp16/fused_optimizer

* fix error message in fp16/unfused_optimizer.py

* assign get_accelerator().pin_memory() result to input Tensor name

* no need to check cuda and whether nvtx supported

* move try-except into inner most block

* call Event() and Stream() in get_accelerator() for data type

* Make Stream and Event as properties of abstract interface so they can be used as data type in deepspeed

* Apply op_builder backend api change from #2705 from @jeffra

* fix tests where Builder NAME is used

* keep original ...Builder.NAME interface instead of ...Builder().NAME interface

* fix builder closure for installation

* fix randomltd builder

* add comments to clarify create_op_builder and get_op_builder

* fix compatibility with pip install -e
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

98cc35b6

[GatheredParameters] fix memory leak (#2665) · ddd48b36

由 Stas Bekman 提交于 1月 26, 2023

* [GatheredParameters] fix memory leak

* simplify

* cleanup and move

* style

* Formatting

* fix test

* fix test

* fix test take 2

* Trigger CI
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJoe Mayer <114769929+jomayeri@users.noreply.github.com>

ddd48b36

25 1月, 2023 3 次提交

J
fixing optimizer sanity check (#2742) · 4be8df72
由 Joe Mayer 提交于 1月 25, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
4be8df72

Automatic tensor parallelism v2 (#2670) · d59b5729

由 Molly Smith 提交于 1月 24, 2023

* loop through pipe.model

* tp_parser first draft

* client_module must be type object

* Simplify layernorm tracking. Add unittest.

* cleanup

* Add more models to unittest

* cleanup inference pytest for merging

* Add unittest

* cleanup

* pre-commit

* unittest id and pytest marker

* try marian for unittest

* precommit

* Move tp code to seperate file

* Add new auto tp file

* pre-commit and type

* Update deepspeed/module_inject/auto_tp.py
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

* Update deepspeed/module_inject/auto_tp.py
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

* Update tests/unit/inference/test_inference.py
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

* remove unused fillmask function
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

d59b5729

L

Change zero_grad() argument to match pytorch (#2741) · 34a11688
由 loadams 提交于 1月 24, 2023

34a11688

20 1月, 2023 1 次提交

Inference Refactor (replace_with_policy, model_implementations) (#2554) · 867da307

由 Ammar Ahmad Awan 提交于 1月 19, 2023

Co-authored-by: NLev Kurilenko <lekurile@microsoft.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

867da307

19 1月, 2023 4 次提交
- M
  fix typo (#2718) · 8df50a26
  由 Michael Wyatt 提交于 1月 18, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  8df50a26
- J
  BF16 optimizer for BF16+ZeRO Stage 1 (#2706) · 8d87c89e
  由 Joe Mayer 提交于 1月 18, 2023
```
* BF16 optimizer only with ZeRO stage 1.

* Updating to grad accum of fp32 for BF16 ZeRO1 case.
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  8d87c89e
- M
  update for lm-eval==0.3.0 (#2713) · 23e5133c
  由 Michael Wyatt 提交于 1月 18, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  23e5133c
- J
  [install] only add deepspeed pkg at install (#2714) · 0b549ad7
  由 Jeff Rasley 提交于 1月 18, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  0b549ad7
18 1月, 2023 3 次提交

M

remove master branch from CI triggers (#2712) · df2495ca
由 Michael Wyatt 提交于 1月 17, 2023

df2495ca

CUDA optional deepspeed ops (#2507) · 3f210c97

由 Olatunji Ruwase 提交于 1月 17, 2023

* CPU-Adam: add compile-flag to enable param-copy from CPU to GPU

* guarde the CUDA-related include files and variables

* remove CUDA dependency from op_builder when building against CPU

* fixing the builder issues

* fix formatting

* return true when there is no mismatch on the cuda version

* guard for when cuda is not available & test with cpu-only environment

* Update cpu_adam and cpu_adagrad

* Format fixes

* Add configurable half precision type; Build/run in CUDA environment

* Run cpu_adam and cpu_adagrad in cpu only environment

* Mark CUDA only unit tests

* CPU environment CI

* Format fixes

* Remove --forked

* Add --forked

* CPU only CI should pass

* Format fixes

* Format fixes

* Remove scattered pytest.skip

* Fix cpu_adam unit test

* Update .github/workflows/nv-torch-latest-cpu.yml
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

* Update .github/workflows/nv-torch-latest-cpu.yml
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

* Address PR feedback

* OpenMP linking

* Fix unit tests
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

3f210c97

J

bump to 0.8.1 · 7d0e4270
由 Jeff Rasley 提交于 1月 17, 2023

7d0e4270

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年