提交 · aef6c65ce39d191ca31618b2a995599942574fd9 · Greenplum / DeepSpeed

12 7月, 2023 1 次提交

Reduce Unit Test Times (Part 3) (#3850) · aef6c65c

由 Michael Wyatt 提交于 7月 11, 2023

* add coverage report

* define env vars in shared action

* reduce time for longest running tests

* fix broken shared action

* reduce test time

* reducing Pipeline test times

* further reducing test times

* rework Z3 test

* testing new mp.pool and persistent dist envs

* fix import

* reuse distributed environment for tests with lots of param combos

* fix for dist teardown

* fix pickling issue with pool cache

* actually fix pickling problem

* avoid running pool cache stuff on non-distributed tests

* fix issues with nested mp.pool

* fix for nested pools in Pipeline Engine

* re-add params

* update workflows with pytest opts

* implement feedback

* resolve race condition with port selection

* Update tests/unit/common.py

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

aef6c65c

07 7月, 2023 1 次提交

[ROCm] Enable TestCUDABackward::test_backward unit tests (#3849) · d24629f4

由 Ramya Ramineni 提交于 7月 06, 2023

* Workaround to pass unit/ops/accelerators/test_accelerator_backward.py unit tests on ROCm

* Rearranged is_rocm_pytorch()

* Introduced is_rocm_pytorch() for ROCm

* Fixed formatting errors

* Function call

* formatting fix

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NLogan Adams <loadams@microsoft.com>

d24629f4

06 7月, 2023 2 次提交

Extend HE-Lora test with Z3 support + Fix/add guard in HE for Z3 (#3883) · db4638d1

由 Ammar Ahmad Awan 提交于 7月 05, 2023

* extend the test and fix fp16 typo.

* guard reset params with z3 enabled check.

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

db4638d1

Add FALCON Auto-TP Support (#3640) · f3c93b05

由 Reza Yazdani 提交于 7月 05, 2023

* Add FALCON auto-tp support
* added (skipped) unit test, refactored code to be more readable

---------
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

f3c93b05

05 7月, 2023 1 次提交

Fix LoRA Fuse/Unfuse in Hybrid Engine (#3563) · d81dfdab

由 Xingjian Shi 提交于 7月 05, 2023

* fix lora fuse unfuse in hybrid_engine

* fix name

* fix typo

* remove empty lines

* Update gptj.py

* add lora test-case + fix gptneo implementation

* try to fix format

* try to accelerate testcase by reducing max length

* reduce test runtime

* Fix bloom / gpt-neox and add test for bloom

* fix CI + fix issue in engine

---------
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

d81dfdab

03 7月, 2023 1 次提交

Zero3 Fix allreduce optimization for extra large tensor (#3832) · d229ff17

由 hablb 提交于 7月 03, 2023

Grad tensors that don't fit in the bucket flat buffer are not added to it, but still added to params_in_ipg_bucket
if such tensors exists use reduce_scatter of params_in_ipg_bucket instead of allreduce. since allreduce assumes all grads are in ipg_bucket_flat_buffer.

Add test for reduce scatter=false
Fix padding to zeros instead of undefined values
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

d229ff17

30 6月, 2023 2 次提交

A
checking process_group before merging bucket ranges (#3521) (#3577) · 2ded2ff0
由 Alexander Jipa 提交于 6月 30, 2023
```
Co-authored-by: NAlexander Jipa <azzhipa@amazon.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
2ded2ff0

Reduce Unit Test Time (Part 2) (#3838) · fd1d2c64

由 Michael Wyatt 提交于 6月 29, 2023

* utilize shorter tests for MII

* use cached torch download

* rework zero++ unit tests

* formatting

---------
Co-authored-by: NHeyangQin <heyangqin@microsoft.com>

fd1d2c64

29 6月, 2023 1 次提交
- M
  
  Re-enable GPT-J unit tests and refactor inference tests (#3618) · 78b76935
  由 Michael Wyatt 提交于 6月 28, 2023
  
  78b76935
27 6月, 2023 1 次提交
- M
  support model declaration in zero.Init context (#3592) · 203ac9d7
  由 Masahiro Tanaka 提交于 6月 26, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  203ac9d7
24 6月, 2023 2 次提交

ZeRO++ (#3784) · d18aa2c7

由 Heyang Qin 提交于 6月 22, 2023

Co-authored-by: NSam Abe Jacobs <samjacobs@microsoft.com>
Co-authored-by: NGuanhuaWang <alexwgh333@gmail.com>
Co-authored-by: Ncmikeh2 <connorholmes@microsoft.com>
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

d18aa2c7

DeepSpeed-Triton for Inference (#3748) · 69d1b9f9

由 stephen youn 提交于 6月 22, 2023

Co-authored-by: NStephen Youn <styoun@microsoft.com>
Co-authored-by: NArash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NEthan Doe <yidoe@microsoft.com>
Co-authored-by: Nyidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

69d1b9f9

13 6月, 2023 1 次提交
- J
  FP8 unittest for H100 (#3731) · 6f4fc30b
  由 Joe Mayer 提交于 6月 12, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  6f4fc30b
09 6月, 2023 1 次提交

zero3 performance optimizations (#3622) · 0977106a

由 hablb 提交于 6月 08, 2023

* Remove dead code

params_already_reduced is not used

* Prevent evaluation of debug strings

Debug strings are evaluated even when logging is disabled

* Use contiguous gradients tensor reduce scatter between ranks

Use allreduce instead of reduce scatter. lower cpu overhead.

* move overflow tracker to optimizer.step

Don't check overflow in gradients for every bucket.
Do overflow chack once on grad flat buffer just before optimizer step

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

0977106a

08 6月, 2023 2 次提交

Fix unit test typo in tests/unit/ops/transformer/inference (#3697) · fb2b4ab1

由 Michael Wyatt 提交于 6月 07, 2023

* mix typo and missing epsilon value

* Touch file to re-build

* revert changes

* Touch file to re-build

* Format

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NLogan Adams <loadams@microsoft.com>

fb2b4ab1

Fix gpt-j inference issue (#3639) · 34a9fbf1

由 Reza Yazdani 提交于 6月 07, 2023

* fix gpt-j inference issue for mlp_gemm_func call

* bring back the gpt-j inference-test

* fix formatting

* fix the neox and pythia injection issue

34a9fbf1

05 6月, 2023 1 次提交

[MiCS] [Fix] saving and loading model checkpoint logic for MiCS sharding (#3440) · c88af214

由 Zhen Zhang 提交于 6月 04, 2023

* fix mics save checkpoint hanging

* MiCS load_checkpoint

* copyright

* fix for torch-1.9.0

all_reduce_coalesced api does not support nccl backend

* Naming alignment

* adding more test conditions for mics shard size

* test with different shard sizes

* adding assertion for better error msg

---------
Co-authored-by: NZhen Zhang <zhzhn@amazon.com>

c88af214

24 5月, 2023 1 次提交

Fixing bf16 test (#3551) · 49d399cd

由 Joe Mayer 提交于 5月 23, 2023

* Fixing bf16 test that was missing a config.

* Chaning train_batch_size to train_micro_batch_size_per_gpu

* Chaning all train_batch_size to train_micro_batch_size_per_gpu

49d399cd

17 5月, 2023 1 次提交

Clone tensors to avoid torch.save bloat (#3348) · 5c3ebd7e

由 Olatunji Ruwase 提交于 5月 16, 2023

* Clone tensors to avoid torch.save bloat

* Adddocs

* Fix clang-formatting

* Update docs/code-docs/source/model-checkpointing.rst
Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>

* Update deepspeed/checkpoint/utils.py
Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>

* Update deepspeed/checkpoint/utils.py
Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>

* Fix url

* url fix

* Tweak docs

---------
Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

5c3ebd7e

16 5月, 2023 2 次提交

[CPU] Support Intel CPU inference (#3041) · 1f72082f

由 Ma, Guokai 提交于 5月 16, 2023

* add fallback path for kernels used in megatron

* temporary numactl WA for SPR 56core

* adapt core allocation according to number of ranks

* add switch to turn on numactl

* detect number of cores on the system

* allow select a subset of the cores on the system to bind

* remove unneeded changes

* add ccl backend

* change nccl to ccl

* remove unused code

* add comm/ccl to ops

* initial ccl comm support

* first broadcast case passed

* add CCL_Backend to DeepSpeed

* support comm timer for CPU

* support barrier for comm backend

* support specify master address from deepspeed command line

* support pytorch 2.0

* remove 'block' from api

* Tweak for debug
Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>

* Remove unecessary directory
Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>

* Add bf16 kernel support for inference

* Add temporary torch implement for cpu inference

* Add softmax ops cpu fallback for inference

* bind cores to numa domain as well

* merge latest change in gma/numactl

* initial bf16 kernel support with fallback path

* initial fallback path for bloom kernel injection

* fix softmax attn mask

* check KMP_AFFINITY to avoid conflict with numactl

* New CCLBackend which utilize TorchBackend for initialization

* rollback last change because there is result error

* fix bloom injection policy TP could not work issue.

injection_policy={BloomBlock: ("self_attention.dense", "mlp.dense_4h_to_h")}

* Use TorchBackend to initialize CCLBackend, make behavior consistent

* remove comm under deepspeed/ops

* add license header

* code clean up

* fix format issue

* remove magic number in main address

* add caching support but not turn on by default

* change name of inference_cuda_module to inference_module

* Check for is_synchronized_device in accelerator before get Event

* fix typo

* Fix fallback path of softmax kernel on CUDA device for BF16 data type, because CUDA tril does not support BF16 datatype, enforce fp32 data type

* add cpu backend files

* change CPU_Accelerator op_builder_dir

* remove cpu_kernel_path

* using CPU_Accelerator on non-cuda device

* fix deepspeed.op_builder => deepspeed.ops.op_builder

* add alias for num_gpus: num_accelerators

* allow loading cpu_builder in build stage

* Assume cuda available if torch not installed

* add oneccl_binding_pt to requirements

* move oneccl-binding-pt to seperate requiremetns-cpu.txt

* add missing file

* use dependency_links in setuptools.setup() call for additional dependency links

* install oneccl_bind_pt in workflows

* change oneccl_bind_pt's version from 1.13 to 2.0

* use intel_exention_for_pytorch as indicator that CPU_Accelerator should be used

* Add indicator for Accelerator used

* change foo.c to foo.cpp

* exclude 'cpu' directory in CUDA op builder reflection

* add a cpu-inference workflow

* run cpu-inference workflow on self-hosted instance

* change cpu runs-on node to v100 node

* print out python version in workflow

* add verbose in pip command to understand oneccl_bind_pt install issue

* update cpu-inference workflow

* add a stage to detect instance instruction sets

* add back bf16 support for CPU inference

* enable autoTP for bloom
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* update workflow to detect cpu instruction sets

* temporary WA for Intel Extension for PyTorch AVX2 instructioon set detection

* change cpu-inference workflow machine to ubuntu-20.04

* add sharded checkpoint loading for AutoTP path to reduce the peak memory in initialization stage
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* enable policy for llama

* use a special build ipex to test avx2 detection fix

* fix format

* fix test fail issue
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* fix gptj sharded checkpoint loading problem
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* return a not implemented build in get_op_builder in cpu_backend

* support cpu device in tests

* use cpuinfo to extract number of CPUs

* use ~/tmp as transfomer cache rather than /blob/

* Add support for mpich launcher with prefer_deepspeed_comm

* add missing modification in accelerator

* enable IMPI launcher

* remove unused file and fix formatting

* clean up ccl.cpp

* Less confusing error message when certin op builder are not implemented

* Fix license header

* Add license header

* add license headers

* add license header

* fix cuda specific code in test

* update CPU workflow

* use numactl to bind to core

* allow bind_cores_to_rank in multi-node impi runner

* fix format error

* Remove InferenceBuilder

* fix format error in numa.py

* check whether op is in installed ops in ds_report.py

* allow override accelerator with DS_ACCELERATOR='cuda','cpu' or 'xpu'

* lazy init class_dict in CUDA_Accelerator to avoid cyclic initialization of CUDA_Accelerator

* put short path in the beginning in real_accelerator.py

* device_count return number of NUMA nodes

* fix typo

* install numactl in cpu workflow

* Follow comments

* Better implementation of device_count() and current_device()

* remove dependency_link for Intel Extension for DeepSpeed

* use check is_synchronized_device in timer only once

* remove env mapping WA in cpu_accelerator

* fix duplicate definition

* fix format error

* refine ccl backend selection

* move comments to the right place

* remove prefer_deepspeed_comm, use CCLBackend by default

* refractor fallback path

* Fix execution failure in kernel injection path

* do not refractory kernel injection fallback path in  residual_add because it contains function call with side-effect

* guard residual_add fallback path with environ DS_KI_FALLBACK=True

* fix format error

* add test for allreduce on CPU workflow

* fix format error

* Fallback to TorchBackend if CCLBackend kernel are not implemented

* Update Intel Extension for Pytorch installation link

* Don't specify version number of Intel Extension for PyTorch

* install oneCCL for CCLBackend

* fix link path for CPU comm kernels

* fix source oneCCL environment

* source oneCCL env before run UT

* Give more specific instruction when CCL_ROOT not defined

---------
Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nsdp <sdp@aia-sdp-spr-108864.jf.intel.com>
Co-authored-by: NCao, Zhong Z <zhong.z.cao@intel.com>
Co-authored-by: NZhenhuan Chen <zhenhuan.chen@intel.com>
Co-authored-by: Nbaodii <di.bao@intel.com>
Co-authored-by: NWang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Njianan-gu <jianan.gu@intel.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

1f72082f

fix typo in comments with deepspeed/ (#3537) · c8d3f5eb

由 digger yu 提交于 5月 16, 2023

* fix spelling error with deepspeed/runtime/

* fix typo docs/

* fix typo in comments with deepspeed/

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

c8d3f5eb

15 5月, 2023 1 次提交

Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 (#2999) · 9f4a8763

由 Yizhou Wang 提交于 5月 15, 2023

* * try to fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2

* * fix format error

* * fix format issue

* * add TODO for integrated testing of TP and ZeRO 1/2/3

* fix default pg error

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

9f4a8763

11 5月, 2023 1 次提交

Hybrid Engine Fix Llama (#3505) · 194053bd

由 Lev Kurilenko 提交于 5月 10, 2023

Co-authored-by: NConnor Holmes <connorholmes@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>

194053bd

10 5月, 2023 1 次提交

fix regression in shard checkpoint loading in AutoTP Path caused by qkv_copy()... · b31b46c0

由 Wang, Yi 提交于 5月 10, 2023

fix regression in shard checkpoint loading in AutoTP Path caused by qkv_copy() is deleted and add UT case for shard checkpoint loading in AutoTP (#3457)

* add UT case for shard checkpoint loading in AutoTP
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* autoTP path also support shard loading
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

---------
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

b31b46c0

04 5月, 2023 1 次提交
- C
  Hybrid Engine Refactor and Llama Inference Support (#3425) · 0a61d5d6
  由 Connor Holmes 提交于 5月 03, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  0a61d5d6
30 4月, 2023 1 次提交
- H
  Fix pipeline module evaluation when contiguous activation checkpointing is enabled (#3005) · 7ddc3b01
  由 hablb 提交于 4月 30, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  7ddc3b01
29 4月, 2023 1 次提交

remove megatron-lm, no longer pip installable (#3389) · a094c976

由 Jeff Rasley 提交于 4月 28, 2023

* remove megatron-lm, no longer pip installable

* Add skips to tests that require megatron-lm and can't be run currently.

* formatting

* Formatting

---------
Co-authored-by: NLogan Adams <loadams@microsoft.com>

a094c976

26 4月, 2023 1 次提交

fixing default communication_data_type for bfloat16_enabled and docs (#3370) · d56268f3

由 Alexander Jipa 提交于 4月 25, 2023

Co-authored-by: NAlexander Jipa <azzhipa@amazon.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

d56268f3

21 4月, 2023 1 次提交

zero3 checkpoint frozen params (#3205) · dd8df20f

由 Olatunji Ruwase 提交于 4月 20, 2023

* zero3 checkpoint frozen params

* Remove debug prints

* Move to cpu

* WIP

* WIP

* WIP

* Cleanup

* Cleanup

* Extend unit test for frozen params

* API fix

dd8df20f

18 4月, 2023 1 次提交

improving int4 asymmetric quantization accuracy (#3190) · 48297c48

由 Heyang Qin 提交于 4月 17, 2023

* Fixes for asymmetric quantization

* addtional offset to further improve accuracy

* put the 0.5 into offset rather than applying it later

* update unit test for quantization

* fix format

* attempt to fix format

---------
Co-authored-by: NConnor Holmes <connorholmes@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

48297c48

14 4月, 2023 1 次提交

Nested zero.Init() and dynamically defined model class (#2989) · 717c3020

由 Masahiro Tanaka 提交于 4月 13, 2023

* support nesting zero.Init() and dynamically defined module

* throw an error if model class defined in zero.Init is not wrapped

* fix check on new classes that are not wrapped in zero.Init()

* add tests of nesting zero.Init() and dynamically defined classes

* fix tests for zero.Init

* fix style

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

717c3020

13 4月, 2023 2 次提交

[CPU support] Optionally bind each rank to different cores on host (#2881) · 0b5252bb

由 Ma, Guokai 提交于 4月 13, 2023

* add fallback path for kernels used in megatron

* temporary numactl WA for SPR 56core

* adapt core allocation according to number of ranks

* add switch to turn on numactl

* detect number of cores on the system

* allow select a subset of the cores on the system to bind

* remove unneeded changes

* use current_env to set OMP_NUM_THREADS in subprocess

* add test for ds_arguments

* change --bind_cores_to_rank option to store_true

* add test for parse_range_list

* add comment for parse range list

* add test for parse range list, rewrite parse_range_list

* fix format error

* fix format

* add -m parameter to numactl when necessary

* Check KMP_AFFINITY to avoid conflict with numactl

* fix format

* negative case for parse_range_list

* detect whether numactl is installed before use numactl to bind cores

* check numactl with package manager of distro

---------
Co-authored-by: Nsdp <sdp@aia-sdp-spr-108864.jf.intel.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

0b5252bb

feat: Add support for `NamedTuple` when sharding parameters [#3029] (#3037) · 29aea09a

由 Alexander van Eck 提交于 4月 12, 2023

* feat: Add support for `NamedTuple` when sharding parameters [#3029]

* Formatting

---------
Co-authored-by: NAlexander van Eck <alexander.vaneck@paige.ai>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

29aea09a

06 4月, 2023 3 次提交
- L
  Revert "Update megatron GPT2Model" · 4b358333
  由 Logan Adams 提交于 4月 05, 2023
```
This reverts commit 1ec34e54.
```
  4b358333
- L
  
  Update megatron GPT2Model · 1ec34e54
  由 Logan Adams 提交于 4月 05, 2023
  
  1ec34e54
- L
  Update skip on torch in tests (#3136) · ab1f32de
  由 Logan Adams 提交于 4月 05, 2023
```
* Replace old torch version checks with existing function

* Clean up formatting
```
  ab1f32de
31 3月, 2023 1 次提交
- M
  Update DeepSpeed copyright license to Apache 2.0 (#3111) · b361c727
  由 Michael Wyatt 提交于 3月 30, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  b361c727
27 3月, 2023 1 次提交
- J
  
  update formatter version and style settings (#3098) · 91d63e02
  由 Jeff Rasley 提交于 3月 27, 2023
  
  91d63e02
24 3月, 2023 2 次提交
- L
  Move cuda check into utils (#3074) · b3ec1c97
  由 Logan Adams 提交于 3月 23, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  b3ec1c97
- M
  pre-commit check for torch.cuda in code (#2981) · 090d49e7
  由 Ma, Guokai 提交于 3月 24, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  090d49e7

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年