提交 · 2bd7d36b73280ac846d6520f3034c8a84382e218 · Greenplum / DeepSpeed

18 7月, 2023 4 次提交
- L
  
  More debug · 2bd7d36b
  由 Logan Adams 提交于 7月 17, 2023
  
  2bd7d36b
- L
  
  add if failed() · bb960080
  由 Logan Adams 提交于 7月 17, 2023
  
  bb960080
- L
  
  Re-enable and add debug logic · 3f19897e
  由 Logan Adams 提交于 7月 17, 2023
  
  3f19897e
- L
  
  Create issue without template/path · 668d5cdd
  由 Logan Adams 提交于 7月 17, 2023
  
  668d5cdd
15 7月, 2023 2 次提交
- L
  
  Try absolute path with github workspace · ca0eb6e3
  由 Logan Adams 提交于 7月 14, 2023
  
  ca0eb6e3
- L
  
  Move actions checkout step · da696b82
  由 Logan Adams 提交于 7月 14, 2023
  
  da696b82
14 7月, 2023 6 次提交
- L
  
  Add additional checkout step · 5360ad57
  由 Logan Adams 提交于 7月 14, 2023
  
  5360ad57
- L
  
  Add way to get url, switch path of template · 05ca420f
  由 Logan Adams 提交于 7月 13, 2023
  
  05ca420f
- L
  
  Test with AMD · 53799ac5
  由 Logan Adams 提交于 7月 13, 2023
  
  53799ac5
- L
  
  Add all nightly/switch envvar name · 25757239
  由 Logan Adams 提交于 7月 13, 2023
  
  25757239
- L
  
  Test running as not CI · fa860f56
  由 Logan Adams 提交于 7月 13, 2023
  
  fa860f56
- L
  
  Update H100 workflow to open an issue if nightly CI fails · b2e118bf
  由 Logan Adams 提交于 7月 13, 2023
  
  b2e118bf
12 7月, 2023 1 次提交

Reduce Unit Test Times (Part 3) (#3850) · aef6c65c

由 Michael Wyatt 提交于 7月 11, 2023

* add coverage report

* define env vars in shared action

* reduce time for longest running tests

* fix broken shared action

* reduce test time

* reducing Pipeline test times

* further reducing test times

* rework Z3 test

* testing new mp.pool and persistent dist envs

* fix import

* reuse distributed environment for tests with lots of param combos

* fix for dist teardown

* fix pickling issue with pool cache

* actually fix pickling problem

* avoid running pool cache stuff on non-distributed tests

* fix issues with nested mp.pool

* fix for nested pools in Pipeline Engine

* re-add params

* update workflows with pytest opts

* implement feedback

* resolve race condition with port selection

* Update tests/unit/common.py

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

aef6c65c

07 7月, 2023 1 次提交
- M
  Update workflows for merge queue (#3892) · 52844f49
  由 Michael Wyatt 提交于 7月 06, 2023
```
* update workflow triggers for merge queue

* add branch specifier to trigger
```
  52844f49
06 7月, 2023 2 次提交
- L
  
  Update apex installation to resolve apex's pyproject.toml issues. (#3745) · 59c9b091
  由 Logan Adams 提交于 7月 05, 2023
  
  59c9b091
- M
  
  update lightning version in CI (#3882) · 7c126f43
  由 Michael Wyatt 提交于 7月 05, 2023
  
  7c126f43
30 6月, 2023 2 次提交

Reduce Unit Test Time (Part 2) (#3838) · fd1d2c64

由 Michael Wyatt 提交于 6月 29, 2023

* utilize shorter tests for MII

* use cached torch download

* rework zero++ unit tests

* formatting

---------
Co-authored-by: NHeyangQin <heyangqin@microsoft.com>

fd1d2c64

Disable AMD test flows in YML (#3847) · c973e157

由 Logan Adams 提交于 6月 29, 2023

* Disable AMD workflows in the YML

* Switch from PR to nightly so we can enable the flows here

c973e157

29 6月, 2023 1 次提交
- M
  Reduce Unit Test Times (Part 1) (#3829) · 7726fc8d
  由 Michael Wyatt 提交于 6月 28, 2023
```
* move torch19 tests to nightly

* make megatron apex install persistent on blob storage
```
  7726fc8d
27 6月, 2023 2 次提交
- J
  Revert "Prevent hangs in CI during parallel run compilation (#2844)" (#3817) · 6102d128
  由 Jeff Rasley 提交于 6月 26, 2023
```
This reverts commit 2b2be85f.
```
  6102d128
- M
  
  Prevent hangs in CI during parallel run compilation (#2844) · 2b2be85f
  由 Michael Wyatt 提交于 6月 26, 2023
  
  2b2be85f
24 6月, 2023 2 次提交

DeepSpeed-Triton for Inference (#3748) · 69d1b9f9

由 stephen youn 提交于 6月 22, 2023

Co-authored-by: NStephen Youn <styoun@microsoft.com>
Co-authored-by: NArash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: NCheng Li <pistasable@gmail.com>
Co-authored-by: NEthan Doe <yidoe@microsoft.com>
Co-authored-by: Nyidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

69d1b9f9

J

remove staging trigger (#3792) · 52c6baa9
由 Jeff Rasley 提交于 6月 22, 2023

52c6baa9

22 6月, 2023 1 次提交
- L
  Add H100 workflow and status badge. (#3754) · dd593410
  由 Logan Adams 提交于 6月 21, 2023
```
* Add H100 workflow

* Switch to nightly from PR - will let us check stability first
```
  dd593410
14 6月, 2023 1 次提交

Fix apex install bugs (#3741) · 1b401823

由 Logan Adams 提交于 6月 13, 2023

* Fix apex installation

* Switch install flag from build-opt to global-opt to fix missing cpp_ext

* Try installing with support for newer pip

* Add build packaging

* Update to latest

* Pin to specific commit while pyproject.toml is fixed

1b401823

01 6月, 2023 1 次提交
- M
  Skip tests on docs-only changes (#3651) · 8b8c7031
  由 Michael Wyatt 提交于 5月 31, 2023
```
* skip test for docs-only changes

* add missing skip to blog changes
```
  8b8c7031
16 5月, 2023 1 次提交

[CPU] Support Intel CPU inference (#3041) · 1f72082f

由 Ma, Guokai 提交于 5月 16, 2023

* add fallback path for kernels used in megatron

* temporary numactl WA for SPR 56core

* adapt core allocation according to number of ranks

* add switch to turn on numactl

* detect number of cores on the system

* allow select a subset of the cores on the system to bind

* remove unneeded changes

* add ccl backend

* change nccl to ccl

* remove unused code

* add comm/ccl to ops

* initial ccl comm support

* first broadcast case passed

* add CCL_Backend to DeepSpeed

* support comm timer for CPU

* support barrier for comm backend

* support specify master address from deepspeed command line

* support pytorch 2.0

* remove 'block' from api

* Tweak for debug
Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>

* Remove unecessary directory
Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>

* Add bf16 kernel support for inference

* Add temporary torch implement for cpu inference

* Add softmax ops cpu fallback for inference

* bind cores to numa domain as well

* merge latest change in gma/numactl

* initial bf16 kernel support with fallback path

* initial fallback path for bloom kernel injection

* fix softmax attn mask

* check KMP_AFFINITY to avoid conflict with numactl

* New CCLBackend which utilize TorchBackend for initialization

* rollback last change because there is result error

* fix bloom injection policy TP could not work issue.

injection_policy={BloomBlock: ("self_attention.dense", "mlp.dense_4h_to_h")}

* Use TorchBackend to initialize CCLBackend, make behavior consistent

* remove comm under deepspeed/ops

* add license header

* code clean up

* fix format issue

* remove magic number in main address

* add caching support but not turn on by default

* change name of inference_cuda_module to inference_module

* Check for is_synchronized_device in accelerator before get Event

* fix typo

* Fix fallback path of softmax kernel on CUDA device for BF16 data type, because CUDA tril does not support BF16 datatype, enforce fp32 data type

* add cpu backend files

* change CPU_Accelerator op_builder_dir

* remove cpu_kernel_path

* using CPU_Accelerator on non-cuda device

* fix deepspeed.op_builder => deepspeed.ops.op_builder

* add alias for num_gpus: num_accelerators

* allow loading cpu_builder in build stage

* Assume cuda available if torch not installed

* add oneccl_binding_pt to requirements

* move oneccl-binding-pt to seperate requiremetns-cpu.txt

* add missing file

* use dependency_links in setuptools.setup() call for additional dependency links

* install oneccl_bind_pt in workflows

* change oneccl_bind_pt's version from 1.13 to 2.0

* use intel_exention_for_pytorch as indicator that CPU_Accelerator should be used

* Add indicator for Accelerator used

* change foo.c to foo.cpp

* exclude 'cpu' directory in CUDA op builder reflection

* add a cpu-inference workflow

* run cpu-inference workflow on self-hosted instance

* change cpu runs-on node to v100 node

* print out python version in workflow

* add verbose in pip command to understand oneccl_bind_pt install issue

* update cpu-inference workflow

* add a stage to detect instance instruction sets

* add back bf16 support for CPU inference

* enable autoTP for bloom
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* update workflow to detect cpu instruction sets

* temporary WA for Intel Extension for PyTorch AVX2 instructioon set detection

* change cpu-inference workflow machine to ubuntu-20.04

* add sharded checkpoint loading for AutoTP path to reduce the peak memory in initialization stage
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* enable policy for llama

* use a special build ipex to test avx2 detection fix

* fix format

* fix test fail issue
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* fix gptj sharded checkpoint loading problem
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* return a not implemented build in get_op_builder in cpu_backend

* support cpu device in tests

* use cpuinfo to extract number of CPUs

* use ~/tmp as transfomer cache rather than /blob/

* Add support for mpich launcher with prefer_deepspeed_comm

* add missing modification in accelerator

* enable IMPI launcher

* remove unused file and fix formatting

* clean up ccl.cpp

* Less confusing error message when certin op builder are not implemented

* Fix license header

* Add license header

* add license headers

* add license header

* fix cuda specific code in test

* update CPU workflow

* use numactl to bind to core

* allow bind_cores_to_rank in multi-node impi runner

* fix format error

* Remove InferenceBuilder

* fix format error in numa.py

* check whether op is in installed ops in ds_report.py

* allow override accelerator with DS_ACCELERATOR='cuda','cpu' or 'xpu'

* lazy init class_dict in CUDA_Accelerator to avoid cyclic initialization of CUDA_Accelerator

* put short path in the beginning in real_accelerator.py

* device_count return number of NUMA nodes

* fix typo

* install numactl in cpu workflow

* Follow comments

* Better implementation of device_count() and current_device()

* remove dependency_link for Intel Extension for DeepSpeed

* use check is_synchronized_device in timer only once

* remove env mapping WA in cpu_accelerator

* fix duplicate definition

* fix format error

* refine ccl backend selection

* move comments to the right place

* remove prefer_deepspeed_comm, use CCLBackend by default

* refractor fallback path

* Fix execution failure in kernel injection path

* do not refractory kernel injection fallback path in  residual_add because it contains function call with side-effect

* guard residual_add fallback path with environ DS_KI_FALLBACK=True

* fix format error

* add test for allreduce on CPU workflow

* fix format error

* Fallback to TorchBackend if CCLBackend kernel are not implemented

* Update Intel Extension for Pytorch installation link

* Don't specify version number of Intel Extension for PyTorch

* install oneCCL for CCLBackend

* fix link path for CPU comm kernels

* fix source oneCCL environment

* source oneCCL env before run UT

* Give more specific instruction when CCL_ROOT not defined

---------
Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nsdp <sdp@aia-sdp-spr-108864.jf.intel.com>
Co-authored-by: NCao, Zhong Z <zhong.z.cao@intel.com>
Co-authored-by: NZhenhuan Chen <zhenhuan.chen@intel.com>
Co-authored-by: Nbaodii <di.bao@intel.com>
Co-authored-by: NWang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Njianan-gu <jianan.gu@intel.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

1f72082f

12 5月, 2023 1 次提交
- D
  
  change actions/checkout@v2 to v3 (#3526) · b3956dc9
  由 digger-yu 提交于 5月 12, 2023
  
  b3956dc9
04 5月, 2023 1 次提交
- C
  Hybrid Engine Refactor and Llama Inference Support (#3425) · 0a61d5d6
  由 Connor Holmes 提交于 5月 03, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  0a61d5d6
25 4月, 2023 1 次提交
- M
  Update DS-Chat issue template (#3368) · 3031eec4
  由 Michael Wyatt 提交于 4月 24, 2023
```
* request log output

* add more details
```
  3031eec4
19 4月, 2023 3 次提交
- L
  Add pre-compiling ops test (#3277) · 4de4d2ac
  由 Logan Adams 提交于 4月 19, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  4de4d2ac
- L
  
  Update MI200 workflow to install apex with changes from pip (#3294) · 61a8b3a0
  由 Logan Adams 提交于 4月 18, 2023
  
  61a8b3a0
- M
  Fix cupy install version detection (#3276) · bcccee4d
  由 Michael Wyatt 提交于 4月 18, 2023
```
* updated cupy install

* do non-isolated pip install

* Update action.yml
```
  bcccee4d
15 4月, 2023 1 次提交
- M
  
  Create deepspeed_chat_bug_report.md · 6fc8e33c
  由 Michael Wyatt 提交于 4月 14, 2023
  
  6fc8e33c
14 4月, 2023 1 次提交
- M
  Update DeepSpeed-Chat docs with latest changes to scripts (#3219) · a8f999e3
  由 Michael Wyatt 提交于 4月 13, 2023
```
* update docs to reflect changes in deepspeed-chat training script

* add blogs to ignored changes in unit tests
```
  a8f999e3
13 4月, 2023 1 次提交

Update AMD workflows (#3179) · 9408a866

由 Logan Adams 提交于 4月 12, 2023

* Update AMD workflows

* Update MI200 test flow to use torch latest

* Update tolerances to values that pass (will fix before completing PR)

* Revert chyanges to atol

* Rename workflows

* Fix CI badges

9408a866

06 4月, 2023 1 次提交
- S
  [ci] `nv-transformers-v100` - use the same torch version as transformers CI (#3096) · 20ed15be
  由 Stas Bekman 提交于 4月 05, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  20ed15be
05 4月, 2023 1 次提交
- M
  
  fix CI badges (#3138) · f2c9a827
  由 Michael Wyatt 提交于 4月 04, 2023
  
  f2c9a827
24 3月, 2023 1 次提交

Goodbye Torch 1.8 (#3082) · 5cdf3593

由 Michael Wyatt 提交于 3月 23, 2023

* bump torch18 -> torch19
* fix gptj

---------
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

5cdf3593

22 3月, 2023 1 次提交
- M
  Assert mp_size is factor of model dimensions (#2891) · 9ea0fdc2
  由 Molly Smith 提交于 3月 21, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  9ea0fdc2

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年