提交 · df425097869214c3371d74a4cbd7c506bea3cef7 · Greenplum / DeepSpeed

08 6月, 2023 6 次提交

C
DeepSpeed overview in Japanese (#3709) · df425097
由 Conglong Li 提交于 6月 07, 2023
```
* DeepSpeed overview in Japanese

* DeepSpeed overview in Japanese
```
df425097

Small tweak on cuda version mismatch documentation (#3706) · d414678d

由 john li 提交于 6月 07, 2023

* Small tweak on cuda version mismatch documentation

* clarify minor versions should also match

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

d414678d

Fix unit test typo in tests/unit/ops/transformer/inference (#3697) · fb2b4ab1

由 Michael Wyatt 提交于 6月 07, 2023

* mix typo and missing epsilon value

* Touch file to re-build

* revert changes

* Touch file to re-build

* Format

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NLogan Adams <loadams@microsoft.com>

fb2b4ab1

D
change partititon_name to partition_name (#3700) · c5edc91e
由 digger yu 提交于 6月 08, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
```
c5edc91e

Fix gpt-j inference issue (#3639) · 34a9fbf1

由 Reza Yazdani 提交于 6月 07, 2023

* fix gpt-j inference issue for mlp_gemm_func call

* bring back the gpt-j inference-test

* fix formatting

* fix the neox and pythia injection issue

34a9fbf1

L
Revert "fix typo name (#3689)" (#3702) · 7e59ef12
由 Logan Adams 提交于 6月 07, 2023
```
This reverts commit f2f5f21b.
```
7e59ef12

07 6月, 2023 5 次提交

fix typo name (#3689) · f2f5f21b

由 tensor-tang 提交于 6月 07, 2023

Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

f2f5f21b

L
Fix incorrectly formatted f string (#3698) · d8aaa581
由 Logan Adams 提交于 6月 06, 2023
```
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
```
d8aaa581
A
Correct world_size/backend for mpi (#3694) · c17313fb
由 Abhilash Majumder 提交于 6月 07, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
```
c17313fb

Fix local rank mismatch for heterogeneous nodes (#3409) · b7f463dd

由 Byungsoo Oh 提交于 6月 07, 2023

Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

b7f463dd

non-JIT build fix on ROCm (#3638) · 4cd0a003

由 Ramya Ramineni 提交于 6月 06, 2023

Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

4cd0a003

06 6月, 2023 3 次提交
- S
  
  Update README to add ICS'23 paper (#3687) · 2d737edd
  由 Siddharth Singh 提交于 6月 06, 2023
  
  2d737edd
- O
  Use logger in accelerator (#3682) · e5fe5f65
  由 Olatunji Ruwase 提交于 6月 05, 2023
```
* Use logger in accelerator

* Handle pre-build cases

* Explain possible import failure
```
  e5fe5f65
- D
  fix some typo (#3675) · 3fb3cfdc
  由 digger yu 提交于 6月 06, 2023
```
* fix typo deepspeed/runtime

* fix some typo

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  3fb3cfdc
05 6月, 2023 1 次提交

[MiCS] [Fix] saving and loading model checkpoint logic for MiCS sharding (#3440) · c88af214

由 Zhen Zhang 提交于 6月 04, 2023

* fix mics save checkpoint hanging

* MiCS load_checkpoint

* copyright

* fix for torch-1.9.0

all_reduce_coalesced api does not support nccl backend

* Naming alignment

* adding more test conditions for mics shard size

* test with different shard sizes

* adding assertion for better error msg

---------
Co-authored-by: NZhen Zhang <zhzhn@amazon.com>

c88af214

03 6月, 2023 3 次提交

J

bump to 0.9.4 · f483c034
由 Jeff Rasley 提交于 6月 02, 2023

f483c034

Refactor check_enabled root validator in DeepSpeedMonitorConfig (#3616) · 4559aa9b

由 Buğra 提交于 6月 02, 2023

* Refactor check_enabled root validator in DeepSpeedMonitorConfig

* formatting

* formatting

---------
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NMichael Wyatt <mrwyattii@gmail.com>

4559aa9b

D
fix typo deepspeed/runtime (#3663) · 5d14afd2
由 digger yu 提交于 6月 03, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
5d14afd2

02 6月, 2023 5 次提交

flops_profiler: add option recompute_fwd_factor for the case of activation recompute (#3362) · 460bec46

由郭叶军提交于 6月 02, 2023

When activation checkpointing is enabled, most of forward is re-computed,
and so the FLOPS calculation should be updated with recompute_fwd_factor=1.0

I don't find a way to pass the option from model script to deepspeed engine,
and so add option directly for flops_profiler.
Co-authored-by: NCheng Li <pistasable@gmail.com>

460bec46

fix typo with deepspeed/ (#3547) · cd4e473e

由 digger yu 提交于 6月 02, 2023

* fix spelling error with deepspeed/runtime/

* fix typo docs/

* fix typo in comments with deepspeed/

* fix typo deepspeed/

* Update constants.py

Remove the space after nebula

---------
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

cd4e473e

M
allow dict datatype for checkpoints (#3007) · da8f4e01
由 Michael Wyatt 提交于 6月 01, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
da8f4e01
H
Fix RuntimeError when using ZeRO Stage3 with mpu: #3564 (#3565) · f5dde36c
由 Haodong Lyu 提交于 6月 02, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
f5dde36c
deepspeed/comm/comm.py: fix typo of warning message (#3636) · 3b299997
由郭叶军提交于 6月 02, 2023
```
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
```
3b299997

01 6月, 2023 3 次提交

Typo Correction (#3621) · e02b8d0b

由 Micah Zoltu 提交于 6月 01, 2023

Code (in this context) is mass noun, and thus has no plural form.
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

e02b8d0b

W
Update megatron.md (#3641) · 8f459c50
由 Will Jessup 提交于 5月 31, 2023
```
grammar fix.
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
```
8f459c50
M
Skip tests on docs-only changes (#3651) · 8b8c7031
由 Michael Wyatt 提交于 5月 31, 2023
```
* skip test for docs-only changes

* add missing skip to blog changes
```
8b8c7031

31 5月, 2023 3 次提交

Add Ascend NPU accelerator support (#3595) · f3c8eaca

由 CurryRice233 提交于 5月 31, 2023

* add Ascend NPU accelerator support

* clean code

---------
Co-authored-by: Njializheng <jializheng@huawei.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

f3c8eaca

stage3.py: do not scale if gradient_predivide_factor is 1.0 (#3630) · 52907a66

由郭叶军提交于 5月 31, 2023

this change also aligns with the logic before reduce_scatter_coalesced
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

52907a66

AISC launcher fixes (#3637) · 49a73549

由 Jeff Rasley 提交于 5月 30, 2023

* tmp remove launcher args

* add exclude list for env variables on aisc

* add comment

49a73549

27 5月, 2023 1 次提交

Align InferenceEngine to store ms in _model_times (#3501) · d755b9d6

由 Danny Semiat 提交于 5月 27, 2023

* Align InferenceEngine to store ms in _model_times

   When using cuda_events, the measured model time is stored in ms.
   When not using cuda_events, the measured model time was stored in seconds.
   This commit fixes the units and aligns them to store ms, the same as elapsed() function.
   This was observed when running the following pytest:
   unit/inference/test_model_profiling.py::TestModelProfiling::test[False-True-roberta-base-fill-mask]

   Returned values were:
     count=0 e2e_t=895.174312 model_t=0.8529715538024902
     count=1 e2e_t=7.500252 model_t=0.0041310787200927734
     count=2 e2e_t=3.887346 model_t=0.0018568038940429688
     count=3 e2e_t=3.577845 model_t=0.0016334056854248047
     count=4 e2e_t=3.43976 model_t=0.0016703605651855469
     count=5 e2e_t=3.310903 model_t=0.0016107559204101562
     count=6 e2e_t=3.299556 model_t=0.001603841781616211
     count=7 e2e_t=3.605722 model_t=0.0015969276428222656
     count=8 e2e_t=3.273741 model_t=0.0015516281127929688
     count=9 e2e_t=3.46306 model_t=0.0016617774963378906

   The units difference is observed here, when model_t is in ther order of 10e-3 comparing to e2e_t

* Update engine.py

---------
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>

d755b9d6

26 5月, 2023 3 次提交
- Q
  Expose Consecutive Hysteresis to Users (#3553) · 0411a9f8
  由 Quentin Anthony 提交于 5月 25, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  0411a9f8
- O
  
  DS init should not broadcast or move zero.Init models (#3611) · d39c311f
  由 Olatunji Ruwase 提交于 5月 25, 2023
  
  d39c311f
- C
  
  bug fix (#3609) · 736bf185
  由 Conglong Li 提交于 5月 25, 2023
  
  736bf185
25 5月, 2023 1 次提交
- N
  
  Fix op_builder against PyTorch nightly (#3596) · 6622776c
  由 Nikita Shulga 提交于 5月 24, 2023
  
  6622776c
24 5月, 2023 2 次提交

Fix Hybrid Engine for BLOOM (#3580) · 76679884

由 Lev Kurilenko 提交于 5月 23, 2023

This PR fixes Hybrid Engine (HE) support for the BLOOM model, which was accidentally broken during the HE refactor in GH-3425.

The BLOOM container now inherits the HybridEngineContainer feature and defines a set_lora_params() function necessary for the feature to work. get_lora_params() is correspondingly removed from the BLOOM policy class as well.

GPT-NeoX was also cleaned up by removing a get_lora_params() function from its policy due to it no longer being used.

76679884

Fixing bf16 test (#3551) · 49d399cd

由 Joe Mayer 提交于 5月 23, 2023

* Fixing bf16 test that was missing a config.

* Chaning train_batch_size to train_micro_batch_size_per_gpu

* Chaning all train_batch_size to train_micro_batch_size_per_gpu

49d399cd

19 5月, 2023 1 次提交
- M
  
  fix_typo (#3559) · d1c3c0df
  由 mzl 提交于 5月 19, 2023
  
  d1c3c0df
17 5月, 2023 2 次提交

Fix (#3527) · 2fc56841

由 Ramya Ramineni 提交于 5月 16, 2023

Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

2fc56841

Clone tensors to avoid torch.save bloat (#3348) · 5c3ebd7e

由 Olatunji Ruwase 提交于 5月 16, 2023

* Clone tensors to avoid torch.save bloat

* Adddocs

* Fix clang-formatting

* Update docs/code-docs/source/model-checkpointing.rst
Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>

* Update deepspeed/checkpoint/utils.py
Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>

* Update deepspeed/checkpoint/utils.py
Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>

* Fix url

* url fix

* Tweak docs

---------
Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

5c3ebd7e

16 5月, 2023 1 次提交

[CPU] Support Intel CPU inference (#3041) · 1f72082f

由 Ma, Guokai 提交于 5月 16, 2023

* add fallback path for kernels used in megatron

* temporary numactl WA for SPR 56core

* adapt core allocation according to number of ranks

* add switch to turn on numactl

* detect number of cores on the system

* allow select a subset of the cores on the system to bind

* remove unneeded changes

* add ccl backend

* change nccl to ccl

* remove unused code

* add comm/ccl to ops

* initial ccl comm support

* first broadcast case passed

* add CCL_Backend to DeepSpeed

* support comm timer for CPU

* support barrier for comm backend

* support specify master address from deepspeed command line

* support pytorch 2.0

* remove 'block' from api

* Tweak for debug
Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>

* Remove unecessary directory
Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>

* Add bf16 kernel support for inference

* Add temporary torch implement for cpu inference

* Add softmax ops cpu fallback for inference

* bind cores to numa domain as well

* merge latest change in gma/numactl

* initial bf16 kernel support with fallback path

* initial fallback path for bloom kernel injection

* fix softmax attn mask

* check KMP_AFFINITY to avoid conflict with numactl

* New CCLBackend which utilize TorchBackend for initialization

* rollback last change because there is result error

* fix bloom injection policy TP could not work issue.

injection_policy={BloomBlock: ("self_attention.dense", "mlp.dense_4h_to_h")}

* Use TorchBackend to initialize CCLBackend, make behavior consistent

* remove comm under deepspeed/ops

* add license header

* code clean up

* fix format issue

* remove magic number in main address

* add caching support but not turn on by default

* change name of inference_cuda_module to inference_module

* Check for is_synchronized_device in accelerator before get Event

* fix typo

* Fix fallback path of softmax kernel on CUDA device for BF16 data type, because CUDA tril does not support BF16 datatype, enforce fp32 data type

* add cpu backend files

* change CPU_Accelerator op_builder_dir

* remove cpu_kernel_path

* using CPU_Accelerator on non-cuda device

* fix deepspeed.op_builder => deepspeed.ops.op_builder

* add alias for num_gpus: num_accelerators

* allow loading cpu_builder in build stage

* Assume cuda available if torch not installed

* add oneccl_binding_pt to requirements

* move oneccl-binding-pt to seperate requiremetns-cpu.txt

* add missing file

* use dependency_links in setuptools.setup() call for additional dependency links

* install oneccl_bind_pt in workflows

* change oneccl_bind_pt's version from 1.13 to 2.0

* use intel_exention_for_pytorch as indicator that CPU_Accelerator should be used

* Add indicator for Accelerator used

* change foo.c to foo.cpp

* exclude 'cpu' directory in CUDA op builder reflection

* add a cpu-inference workflow

* run cpu-inference workflow on self-hosted instance

* change cpu runs-on node to v100 node

* print out python version in workflow

* add verbose in pip command to understand oneccl_bind_pt install issue

* update cpu-inference workflow

* add a stage to detect instance instruction sets

* add back bf16 support for CPU inference

* enable autoTP for bloom
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* update workflow to detect cpu instruction sets

* temporary WA for Intel Extension for PyTorch AVX2 instructioon set detection

* change cpu-inference workflow machine to ubuntu-20.04

* add sharded checkpoint loading for AutoTP path to reduce the peak memory in initialization stage
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* enable policy for llama

* use a special build ipex to test avx2 detection fix

* fix format

* fix test fail issue
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* fix gptj sharded checkpoint loading problem
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>

* return a not implemented build in get_op_builder in cpu_backend

* support cpu device in tests

* use cpuinfo to extract number of CPUs

* use ~/tmp as transfomer cache rather than /blob/

* Add support for mpich launcher with prefer_deepspeed_comm

* add missing modification in accelerator

* enable IMPI launcher

* remove unused file and fix formatting

* clean up ccl.cpp

* Less confusing error message when certin op builder are not implemented

* Fix license header

* Add license header

* add license headers

* add license header

* fix cuda specific code in test

* update CPU workflow

* use numactl to bind to core

* allow bind_cores_to_rank in multi-node impi runner

* fix format error

* Remove InferenceBuilder

* fix format error in numa.py

* check whether op is in installed ops in ds_report.py

* allow override accelerator with DS_ACCELERATOR='cuda','cpu' or 'xpu'

* lazy init class_dict in CUDA_Accelerator to avoid cyclic initialization of CUDA_Accelerator

* put short path in the beginning in real_accelerator.py

* device_count return number of NUMA nodes

* fix typo

* install numactl in cpu workflow

* Follow comments

* Better implementation of device_count() and current_device()

* remove dependency_link for Intel Extension for DeepSpeed

* use check is_synchronized_device in timer only once

* remove env mapping WA in cpu_accelerator

* fix duplicate definition

* fix format error

* refine ccl backend selection

* move comments to the right place

* remove prefer_deepspeed_comm, use CCLBackend by default

* refractor fallback path

* Fix execution failure in kernel injection path

* do not refractory kernel injection fallback path in  residual_add because it contains function call with side-effect

* guard residual_add fallback path with environ DS_KI_FALLBACK=True

* fix format error

* add test for allreduce on CPU workflow

* fix format error

* Fallback to TorchBackend if CCLBackend kernel are not implemented

* Update Intel Extension for Pytorch installation link

* Don't specify version number of Intel Extension for PyTorch

* install oneCCL for CCLBackend

* fix link path for CPU comm kernels

* fix source oneCCL environment

* source oneCCL env before run UT

* Give more specific instruction when CCL_ROOT not defined

---------
Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nsdp <sdp@aia-sdp-spr-108864.jf.intel.com>
Co-authored-by: NCao, Zhong Z <zhong.z.cao@intel.com>
Co-authored-by: NZhenhuan Chen <zhenhuan.chen@intel.com>
Co-authored-by: Nbaodii <di.bao@intel.com>
Co-authored-by: NWang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Njianan-gu <jianan.gu@intel.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>

1f72082f

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年