提交 · a6317eb509cf6a7a3929787530174f7c575cf475 · Greenplum / DeepSpeed

30 3月, 2023 1 次提交

️ replace deprecated functions for communication (#2995) · a6317eb5

由 Mayank Mishra 提交于 3月 30, 2023

* 💩 drop dead code

* ♻ replace has_all_gather_base with has_all_gather_into_tensor

* ♻ remove deprecated _all_gather_base

* ♻ remove deprecated _reduce_scatter_base

* 🎨 reformat files

* 🔧 fix _six

* Trigger CI

* Trigger CI

* Trigger CI

* 🎨 formatting

* incorporate suggestion

* incorporate suggestion

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

a6317eb5

29 3月, 2023 1 次提交

Disable Stage 1&2 CPUAdam pathways (#3097) · 4b6d7c15

由 Michael Wyatt 提交于 3月 28, 2023

* disable CPUAdam pathways in optimizer copy/step

* Update stage_1_and_2.py

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

4b6d7c15

28 3月, 2023 1 次提交

Fix comms benchmark import issues and support MPI/slurm launching (#2932) · 9726bd46

由 Quentin Anthony 提交于 3月 27, 2023

* Fix benchmark import issues and support MPI launching with pure torch.dist

* Formatting

* Update comms benchmark README

* Formatting

* Added better error handling and support MPI torch.dist backend

* Update formatting versions

* Formatting again

* Trigger CI

---------
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

9726bd46

27 3月, 2023 1 次提交
- J
  
  update formatter version and style settings (#3098) · 91d63e02
  由 Jeff Rasley 提交于 3月 27, 2023
  
  91d63e02
24 3月, 2023 6 次提交
- L
  Move cuda check into utils (#3074) · b3ec1c97
  由 Logan Adams 提交于 3月 23, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  b3ec1c97
- M
  pre-commit check for torch.cuda in code (#2981) · 090d49e7
  由 Ma, Guokai 提交于 3月 24, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  090d49e7
- O
  Empty ZeRO3 partition cache (#3060) · e80ae088
  由 Olatunji Ruwase 提交于 3月 23, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  e80ae088
- M
  Goodbye Torch 1.8 (#3082) · 5cdf3593
  由 Michael Wyatt 提交于 3月 23, 2023
```
* bump torch18 -> torch19
* fix gptj

---------
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  5cdf3593
- S
  allow list (#3042) · 5c2a81c2
  由 Satpal Singh Rathore 提交于 3月 23, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  5c2a81c2
- F
  Fix nebula in save_16bit_model issue (#3023) · a78d6b89
  由 FreyaRao 提交于 3月 24, 2023
```
Co-authored-by: NQinghuan Rao <qinghuanrao@microsoft.com>
```
  a78d6b89
22 3月, 2023 7 次提交
- C
  Softmax Scheduling Cleanup (#3046) · 1286e374
  由 Connor Holmes 提交于 3月 22, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  1286e374
- M
  Remove bf16 from inference config dtye enum (#3010) · 27e1b02d
  由 Molly Smith 提交于 3月 22, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  27e1b02d
- M
  fix return prev key and value , added strides to from_blob (#2828) · 871c8a3f
  由 Mor Zusman 提交于 3月 22, 2023
```
Co-authored-by: NMor Zusman <morz@ai21.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  871c8a3f
- J
  
  [CI] follow-up fixes (#3072) · 36677588
  由 Jeff Rasley 提交于 3月 21, 2023
  
  36677588
- M
  Assert mp_size is factor of model dimensions (#2891) · 9ea0fdc2
  由 Molly Smith 提交于 3月 21, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  9ea0fdc2
- L
  Several fixes to unblock CI (#3047) · 4e068623
  由 Logan Adams 提交于 3月 21, 2023
```
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  4e068623
- Q
  
  [docs] add MCR-DL paper to readme/docs (#3066) · b38b3036
  由 Quentin Anthony 提交于 3月 21, 2023
  
  b38b3036
18 3月, 2023 1 次提交
- S
  
  Fix Broken Links (#3048) · f1e4fb0b
  由 Satpal Singh Rathore 提交于 3月 18, 2023
  
  f1e4fb0b
16 3月, 2023 1 次提交
- J
  
  update email info · bbfd0a6a
  由 Jeff Rasley 提交于 3月 15, 2023
  
  bbfd0a6a
15 3月, 2023 6 次提交
- Q
  Improve loss overflow logs (#3008) · ac2c9ffa
  由 Quentin Anthony 提交于 3月 15, 2023
```
* Improve overflow logs

* Trigger CI

---------
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  ac2c9ffa
- J
  Convert model parameters from generator to list. (#3017) · 94f7da26
  由 Joe Mayer 提交于 3月 15, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  94f7da26
- S
  [logger] implement warning_once (#3021) · 50a49e42
  由 Stas Bekman 提交于 3月 14, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  50a49e42
- J
  adding attribute checks for bf opt with zero (#3022) · d7c925e4
  由 Joe Mayer 提交于 3月 14, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  d7c925e4
- S
  
  Update torch_checkpoint_engine.py (#3019) · e355863b
  由 Stas Bekman 提交于 3月 14, 2023
  
  e355863b
- J
  [docs] add new paper to readme/docs (#3018) · 4292e8c5
  由 Jeff Rasley 提交于 3月 14, 2023
```
Co-authored-by: NZhewei Yao <zheweiyao@gmail.com>
```
  4292e8c5
14 3月, 2023 1 次提交

Fix buffer size for pipeline parallel and communication schedule (#2862) · b528f50e

由 Masahiro Tanaka 提交于 3月 13, 2023

* fix buffer size for pipeline parallel (#2800)

* improve explanation of buffer size for pipeline parallelism
Co-authored-by: NJae-Won Chung <jwnchung@umich.edu>

* fix format of comment

---------
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NJae-Won Chung <jwnchung@umich.edu>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

b528f50e

13 3月, 2023 1 次提交
- A
  ckpt: create directories in checkpoint_engine (#2988) · 43d58d99
  由 Adam Moody 提交于 3月 13, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  43d58d99
11 3月, 2023 1 次提交
- L
  Fix Meta Tensor checkpoint load for OPT models (#2990) · 3798e605
  由 Lev Kurilenko 提交于 3月 10, 2023
```
This PR fixes Meta Tensor checkpoint loading for OPT models where the SD keys start with `model.`.
```
  3798e605
10 3月, 2023 1 次提交
- J
  
  [zero] prevent poor configs from running w. zero-offload (#2971) · 457850dc
  由 Jeff Rasley 提交于 3月 09, 2023
  
  457850dc
09 3月, 2023 1 次提交
- J
  Fix issue between our abstract accelerator and colossalai's version of op_builder (#2963) · 58a4a4d4
  由 Jeff Rasley 提交于 3月 08, 2023
```
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
```
  58a4a4d4
08 3月, 2023 4 次提交
- R
  bug fix for skipping mbs (#2171) · 6379defa
  由 Rahil Bathwal 提交于 3月 07, 2023
```
Co-authored-by: NRajhans Samdani <rajhans@gmail.com>
```
  6379defa
- J
  
  bump to 0.8.3 · d58b4df3
  由 Jeff Rasley 提交于 3月 07, 2023
  
  d58b4df3
- N
  deepspeed.init_distributed() support for TCP protocols (#2905) · db15ef57
  由 noabauma 提交于 3月 07, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  db15ef57
- M
  [RFC] add device abstraction to allow other device than CUDA be used (#2221) · 0acf7e9c
  由 Ma, Guokai 提交于 3月 08, 2023
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  0acf7e9c
07 3月, 2023 3 次提交
- O
  Improve overflow handling (#2944) · 80d8fcbd
  由 Olatunji Ruwase 提交于 3月 06, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  80d8fcbd
- L
  
  Check for local CUDA graphs when enable_cuda_graph=True (#2941) · 87eaf8f9
  由 Lev Kurilenko 提交于 3月 06, 2023
  
  87eaf8f9
- M
  AutoTP Assert Kernel Injection Support (#2939) · 2ede0d94
  由 Molly Smith 提交于 3月 06, 2023
```
* check kernel injection supported models

* Clarify why user should use kernel injection
```
  2ede0d94
02 3月, 2023 2 次提交

M
TP unsupported models and assertions (#2810) · 4ae3a3da
由 Molly Smith 提交于 3月 01, 2023
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
4ae3a3da

Add MPICH Multinode Runner (#2839) · 8d53ac0c

由 mzl 提交于 3月 02, 2023

* MPICH support

* MPICH changes

* MPICH changes

* MPICH changes

* MPICH changes

* accelerator runtime modifications

* Accelerator runtime changes

* Accelerator runtime modifications

* Remove redundant print from single node

* Move hostfile to tmp

* Code cleanup for MPICH class

* Code cleanup, rm whitespace

* Removing mpiexec environment check details

* Not needed tmp hostfile as pass directly

* Remove debugging comments

* rm print statement

* Revert comm changes as WA not needed

* Use MPICHRunner name for class

* Use MPICHRunner as class name

* No need to use args.force_multi and args.launcher .

This should be set in deepspeedexamples gpt-3.6b .sh script as:
$launcher=MPICH
run_cmd=" deepspeed  --hostfile=${hostfile_ds}  --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --launcher=${launcher} --force_multi pretrain_gpt2.py $@ ${gpt_options}"

* Adhere to code pattern

* Rm empty lines in MPICHRunner class

* Uncomment check for num nodes and workers when used hostfile_deepspeed in gpt-3.6b.sh

* pass MPICH hostfile through launcher_args in gpt-3.6b.sh

* Clean code and remove args hostfile

* fix merge

* fix merge

---------
Co-authored-by: NAbhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* clean up and fix format

* add ut

---------
Co-authored-by: NAbhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

8d53ac0c

01 3月, 2023 1 次提交

Fixes `AttributeError` in #2853 (#2854) · 91d7090e

由 Sam Foreman 提交于 3月 01, 2023

Updates `deepspeed/monitor/monitor.py`
to instantiate objects with correct configs

Relevant issue:
https://github.com/microsoft/DeepSpeed/issues/2853Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

91d7090e

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年