提交 · 488105ebd200bbd1f6d7cbe863412e41d9ab4221 · Greenplum / DeepSpeed

13 11月, 2021 1 次提交
- O
  
  Fix zinf none swapper (#1550) · 488105eb
  由 Olatunji Ruwase 提交于 11月 12, 2021
  
  488105eb
12 11月, 2021 6 次提交
- B
  Add warmup_type arguments in WarmupLR and WarmupDecayLR (#1530) · 76847f42
  由 Baizhou Huang 提交于 11月 12, 2021
```
* Add warmup_type arguments in WarmupLR and WarmupDecayLR

* Add warmup_type unit test

* replace hardcoded constants with global vars
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  76847f42
- R
  Fix sparse attention for small block-sizes (#1545) · 3ed77304
  由 Reza Yazdani 提交于 11月 12, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  3ed77304
- R
  Tensor-Parallelism general support (#1512) · 9ce00a21
  由 Reza Yazdani 提交于 11月 11, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  9ce00a21
- C
  
  backward compatibility (#1549) · b16dd943
  由 Conglong Li 提交于 11月 11, 2021
  
  b16dd943
- J
  
  bump to 0.5.7 · fa9d3e84
  由 Jeff Rasley 提交于 11月 11, 2021
  
  fa9d3e84
- J
  
  Fix 1bit extra issue (#1542) · 2665c8b1
  由 Jeff Rasley 提交于 11月 11, 2021
  
  2665c8b1
11 11月, 2021 1 次提交
- O
  
  Use cuda tensors for allgather (#1548) · bd3ebddf
  由 Olatunji Ruwase 提交于 11月 10, 2021
  
  bd3ebddf
10 11月, 2021 1 次提交

CPU-Adam: Fix compile Issue (#1537) · af443f63

由 Reza Yazdani 提交于 11月 09, 2021

* fixing the softmax masking when using triangular masking

* move the TILE declaration outside of the SIMD loop

* remove unrelated changes

* fix Adagrad compile issue

af443f63

09 11月, 2021 3 次提交
- C
  Modify inference engine (#1520) · f0122007
  由 Chunyang Wen 提交于 11月 09, 2021
```
Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  f0122007
- J
  
  [unit tests] allow unique port for tests · 0af15b98
  由 Jeff Rasley 提交于 11月 08, 2021
  
  0af15b98
- C
  
  fstr for multnode_runner (#1532) · 93c71831
  由 Chunyang Wen 提交于 11月 09, 2021
  
  93c71831
08 11月, 2021 1 次提交
- S
  [docs] fix 404 (#1531) · 76f2b5e5
  由 Stas Bekman 提交于 11月 07, 2021
```
* [docs] fix 404

This PR fixes a few broken links

* fix 404
```
  76f2b5e5
06 11月, 2021 3 次提交

typo in profiler.py (#1527) · 2c62d657

由 Nathan Frey 提交于 11月 05, 2021

Fix typos in Flops Profiler message
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

2c62d657

Adding Tutel to MoE layer (#1528) · 2887349c

由 alexandremuzio 提交于 11月 05, 2021

Co-authored-by: NAlex Muzio <alferre@microsoft.com>
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

2887349c

C
Use fstr in launcher (#1521) · cf1f1601
由 Chunyang Wen 提交于 11月 06, 2021
```
* Use fstr in launcher

* Fix wrong condition for word_info

* Fix typo
```
cf1f1601

05 11月, 2021 1 次提交
- C
  
  make conv flops counting general for 1,2,3d (#1518) · f9b37801
  由 Cheng Li 提交于 11月 04, 2021
  
  f9b37801
03 11月, 2021 3 次提交
- J
  
  bump to 0.5.6 · 426dd2b5
  由 Jeff Rasley 提交于 11月 02, 2021
  
  426dd2b5
- A
  Prevent creation of local temp directory (#1494) · 91defd7c
  由 Alex Hedges 提交于 11月 02, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  91defd7c
- C
  Unify use f str (#1511) · df5b0884
  由 Chunyang Wen 提交于 11月 03, 2021
```
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
```
  df5b0884
02 11月, 2021 3 次提交

[code readability] pipe (#1510) · bf1725bb

由 Stas Bekman 提交于 11月 02, 2021

This PR suggests a small improvement to code readability.

--------------------------

I was puzzling over this code:

https://github.com/microsoft/DeepSpeed/blob/85ce85dd5f4b18c0019a5121b06900e3a2c3933b/deepspeed/runtime/pipe/module.py#L381-L385

I had no idea this construct existed. 

After reading up on it, it appears to be used incorrectly. The only point to using it with `break`.

It's explained here https://docs.python.org/3/tutorial/controlflow.html#break-and-continue-statements-and-else-clauses-on-loops

So I'm proposing to remove the `else` control and just run the code in its branch normally since it *always* gets executed as there is no `break` statement.

And the objective of this code is to always be run if I understand it correctly. So let's make it loud and clear.

Here is a quick proof:
```
for i in []:
    print(i)
else:
    print("loop did not finish via break") # runs!

for i in [0]:
    print(i)
else:
    print("loop did not finish via break") # runs!


for i in [0]:
    print(i)
    break
else:
    print("loop did not finish via break") # does not run
```

@tjruwase

bf1725bb

C

Remove redundant pass (#1509) · 85ce85dd
由 Chunyang Wen 提交于 11月 02, 2021

85ce85dd

Bfloat16 zero2 (#1398) · 648f7bfa

由 Rana Ali Amjad 提交于 11月 01, 2021

* Changes for bfloat16 Zero2

* Cleaned up additional comments and debugging code

* Adapted fp16_master_weights_and_grads option to cover BF16

* Reverted fp16_master_weights_and_gradients extension to BFloat16 and minor cleanup

* Fixed formatting and variable naming errors recognized in testing

* Added relevant unit tests for bfloat16 with ZeRO-2

* Updates conditions for skipping BFloat16 unit tests

* Added check for NCCL inconsistent version naming convention

* Update skip message for Bfloat16 tests to mention additional checks
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>

648f7bfa

01 11月, 2021 1 次提交
- R
  Transformer kernel - fix unit test (#1503) · 2c5bba6d
  由 Reza Yazdani 提交于 10月 31, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  2c5bba6d
31 10月, 2021 1 次提交

ZeRO3, improved parameter all-gather operation (#1188) · c0eeb69d

由 Zhen Zhang 提交于 10月 31, 2021

* remove norm(), avoid memcpy after allgather

1) Removing the norm computation in debug printing
2) Changing _all_gather to be sync op in fetch_sub_module
    Reason: the async version is not async at all, because each
    all_gather calls torch.cuda.synchronize() to guarantee previous
    communication op to be completed
3) Adding new function _allgather_params_split_launch
    the existing _allgather_params has explicit memcpy after the
    all-gather op. We can avoid the explicit memory copy at
    python side, to improve the performance.

Known issue:
    the `torch.distributed.all_gather` will do implicit memcpy
    at the end of each `ncclAllgather`.

* WIP: wrapped ncclAllgather as customized op in DS

micro benchmark shows the improvement of allgather a
transformer layer with 9834560 elements in half precision is about
1.1ms on aws-p4d instance.

* WIP: integrated into partition_parameters

Performance improvement of 5.1B bert on aws-p4d:
fwd: 300ms -> 200ms
bwd: 680ms -> 610ms

* Fix format

* cleaned dead code, modified unit test

* removed customized c++ extension

revert back to use torch distributed API

* change torch.ones to torch empty

* typo

* warn if not cuda tensor for allgather

* fix formatting

* fix: move ds_tensor to cuda device

but it is strange that the ds_tensor haven't been moved to cuda

* remove try clause on the path for fetching params
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>

c0eeb69d

30 10月, 2021 6 次提交
- C
  update CL doc (#1506) · 7f5a3add
  由 Conglong Li 提交于 10月 29, 2021
```
* update CL doc

* doc fix
```
  7f5a3add
- J
  
  allow passing hostfile to ds_ssh (#1504) · 163f568f
  由 Jeff Rasley 提交于 10月 29, 2021
  
  163f568f
- R
  
  Fixing the transformer APIs to return tuple as the output (if needed) (#1491) · ee6a92c0
  由 Reza Yazdani 提交于 10月 29, 2021
  
  ee6a92c0
- O
  [docs] Update DSE Megatron links (#1500) · a4fff53d
  由 Olatunji Ruwase 提交于 10月 29, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  a4fff53d
- A
  Fix OneCycleLR zero division error (#1498) · db5d8ba2
  由 Anes Benmerzoug 提交于 10月 29, 2021
```
* Add regression test for onecyclelr zerodivision error

* Wrap computation of lr and mom decay factors in try except block

This handles the case when decay_step_size is set to zero, which is the default case,
and prevents a zero division error

* Use boolean attributes instead of try/except block
```
  db5d8ba2
- M
  
  Fix typo (#1501) · e976accb
  由 Manuel R. Ciosici 提交于 10月 29, 2021
  
  e976accb
29 10月, 2021 1 次提交
- A
  enable/disable moe token dropping. (#1492) · 56635d5b
  由 Ammar Ahmad Awan 提交于 10月 28, 2021
```
* Add a flag to enable/disable token dropping in moe/top-1 gating.

* fix syntax and formatting.
```
  56635d5b
28 10月, 2021 2 次提交
- O
  Synchronize folder creation; Single latest file writer (#1486) · 99bd592d
  由 Olatunji Ruwase 提交于 10月 27, 2021
```
* Synchronize folder creation; Single latest file writer

* Address PR feedback
```
  99bd592d
- J
  
  fix read-the-docs based on this issue: https://github.com/sphinx-doc/sphinx/issues/9727 (#1489) · 24dd285f
  由 Jeff Rasley 提交于 10月 27, 2021
  
  24dd285f
27 10月, 2021 2 次提交
- M
  Proposal of how we might use sparse tensors for gradients (#1484) · fcb3ca5e
  由 Mikhail Druzhinin 提交于 10月 27, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  fcb3ca5e
- W
  Add cpu adagrad (#1358) · 8abdaee2
  由 Wenhao Hu 提交于 10月 27, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
```
  8abdaee2
23 10月, 2021 1 次提交

add moe+zero ckpt unit test. (#1429) · 0b77d1d9

由 Ammar Ahmad Awan 提交于 10月 22, 2021

* Add unit test to check moe+zero checkpoints
* Fix zero stage2 checkpoint loading logic to deal with experts related state dicts.

0b77d1d9

22 10月, 2021 3 次提交
- C
  
  fix pp (#1474) · 29bee73f
  由 Conglong Li 提交于 10月 21, 2021
  
  29bee73f
- C
  fix pipeline engine (#1473) · 17a479dd
  由 Conglong Li 提交于 10月 21, 2021
```
* fix pp

* better fix
```
  17a479dd
- O
  Ensure single zero3 context (#1462) · 58a8e13c
  由 Olatunji Ruwase 提交于 10月 21, 2021
```
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
```
  58a8e13c

Greenplum / DeepSpeed 上一次同步 大约 1 年

Greenplum / DeepSpeed
上一次同步大约 1 年