- 15 12月, 2020 1 次提交
-
-
由 Ammar Ahmad Awan 提交于
* NCCL based 1-bit Implementation + Refactor to add communication backends (#593) * add nccl 1-bit optim. * temporary commit to save stuff. * Use dist collectives instead of mpi routines. * remove old code for comm. * Fix bugs. still does not work. * modify to test the nccl side code path * Initial gather impl. Works intra-node. * Updates to comm. phase 2. nccl comm. passed the tests. * refactor code to introduce nccl/mpi as backends for onebit adam. * Refactor updates to test/engine. * Fix compile/runtime errors. * simplify support for nccl/mpi backends. * Add missign file * Add compression backend in constructor. Revert later. * modify test with some perf counting. * Implement a true non-blocking gather for nccl side. * Revert "Add compression backend in constructor. Revert later." This reverts commit df8c40d3. * improve the 1-bit adam test. * Refactor comm. and compression backend in 1-bit adam. * Fix the test. * Fix runtime errors and typos in nccl backend * fix mpi backend. modify tests. * modify nccl perf test. * fix mpi side errors. * Add an mpi perf test * Sync DSE. * Remove old collectives file. * Undo a typo. * Graceful failure for torch versions that don't support nccl pt2pt.
-
- 10 12月, 2020 4 次提交
-
-
由 Jeff Rasley 提交于
-
由 Jeff Rasley 提交于
-
由 Jeff Rasley 提交于
-
由 Jeff Rasley 提交于
-
- 09 12月, 2020 1 次提交
-
-
由 Shaden Smith 提交于
* Switch from deprecated allreduce interface. * Make pipeline checkpoint files portable.
-
- 08 12月, 2020 2 次提交
-
-
由 Stas Bekman 提交于
RTX-30 series are compute_86 ``` python -c "import torch; print(torch.cuda.get_device_capability())" ``` This PR adds support for this compute capability. Reference: https://developer.nvidia.com/cuda-gpusCo-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Stas Bekman 提交于
-
- 05 12月, 2020 1 次提交
-
-
由 Zhun 提交于
* 1) Register layout as buffer of module so that we can save/load checkpoint; 2) Add a broadcast of layout at the beginning to ensure different processes will have consistent layout during distributed training. * Add docstring for max_seq_length argument in SparseSelfAttention Co-authored-by: NZhun Liu <zhunliu@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 03 12月, 2020 5 次提交
-
-
由 Stas Bekman 提交于
-
由 Jeff Rasley 提交于
-
由 Stas Bekman 提交于
-
由 Jeff Rasley 提交于
-
由 Stas Bekman 提交于
* [cifar tutorial] improve readability
-
- 02 12月, 2020 2 次提交
-
-
由 Reza Yazdani 提交于
* tracking optimizer step in cpu-adam when loading checkpoint * add warning/error message for updating optimizer step count * resolve build issue * supporting state update from the python side * track step from python in all cases * remove comma
-
由 Reza Yazdani 提交于
* supporting different hidden dimensions * add support for larger hidden dimensions (greater than 8K) * remove empty line * add loop unrolling factor for dropout kernels * update different kernels based on the reviews Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 28 11月, 2020 1 次提交
-
-
由 Stas Bekman 提交于
This PR: * fixes a misspelled method name * also `( () )` doesn't read too well, until one reads the code and understands that it's not a formatting bug. I proposed to simply say that it's a callable object.
-
- 26 11月, 2020 4 次提交
-
-
由 Jeff Rasley 提交于
-
由 Jeff Rasley 提交于
-
由 Jeff Rasley 提交于
-
由 Shaden Smith 提交于
-
- 23 11月, 2020 1 次提交
-
-
由 Jeff Rasley 提交于
-
- 25 11月, 2020 6 次提交
-
-
由 Jeff Rasley 提交于
-
由 Ammar Ahmad Awan 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Olatunji Ruwase 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Jeff Rasley 提交于
-
由 Jeff Rasley 提交于
-
由 Jeff Rasley 提交于
-
- 24 11月, 2020 1 次提交
-
-
由 Samyam Rajbhandari 提交于
In the absence of a model parallel group, model_parallel_allreduce should not do any reduction. This commit fixes the bug which was doing a model parallel allreduce across world group when model parallel group is None
-
- 23 11月, 2020 1 次提交
-
-
由 Samyam Rajbhandari 提交于
-
- 22 11月, 2020 1 次提交
-
-
由 Olatunji Ruwase 提交于
-
- 21 11月, 2020 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Use zero-tensors for missing gradients to avoid size mismatch * Unit test for unbalanced gradients in ZeRO * Formatting fixes
-
- 20 11月, 2020 6 次提交
-
-
由 Jeff Rasley 提交于
-
由 Ammar Ahmad Awan 提交于
* Use AML method to set env vars instead of using mpi4py. Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Seunghwan Hong 提交于
* Add guard to not using `torch.version.cuda` above no-CUDA environment. * Fix several typos on setup.py. Signed-off-by: NSeunghwan Hong <seunghwan@scatterlab.co.kr> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Jeff Rasley 提交于
-
由 Jeff Rasley 提交于
-
由 Jeff Rasley 提交于
* zero-1 memory fix * auto-tune max elems per comm to reduce padding/comm intervals * clean-up and added previously missing reduction options * fix testing backing to work with torch1.7
-
- 19 11月, 2020 2 次提交
-
-
由 Jeff Rasley 提交于
-
由 Jeff Rasley 提交于
-