- 19 4月, 2021 2 次提交
-
-
由 Jeff Rasley 提交于
-
由 Jeff Rasley 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
-
- 17 4月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Fix UnboundLocalError * Get full partition size
-
- 15 4月, 2021 2 次提交
-
-
由 Cheng Li 提交于
* update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Stas Bekman 提交于
* faster flatten/unflatten with apex * switch to cpp flatten/unflatten * style * better comment * missing import * switch to build ops at run time * fixes Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 14 4月, 2021 3 次提交
-
-
由 Stas Bekman 提交于
* e-notation for large floats * handle ints too * readability * handle bool Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Stas Bekman 提交于
-
由 Takuya Makino 提交于
-
- 08 4月, 2021 6 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Samyam Rajbhandari 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NReza Yazdani <reyazda@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Samyam Rajbhandari 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 07 4月, 2021 1 次提交
-
-
由 Takuya Makino 提交于
-
- 02 4月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* zero.Init() clarification clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must. this proposal is via @samyam's clarification shared elsewhere. Thank you. * style * add clarity * style Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 31 3月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
-
- 27 3月, 2021 3 次提交
-
-
由 hamlet 提交于
* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in https://github.com/microsoft/DeepSpeed/issues/707 As some model trainable parameters skipped in training, their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, so they have no norm_for_param_grads * Trim space * Trim space Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Stas Bekman 提交于
-
- 25 3月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* see_memory_usage fixes * didn't expect pt-1.2 * fix the order of things * fix the order of things
-
- 18 3月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* consistent checkpoint filenaming * backward compatible rename Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 17 3月, 2021 6 次提交
-
-
由 Conglong Li 提交于
Authors: @awan-10 @conglongli @samyam @jeffra What's new: NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation. Add support to momentum masks for those parameters with constant zero gradients during training. Bug fixes (e.g., #813). * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594) * NCCL based 1-bit Implementation + Refactor to add communication backends (#593) * add nccl 1-bit optim. * temporary commit to save stuff. * Use dist collectives instead of mpi routines. * remove old code for comm. * Fix bugs. still does not work. * modify to test the nccl side code path * Initial gather impl. Works intra-node. * Updates to comm. phase 2. nccl comm. passed the tests. * refactor code to introduce nccl/mpi as backends for onebit adam. * Refactor updates to test/engine. * Fix compile/runtime errors. * simplify support for nccl/mpi backends. * Add missign file * Add compression backend in constructor. Revert later. * modify test with some perf counting. * Implement a true non-blocking gather for nccl side. * Revert "Add compression backend in constructor. Revert later." This reverts commit df8c40d3. * improve the 1-bit adam test. * Refactor comm. and compression backend in 1-bit adam. * Fix the test. * Fix runtime errors and typos in nccl backend * fix mpi backend. modify tests. * modify nccl perf test. * fix mpi side errors. * Add an mpi perf test * Sync DSE. * Remove old collectives file. * Undo a typo. * Graceful failure for torch versions that don't support nccl pt2pt. * Revert "Merge branch 'master' into staging-1bit-nccl-v2" This reverts commit 78400850, reversing changes made to a6dba72a. * Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2"" This reverts commit 6dbdd985. * comm optimization + 1-bit lamb * Saving/debugging commit. * finalizing 1-bit lamb * finalizing 1-bit lamb * add momentum mask and chkpt handling for 1-bit adam * Cleanup and modify nccl test to be runnable with deepspeed launcher. * Fix format. * fix formatting again. * make test runnable without mpi4py * Add dist.alltoall and dist.allgather instead of custom functions. * remove debug prints. * formatting and renaming * renaming * renaming * add unit test, fix existing tests * skip unit test when torch < 1.8 * revert 1-bit lamb * flatten momentum when dimension is more than 1 * add warning message for 1-bit adam under fp32 * improve version check * add fp32 test * 1-bit adam doc * fix file name * doc fix * torch 1.8 is released * doc fix * fix tests * update news * add doc for momentum mask * fix checkpoing handling, add unit test * checkpoint handling doc * doc final cleanup * bump dates * update tests * url change * doc fix * fix test * doc update Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Olatunji Ruwase 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Olatunji Ruwase 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Jeff Rasley 提交于
-
由 brett koonce 提交于
-
由 Stas Bekman 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 16 3月, 2021 2 次提交
-
-
由 Olatunji Ruwase 提交于
* Ensure gradients of other partitions are cleared after reduction * Remove redundant code Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Samyam Rajbhandari 提交于
* Fix mis-aligned-grad When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that. * Formatting fix * Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size * also removing alignment from flat fp16 buffers * Testing for hidden dim alignment * inference hook fix * Update stage3.py * formatting * [bug-fix] move params to gpu if offload params is turned off Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 13 3月, 2021 1 次提交
-
-
由 Cheng Li 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 12 3月, 2021 3 次提交
-
-
由 Stas Bekman 提交于
* fix log(0) & 1/log(1) bugs * simplify Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: NCheng Li <pistasable@gmail.com>
-
由 Olatunji Ruwase 提交于
* Control ZeRO wall clock timers * Disable more ZeRO3 debug prints Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Cheng Li 提交于
* add optimizers and schedules to rtd * update ds website and fix links * add optimizers and schedules to rtd * update ds website and fix links * add flops profiler to rtd * fix Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
-
- 11 3月, 2021 2 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Jeff Rasley 提交于
-
- 10 3月, 2021 2 次提交
-
-
由 Jeff Rasley 提交于
-
由 Jeff Rasley 提交于
-
- 09 3月, 2021 1 次提交
-
-
由 Samyam Rajbhandari 提交于
* Squash stage3 v1 (#146) Co-authored-by: NSamyam <samyamr@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Neltonzheng <eltonz@microsoft.com> * Fix correctness bug (#147) * formatting fix (#150) * stage3 bugfix (API) update and simplified FP16 Z3 tests (#151) * fp16 Z3 API update and bugfix * revert debug change * ZeRO-3 detach and race condition bugfixes (#149) * trying out ZeRO-3 race condition fix * CUDA sync instead of stream * reduction stream sync * remove commented code * Fix optimizer state_dict KeyError (#148) Co-authored-by: NJeff Rasley <jerasley@microsoft.com> * fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152) * Simplifying the logic for getting averaged gradients (#153) * skip for now * Z3 Docs redux (#154) * removing some TODOs and commented code (#155) * New Z3 defaults (#156) Co-authored-by: NJeff Rasley <jerasley@microsoft.com> * formatting * megatron external params Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Neltonzheng <eltonz@microsoft.com>
-
- 27 2月, 2021 1 次提交
-
-
由 vfdev 提交于
-