1. 19 4月, 2021 1 次提交
  2. 17 4月, 2021 1 次提交
  3. 15 4月, 2021 3 次提交
  4. 14 4月, 2021 3 次提交
  5. 09 4月, 2021 2 次提交
  6. 08 4月, 2021 9 次提交
  7. 07 4月, 2021 1 次提交
  8. 03 4月, 2021 2 次提交
  9. 02 4月, 2021 1 次提交
  10. 31 3月, 2021 6 次提交
  11. 27 3月, 2021 3 次提交
  12. 25 3月, 2021 1 次提交
  13. 24 3月, 2021 1 次提交
  14. 18 3月, 2021 2 次提交
  15. 17 3月, 2021 4 次提交
    • C
      1-bit Adam v2 (#817) · 68c8481b
      Conglong Li 提交于
      Authors: @awan-10 @conglongli @samyam @jeffra
      
      What's new:
      
      NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
      Add support to momentum masks for those parameters with constant zero gradients during training.
      Bug fixes (e.g., #813).
      
      * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594)
      
      * NCCL based 1-bit Implementation + Refactor to add communication backends (#593)
      
      * add nccl 1-bit optim.
      
      * temporary commit to save stuff.
      
      * Use dist collectives instead of mpi routines.
      
      * remove old code for comm.
      
      * Fix bugs. still does not work.
      
      * modify to test the nccl side code path
      
      * Initial gather impl. Works intra-node.
      
      * Updates to comm. phase 2. nccl comm. passed the tests.
      
      * refactor code to introduce nccl/mpi as backends for onebit adam.
      
      * Refactor updates to test/engine.
      
      * Fix compile/runtime errors.
      
      * simplify support for nccl/mpi backends.
      
      * Add missign file
      
      * Add compression backend in constructor. Revert later.
      
      * modify test with some perf counting.
      
      * Implement a true non-blocking gather for nccl side.
      
      * Revert "Add compression backend in constructor. Revert later."
      
      This reverts commit df8c40d3.
      
      * improve the 1-bit adam test.
      
      * Refactor comm. and compression backend in 1-bit adam.
      
      * Fix the test.
      
      * Fix runtime errors and typos in nccl backend
      
      * fix mpi backend. modify tests.
      
      * modify nccl perf test.
      
      * fix mpi side errors.
      
      * Add an mpi perf test
      
      * Sync DSE.
      
      * Remove old collectives file.
      
      * Undo a typo.
      
      * Graceful failure for torch versions that don't support nccl pt2pt.
      
      * Revert "Merge branch 'master' into staging-1bit-nccl-v2"
      
      This reverts commit 78400850, reversing
      changes made to a6dba72a.
      
      * Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""
      
      This reverts commit 6dbdd985.
      
      * comm optimization + 1-bit lamb
      
      * Saving/debugging commit.
      
      * finalizing 1-bit lamb
      
      * finalizing 1-bit lamb
      
      * add momentum mask and chkpt handling for 1-bit adam
      
      * Cleanup and modify nccl test to be runnable with deepspeed launcher.
      
      * Fix format.
      
      * fix formatting again.
      
      * make test runnable without mpi4py
      
      * Add dist.alltoall and dist.allgather instead of custom functions.
      
      * remove debug prints.
      
      * formatting and renaming
      
      * renaming
      
      * renaming
      
      * add unit test, fix existing tests
      
      * skip unit test when torch < 1.8
      
      * revert 1-bit lamb
      
      * flatten momentum when dimension is more than 1
      
      * add warning message for 1-bit adam under fp32
      
      * improve version check
      
      * add fp32 test
      
      * 1-bit adam doc
      
      * fix file name
      
      * doc fix
      
      * torch 1.8 is released
      
      * doc fix
      
      * fix tests
      
      * update news
      
      * add doc for momentum mask
      
      * fix checkpoing handling, add unit test
      
      * checkpoint handling doc
      
      * doc final cleanup
      
      * bump dates
      
      * update tests
      
      * url change
      
      * doc fix
      
      * fix test
      
      * doc update
      Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      68c8481b
    • J
      bump version 0.3.13 · 12a53b43
      Jeff Rasley 提交于
      12a53b43
    • O
      Make config objects json serializable (#862) · 7bcd72a2
      Olatunji Ruwase 提交于
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      7bcd72a2
    • O
      Fix ZeRO3 save_checkpoint (#857) · fa87a73a
      Olatunji Ruwase 提交于
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      fa87a73a