• C
    1-bit Adam v2 (#817) · 68c8481b
    Conglong Li 提交于
    Authors: @awan-10 @conglongli @samyam @jeffra
    
    What's new:
    
    NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
    Add support to momentum masks for those parameters with constant zero gradients during training.
    Bug fixes (e.g., #813).
    
    * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594)
    
    * NCCL based 1-bit Implementation + Refactor to add communication backends (#593)
    
    * add nccl 1-bit optim.
    
    * temporary commit to save stuff.
    
    * Use dist collectives instead of mpi routines.
    
    * remove old code for comm.
    
    * Fix bugs. still does not work.
    
    * modify to test the nccl side code path
    
    * Initial gather impl. Works intra-node.
    
    * Updates to comm. phase 2. nccl comm. passed the tests.
    
    * refactor code to introduce nccl/mpi as backends for onebit adam.
    
    * Refactor updates to test/engine.
    
    * Fix compile/runtime errors.
    
    * simplify support for nccl/mpi backends.
    
    * Add missign file
    
    * Add compression backend in constructor. Revert later.
    
    * modify test with some perf counting.
    
    * Implement a true non-blocking gather for nccl side.
    
    * Revert "Add compression backend in constructor. Revert later."
    
    This reverts commit df8c40d3.
    
    * improve the 1-bit adam test.
    
    * Refactor comm. and compression backend in 1-bit adam.
    
    * Fix the test.
    
    * Fix runtime errors and typos in nccl backend
    
    * fix mpi backend. modify tests.
    
    * modify nccl perf test.
    
    * fix mpi side errors.
    
    * Add an mpi perf test
    
    * Sync DSE.
    
    * Remove old collectives file.
    
    * Undo a typo.
    
    * Graceful failure for torch versions that don't support nccl pt2pt.
    
    * Revert "Merge branch 'master' into staging-1bit-nccl-v2"
    
    This reverts commit 78400850, reversing
    changes made to a6dba72a.
    
    * Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""
    
    This reverts commit 6dbdd985.
    
    * comm optimization + 1-bit lamb
    
    * Saving/debugging commit.
    
    * finalizing 1-bit lamb
    
    * finalizing 1-bit lamb
    
    * add momentum mask and chkpt handling for 1-bit adam
    
    * Cleanup and modify nccl test to be runnable with deepspeed launcher.
    
    * Fix format.
    
    * fix formatting again.
    
    * make test runnable without mpi4py
    
    * Add dist.alltoall and dist.allgather instead of custom functions.
    
    * remove debug prints.
    
    * formatting and renaming
    
    * renaming
    
    * renaming
    
    * add unit test, fix existing tests
    
    * skip unit test when torch < 1.8
    
    * revert 1-bit lamb
    
    * flatten momentum when dimension is more than 1
    
    * add warning message for 1-bit adam under fp32
    
    * improve version check
    
    * add fp32 test
    
    * 1-bit adam doc
    
    * fix file name
    
    * doc fix
    
    * torch 1.8 is released
    
    * doc fix
    
    * fix tests
    
    * update news
    
    * add doc for momentum mask
    
    * fix checkpoing handling, add unit test
    
    * checkpoint handling doc
    
    * doc final cleanup
    
    * bump dates
    
    * update tests
    
    * url change
    
    * doc fix
    
    * fix test
    
    * doc update
    Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
    Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
    68c8481b
config-json.md 38.0 KB