1. 23 1月, 2022 1 次提交
  2. 22 1月, 2022 1 次提交
  3. 21 1月, 2022 1 次提交
    • J
      Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support) (#1453) · 4912e0ad
      Justin Chiu 提交于
      * Changes for bfloat16 Zero2
      
      * ZeRO stage3 optimizations, with some bug fixes
      
      optimizations for stage3:
      - prefetching improvements
      - batching allgather calls to amortize fixed overhead and improve
        bandwidth utilization
      - batching reduce_scatter calls to amortize fixed overhead and
        improve bandwidth utilization
      - using *_base variants of allgather and reduce scatter to reduce memory
        allocations and data movement
      - more fine grained synchronization for communication that allows
        blocking on less work
      - precomputation of fetching code - using a fetch queue rather than
        deciding what to (pre)fetch at each iteration
      - limiting queued coalesced communication ops to reduce memory pressure
        on pytorch cuda caching allocator (not elegant solution)
      
      optimizations for stage3-offload:
      - made some host-device tensor copies async to improve performance
      
      bug fixes and qol improvements:
      - fix init context method when parent modules modify child weights
      - speed up model initialization by moving model to GPU before weight
        initialization
      - fixed unit test imports so that unit tests can be run from any
        directory
      - change performance logging to include memory consumption
      - add logging w/ model size when done partitioning model
      
      new features
      - bfloat16 support for ZeRO 3
      
      * fix import in ut
      
      * ran yapf
      
      * improvements to cache flush warn log
      
      * backwards compatibility with older versions of pytorch
      
      * handle edge case where reduced tensor smaller than world size
      
      * moved event synchronization to allgather handle wait() call
      
      * removed unnecessary barrier call
      
      * formatting fix after resolving merge conflict
      
      * skip nvme prefetch when trace not complete
      
      * opportunistically avoid memory allocation in allgather coalesced where possible
      
      * fix indentation after merge
      
      * fixes to account for parameter offload
      
      * accounting for torch.cuda.memory_stats not being available
      
      * moved partition_all_params to optimizer step
      
      * allgathering on params before item gets called
      
      * fix param status checks
      
      needed after moving partition_all_parameters call to optimizer step
      
      * fix grad accumulation with optimizer offload
      
      * grad norm computation fix for optimizer offload
      
      * change post divide in reduce-scatter to pre divide
      
      * fix gradient race condition w/ optimizer offload
      
      * improve inf/nan gradient tracking
      
      * don't prefetch when not in training mode
      
      * format fix after merging
      
      * fix prefetching issue when using NVME offload
      
      * improved defragmentation for fp16 parameters
      
      * relative imports for bf16 tests
      
      * changes for bwd compatibility with pytorch 1.2
      
      * remove buffered_reduce_fallback
      
      * removed unused parameter offset bookkeeping
      
      * fixed tracking for multiple param groups
      
      * unbroke bfloat16 config after merge conflict
      
      * using base allgather params when only 1 param
      
      * cleanup/fixes for fp16 partition defragmentation
      
      * switch to CRLF
      
      * convert to same new-line style as master
      
      * align new line with master
      
      * Fix merge issues
      
      * switch to CRLF
      
      * fix to LF line endings
      
      * minor merge fixes
      
      * remove extra bfloat16_enabled definition
      
      * asserting params inflight for AllGatherHandle
      
      * remove get_cuda_mem_allocated_str
      
      * Format fixes
      
      * fix bfloat16 zero stage check (broken after merge commit)
      
      * +self.communication_data_type, -self.allreduce_always_fp32; delete dead code
      
      * Add self.reduce_scatter
      
      * Format fix
      
      * Fix merge issues
      
      * iterate over params_to_fetch rather than make another iterator
      
      * add some TODOs
      
      * remove unnecessary division by micro_step_id
      
      * rename config keys "bfloat16" -> "bf16"
      
      * rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bit_weights_on_model_save
      
      * add unit test to check backwards compatibility for gather_16bit_weights
      
      * added test to confirm bf16 key bwd compatibility
      
      * Format fixes
      Co-authored-by: NRana Ali Amjad <raamjad@amazon.com>
      Co-authored-by: NJustin Chiu <justchiu@amazon.com>
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      4912e0ad
  4. 04 1月, 2022 1 次提交
  5. 14 12月, 2021 1 次提交
  6. 27 11月, 2021 1 次提交
  7. 23 11月, 2021 1 次提交
  8. 13 11月, 2021 2 次提交
  9. 02 10月, 2021 1 次提交
  10. 01 10月, 2021 1 次提交
  11. 17 8月, 2021 1 次提交
  12. 30 7月, 2021 1 次提交
    • O
      [Doc] round_robin_gradients (#1261) · 40c381df
      Olatunji Ruwase 提交于
      * Fix docstring
      
      * Make screenshots clickable for easier viewing
      
      * Navigation menu in alphabetical order; More clicable screenshots
      
      * Rename 1Cycle doc
      
      * Tweak naming
      
      * Remove no longer used flag
      
      * ZeRO3 Offload release
      
      * Single GPU results
      
      * Rearrange figures
      
      * Single GPU text
      
      * tweak intro
      
      * zero3-offload section
      
      * Add asynchronous i/o docs
      
      * Fix print_per_steps doc
      
      * Document round_robin_gradients
      
      * Tweak description
      
      * Trigger CI
      40c381df
  13. 02 7月, 2021 1 次提交
  14. 17 6月, 2021 1 次提交
    • O
      [Doc] Fix steps_per_print description (#1163) · fa7921e2
      Olatunji Ruwase 提交于
      * Fix docstring
      
      * Make screenshots clickable for easier viewing
      
      * Navigation menu in alphabetical order; More clicable screenshots
      
      * Rename 1Cycle doc
      
      * Tweak naming
      
      * Remove no longer used flag
      
      * ZeRO3 Offload release
      
      * Single GPU results
      
      * Rearrange figures
      
      * Single GPU text
      
      * tweak intro
      
      * zero3-offload section
      
      * Add asynchronous i/o docs
      
      * Fix print_per_steps doc
      fa7921e2
  15. 09 6月, 2021 1 次提交
  16. 20 5月, 2021 1 次提交
  17. 14 5月, 2021 1 次提交
    • O
      [docs] unused parameter handling (#1060) · 63c5070e
      Olatunji Ruwase 提交于
      * Fix docstring
      
      * Make screenshots clickable for easier viewing
      
      * Navigation menu in alphabetical order; More clicable screenshots
      
      * Rename 1Cycle doc
      
      * Tweak naming
      
      * Remove no longer used flag
      
      * ZeRO3 Offload release
      
      * Single GPU results
      
      * Rearrange figures
      
      * Single GPU text
      
      * tweak intro
      
      * zero3-offload section
      
      * Add asynchronous i/o docs
      63c5070e
  18. 13 5月, 2021 2 次提交
  19. 27 4月, 2021 1 次提交
  20. 25 4月, 2021 1 次提交
    • H
      Add find_unused_parameters option to DeepSpeedEngine (#945) · d0b61f18
      hamlet 提交于
      * Add find_unused_parameters option
      
      As unused parameters in modules may not be expected sometimes, 
      add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707
      
      * Add find_unused_parameters option
      
      As unused parameters in modules may not be expected sometimes, 
      add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707
      
      * Fix syntax error
      
      * Fix yapf error
      
      * Fix yapf error
      
      * Fix yapf error
      
      * Fix yapf error
      
      * Move stage2 find_unused_parameters to config file
      
      * Add stage2 find_unused_parameters
      
      * Add stage2 find_unused_parameters
      
      * Add stage2_find_unused_parameters option
      
      * Change error msg to reflect zero_optimization config change
      
      * Fix yapf error
      
      * Fix yapf errors
      
      * Change find_unused_parameters option name
      
      * Change find_unused_parameters option name
      
      * Change find_unused_parameters option name
      
      * Change find_unused_parameters option name
      
      * Change find_unused_parameters option name
      
      * Add UnusedParametersModel for test option find_unused_parameters
      
      * Add unit test for stage2 find_unused_parameters
      
      * Add cpu-adam compatible check
      
      * Remove dups import
      
      * Trim spaces
      
      * Fix yapf errors
      
      * Trim spaces
      
      * Add False Positive test check
      
      * Fix find_unused_parameters test
      
      * Trim spaces
      
      * Fix yapf error
      d0b61f18
  21. 23 4月, 2021 2 次提交
    • O
      Asynchronous I/O docs (#1000) · bff4bc72
      Olatunji Ruwase 提交于
      * Fix docstring
      
      * Make screenshots clickable for easier viewing
      
      * Navigation menu in alphabetical order; More clicable screenshots
      
      * Rename 1Cycle doc
      
      * Tweak naming
      
      * Remove no longer used flag
      
      * ZeRO3 Offload release
      
      * Single GPU results
      
      * Rearrange figures
      
      * Single GPU text
      
      * tweak intro
      
      * zero3-offload section
      
      * Add asynchronous i/o docs
      bff4bc72
    • S
      [doc] add missing pin_memory entry (#999) · ecf2e1bc
      Stas Bekman 提交于
      - `offload_param` was missing `pin_memory` 
      - also moved the entry in `offload_optimizer` to have it in the same place.
      ecf2e1bc
  22. 21 4月, 2021 2 次提交
  23. 19 4月, 2021 1 次提交
  24. 15 4月, 2021 1 次提交
  25. 08 4月, 2021 1 次提交
  26. 17 3月, 2021 1 次提交
    • C
      1-bit Adam v2 (#817) · 68c8481b
      Conglong Li 提交于
      Authors: @awan-10 @conglongli @samyam @jeffra
      
      What's new:
      
      NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
      Add support to momentum masks for those parameters with constant zero gradients during training.
      Bug fixes (e.g., #813).
      
      * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594)
      
      * NCCL based 1-bit Implementation + Refactor to add communication backends (#593)
      
      * add nccl 1-bit optim.
      
      * temporary commit to save stuff.
      
      * Use dist collectives instead of mpi routines.
      
      * remove old code for comm.
      
      * Fix bugs. still does not work.
      
      * modify to test the nccl side code path
      
      * Initial gather impl. Works intra-node.
      
      * Updates to comm. phase 2. nccl comm. passed the tests.
      
      * refactor code to introduce nccl/mpi as backends for onebit adam.
      
      * Refactor updates to test/engine.
      
      * Fix compile/runtime errors.
      
      * simplify support for nccl/mpi backends.
      
      * Add missign file
      
      * Add compression backend in constructor. Revert later.
      
      * modify test with some perf counting.
      
      * Implement a true non-blocking gather for nccl side.
      
      * Revert "Add compression backend in constructor. Revert later."
      
      This reverts commit df8c40d3.
      
      * improve the 1-bit adam test.
      
      * Refactor comm. and compression backend in 1-bit adam.
      
      * Fix the test.
      
      * Fix runtime errors and typos in nccl backend
      
      * fix mpi backend. modify tests.
      
      * modify nccl perf test.
      
      * fix mpi side errors.
      
      * Add an mpi perf test
      
      * Sync DSE.
      
      * Remove old collectives file.
      
      * Undo a typo.
      
      * Graceful failure for torch versions that don't support nccl pt2pt.
      
      * Revert "Merge branch 'master' into staging-1bit-nccl-v2"
      
      This reverts commit 78400850, reversing
      changes made to a6dba72a.
      
      * Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""
      
      This reverts commit 6dbdd985.
      
      * comm optimization + 1-bit lamb
      
      * Saving/debugging commit.
      
      * finalizing 1-bit lamb
      
      * finalizing 1-bit lamb
      
      * add momentum mask and chkpt handling for 1-bit adam
      
      * Cleanup and modify nccl test to be runnable with deepspeed launcher.
      
      * Fix format.
      
      * fix formatting again.
      
      * make test runnable without mpi4py
      
      * Add dist.alltoall and dist.allgather instead of custom functions.
      
      * remove debug prints.
      
      * formatting and renaming
      
      * renaming
      
      * renaming
      
      * add unit test, fix existing tests
      
      * skip unit test when torch < 1.8
      
      * revert 1-bit lamb
      
      * flatten momentum when dimension is more than 1
      
      * add warning message for 1-bit adam under fp32
      
      * improve version check
      
      * add fp32 test
      
      * 1-bit adam doc
      
      * fix file name
      
      * doc fix
      
      * torch 1.8 is released
      
      * doc fix
      
      * fix tests
      
      * update news
      
      * add doc for momentum mask
      
      * fix checkpoing handling, add unit test
      
      * checkpoint handling doc
      
      * doc final cleanup
      
      * bump dates
      
      * update tests
      
      * url change
      
      * doc fix
      
      * fix test
      
      * doc update
      Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      68c8481b
  27. 12 3月, 2021 1 次提交
  28. 09 3月, 2021 1 次提交
  29. 21 2月, 2021 1 次提交
  30. 11 2月, 2021 1 次提交
    • C
      Add flops profiler tutorial (#682) · e2dfe0d1
      Cheng Li 提交于
      * work on flops profiler tutorial
      
      * update flops profiler tutorial
      
      * add flops profiler tutorial and fix names
      
      * work on flops profiler tutorial
      
      * update flops profiler tutorial
      
      * add flops profiler tutorial and fix names
      
      * fix tailing ws
      
      * fix names
      
      * remove multistep profiling and update docs
      
      * fix cases where functionals and submodules coexist in a parent module, update readme
      
      * fix typo
      
      * always invoke post hook function
      
      * fix module flops sum and update tests
      
      * update tutorial
      e2dfe0d1
  31. 21 1月, 2021 1 次提交
  32. 16 1月, 2021 2 次提交
  33. 31 10月, 2020 1 次提交
  34. 11 10月, 2020 1 次提交
  35. 17 9月, 2020 1 次提交