1. 26 7月, 2022 2 次提交
  2. 21 6月, 2022 1 次提交
  3. 20 6月, 2022 1 次提交
  4. 11 6月, 2022 1 次提交
  5. 16 5月, 2022 1 次提交
  6. 12 5月, 2022 1 次提交
  7. 10 5月, 2022 1 次提交
  8. 07 5月, 2022 1 次提交
  9. 06 5月, 2022 1 次提交
    • O
      Improve z3 trace management (#1916) · 673cb608
      Olatunji Ruwase 提交于
      * Fix OOM and type mismatch
      
      * Toggle prefetching
      
      * Disable z3 prefetching for inference (temp workaround)
      
      * Fix zero3 tracing issues
      
      * Remove debug prints
      
      * Enable prefetch for inference
      
      * Code clarity
      
      * Invalidate trace cache
      
      * Trace cache invalidation when needed
      Separate nvme prefetch from all-gather prefetch
      
      * Track last used step id
      
      * Use debug name in error message
      
      * Construct param trace from module trace
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      673cb608
  10. 27 4月, 2022 1 次提交
  11. 26 4月, 2022 1 次提交
  12. 20 4月, 2022 1 次提交
  13. 23 1月, 2022 1 次提交
  14. 21 1月, 2022 1 次提交
    • J
      Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support) (#1453) · 4912e0ad
      Justin Chiu 提交于
      * Changes for bfloat16 Zero2
      
      * ZeRO stage3 optimizations, with some bug fixes
      
      optimizations for stage3:
      - prefetching improvements
      - batching allgather calls to amortize fixed overhead and improve
        bandwidth utilization
      - batching reduce_scatter calls to amortize fixed overhead and
        improve bandwidth utilization
      - using *_base variants of allgather and reduce scatter to reduce memory
        allocations and data movement
      - more fine grained synchronization for communication that allows
        blocking on less work
      - precomputation of fetching code - using a fetch queue rather than
        deciding what to (pre)fetch at each iteration
      - limiting queued coalesced communication ops to reduce memory pressure
        on pytorch cuda caching allocator (not elegant solution)
      
      optimizations for stage3-offload:
      - made some host-device tensor copies async to improve performance
      
      bug fixes and qol improvements:
      - fix init context method when parent modules modify child weights
      - speed up model initialization by moving model to GPU before weight
        initialization
      - fixed unit test imports so that unit tests can be run from any
        directory
      - change performance logging to include memory consumption
      - add logging w/ model size when done partitioning model
      
      new features
      - bfloat16 support for ZeRO 3
      
      * fix import in ut
      
      * ran yapf
      
      * improvements to cache flush warn log
      
      * backwards compatibility with older versions of pytorch
      
      * handle edge case where reduced tensor smaller than world size
      
      * moved event synchronization to allgather handle wait() call
      
      * removed unnecessary barrier call
      
      * formatting fix after resolving merge conflict
      
      * skip nvme prefetch when trace not complete
      
      * opportunistically avoid memory allocation in allgather coalesced where possible
      
      * fix indentation after merge
      
      * fixes to account for parameter offload
      
      * accounting for torch.cuda.memory_stats not being available
      
      * moved partition_all_params to optimizer step
      
      * allgathering on params before item gets called
      
      * fix param status checks
      
      needed after moving partition_all_parameters call to optimizer step
      
      * fix grad accumulation with optimizer offload
      
      * grad norm computation fix for optimizer offload
      
      * change post divide in reduce-scatter to pre divide
      
      * fix gradient race condition w/ optimizer offload
      
      * improve inf/nan gradient tracking
      
      * don't prefetch when not in training mode
      
      * format fix after merging
      
      * fix prefetching issue when using NVME offload
      
      * improved defragmentation for fp16 parameters
      
      * relative imports for bf16 tests
      
      * changes for bwd compatibility with pytorch 1.2
      
      * remove buffered_reduce_fallback
      
      * removed unused parameter offset bookkeeping
      
      * fixed tracking for multiple param groups
      
      * unbroke bfloat16 config after merge conflict
      
      * using base allgather params when only 1 param
      
      * cleanup/fixes for fp16 partition defragmentation
      
      * switch to CRLF
      
      * convert to same new-line style as master
      
      * align new line with master
      
      * Fix merge issues
      
      * switch to CRLF
      
      * fix to LF line endings
      
      * minor merge fixes
      
      * remove extra bfloat16_enabled definition
      
      * asserting params inflight for AllGatherHandle
      
      * remove get_cuda_mem_allocated_str
      
      * Format fixes
      
      * fix bfloat16 zero stage check (broken after merge commit)
      
      * +self.communication_data_type, -self.allreduce_always_fp32; delete dead code
      
      * Add self.reduce_scatter
      
      * Format fix
      
      * Fix merge issues
      
      * iterate over params_to_fetch rather than make another iterator
      
      * add some TODOs
      
      * remove unnecessary division by micro_step_id
      
      * rename config keys "bfloat16" -> "bf16"
      
      * rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bit_weights_on_model_save
      
      * add unit test to check backwards compatibility for gather_16bit_weights
      
      * added test to confirm bf16 key bwd compatibility
      
      * Format fixes
      Co-authored-by: NRana Ali Amjad <raamjad@amazon.com>
      Co-authored-by: NJustin Chiu <justchiu@amazon.com>
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      4912e0ad
  15. 02 12月, 2021 1 次提交
  16. 01 12月, 2021 1 次提交
  17. 19 11月, 2021 1 次提交
  18. 13 11月, 2021 1 次提交
  19. 11 11月, 2021 1 次提交
  20. 31 10月, 2021 1 次提交
    • Z
      ZeRO3, improved parameter all-gather operation (#1188) · c0eeb69d
      Zhen Zhang 提交于
      * remove norm(), avoid memcpy after allgather
      
      1) Removing the norm computation in debug printing
      2) Changing _all_gather to be sync op in fetch_sub_module
          Reason: the async version is not async at all, because each
          all_gather calls torch.cuda.synchronize() to guarantee previous
          communication op to be completed
      3) Adding new function _allgather_params_split_launch
          the existing _allgather_params has explicit memcpy after the
          all-gather op. We can avoid the explicit memory copy at
          python side, to improve the performance.
      
      Known issue:
          the `torch.distributed.all_gather` will do implicit memcpy
          at the end of each `ncclAllgather`.
      
      * WIP: wrapped ncclAllgather as customized op in DS
      
      micro benchmark shows the improvement of allgather a
      transformer layer with 9834560 elements in half precision is about
      1.1ms on aws-p4d instance.
      
      * WIP: integrated into partition_parameters
      
      Performance improvement of 5.1B bert on aws-p4d:
      fwd: 300ms -> 200ms
      bwd: 680ms -> 610ms
      
      * Fix format
      
      * cleaned dead code, modified unit test
      
      * removed customized c++ extension
      
      revert back to use torch distributed API
      
      * change torch.ones to torch empty
      
      * typo
      
      * warn if not cuda tensor for allgather
      
      * fix formatting
      
      * fix: move ds_tensor to cuda device
      
      but it is strange that the ds_tensor haven't been moved to cuda
      
      * remove try clause on the path for fetching params
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      c0eeb69d
  21. 22 10月, 2021 1 次提交
  22. 02 10月, 2021 1 次提交
  23. 01 10月, 2021 1 次提交
  24. 30 9月, 2021 1 次提交
  25. 16 9月, 2021 1 次提交
  26. 02 9月, 2021 1 次提交
  27. 13 7月, 2021 1 次提交
  28. 12 7月, 2021 1 次提交
  29. 10 7月, 2021 1 次提交
  30. 29 6月, 2021 1 次提交
  31. 24 6月, 2021 1 次提交
  32. 08 6月, 2021 1 次提交
  33. 14 5月, 2021 1 次提交
  34. 01 5月, 2021 1 次提交
  35. 30 4月, 2021 1 次提交
  36. 29 4月, 2021 1 次提交
  37. 22 4月, 2021 1 次提交
  38. 21 4月, 2021 1 次提交
  39. 19 4月, 2021 1 次提交