1. 08 2月, 2023 2 次提交
  2. 20 1月, 2023 1 次提交
  3. 18 1月, 2023 2 次提交
  4. 15 1月, 2023 1 次提交
    • R
      support mp on xpu (#49815) · 6a56bce7
      Roc 提交于
      1 update xccl lib
      2 when using comm_ctx, the allocator should be set manually.
      6a56bce7
  5. 13 1月, 2023 1 次提交
    • D
      [Custom Device] Clear ProcessGroup Manually (#49182) · a923a757
      duanyanhui 提交于
      * clear ProcessGroupCustom manually
      
      * fix bug
      
      * fix bug
      
      * move destroy ProcessGroup to ProcessGroupIdMap
      
      * enable destroy to all device
      
      * remove unused comments
      
      * change to internal api
      
      * Update process_group.cc
      
      * Update process_group.cc
      a923a757
  6. 12 1月, 2023 2 次提交
    • W
      Migrate collective communication checks to PHI (#49754) · c24e7fe1
      Wen Sun 提交于
      * refactor: migrate comm checks
      
      * refactor: add check in comm context
      
      * feat: add gloo static check
      
      * refactor: add place param in static check
      c24e7fe1
    • J
      Fix reduce func bug in process_group_bkcl (#49749) · 8e291bf7
      jameszhang 提交于
      * Fix reduce func bug in process_group_bkcl
      
      Also catch up with a recent process_group PR that failed to add XPU branch.
      Note that reduce is still accomplished by allreduce for xpu. Fix this should
      xccl lib be updated.
      
      * fix compile issue for non-XPU
      8e291bf7
  7. 09 1月, 2023 1 次提交
  8. 06 1月, 2023 1 次提交
  9. 05 1月, 2023 1 次提交
  10. 19 12月, 2022 1 次提交
  11. 17 12月, 2022 1 次提交
  12. 16 12月, 2022 1 次提交
  13. 15 12月, 2022 1 次提交
  14. 14 12月, 2022 1 次提交
  15. 12 12月, 2022 1 次提交
  16. 05 12月, 2022 1 次提交
  17. 03 12月, 2022 1 次提交
  18. 24 11月, 2022 1 次提交
  19. 23 11月, 2022 1 次提交
  20. 21 11月, 2022 4 次提交
  21. 19 11月, 2022 1 次提交
  22. 18 11月, 2022 3 次提交
    • W
    • J
      correct sync behavior for XPU distributed training (#47882) · aafa9820
      james 提交于
      * correct sync behavior for XPU distributed training
      
      XPU support event mechanism similar to cuda event, so it is advisable to
      use an event to sync compute/comm streams for performance. However this
      mechanism is never fully tested, and inconsistent loss/ending_epochs are
      reported. Therefore, this PR replaces event sync with stream waiting as
      a temporary solution.
      
      * remove compile warning
      aafa9820
    • J
      fix device id issue for xpu eager mode (#48076) · 3b18d96b
      james 提交于
      * fix device id issue for xpu eager
      
      xpu device id is not correctly set in eager mode, thus vars are on dev0 unless
      XPUDeviceGurad is called, leading to this error message for all node rank != 0:
      "NotImplementedError: (Unimplemented) Place Place(xpu:0) is not supported."
      
      * fix typo
      
      * fix pybind error
      3b18d96b
  23. 17 11月, 2022 1 次提交
  24. 16 11月, 2022 1 次提交
  25. 14 11月, 2022 3 次提交
  26. 10 11月, 2022 2 次提交
    • J
      XPU multi-card support eager mode (#47445) · 3b91f8f3
      james 提交于
      * XPU support eager mode
      
      * add unittest for XPU eager mode
      
      * minor bugfix
      
      * minor bugfix, test=kunlun
      
      * correct copyright info
      
      * 1. remove unsed vars/funcs
      2. ProcessGroupBKCL inherit from ProcessGroupStream
      
      * bugfix for fp16 in eager mode multi-card, test=kunlun
      
      * rebase & fix a few issues
      
      * use new processgroup interface, test=kunlun
      
      * fix compile issue, test=kunlun
      3b91f8f3
    • W
      Refactor collective communication P2P C++ API (#47801) · d926c270
      Wen Sun 提交于
      * refactor: send, recv, send_partial, recv_partial
      
      * refactor: rm useless const ref
      d926c270
  27. 09 11月, 2022 1 次提交
  28. 08 11月, 2022 1 次提交
  29. 07 11月, 2022 1 次提交