1. 19 12月, 2022 1 次提交
  2. 17 12月, 2022 1 次提交
  3. 16 12月, 2022 1 次提交
  4. 15 12月, 2022 1 次提交
  5. 14 12月, 2022 1 次提交
  6. 12 12月, 2022 1 次提交
  7. 09 12月, 2022 1 次提交
  8. 05 12月, 2022 1 次提交
  9. 03 12月, 2022 1 次提交
  10. 28 11月, 2022 1 次提交
  11. 24 11月, 2022 2 次提交
    • H
      [PHI decoupling] simplify "convert_utils.h" in fluid (#48168) · de4310e6
      huangjiyi 提交于
      * rm dependence to "convert_utils.h" in some files
      
      * fix bugs
      
      * replace DataType2String with DataTypeToString
      
      * replace framework::DataTypeSize with phi::SizeOf
      
      * mv convert_function from fluid to phi and rm old map
      
      * recommit with pre-commit
      
      * repalce ProtoVarType with ProtoDataType and update comment.
      
      * fix error about include "dnnl.hpp"
      
      * revert add dep mkldnn to convert_utils in phi
      
      * add mkldnn deps in convert_utils.h in phi
      
      * move deps to convert_utils.h in phi
      de4310e6
    • J
      processgroup bkcl support reduce (#48232) · 5f995d3f
      james 提交于
      Note: this is a temporary solution, should be replaced once reduce kernel
      is natively supported on KL2
      5f995d3f
  12. 23 11月, 2022 1 次提交
  13. 21 11月, 2022 4 次提交
  14. 19 11月, 2022 1 次提交
  15. 18 11月, 2022 3 次提交
    • W
    • J
      correct sync behavior for XPU distributed training (#47882) · aafa9820
      james 提交于
      * correct sync behavior for XPU distributed training
      
      XPU support event mechanism similar to cuda event, so it is advisable to
      use an event to sync compute/comm streams for performance. However this
      mechanism is never fully tested, and inconsistent loss/ending_epochs are
      reported. Therefore, this PR replaces event sync with stream waiting as
      a temporary solution.
      
      * remove compile warning
      aafa9820
    • J
      fix device id issue for xpu eager mode (#48076) · 3b18d96b
      james 提交于
      * fix device id issue for xpu eager
      
      xpu device id is not correctly set in eager mode, thus vars are on dev0 unless
      XPUDeviceGurad is called, leading to this error message for all node rank != 0:
      "NotImplementedError: (Unimplemented) Place Place(xpu:0) is not supported."
      
      * fix typo
      
      * fix pybind error
      3b18d96b
  16. 17 11月, 2022 1 次提交
  17. 16 11月, 2022 1 次提交
  18. 14 11月, 2022 3 次提交
  19. 10 11月, 2022 3 次提交
    • L
      remove the hang checkness (#47806) · 8d99dd0c
      LiYuRio 提交于
      8d99dd0c
    • J
      XPU multi-card support eager mode (#47445) · 3b91f8f3
      james 提交于
      * XPU support eager mode
      
      * add unittest for XPU eager mode
      
      * minor bugfix
      
      * minor bugfix, test=kunlun
      
      * correct copyright info
      
      * 1. remove unsed vars/funcs
      2. ProcessGroupBKCL inherit from ProcessGroupStream
      
      * bugfix for fp16 in eager mode multi-card, test=kunlun
      
      * rebase & fix a few issues
      
      * use new processgroup interface, test=kunlun
      
      * fix compile issue, test=kunlun
      3b91f8f3
    • W
      Refactor collective communication P2P C++ API (#47801) · d926c270
      Wen Sun 提交于
      * refactor: send, recv, send_partial, recv_partial
      
      * refactor: rm useless const ref
      d926c270
  20. 09 11月, 2022 1 次提交
  21. 08 11月, 2022 2 次提交
  22. 07 11月, 2022 1 次提交
  23. 04 11月, 2022 2 次提交
  24. 03 11月, 2022 1 次提交
  25. 01 11月, 2022 2 次提交
  26. 31 10月, 2022 2 次提交