1. 23 11月, 2022 1 次提交
  2. 21 11月, 2022 4 次提交
  3. 19 11月, 2022 1 次提交
  4. 18 11月, 2022 3 次提交
    • W
    • J
      correct sync behavior for XPU distributed training (#47882) · aafa9820
      james 提交于
      * correct sync behavior for XPU distributed training
      
      XPU support event mechanism similar to cuda event, so it is advisable to
      use an event to sync compute/comm streams for performance. However this
      mechanism is never fully tested, and inconsistent loss/ending_epochs are
      reported. Therefore, this PR replaces event sync with stream waiting as
      a temporary solution.
      
      * remove compile warning
      aafa9820
    • J
      fix device id issue for xpu eager mode (#48076) · 3b18d96b
      james 提交于
      * fix device id issue for xpu eager
      
      xpu device id is not correctly set in eager mode, thus vars are on dev0 unless
      XPUDeviceGurad is called, leading to this error message for all node rank != 0:
      "NotImplementedError: (Unimplemented) Place Place(xpu:0) is not supported."
      
      * fix typo
      
      * fix pybind error
      3b18d96b
  5. 17 11月, 2022 1 次提交
  6. 16 11月, 2022 1 次提交
  7. 14 11月, 2022 3 次提交
  8. 10 11月, 2022 3 次提交
    • L
      remove the hang checkness (#47806) · 8d99dd0c
      LiYuRio 提交于
      8d99dd0c
    • J
      XPU multi-card support eager mode (#47445) · 3b91f8f3
      james 提交于
      * XPU support eager mode
      
      * add unittest for XPU eager mode
      
      * minor bugfix
      
      * minor bugfix, test=kunlun
      
      * correct copyright info
      
      * 1. remove unsed vars/funcs
      2. ProcessGroupBKCL inherit from ProcessGroupStream
      
      * bugfix for fp16 in eager mode multi-card, test=kunlun
      
      * rebase & fix a few issues
      
      * use new processgroup interface, test=kunlun
      
      * fix compile issue, test=kunlun
      3b91f8f3
    • W
      Refactor collective communication P2P C++ API (#47801) · d926c270
      Wen Sun 提交于
      * refactor: send, recv, send_partial, recv_partial
      
      * refactor: rm useless const ref
      d926c270
  9. 09 11月, 2022 1 次提交
  10. 08 11月, 2022 2 次提交
  11. 07 11月, 2022 1 次提交
  12. 04 11月, 2022 2 次提交
  13. 03 11月, 2022 1 次提交
  14. 01 11月, 2022 2 次提交
  15. 31 10月, 2022 2 次提交
  16. 28 10月, 2022 2 次提交
  17. 27 10月, 2022 1 次提交
    • L
      make all cpp tests dynamic linked to libpaddle.so [except windows] (#47088) · 2096448b
      Leo Chen 提交于
      * make all cpp tests dynamic linked to libpaddle.so
      
      * add comments
      
      * keep old cc_test for some tests
      
      * fix some ut
      
      * make some ut use cc_test_old
      
      * fix typos and fit for win32
      
      * fix lib path
      
      * fix some tests
      
      * skip lite test
      
      * fit for rocm
      
      * fit for cinn
      
      * fit for mac
      
      * fit for win32
      
      * skip inference ut
      
      * skip  windows
      
      * fix coverage
      2096448b
  18. 26 10月, 2022 1 次提交
  19. 19 10月, 2022 2 次提交
  20. 17 10月, 2022 1 次提交
    • G
      Support BF16 training for sharding (#46846) · 0b39b244
      Ghost Screaming 提交于
      * Fix bug of reduce_sum op. When input.numel() > INT32_MAX, its result
      is wrong.
      
      * support pure bfloat16
      
      * support bf16 linear
      
      * update PR to pass CI
      
      * tiny fix where_grad_kernel.cu
      
      * Support bfloat16 type for reducer and sharding.
      
      * Fix some bug.
      
      * Polish code.
      
      * Polise code.
      
      * Add bfloat16 datatype in fill_grad kernels.
      Co-authored-by: Nsneaxiy <sneaxiy@126.com>
      0b39b244
  21. 13 10月, 2022 1 次提交
    • X
      [WIP]飞桨PaddlePaddle 分布式强化学习功能研发 (#45998) · f0afcabc
      Xinger 提交于
      * add rpc module in cpp side
      
      * add rpc module in python side
      
      * support win32 and mac for rpc
      
      * 代码优化
      
      * 优化代码
      
      * update rpc
      
      * update rpc launch
      
      * rpc remove rank and world_size api
      
      * fix logger import bug
      
      * remove support for win and mac
      
      * remove support for xpu, npu, cinn and rocm
      
      * remove support for xpu, npu, cinn and rocm
      
      * fix shutdown barrier timeout bug
      
      * update:python_rpc_handler to shared ptr
      
      * fix master shutodwn first bug
      
      * tests support for cpu
      
      * update log to vlog
      
      * update get service info api
      
      * add single process test case
      
      * remove process group
      
      * remove some useless dependencies
      
      * update rpc api comments
      
      * update rpc comments: Example to Examples
      
      * update rpc api comments
      
      * update rpc api comments
      
      * update launch api comments
      
      * update init_rpc comments
      
      * update rpc sync and async comments
      
      * fix bug: init_rpc cant be called repeatly in a process
      
      * update rpc api comment: make master endpoint unique
      
      * update rpc api:service to worker, timeout_ms to timeout
      
      * rename ServiceInfo to WorkerInfo
      
      * refactor: rename server to worker, log to vlog
      
      * add launch test
      
      * remove unused codes
      
      * refine
      f0afcabc
  22. 11 10月, 2022 3 次提交
  23. 10 10月, 2022 1 次提交