1. 17 2月, 2023 1 次提交
  2. 03 11月, 2022 1 次提交
  3. 18 10月, 2022 1 次提交
    • Y
      Cherry pick for sharding (#47061) · 5b642140
      Yuang Liu 提交于
      * [dygraph sharding] Overlap the reduce and the caculation for sharding stage 2. (#46495)
      
      * [dygraph sharding stage 2] sharding broadcast overlap (#46656)
      
      * Multi groups for broadcast of sharding stage 2 (#46894)
      5b642140
  4. 17 10月, 2022 1 次提交
    • W
      [Cherry-pick] Collective communication APIs (#46922) · 5fba2a98
      Wen Sun 提交于
      * Support both use_calc_stream and sync_op in send recv APIs (#46023)
      
      * Support both use_calc_stream and sync_op in allgather API (#46295)
      
      * Support both use_calc_stream and sync_op in collective communication API (#46761)
      
      * Move group and all reduce from collective to communication (#45848)
      
      * Completes bfloat16 dtype for collective api in eager mode (#45844)
      
      * Fix collective APIs cannot be recognized when building docs (#46962)
      Co-authored-by: NLiYuRio <63526175+LiYuRio@users.noreply.github.com>
      5fba2a98
  5. 27 9月, 2022 1 次提交
  6. 22 9月, 2022 1 次提交
    • R
      logger manager (#45909) (#46087) · 7eb046c7
      Roc 提交于
      uniform logger manager in FleetAPI.
      hidde API under distributed/utils which users don't need.
      7eb046c7
  7. 20 9月, 2022 2 次提交
  8. 19 9月, 2022 2 次提交
    • W
      Recompute unify incubate (#46073) (#46210) · 4bced24a
      wuhuachaocoding 提交于
      4bced24a
    • Y
      [Cherry-pick][Auto Parallel] Improve the APIs (#46164) · c5cc4278
      Yulong Ao 提交于
      * [AutoParallel] adapt gradient merge pass (#45915)
      
      * adapt gradient merge
      
      * fix op_role
      
      * fix strategy
      
      * [Auto Parallel] Gradient Fuse Allreduce (#45643)
      
      * bugfix (#45332)
      
      * dist embedding support lookup table v1
      
      * add unitest
      
      * customize wait_comm
      
      * group gradients
      
      * bugfix
      
      * update program
      
      * [Auto Parallel] Improve the APIs (#45776)
      
      * [Auto Parallel] Use c++ dist attr in the completion process
      
      * [Auto Parallel] Add minor changes
      
      * [Auto Parallel] Use c++ dist attr in the completion process
      
      * [Auto Parallel] Add minor changes
      
      * [Auto Parallel] Add the serialization process for dist attrs
      
      * [Auto Parallel] Remove unnecessary comments
      
      * [Auto Parallel] Fix some bugs
      
      * [Auto Parallel] Fix the code style
      
      * [Auto Parallel] Remove unnecessary impls
      
      * [Auto Parallel] Fix the importing error
      
      * [Auto Parallel] Fix the copy from bugs of op dist attr
      
      * [Auto Parallel] Replace the use of constexpr if
      
      * [Auto Parallel] Redesign the shard_tensor, shard_op and ProcessMesh
      
      * [Auto Parallel] Change API of the completion unittest
      
      * [Auto Parallel] Fix the bug when set_attr an int
      
      * [Auto Parallel] Add the unittest for the serialization
      
      * [Auto Parallel] Add some unit tests
      
      * [Auto Paralle] Unify the strategy
      
      * [Auto Parallel] Improve the engine api
      
      * [Auto Parallel] Reset the changes made to the framework
      
      * [Auto Parallel] Change the engine unittest
      
      * [Auto Parallel] Update API of the completion and partitioner
      
      * [Auto Parallel] Update unit tests using engine api
      
      * update shard annotation
      
      * [Auto Parallel] Remove the modifications of other modules
      
      * [Auto Parallel] Add docs for APIs
      
      * add new strategy
      
      * [Auto Parallel] Replace the logger
      
      * [Auto Parallel] Restore the test_program.py
      
      * [Auto Parallel] Change the import rules
      
      * [Auto Parallel] Add the examples for Engine
      
      * [Auto Parallel] Do some minor changes
      
      * [Auto Parallel] Remove yaml dependency
      
      * [Auto Parallel] Fix the unittests
      
      * add valid after train
      
      * bug fix
      Co-authored-by: Nzhaoyingli <zhaoyingli@baidu.com>
      Co-authored-by: Ncaozhou <caozhou@radi.ac.cn>
      Co-authored-by: Ncaozhou <48191911+Caozhou1995@users.noreply.github.com>
      
      * [Auto Parallel] Bugfix allreduce fuse for MP (#46086)
      
      * bugfix
      
      * bugfix
      
      * typos fixed
      
      * update strategy (#46138)
      Co-authored-by: Nzhaoyingli <86812880+zhaoyinglia@users.noreply.github.com>
      Co-authored-by: NJZ-LIANG <jianzhongliang10@gmail.com>
      Co-authored-by: Nzhaoyingli <zhaoyingli@baidu.com>
      Co-authored-by: Ncaozhou <caozhou@radi.ac.cn>
      Co-authored-by: Ncaozhou <48191911+Caozhou1995@users.noreply.github.com>
      c5cc4278
  9. 08 9月, 2022 1 次提交
  10. 07 9月, 2022 1 次提交
  11. 06 9月, 2022 3 次提交
  12. 05 9月, 2022 1 次提交
  13. 02 9月, 2022 1 次提交
  14. 01 9月, 2022 1 次提交
  15. 31 8月, 2022 2 次提交
  16. 29 8月, 2022 1 次提交
  17. 26 8月, 2022 1 次提交
    • R
      move collective tests into a collective directory (#45223) · 9eb4d89b
      Roc 提交于
      * add simple reformated ci files
      
      * update
      
      * add radme for new unitetsts
      
      * add radme for new unitetsts
      
      * add radme for new unitetsts
      
      * reset mlu
      
      * update for samples
      
      * add base api
      
      * reset some dist unit tests
      
      * add warning in grenerated cmakelists file
      
      * update readme for new dist unit tests
      
      * add all collective tests
      
      * remain base file and launcher file
      
      * Update README.md
      
      * Update README.md
      
      * fix env PYTHONPATH
      
      * Update gen_ut_cmakelists.py
      
      * add all collective tests
      
      * add docs for gen_ut_cmakelists.py
      
      * pretify codes
      
      * commont name == "name"
      
      * update for comments
      
      * update function's help
      
      * update for run type
      
      * update readme
      
      * add all collective tests
      
      * add all collective tests
      
      * mv  collective test files
      
      * update for all collective tests
      
      * update
      
      * update
      
      * update
      
      * update for all tests
      
      * update for checking name
      
      * Update Cmakelists.txt
      
      * update testlist.csv
      
      * remain test_parallel_dygraph_dataparallel in unittests
      
      * set broadcast op all platforms
      
      * update
      
      * remain test_broadcast_tensors_op
      
      * fix
      
      * rm some collective files
      
      * update more colective tests
      
      * update
      
      * update
      
      * update
      gen_ut_supports recursion
      
      * update
      
      * update
      
      * update
      
      * update
      
      * fix nccl version
      
      * update
      
      * update
      
      * update
      
      * update
      
      * fix a bug and try to pass
      
      * update
      
      * add csv
      
      * update for timeout
      
      * remove tcp store
      
      * fix
      
      * fix
      
      * update
      
      * update
      
      * update for more dist tests
      
      * move multi node tests
      
      * update
      
      * update
      
      * update
      
      * fix for auto parallele
      
      * update
      
      * update path in python file
      
      * update
      
      * reset some test in unittests
      
      * fix
      
      * update readme
      
      * fix
      
      * update
      
      * fix port
      9eb4d89b
  18. 18 8月, 2022 1 次提交