1. 13 1月, 2023 1 次提交
    • D
      [Custom Device] Clear ProcessGroup Manually (#49182) · a923a757
      duanyanhui 提交于
      * clear ProcessGroupCustom manually
      
      * fix bug
      
      * fix bug
      
      * move destroy ProcessGroup to ProcessGroupIdMap
      
      * enable destroy to all device
      
      * remove unused comments
      
      * change to internal api
      
      * Update process_group.cc
      
      * Update process_group.cc
      a923a757
  2. 26 12月, 2022 1 次提交
  3. 08 12月, 2022 1 次提交
    • G
      Clean fluid APIs in distributed and fleet files (#48851) · 911d6bb1
      Ghost Screaming 提交于
      * Fix bug of reduce_sum op. When input.numel() > INT32_MAX, its result
      is wrong.
      
      * Remove climits.
      
      * Clean fluid API in paddle/distributed and paddle/fleetx folders.
      Include following files:
      python/paddle/distributed/__init__.py
      python/paddle/distributed/collective.py
      python/paddle/distributed/fleet/utils/fs.py
      python/paddle/distributed/fleet/utils/hybrid_parallel_inference.py
      python/paddle/distributed/fleet/utils/hybrid_parallel_util.py
      python/paddle/distributed/fleet/utils/internal_storage.py
      python/paddle/distributed/launch/context/device.py
      python/paddle/distributed/parallel.py
      python/paddle/distributed/parallel_with_gloo.py
      python/paddle/distributed/spawn.py
      python/paddle/framework/__init__.py
      To be mentioned, 'paddle.fluid.dygraph.parallel.ParallelEnv'
       and 'fluid.framework.core' keeps unchanged in those files.
      ParallelEnv is used by paddle.fluid.dygraph.parallel.DataParallel.
      However, APIs in paddle.fluid.dygraph.parallel can't be
      migrated to paddle.distributed, as there exists cyclic import
      dependencies in modules like paddle.static, paddle.tensor. And
      'fluid.framework.core' will be changed to import framework.core
      after fluid.core is transmitted.
      
      * Change TODO authors.
      911d6bb1
  4. 28 11月, 2022 1 次提交
  5. 25 11月, 2022 1 次提交
  6. 16 11月, 2022 1 次提交
    • W
      [remove fluid] under fleet meta_optimizers (#47864) · a2a97cbb
      wangzhen38 提交于
      * [remove fluid] under fleet meta_optimizers
      
      * [remove fluid] under fleet meta_optimizers
      
      * [remove fluid] under fleet meta_optimizers
      
      * [remove fluid] under fleet meta_optimizers
      
      * [remove fluid] under fleet meta_optimizers
      
      * [remove fluid] under fleet meta_optimizers
      
      * [remove fluid] under fleet meta_optimizers
      
      * [remove fluid] under fleet meta_optimizers
      
      * [remove fluid] under fleet meta_optimizers
      
      * [remove fluid] under fleet meta_optimizers
      
      * [remove fluid] under fleet meta_optimizers
      
      * [remove fluid] under fleet meta_optimizers
      a2a97cbb
  7. 04 11月, 2022 1 次提交
  8. 23 10月, 2022 1 次提交
  9. 19 10月, 2022 1 次提交
  10. 14 10月, 2022 1 次提交
  11. 13 10月, 2022 1 次提交
    • X
      [WIP]飞桨PaddlePaddle 分布式强化学习功能研发 (#45998) · f0afcabc
      Xinger 提交于
      * add rpc module in cpp side
      
      * add rpc module in python side
      
      * support win32 and mac for rpc
      
      * 代码优化
      
      * 优化代码
      
      * update rpc
      
      * update rpc launch
      
      * rpc remove rank and world_size api
      
      * fix logger import bug
      
      * remove support for win and mac
      
      * remove support for xpu, npu, cinn and rocm
      
      * remove support for xpu, npu, cinn and rocm
      
      * fix shutdown barrier timeout bug
      
      * update:python_rpc_handler to shared ptr
      
      * fix master shutodwn first bug
      
      * tests support for cpu
      
      * update log to vlog
      
      * update get service info api
      
      * add single process test case
      
      * remove process group
      
      * remove some useless dependencies
      
      * update rpc api comments
      
      * update rpc comments: Example to Examples
      
      * update rpc api comments
      
      * update rpc api comments
      
      * update launch api comments
      
      * update init_rpc comments
      
      * update rpc sync and async comments
      
      * fix bug: init_rpc cant be called repeatly in a process
      
      * update rpc api comment: make master endpoint unique
      
      * update rpc api:service to worker, timeout_ms to timeout
      
      * rename ServiceInfo to WorkerInfo
      
      * refactor: rename server to worker, log to vlog
      
      * add launch test
      
      * remove unused codes
      
      * refine
      f0afcabc
  12. 20 9月, 2022 1 次提交
    • R
      logger manager (#45909) · 264ad205
      Roc 提交于
      uniform logger manager in FleetAPI.
      hidde API under distributed/utils which users don't need.
      264ad205
  13. 31 8月, 2022 1 次提交
  14. 28 7月, 2022 1 次提交
  15. 11 7月, 2022 1 次提交
  16. 05 6月, 2022 1 次提交
    • S
      【code format check upgrade】 step2:yapf (#42944) · a072fca8
      Sing_chan 提交于
      * use yapf to format all python file
      
      * yapf exclude two unittests file for they rely on writing and reading file, and format will break them
      
      * disable diff_py_file because too many diff files cause command following failed
      a072fca8
  17. 12 4月, 2022 1 次提交
  18. 23 3月, 2022 1 次提交
  19. 09 3月, 2022 1 次提交
  20. 26 11月, 2021 1 次提交
  21. 29 10月, 2021 1 次提交
    • Y
      [Auto Parallel] Improve the interface and the underlying mechanisms (#36617) · a02532b5
      Yulong Ao 提交于
      * default dist op
      
      * add dist_attr for dist op
      
      * add unitest
      
      * update inputname
      
      * update function name
      
      * add unitest
      
      * update CMakeLists.txt for CI
      
      * fix dis_matmul
      
      * fix compile error
      
      * update matmul to matmul_v2
      
      * unify api
      
      * unify api
      
      * todo
      
      * update distop forward func
      
      * update distop forward func
      
      * auto parallel backward
      
      * update dist op
      
      * autoparallel backward
      
      * add backward for embedding
      
      * temp1
      
      * temp2
      
      * temp3
      
      * temp4
      
      * backward done1
      
      * backward done2
      
      * backward done3
      
      * dist embedding remove mp mode
      
      * dist matmul remove mp mode
      
      * update dist embedding
      『
      
      * dist op init1
      
      * dist op init 2
      
      * update unitest
      
      * context remove parallel mode
      
      * partitioner remove parallel mode
      
      * update unitest
      
      * a more general method to support varying mesh in pipeline parallel
      
      * support varying mesh in pipeline parallel
      
      * embedding support varying mesh in pipeline parallel
      
      * matmul support varying mesh in pipeline parallel
      
      * default dist op support varying mesh in pipeline parallel
      
      * dist attribute for startup program
      
      * default dist op support varying mesh in pipeline parallel 2
      
      * partitoner support varying mesh in pipeline parallel
      
      * revise logic for auto compeletion
      
      * revise framework.py
      
      * revise reshard unitest
      
      * revise unitest for parallelize
      
      * chmod
      
      * fixed bug for dist embedding name mapping
      
      * Improve the interface and the underlying mechanisms of auto parallel
      
      * revise completion for backward
      
      * revise completion for update
      
      * revise completion for update
      
      * update unitest
      
      * chmod
      
      * bugfix for grad_op output var's mesh
      
      * Modify codes for pr 36744
      
      * Remove unnecessary comments in framework.py
      
      * Remove unnecessary comments in completion.py
      Co-authored-by: NJZ-LIANG <jianzhongliang10@gmail.com>
      Co-authored-by: Nzhaoyingli <zhaoyingli@baidu.com>
      Co-authored-by: NJZ-LIANG <38102074+JZ-LIANG@users.noreply.github.com>
      a02532b5
  22. 18 9月, 2021 1 次提交
  23. 17 9月, 2021 1 次提交
  24. 08 9月, 2021 1 次提交
  25. 24 8月, 2021 1 次提交
    • Y
      Add auto completion module for auto parallel (#34813) · 93d862b0
      Yulong Ao 提交于
      * add auto_parallel dir
      
      * mv to paddle.distributed
      
      * add shard_xx api
      
      * add distributed attrs for var
      
      * add ut, test=develop
      
      * add dist
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update, test=develop
      
      * update, test=develop
      
      * update, test=develop
      
      * update, test=develop
      
      * update, test=develop
      
      * update, test=develop
      
      * update, test=develop
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update
      
      * update, test=develop
      
      * update, test=develop
      
      * update
      
      * update
      
      * delete unused proto
      
      * resotre op_desc
      
      * restore type_defs
      
      * update var_desc
      
      * remove dimss_mapping for proto_pybind
      
      * update interface.py
      
      * update framework.py
      
      * update
      
      * update
      
      * add auto_parallel dir
      
      * mv to paddle.distributed
      
      * add shard_xx api
      
      * add distributed attrs for var
      
      * add ut, test=develop
      
      * [WIP] Add the auto completion feature and related codes
      
      * [WIP] Improve the auto completion and related codes
      
      * [WIP] Make the auto completion to support data-parallel
      
      * [WIP] Make the completion support mp and dp+mp
      
      * [WIP] Refactor auto completion unit test for MLP
      
      * [WIP] Refactor the implementation of DistributedOperatorImpl
      
      * [WIP] Improve dims_mapping update rule and fix a bug
      
      * [WIP] Support auto completion for one transformer decoder layer
      
      * [WIP] Add a minor change
      
      * [WIP] Fix a bug within the uint test
      
      * Shard XShape tensor, add embedding completion and refactor code
      
      * Add the distributed_operators dir to setup.py.in
      
      * Improve the completion process and add the unittest for gpt
      
      * fix process_mesh ut
      
      * fix process_mesh ut
      
      * update
      
      * update, test=develop
      
      * Add support for automatically completing distributed attrs of special ops
      
      * update
      
      * update
      
      * update
      
      * fix doc sample codes, test=develop
      
      * improve coverage, test=develop
      
      * add static_mode check, test=develop
      
      * Model the cluster for cost model and physical mapping
      
      * update, test=develop
      
      * add set_placement, test=develop
      
      * Add the check to make sure the candidate tensors' size is great than zero
      
      * update doc, test=develop
      
      * update doc, test=develop
      
      * update doc, test=develop
      
      * update doc, test=develop
      
      * update, test=develop
      
      * Auto mark dist attrs annotated by user
      
      * update ndarray to nested list, test=develop
      
      * update, test=develop
      
      * Add auto-completion module for auto-parallel (based on PR#33804)
      
      * Remove unnecessary files
      
      * Remove unrelated files for the auto completion pr
      
      * Update the unit test to improve the coverage
      
      * Modify codes based on reviews
      
      * Minor changes for CI
      
      * Improve some codes based on new comments
      
      * Fix bugs caused by shallow copy in attributes.py
      * Imporve amend_distributed_attr_for_program in context.py
      * Other changes for weihang's comments
      Co-authored-by: Nsandyhouse <lilong12@baidu.com>
      93d862b0
  26. 23 8月, 2021 1 次提交
  27. 11 8月, 2021 1 次提交
  28. 06 5月, 2021 1 次提交
  29. 24 2月, 2021 1 次提交
    • T
      fix entry (#31079) · ebbdf525
      tangwei12 提交于
      * fix entry
      
      * fix distributed lookup table fuse case
      
      * fix entry bug at first time
      
      * move entry from paddle.fluid -> paddle.distributed
      
      * fix ut with paddle.enable_static()
      Co-authored-by: Nmalin10 <malin10@baidu.com>
      ebbdf525
  30. 08 1月, 2021 1 次提交
  31. 28 9月, 2020 1 次提交
  32. 16 9月, 2020 1 次提交
  33. 29 8月, 2020 1 次提交
  34. 28 8月, 2020 1 次提交
    • C
      Add interface to launch parallel dygraph by multiprocessing (#26044) · 31f422ae
      Chen Weihang 提交于
      * add dygraph parallel run interface
      
      * polish implement & unified env property name
      
      * add print config arg
      
      * refactor init_parallel_env function
      
      * Compatible with multiprocessing and launch modes
      
      * set default trainer start port
      
      * support run in python 2
      
      * polish python2 support code
      
      * remove python2 support
      
      * refine launch import
      
      * polish dome design details
      
      * refactor api implemention & path
      
      * use new method _set_expected_place
      
      * add spawn unittest framework & mnist test
      
      * add more unittests & doc
      
      * fix unittest failed
      
      * polish english doc
      
      * self review and polish details
      
      * refactor code by reviewer's comments
      
      * fix unittest failed
      
      * fix parallel_env unittest
      
      * fix several typos
      
      * fix error introduced when fixing typos
      
      * add unpublic note for start_processes
      
      * polish details by xiaoguang's comment
      
      * verify correctly when spawn nprocs=-1
      
      * refactor spawn & init_parallel_env design
      
      * polish doc details
      
      * open spawn unittests
      
      * try to fix doc compile error
      
      * try to fix unknown doc format error
      
      * add skip unittest when not gpu
      31f422ae
  35. 27 8月, 2020 1 次提交
  36. 07 7月, 2020 1 次提交
  37. 08 5月, 2020 1 次提交
  38. 12 2月, 2019 1 次提交
  39. 24 1月, 2019 1 次提交
  40. 24 12月, 2018 1 次提交
    • W
      Init paddle slim (#14834) · 93870574
      whs 提交于
      * Init slim.
      
      * Remove distillation demo.
      
      * Fix import errors.
      test=develop
      
      * Fix some issues.
      test=develop
      
      * Fix configs.
      test=develop
      
      * Modify API.spec.
      test=develop
      
      * Fix format.
      test=develop
      
      * Fix format.
      test=develop
      
      * Add some comments.
      93870574