- 13 1月, 2023 1 次提交
-
-
由 duanyanhui 提交于
* clear ProcessGroupCustom manually * fix bug * fix bug * move destroy ProcessGroup to ProcessGroupIdMap * enable destroy to all device * remove unused comments * change to internal api * Update process_group.cc * Update process_group.cc
-
- 26 12月, 2022 1 次提交
-
-
由 Wen Sun 提交于
* feat: broadcast_object_list & scatter_object_list * chore: update ut conf * get_backend & is_available * docs: update requirements * fix: resolve conflicts Co-authored-by: NLiYuRio <liyuruijx@163.com>
-
- 08 12月, 2022 1 次提交
-
-
由 Ghost Screaming 提交于
* Fix bug of reduce_sum op. When input.numel() > INT32_MAX, its result is wrong. * Remove climits. * Clean fluid API in paddle/distributed and paddle/fleetx folders. Include following files: python/paddle/distributed/__init__.py python/paddle/distributed/collective.py python/paddle/distributed/fleet/utils/fs.py python/paddle/distributed/fleet/utils/hybrid_parallel_inference.py python/paddle/distributed/fleet/utils/hybrid_parallel_util.py python/paddle/distributed/fleet/utils/internal_storage.py python/paddle/distributed/launch/context/device.py python/paddle/distributed/parallel.py python/paddle/distributed/parallel_with_gloo.py python/paddle/distributed/spawn.py python/paddle/framework/__init__.py To be mentioned, 'paddle.fluid.dygraph.parallel.ParallelEnv' and 'fluid.framework.core' keeps unchanged in those files. ParallelEnv is used by paddle.fluid.dygraph.parallel.DataParallel. However, APIs in paddle.fluid.dygraph.parallel can't be migrated to paddle.distributed, as there exists cyclic import dependencies in modules like paddle.static, paddle.tensor. And 'fluid.framework.core' will be changed to import framework.core after fluid.core is transmitted. * Change TODO authors.
-
- 28 11月, 2022 1 次提交
-
-
由 Wen Sun 提交于
* refactor: move wait * refactor: move barrier * fix: fix incorrect import
-
- 25 11月, 2022 1 次提交
-
-
由 Wen Sun 提交于
* refactor: move all_gather
-
- 16 11月, 2022 1 次提交
-
-
由 wangzhen38 提交于
* [remove fluid] under fleet meta_optimizers * [remove fluid] under fleet meta_optimizers * [remove fluid] under fleet meta_optimizers * [remove fluid] under fleet meta_optimizers * [remove fluid] under fleet meta_optimizers * [remove fluid] under fleet meta_optimizers * [remove fluid] under fleet meta_optimizers * [remove fluid] under fleet meta_optimizers * [remove fluid] under fleet meta_optimizers * [remove fluid] under fleet meta_optimizers * [remove fluid] under fleet meta_optimizers * [remove fluid] under fleet meta_optimizers
-
- 04 11月, 2022 1 次提交
-
-
由 LiYuRio 提交于
-
- 23 10月, 2022 1 次提交
-
-
由 Nyakku Shigure 提交于
* update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format
-
- 19 10月, 2022 1 次提交
-
-
由 Nyakku Shigure 提交于
-
- 14 10月, 2022 1 次提交
-
-
由 Wen Sun 提交于
-
- 13 10月, 2022 1 次提交
-
-
由 Xinger 提交于
* add rpc module in cpp side * add rpc module in python side * support win32 and mac for rpc * 代码优化 * 优化代码 * update rpc * update rpc launch * rpc remove rank and world_size api * fix logger import bug * remove support for win and mac * remove support for xpu, npu, cinn and rocm * remove support for xpu, npu, cinn and rocm * fix shutdown barrier timeout bug * update:python_rpc_handler to shared ptr * fix master shutodwn first bug * tests support for cpu * update log to vlog * update get service info api * add single process test case * remove process group * remove some useless dependencies * update rpc api comments * update rpc comments: Example to Examples * update rpc api comments * update rpc api comments * update launch api comments * update init_rpc comments * update rpc sync and async comments * fix bug: init_rpc cant be called repeatly in a process * update rpc api comment: make master endpoint unique * update rpc api:service to worker, timeout_ms to timeout * rename ServiceInfo to WorkerInfo * refactor: rename server to worker, log to vlog * add launch test * remove unused codes * refine
-
- 20 9月, 2022 1 次提交
-
-
由 Roc 提交于
uniform logger manager in FleetAPI. hidde API under distributed/utils which users don't need.
-
- 31 8月, 2022 1 次提交
-
-
由 LiYuRio 提交于
-
- 28 7月, 2022 1 次提交
-
-
由 LiYuRio 提交于
-
- 11 7月, 2022 1 次提交
-
-
由 Haohongxiang 提交于
* fix conflict * new pg apis * add docs of new apis * update * fix coverage * update * fix bug * fix reduce scatter * fix api * update Co-authored-by: NForFishes <2282912238@qq.com>
-
- 05 6月, 2022 1 次提交
-
-
由 Sing_chan 提交于
* use yapf to format all python file * yapf exclude two unittests file for they rely on writing and reading file, and format will break them * disable diff_py_file because too many diff files cause command following failed
-
- 12 4月, 2022 1 次提交
-
-
由 Yanxing Shi 提交于
-
- 23 3月, 2022 1 次提交
-
-
由 kuizhiqing 提交于
-
- 09 3月, 2022 1 次提交
-
-
由 Baibaifan 提交于
-
- 26 11月, 2021 1 次提交
-
-
由 zhaocaibei123 提交于
* test * test * rm test * update * update * update * add unittest * update * update save
-
- 29 10月, 2021 1 次提交
-
-
由 Yulong Ao 提交于
* default dist op * add dist_attr for dist op * add unitest * update inputname * update function name * add unitest * update CMakeLists.txt for CI * fix dis_matmul * fix compile error * update matmul to matmul_v2 * unify api * unify api * todo * update distop forward func * update distop forward func * auto parallel backward * update dist op * autoparallel backward * add backward for embedding * temp1 * temp2 * temp3 * temp4 * backward done1 * backward done2 * backward done3 * dist embedding remove mp mode * dist matmul remove mp mode * update dist embedding 『 * dist op init1 * dist op init 2 * update unitest * context remove parallel mode * partitioner remove parallel mode * update unitest * a more general method to support varying mesh in pipeline parallel * support varying mesh in pipeline parallel * embedding support varying mesh in pipeline parallel * matmul support varying mesh in pipeline parallel * default dist op support varying mesh in pipeline parallel * dist attribute for startup program * default dist op support varying mesh in pipeline parallel 2 * partitoner support varying mesh in pipeline parallel * revise logic for auto compeletion * revise framework.py * revise reshard unitest * revise unitest for parallelize * chmod * fixed bug for dist embedding name mapping * Improve the interface and the underlying mechanisms of auto parallel * revise completion for backward * revise completion for update * revise completion for update * update unitest * chmod * bugfix for grad_op output var's mesh * Modify codes for pr 36744 * Remove unnecessary comments in framework.py * Remove unnecessary comments in completion.py Co-authored-by: NJZ-LIANG <jianzhongliang10@gmail.com> Co-authored-by: Nzhaoyingli <zhaoyingli@baidu.com> Co-authored-by: NJZ-LIANG <38102074+JZ-LIANG@users.noreply.github.com>
-
- 18 9月, 2021 1 次提交
-
-
由 Guoxia Wang 提交于
* fix bug
-
- 17 9月, 2021 1 次提交
-
-
由 Guoxia Wang 提交于
* add launch doc
-
- 08 9月, 2021 1 次提交
-
-
由 lilong12 提交于
* update, test=develop
-
- 24 8月, 2021 1 次提交
-
-
由 Yulong Ao 提交于
* add auto_parallel dir * mv to paddle.distributed * add shard_xx api * add distributed attrs for var * add ut, test=develop * add dist * update * update * update * update * update * update, test=develop * update, test=develop * update, test=develop * update, test=develop * update, test=develop * update, test=develop * update, test=develop * update * update * update * update * update * update, test=develop * update, test=develop * update * update * delete unused proto * resotre op_desc * restore type_defs * update var_desc * remove dimss_mapping for proto_pybind * update interface.py * update framework.py * update * update * add auto_parallel dir * mv to paddle.distributed * add shard_xx api * add distributed attrs for var * add ut, test=develop * [WIP] Add the auto completion feature and related codes * [WIP] Improve the auto completion and related codes * [WIP] Make the auto completion to support data-parallel * [WIP] Make the completion support mp and dp+mp * [WIP] Refactor auto completion unit test for MLP * [WIP] Refactor the implementation of DistributedOperatorImpl * [WIP] Improve dims_mapping update rule and fix a bug * [WIP] Support auto completion for one transformer decoder layer * [WIP] Add a minor change * [WIP] Fix a bug within the uint test * Shard XShape tensor, add embedding completion and refactor code * Add the distributed_operators dir to setup.py.in * Improve the completion process and add the unittest for gpt * fix process_mesh ut * fix process_mesh ut * update * update, test=develop * Add support for automatically completing distributed attrs of special ops * update * update * update * fix doc sample codes, test=develop * improve coverage, test=develop * add static_mode check, test=develop * Model the cluster for cost model and physical mapping * update, test=develop * add set_placement, test=develop * Add the check to make sure the candidate tensors' size is great than zero * update doc, test=develop * update doc, test=develop * update doc, test=develop * update doc, test=develop * update, test=develop * Auto mark dist attrs annotated by user * update ndarray to nested list, test=develop * update, test=develop * Add auto-completion module for auto-parallel (based on PR#33804) * Remove unnecessary files * Remove unrelated files for the auto completion pr * Update the unit test to improve the coverage * Modify codes based on reviews * Minor changes for CI * Improve some codes based on new comments * Fix bugs caused by shallow copy in attributes.py * Imporve amend_distributed_attr_for_program in context.py * Other changes for weihang's comments Co-authored-by: Nsandyhouse <lilong12@baidu.com>
-
- 23 8月, 2021 1 次提交
-
-
由 Bo Liu 提交于
-
- 11 8月, 2021 1 次提交
-
-
由 lilong12 提交于
* add auto_parallel apis
-
- 06 5月, 2021 1 次提交
-
-
由 zhiboniu 提交于
-
- 24 2月, 2021 1 次提交
-
-
由 tangwei12 提交于
* fix entry * fix distributed lookup table fuse case * fix entry bug at first time * move entry from paddle.fluid -> paddle.distributed * fix ut with paddle.enable_static() Co-authored-by: Nmalin10 <malin10@baidu.com>
-
- 08 1月, 2021 1 次提交
-
-
由 Chen Weihang 提交于
-
- 28 9月, 2020 1 次提交
-
-
由 yaoxuefeng 提交于
-
- 16 9月, 2020 1 次提交
-
-
由 yaoxuefeng 提交于
-
- 29 8月, 2020 1 次提交
-
-
由 Dong Daxiang 提交于
* fix api document
-
- 28 8月, 2020 1 次提交
-
-
由 Chen Weihang 提交于
* add dygraph parallel run interface * polish implement & unified env property name * add print config arg * refactor init_parallel_env function * Compatible with multiprocessing and launch modes * set default trainer start port * support run in python 2 * polish python2 support code * remove python2 support * refine launch import * polish dome design details * refactor api implemention & path * use new method _set_expected_place * add spawn unittest framework & mnist test * add more unittests & doc * fix unittest failed * polish english doc * self review and polish details * refactor code by reviewer's comments * fix unittest failed * fix parallel_env unittest * fix several typos * fix error introduced when fixing typos * add unpublic note for start_processes * polish details by xiaoguang's comment * verify correctly when spawn nprocs=-1 * refactor spawn & init_parallel_env design * polish doc details * open spawn unittests * try to fix doc compile error * try to fix unknown doc format error * add skip unittest when not gpu
-
- 27 8月, 2020 1 次提交
-
-
由 lilong12 提交于
add collective op for cpu using gloo and paddle.distributed.* apis
-
- 07 7月, 2020 1 次提交
-
-
由 gongweibao 提交于
-
- 08 5月, 2020 1 次提交
-
-
由 zhangchunle 提交于
-
- 12 2月, 2019 1 次提交
-
-
由 Yan Xu 提交于
* add launch mp distributed mode module test=develop * delete unused file test=develop * refine usage test=develop * refine usage test=develop * move distributed package test=develop * add to whl package test=develop
-
- 24 1月, 2019 1 次提交
-
-
由 WangZhen 提交于
-
- 24 12月, 2018 1 次提交
-
-
由 whs 提交于
* Init slim. * Remove distillation demo. * Fix import errors. test=develop * Fix some issues. test=develop * Fix configs. test=develop * Modify API.spec. test=develop * Fix format. test=develop * Fix format. test=develop * Add some comments.
-