提交 · dec67d6d614efe0b3f3513e5df6687683437902d · PaddlePaddle / Paddle

26 12月, 2022 1 次提交

Add collective communication APIs to improve completeness (#49252) · dec67d6d

由 Wen Sun 提交于 12月 26, 2022

* feat: broadcast_object_list & scatter_object_list

* chore: update ut conf

* get_backend & is_available

* docs: update requirements

* fix: resolve conflicts
Co-authored-by: NLiYuRio <liyuruijx@163.com>

dec67d6d

08 12月, 2022 1 次提交

Clean fluid APIs in distributed and fleet files (#48851) · 911d6bb1

由 Ghost Screaming 提交于 12月 08, 2022

* Fix bug of reduce_sum op. When input.numel() > INT32_MAX, its result
is wrong.

* Remove climits.

* Clean fluid API in paddle/distributed and paddle/fleetx folders.
Include following files:
python/paddle/distributed/__init__.py
python/paddle/distributed/collective.py
python/paddle/distributed/fleet/utils/fs.py
python/paddle/distributed/fleet/utils/hybrid_parallel_inference.py
python/paddle/distributed/fleet/utils/hybrid_parallel_util.py
python/paddle/distributed/fleet/utils/internal_storage.py
python/paddle/distributed/launch/context/device.py
python/paddle/distributed/parallel.py
python/paddle/distributed/parallel_with_gloo.py
python/paddle/distributed/spawn.py
python/paddle/framework/__init__.py
To be mentioned, 'paddle.fluid.dygraph.parallel.ParallelEnv'
 and 'fluid.framework.core' keeps unchanged in those files.
ParallelEnv is used by paddle.fluid.dygraph.parallel.DataParallel.
However, APIs in paddle.fluid.dygraph.parallel can't be
migrated to paddle.distributed, as there exists cyclic import
dependencies in modules like paddle.static, paddle.tensor. And
'fluid.framework.core' will be changed to import framework.core
after fluid.core is transmitted.

* Change TODO authors.

911d6bb1

28 11月, 2022 1 次提交
- W
  Remove unnecessary exports in `distributed.communication` and move `wait` & `barrier` (#48396) · fd689106
  由 Wen Sun 提交于 11月 28, 2022
```
* refactor: move wait

* refactor: move barrier

* fix: fix incorrect import
```
  fd689106
25 11月, 2022 1 次提交
- W
  Move collective communication `all_gather` from collective.py (#48339) · 776aef79
  由 Wen Sun 提交于 11月 25, 2022
```
* refactor: move all_gather
```
  776aef79
16 11月, 2022 1 次提交

[remove fluid] under fleet meta_optimizers (#47864) · a2a97cbb

由 wangzhen38 提交于 11月 16, 2022

* [remove fluid] under fleet meta_optimizers

* [remove fluid] under fleet meta_optimizers

* [remove fluid] under fleet meta_optimizers

* [remove fluid] under fleet meta_optimizers

* [remove fluid] under fleet meta_optimizers

* [remove fluid] under fleet meta_optimizers

* [remove fluid] under fleet meta_optimizers

* [remove fluid] under fleet meta_optimizers

* [remove fluid] under fleet meta_optimizers

* [remove fluid] under fleet meta_optimizers

* [remove fluid] under fleet meta_optimizers

* [remove fluid] under fleet meta_optimizers

a2a97cbb

04 11月, 2022 1 次提交
- L
  
  move broadcast, reduce, send, recv, reduce_scatter, scatter, alltoall (#47255) · 99504cbb
  由 LiYuRio 提交于 11月 04, 2022
  
  99504cbb
23 10月, 2022 1 次提交
- N
  [CodeStyle][black] use black instead of yapf (#46014) · 7097630f
  由 Nyakku Shigure 提交于 10月 23, 2022
```
* update config

* re-blacken python code

* temporarily disable date and diff_py_file

* skip a format
```
  7097630f
19 10月, 2022 1 次提交
- N
  
  [CodeStyle][F403] expand star import (#46946) · 499d2daf
  由 Nyakku Shigure 提交于 10月 19, 2022
  
  499d2daf
14 10月, 2022 1 次提交
- W
  
  Fix collective APIs cannot be recognized when building docs (#46962) · 2010bdc3
  由 Wen Sun 提交于 10月 14, 2022
  
  2010bdc3
13 10月, 2022 1 次提交

[WIP]飞桨PaddlePaddle 分布式强化学习功能研发 (#45998) · f0afcabc

由 Xinger 提交于 10月 13, 2022

* add rpc module in cpp side

* add rpc module in python side

* support win32 and mac for rpc

* 代码优化

* 优化代码

* update rpc

* update rpc launch

* rpc remove rank and world_size api

* fix logger import bug

* remove support for win and mac

* remove support for xpu, npu, cinn and rocm

* remove support for xpu, npu, cinn and rocm

* fix shutdown barrier timeout bug

* update:python_rpc_handler to shared ptr

* fix master shutodwn first bug

* tests support for cpu

* update log to vlog

* update get service info api

* add single process test case

* remove process group

* remove some useless dependencies

* update rpc api comments

* update rpc comments: Example to Examples

* update rpc api comments

* update rpc api comments

* update launch api comments

* update init_rpc comments

* update rpc sync and async comments

* fix bug: init_rpc cant be called repeatly in a process

* update rpc api comment: make master endpoint unique

* update rpc api:service to worker, timeout_ms to timeout

* rename ServiceInfo to WorkerInfo

* refactor: rename server to worker, log to vlog

* add launch test

* remove unused codes

* refine

f0afcabc

20 9月, 2022 1 次提交

logger manager (#45909) · 264ad205

由 Roc 提交于 9月 20, 2022

uniform logger manager in FleetAPI.
hidde API under distributed/utils which users don't need.

264ad205

31 8月, 2022 1 次提交
- L
  
  add stream.all_reduce API and ProcessGroupStream (#45282) · ce4775cd
  由 LiYuRio 提交于 8月 31, 2022
  
  ce4775cd
28 7月, 2022 1 次提交
- L
  
  Complete the dtypes for all_gather, add all_gather_object api (#44417) · d4cf02bc
  由 LiYuRio 提交于 7月 28, 2022
  
  d4cf02bc
11 7月, 2022 1 次提交

[Dygraph] Support new apis in ProcessGroupNCCL (#43918) · 37216a8f

由 Haohongxiang 提交于 7月 11, 2022

* fix conflict

* new pg apis

* add docs of new apis

* update

* fix coverage

* update

* fix bug

* fix reduce scatter

* fix api

* update
Co-authored-by: NForFishes <2282912238@qq.com>

37216a8f

05 6月, 2022 1 次提交

【code format check upgrade】 step2：yapf (#42944) · a072fca8

由 Sing_chan 提交于 6月 05, 2022

* use yapf to format all python file

* yapf exclude two unittests file for they rely on writing and reading file, and format will break them

* disable diff_py_file because too many diff files cause command following failed

a072fca8

12 4月, 2022 1 次提交
- Y
  
  add ParallelMode docs (#41326) · 0835de79
  由 Yanxing Shi 提交于 4月 12, 2022
  
  0835de79
23 3月, 2022 1 次提交
- K
  
  enable continuous log; update doc (#40782) · fdafbc7b
  由 kuizhiqing 提交于 3月 23, 2022
  
  fdafbc7b
09 3月, 2022 1 次提交
- B
  
  add_sharding_api (#40129) · f40ed5f4
  由 Baibaifan 提交于 3月 09, 2022
  
  f40ed5f4
26 11月, 2021 1 次提交
- Z
  upgrade async distributed training in pscore (#37515) · 74605fc2
  由 zhaocaibei123 提交于 11月 26, 2021
```
* test

* test

* rm test

* update

* update

* update

* add unittest

* update

* update save
```
  74605fc2
29 10月, 2021 1 次提交

[Auto Parallel] Improve the interface and the underlying mechanisms (#36617) · a02532b5

由 Yulong Ao 提交于 10月 29, 2021

* default dist op

* add dist_attr for dist op

* add unitest

* update inputname

* update function name

* add unitest

* update CMakeLists.txt for CI

* fix dis_matmul

* fix compile error

* update matmul to matmul_v2

* unify api

* unify api

* todo

* update distop forward func

* update distop forward func

* auto parallel backward

* update dist op

* autoparallel backward

* add backward for embedding

* temp1

* temp2

* temp3

* temp4

* backward done1

* backward done2

* backward done3

* dist embedding remove mp mode

* dist matmul remove mp mode

* update dist embedding
『

* dist op init1

* dist op init 2

* update unitest

* context remove parallel mode

* partitioner remove parallel mode

* update unitest

* a more general method to support varying mesh in pipeline parallel

* support varying mesh in pipeline parallel

* embedding support varying mesh in pipeline parallel

* matmul support varying mesh in pipeline parallel

* default dist op support varying mesh in pipeline parallel

* dist attribute for startup program

* default dist op support varying mesh in pipeline parallel 2

* partitoner support varying mesh in pipeline parallel

* revise logic for auto compeletion

* revise framework.py

* revise reshard unitest

* revise unitest for parallelize

* chmod

* fixed bug for dist embedding name mapping

* Improve the interface and the underlying mechanisms of auto parallel

* revise completion for backward

* revise completion for update

* revise completion for update

* update unitest

* chmod

* bugfix for grad_op output var's mesh

* Modify codes for pr 36744

* Remove unnecessary comments in framework.py

* Remove unnecessary comments in completion.py
Co-authored-by: NJZ-LIANG <jianzhongliang10@gmail.com>
Co-authored-by: Nzhaoyingli <zhaoyingli@baidu.com>
Co-authored-by: NJZ-LIANG <38102074+JZ-LIANG@users.noreply.github.com>

a02532b5

18 9月, 2021 1 次提交
- G
  fix bug of module 'paddle' has no attribute 'distributed' for python3.6 (#35848) · d4cd2590
  由 Guoxia Wang 提交于 9月 18, 2021
```
* fix bug
```
  d4cd2590
17 9月, 2021 1 次提交
- G
  add launch doc (#35634) · 5548061b
  由 Guoxia Wang 提交于 9月 17, 2021
```
* add launch doc
```
  5548061b
08 9月, 2021 1 次提交
- L
  hidden the auto parallel apis (#35385) · afd1b372
  由 lilong12 提交于 9月 08, 2021
```
* update, test=develop
```
  afd1b372
24 8月, 2021 1 次提交

Add auto completion module for auto parallel (#34813) · 93d862b0

由 Yulong Ao 提交于 8月 24, 2021

* add auto_parallel dir

* mv to paddle.distributed

* add shard_xx api

* add distributed attrs for var

* add ut, test=develop

* add dist

* update

* update

* update

* update

* update

* update, test=develop

* update, test=develop

* update, test=develop

* update, test=develop

* update, test=develop

* update, test=develop

* update, test=develop

* update

* update

* update

* update

* update

* update, test=develop

* update, test=develop

* update

* update

* delete unused proto

* resotre op_desc

* restore type_defs

* update var_desc

* remove dimss_mapping for proto_pybind

* update interface.py

* update framework.py

* update

* update

* add auto_parallel dir

* mv to paddle.distributed

* add shard_xx api

* add distributed attrs for var

* add ut, test=develop

* [WIP] Add the auto completion feature and related codes

* [WIP] Improve the auto completion and related codes

* [WIP] Make the auto completion to support data-parallel

* [WIP] Make the completion support mp and dp+mp

* [WIP] Refactor auto completion unit test for MLP

* [WIP] Refactor the implementation of DistributedOperatorImpl

* [WIP] Improve dims_mapping update rule and fix a bug

* [WIP] Support auto completion for one transformer decoder layer

* [WIP] Add a minor change

* [WIP] Fix a bug within the uint test

* Shard XShape tensor, add embedding completion and refactor code

* Add the distributed_operators dir to setup.py.in

* Improve the completion process and add the unittest for gpt

* fix process_mesh ut

* fix process_mesh ut

* update

* update, test=develop

* Add support for automatically completing distributed attrs of special ops

* update

* update

* update

* fix doc sample codes, test=develop

* improve coverage, test=develop

* add static_mode check, test=develop

* Model the cluster for cost model and physical mapping

* update, test=develop

* add set_placement, test=develop

* Add the check to make sure the candidate tensors' size is great than zero

* update doc, test=develop

* update doc, test=develop

* update doc, test=develop

* update doc, test=develop

* update, test=develop

* Auto mark dist attrs annotated by user

* update ndarray to nested list, test=develop

* update, test=develop

* Add auto-completion module for auto-parallel (based on PR#33804)

* Remove unnecessary files

* Remove unrelated files for the auto completion pr

* Update the unit test to improve the coverage

* Modify codes based on reviews

* Minor changes for CI

* Improve some codes based on new comments

* Fix bugs caused by shallow copy in attributes.py
* Imporve amend_distributed_attr_for_program in context.py
* Other changes for weihang's comments
Co-authored-by: Nsandyhouse <lilong12@baidu.com>

93d862b0

23 8月, 2021 1 次提交
- B
  
  [CPU] Enable barrier op upon gloo (#34671) · e8f146a9
  由 Bo Liu 提交于 8月 23, 2021
  
  e8f146a9
11 8月, 2021 1 次提交
- L
  add the basic apis for auto_parallel (#33804) · 3f962e77
  由 lilong12 提交于 8月 11, 2021
```
* add auto_parallel apis
```
  3f962e77
06 5月, 2021 1 次提交
- Z
  
  update 2.0 public api in distributed (#32695) · 70eb435c
  由 zhiboniu 提交于 5月 06, 2021
  
  70eb435c
24 2月, 2021 1 次提交

fix entry (#31079) · ebbdf525

由 tangwei12 提交于 2月 24, 2021

* fix entry

* fix distributed lookup table fuse case

* fix entry bug at first time

* move entry from paddle.fluid -> paddle.distributed

* fix ut with paddle.enable_static()
Co-authored-by: Nmalin10 <malin10@baidu.com>

ebbdf525

08 1月, 2021 1 次提交
- C
  
  remove distributed prepare context (#30219) · 3016ba85
  由 Chen Weihang 提交于 1月 08, 2021
  
  3016ba85
28 9月, 2020 1 次提交
- Y
  
  【paddle.distributed.fleet】add data_generator in distributed.fleet.dataset (#27345) · 78014059
  由 yaoxuefeng 提交于 9月 28, 2020
  
  78014059
16 9月, 2020 1 次提交
- Y
  
  refine fleet dataset class api (#27133) · c67c3916
  由 yaoxuefeng 提交于 9月 16, 2020
  
  c67c3916
29 8月, 2020 1 次提交
- D
  【paddle.fleet】fix api documents (#26777) · 994217ea
  由 Dong Daxiang 提交于 8月 29, 2020
```
* fix api document
```
  994217ea
28 8月, 2020 1 次提交

Add interface to launch parallel dygraph by multiprocessing (#26044) · 31f422ae

由 Chen Weihang 提交于 8月 28, 2020

* add dygraph parallel run interface

* polish implement & unified env property name

* add print config arg

* refactor init_parallel_env function

* Compatible with multiprocessing and launch modes

* set default trainer start port

* support run in python 2

* polish python2 support code

* remove python2 support

* refine launch import

* polish dome design details

* refactor api implemention & path

* use new method _set_expected_place

* add spawn unittest framework & mnist test

* add more unittests & doc

* fix unittest failed

* polish english doc

* self review and polish details

* refactor code by reviewer's comments

* fix unittest failed

* fix parallel_env unittest

* fix several typos

* fix error introduced when fixing typos

* add unpublic note for start_processes

* polish details by xiaoguang's comment

* verify correctly when spawn nprocs=-1

* refactor spawn & init_parallel_env design

* polish doc details

* open spawn unittests

* try to fix doc compile error

* try to fix unknown doc format error

* add skip unittest when not gpu

31f422ae

27 8月, 2020 1 次提交
- L
  [api 2.0] add collective op for cpu using gloo and paddle.distributed.* apis (#26552) · 1c681383
  由 lilong12 提交于 8月 27, 2020
```
add collective op for cpu using gloo and paddle.distributed.* apis
```
  1c681383
07 7月, 2020 1 次提交
- G
  
  Fix typo in interface. (#24779) · 80f1c507
  由 gongweibao 提交于 7月 07, 2020
  
  80f1c507
08 5月, 2020 1 次提交
- Z
  
  fs_wrapper add __all__ (#24335) · f62dfc62
  由 zhangchunle 提交于 5月 08, 2020
  
  f62dfc62
12 2月, 2019 1 次提交

add launch mp distributed job py module test=develop (#15620) · d424e5b4

由 Yan Xu 提交于 2月 12, 2019

* add launch mp distributed mode module test=develop

* delete unused file test=develop

* refine usage test=develop

* refine usage test=develop

* move distributed package test=develop

* add to whl package test=develop

d424e5b4

24 1月, 2019 1 次提交
- W
  
  add quantization freeze pass. · dde19a0f
  由 WangZhen 提交于 1月 24, 2019
  
  dde19a0f
24 12月, 2018 1 次提交

Init paddle slim (#14834) · 93870574

由 whs 提交于 12月 24, 2018

* Init slim.

* Remove distillation demo.

* Fix import errors.
test=develop

* Fix some issues.
test=develop

* Fix configs.
test=develop

* Modify API.spec.
test=develop

* Fix format.
test=develop

* Fix format.
test=develop

* Add some comments.

93870574

02 7月, 2018 1 次提交
- X
  
  move v2 api and capi to legacy · 8c1326c5
  由 Xin Pan 提交于 7月 01, 2018
  
  8c1326c5

PaddlePaddle / Paddle 大约 1 年 前同步成功

PaddlePaddle / Paddle
大约 1 年前同步成功