提交 · 85bb1a85cdb3bc9927f5047dc81e25f0d7ada844 · 机器未来 / Paddle

13 10月, 2021 1 次提交
- G
  
  support auto parallel data shard (#36055) · 85bb1a85
  由 Guoxia Wang 提交于 10月 13, 2021
  
  85bb1a85
13 9月, 2021 1 次提交
- S
  [HybridParallel]Fix scaler bug in pipeline_parallel/model_parallel (#35556) · 2bb44317
  由 ShenLiang 提交于 9月 13, 2021
```
* support grad group

* fix single card condition
```
  2bb44317
10 9月, 2021 1 次提交
- J
  [Dygraph 4D Parallel] Sharding Support MP-PP-DP Parallelism (#35580) · 2c922d63
  由 JZ-LIANG 提交于 9月 10, 2021
```
* sharding support dp

* sharding support mp

* sharding support pp
```
  2c922d63
08 9月, 2021 2 次提交

[Auto Parallel] Integrate all modules (#35483) · 12155358

由 Yulong Ao 提交于 9月 08, 2021

* add auto_parallel dir

* mv to paddle.distributed

* add shard_xx api

* add distributed attrs for var

* add ut, test=develop

* add dist

* update

* update

* update

* update

* update

* update, test=develop

* update, test=develop

* update, test=develop

* update, test=develop

* update, test=develop

* update, test=develop

* update, test=develop

* update

* update

* update

* update

* update

* update, test=develop

* update, test=develop

* update

* update

* delete unused proto

* resotre op_desc

* restore type_defs

* update var_desc

* remove dimss_mapping for proto_pybind

* update interface.py

* update framework.py

* update

* update

* add auto_parallel dir

* mv to paddle.distributed

* add shard_xx api

* add distributed attrs for var

* add ut, test=develop

* [WIP] Add the auto completion feature and related codes

* [WIP] Improve the auto completion and related codes

* [WIP] Make the auto completion to support data-parallel

* [WIP] Make the completion support mp and dp+mp

* [WIP] Refactor auto completion unit test for MLP

* [WIP] Refactor the implementation of DistributedOperatorImpl

* [WIP] Improve dims_mapping update rule and fix a bug

* [WIP] Support auto completion for one transformer decoder layer

* [WIP] Add a minor change

* [WIP] Fix a bug within the uint test

* Shard XShape tensor, add embedding completion and refactor code

* Add the distributed_operators dir to setup.py.in

* Improve the completion process and add the unittest for gpt

* fix process_mesh ut

* fix process_mesh ut

* update

* update, test=develop

* Add support for automatically completing distributed attrs of special ops

* update

* update

* update

* fix doc sample codes, test=develop

* improve coverage, test=develop

* add static_mode check, test=develop

* Model the cluster for cost model and physical mapping

* update, test=develop

* add set_placement, test=develop

* Add the check to make sure the candidate tensors' size is great than zero

* update doc, test=develop

* update doc, test=develop

* update doc, test=develop

* update doc, test=develop

* update, test=develop

* Auto mark dist attrs annotated by user

* update ndarray to nested list, test=develop

* update, test=develop

* Add auto-completion module for auto-parallel (based on PR#33804)

* Remove unnecessary files

* Remove unrelated files for the auto completion pr

* Update the unit test to improve the coverage

* Modify codes based on reviews

* Minor changes for CI

* Improve some codes based on new comments

* Fix bugs caused by shallow copy in attributes.py
* Imporve amend_distributed_attr_for_program in context.py
* Other changes for weihang's comments

* support shard reader

* support shard reader

* add parallel mode

* update process mesh

* add method to compute comm_group

* implement dist_embedding forward func

* implement dist matmul forward func

* implement dist reshape forward func

* add transpiler framework

* add transpiler forward

* implement transpiler forward

* implement transpiler backward & update

* add process

* add unitest

* chmod

* chmod

* chmod

* update unitest

* add unitest for gpt

* remove unused print

* rename transpiler --> partitioner

* rename transpiler --> partitioner

* chmod

* chmod

* bug fixed

* remove amp function

* update case for dp mode

* update case for dp mode

* [Auto Parallel] Integrate all parts with the newest code

* Integrate all parts of auto parallel and improve codes

* Integrate all parts by AutoParallelizer
* Add unit test for AutoParallelizer
* Improve auto completion module for pipeline parallel
* Add support for matmul_v2 in dist_matmul
* Correct the typo "stratergy" to "strategy"

* Modify distributed_strategy.proto to conform the main stream

* Restore parts of distributed_strategy to conform the develop branch
Co-authored-by: Nsandyhouse <lilong12@baidu.com>
Co-authored-by: NJZ-LIANG <jianzhongliang10@gmail.com>

12155358

Enable program passes on Fleet APIs (#34955) · 5f369881

由 Zeng Jinle 提交于 9月 08, 2021

* add fleet api for program pass

* turn on apply pass for CI test

* fix disable fuse_all_optimizer bug

* try to test ci

* fix CI

* fill unspecified op role

* fix fuse_allreduce

* add ut to improve coverage

* remove useless change

* improve c++ coverage

* follow some comments

* test ir pass pipeline

* update doc

* reduce ut time again

5f369881

30 7月, 2021 1 次提交
- W
  add trainer desc config to distributed strategy (#34457) · e6aacd1e
  由 wangguanqun 提交于 7月 30, 2021
```
* add trainer desc config to distributed strategy

* code style modified
```
  e6aacd1e
27 7月, 2021 1 次提交
- Y
  
  supports mp and dp hybrid (#34377) · 937e21a3
  由 Yuang Liu 提交于 7月 27, 2021
  
  937e21a3
01 7月, 2021 1 次提交
- J
  Dygraph/sharding (#33633) · f33f2444
  由 JZ-LIANG 提交于 7月 01, 2021
```
* dygraph sharding

* update unitest hybrid_parallel_communicate_group
```
  f33f2444
25 6月, 2021 1 次提交
- W
  
  static support mp_layers (#33700) · 91a0acdb
  由 WangXi 提交于 6月 25, 2021
  
  91a0acdb
27 5月, 2021 1 次提交

[PsCore] support ssd (#33031) · 988b5fe1

由 Thunderbrook 提交于 5月 27, 2021

* support ssd in PsCore

* remove log

* remove bz2

* defalut value

* code style

* parse table class

* code style

* add define

988b5fe1

17 5月, 2021 1 次提交
- S
  [HybridParallel]Fix precision problem of model parallel (#32897) · c809530e
  由 ShenLiang 提交于 5月 17, 2021
```
* fix precision of mp

* fix bug of seed

* fix dp

* print group
```
  c809530e
12 5月, 2021 1 次提交

Optimize/fleet save (#32817) · 890f626b

由 tangwei12 提交于 5月 12, 2021

* fix cpp lint
* fix save/load with unexpected value
* fix save and user interface

890f626b

06 5月, 2021 1 次提交
- Z
  
  update 2.0 public api in distributed (#32695) · 70eb435c
  由 zhiboniu 提交于 5月 06, 2021
  
  70eb435c
25 4月, 2021 1 次提交
- L
  add pipeline for dynamic graph (#32511) · 561dc719
  由 lilong12 提交于 4月 25, 2021
```
* add pp dygraph, test=develop
```
  561dc719
22 4月, 2021 2 次提交
- Y
  
  Add fleet get_loss_scaling doc and update alert message (#32419) · d03b0b16
  由 Yuang Liu 提交于 4月 22, 2021
  
  d03b0b16
- S
  [HybridParallel] Add ClipGradByGlobalNorm & check_finite_and_unscale in Dygraph (#32354) · 7ea999fd
  由 ShenLiang 提交于 4月 22, 2021
```
* add clip/check

* add amp & clip grad in dygraph

* add logging
```
  7ea999fd
21 4月, 2021 1 次提交
- Y
  
  add get_loss_scaling to fleet (#32401) · 37bb3342
  由 Yuang Liu 提交于 4月 21, 2021
  
  37bb3342
19 4月, 2021 1 次提交
- S
  [Hybrid Parallel] Support dp & mp in dygraph (#32323) · ffd40860
  由 ShenLiang 提交于 4月 19, 2021
```
* support dp & mp
```
  ffd40860
17 4月, 2021 1 次提交
- S
  [Hybrid Parallel] Add model parallel support in dygraph (#32248) · 66d46221
  由 ShenLiang 提交于 4月 17, 2021
```
* add model parallel support in dygraph
```
  66d46221
07 4月, 2021 1 次提交

【NPU】Merge ascend GE&distributed code by 0208 from ascendrc (#31957) · 8c7c53b3

由 zhang wenhui 提交于 4月 07, 2021

* Ascend rc (#30483)

* Fix compilcation on CANN20.1 and older (#30494)

Fix compilcation on CANN20.1 and older

* Add distribution supported (#30578)

Add distribution supported

* Build praser for Hcom* operators (#30627)

Build praser for Hcom* operators

* Pass device_ids info from launch to trainer. (#30632)

Pass device_ids info from launch to trainer

* Add Hccl program group (#30642)

Add Hccl program group

* Add startup bash files of test_ascend_group. (#30645)

Add startup bash files of test_ascend_group

* cleanup (#30646)

cleanup test_ascend_group.py

* [Feature] Build parser to support distributed training (#30658)

[Feature] Build parser to support distributed training

* fix compilation on ascend-20.1 (#30722)

fix compilation on ascend-20.1

* Dev/fix ascend string (#30749)

Dev/fix ascend string

* code style (#30781)

code style

* Merge ascend_optimizer and ascend_parser. (#30776)

Merge ascend_optimizer and ascend_parser.

* Ascendrc add converted op : [range/equal/range/uniform_random/expand/squeeze], fix cast op bug  (#30797)

Ascendrc add converted op : [range/equal/range/uniform_random/expand/squeeze], fix cast op bug

* Add paddle ascend distribution training supported (#30796)

Add paddle ascend distribution training supported

* pass cxx_flags to gloo cmake (#30857)

* Destroy session first. (#30954)

Destroy session first.

* merge

* fix, test=develop

* fix, test=develop

* fix style, test=develop

* fix, test=develop

* fix

* fix log fatal, test=develop

* fix enforce style, test=develop

* fix, test=develop

* fix, test=develop

* fix rccl, test=develop

* fix test, test=develop

* fix, test=develop

* fix, test=develop

* fix, test=develop

* fix node_num, test=develop

* fix ids str, test=develop

* fix ids str, test=develop

* fix ids str, test=develop

* fix, test=develop

* fix, test=develop

* fix, test=develop

* fix, test=develop

* fix, test=develop

* fix, test=develop

* fix, test=develop

* fix, test=develop

* fix style code, test=develop

* fix style code, test=develop

* fix style code, test=develop

* fix style code, test=develop
Co-authored-by: Nhutuxian <hutuxian2011@sina.cn>
Co-authored-by: Ngongweibao <weibao.gong@gmail.com>
Co-authored-by: NVoid Main <voidmain1313113@gmail.com>
Co-authored-by: NLeo Chen <chenqiuliang@baidu.com>
Co-authored-by: Ndingsiyu <18369187719@163.com>
Co-authored-by: NOleNet <olenet@126.com>

8c7c53b3

01 4月, 2021 2 次提交
- S
  Support control flow in DataParallel (#31625) · 8460698b
  由 ShenLiang 提交于 4月 01, 2021
```
* support control flow

* supoort sync_parameters_buffers

* fix the bug of sparse embedding
```
  8460698b
- T
  LOG CLEAN (#31819) · 0589ed21
  由 tangwei12 提交于 4月 01, 2021
```
* upgrade vlog

* train from dataset fetch optimize
```
  0589ed21
15 3月, 2021 1 次提交
- S
  
  fix amp bug of fleet (#31532) · c3634c6b
  由 ShenLiang 提交于 3月 15, 2021
  
  c3634c6b
20 2月, 2021 1 次提交
- 1
  test=develop, save/load, shrink (#30625) · 16b4260b
  由 123malin 提交于 2月 20, 2021
```
* test=develop, save/load, shrink
Co-authored-by: NseiriosPlus <tangwei12@baidu.com>
```
  16b4260b
01 2月, 2021 1 次提交
- W
  
  Fleet distributed strategy support pure fp16 (#30754) · 31ed9c9e
  由 WangXi 提交于 2月 01, 2021
  
  31ed9c9e
21 1月, 2021 1 次提交
- Z
  Fix the bug in fleet amp_init. (#30606) · 4a9de931
  由 Zhen Wang 提交于 1月 21, 2021
```
* Fix the bug in fleet amp_init.

* Fix the amp_init unit test.
```
  4a9de931
20 1月, 2021 1 次提交
- H
  Add fleet amp_init() (#30572) · 13862008
  由 huangxu96 提交于 1月 20, 2021
```
* add fleet amp.init()

* add unittest for fleet_amp_init
```
  13862008
12 1月, 2021 1 次提交
- C
  【Paddle.Fleet】Support local save sparse param (#30175) · d479ae17
  由 Chengmo 提交于 1月 12, 2021
```
* add save tensor support
Co-authored-by: NseiriosPlus <tangwei12@baidu.com>
```
  d479ae17
22 12月, 2020 1 次提交
- S
  Support multi-stream communication for dynamic graph distributed (#29525) · 01e2874a
  由 ShenLiang 提交于 12月 22, 2020
```
* fix fleet for multi-stream

* fix memcpy for ncclid

* use sync to solve move operation
```
  01e2874a
04 12月, 2020 1 次提交
- S
  
  support dp run single card (#29358) · 4064354a
  由 ShenLiang 提交于 12月 04, 2020
  
  4064354a
03 12月, 2020 2 次提交
- S
  
  fix warning of fleet (#29317) · 2d6aa1a5
  由 ShenLiang 提交于 12月 03, 2020
  
  2d6aa1a5
- S
  Fix doc of fleet api (#29282) · 2cd0bf57
  由 ShenLiang 提交于 12月 03, 2020
```
* fix doc, test=document_fix
```
  2cd0bf57
01 12月, 2020 2 次提交
- S
  
  Change the api of DataParallel and Fleet (#29224) · 46b73e6c
  由 ShenLiang 提交于 12月 01, 2020
  
  46b73e6c
- 1
  test=develop, fix doc (#29200) · cc9c6196
  由 123malin 提交于 12月 01, 2020
```
* fix fleet api doc
```
  cc9c6196
27 11月, 2020 1 次提交

Support dynamic graph distributed (#28997) · e2d01eb6

由 ShenLiang 提交于 11月 27, 2020

* add reducer

* refine envent for memorycopy

* add concat&split for allreduce

* apply concat & split for fuse tensor

* fix nccl dep

* fix the untest, compile problem and ddp initialize problem

* fix untest for mac & add some comments & solve the repeated param in sublayers

* fix untest for windows & fix document

e2d01eb6

19 10月, 2020 1 次提交
- M
  fleet support paddle.optimzier (#28026) · 55098b97
  由 MRXLT 提交于 10月 19, 2020
```
fleet support paddle.optimzier

* bug fix

* fix fleet_base

* bug fix

* fix coverage
```
  55098b97
16 10月, 2020 1 次提交
- W
  
  【paddle.fleet】fleet add _get_applied_meta_list and _get_applied_graph_list (#27952) · fb641c91
  由 WangXi 提交于 10月 16, 2020
  
  fb641c91
15 10月, 2020 2 次提交
- T
  Feature/large scale kv save base/delta (#27470) · 202bfab1
  由 tangwei12 提交于 10月 15, 2020
```
* add size method for large scale

* add large scale UT

* add ut for checkpoint
```
  202bfab1
- D
  【paddle.fleet】raise error when using multi-cards in fleet non_distributed mode (#27854) · 8d7908f3
  由 danleifeng 提交于 10月 15, 2020
```
* raise error if use multi-cards in fleet non_distributed mode; test=develop
```
  8d7908f3
14 10月, 2020 1 次提交
- C
  
  remove scale loss and coll grads, test=document_fix (#27874) · ed31dac6
  由 Chen Weihang 提交于 10月 14, 2020
  
  ed31dac6

机器未来 / Paddle 与 Fork 源项目一致

机器未来 / Paddle
与 Fork 源项目一致