提交 · 284bae99a4ff5bdb63d20f8910064e37b75034da · 机器未来 / Paddle

12 1月, 2021 2 次提交

【Cherry-Pick】Fix device_context & Save Tensor & Gloo (#30336) · 284bae99

由 Chengmo 提交于 1月 12, 2021

* Fix server.h include device_context (#30243)

* fix cmake
Co-authored-by: NseiriosPlus <tangwei12@baidu.com>

* 【Paddle.Fleet】Support local save sparse param (#30175)

* add save tensor support
Co-authored-by: NseiriosPlus <tangwei12@baidu.com>

* add sparse embedding & load vars for 2.0 & gloo bug fix (#30306)

* add sparse embedding & load vars for 2.0

Change-Id: I36b59ed5f015189dc9d9d2e34a9357722d369f1b

* fix hdfs gloo

Change-Id: Ia84d579053720ad804183e54c9a04b4f031c79c6

* fix gloo hdfs

Change-Id: I5ab982fd483cddc10adcdef0b8aa83aca976cb9e

* move loadvar/sparse embedding from incubute to static

Change-Id: I57081d3545ad2efab78c72420d2162c0eacaf3a0
Co-authored-by: Ntangwei12 <tangwei12@baidu.com>

284bae99

C

cherry pick tensor table (#30221) · 330aea6e
由 Chengmo 提交于 1月 12, 2021

330aea6e

11 1月, 2021 2 次提交

[cherry-pick 2.0] optimize gradient merge (#30185) · e283dc6f

由 WangXi 提交于 1月 11, 2021

* Optimization grad merge performance (#29784)

* [fleet] combine amp and gradient merge, test=develop (#30086)

* fix assign_op_xpu concat_op_xpu warining (#30120)
Co-authored-by: Nliuyuhui <liuyuhui@baidu.com>

e283dc6f

C
[Cherry-pick] remove distributed prepare context (#30219) (#30256) · 1fa98c5d
由 Chen Weihang 提交于 1月 10, 2021
```
att, cherry-pick of #30219
```
1fa98c5d

08 1月, 2021 1 次提交

[Cherry-pick] Simplify the options of spawn based on fleetrun (#30144) (#30197) · 39204d56

由 Chen Weihang 提交于 1月 07, 2021

* Simplify the options of spawn based on fleetrun (#30144)

* Simplify the options of spawn based on fleetrun

* polish details

* polish doc details

* cleanup enum test=develop (#29294)
Co-authored-by: Ngongweibao <weibao.gong@gmail.com>

39204d56

06 1月, 2021 1 次提交
- G
  Cherrypick 30071 (#30074) · 19bec2fe
  由 gongweibao 提交于 1月 06, 2021
```
* fix log test=release/2.0

* fix ut test=develop
```
  19bec2fe
05 1月, 2021 2 次提交

G

fix test=release/2.0 (#30045) · 6e2066b0
由 gongweibao 提交于 1月 05, 2021

6e2066b0

[cherry pick]Set FLAGS_selected_gpus for spawn (#29962) (#30097) · cda7397f

由 Chen Weihang 提交于 1月 05, 2021

Set FLAGS_selected_gpus for spawn.

When the child process starts, it will inherit the configuration of the main process and set the FLAGS once, but the environment variable has not been set at this time, which leads to the FLAGS_selected_gpus is keep same with mainprocess(usually empty), so manually update the flags here.

注：增加了一个单测，又移除了，单测打印显示CI机器nvidia-smi只有两张卡，需要大于两张卡才能测这个问题

cda7397f

31 12月, 2020 3 次提交
- L
  add the paddle.distributed.split api (#29970) (#30041) · 84c2315a
  由 lilong12 提交于 12月 31, 2020
```
* add distributed.split, test=develop
```
  84c2315a
- L
  fix the bug in pipeline data parallelism (#29731) (#29918) · f0e04e1f
  由 lilong12 提交于 12月 31, 2020
```
* update, test=develop
```
  f0e04e1f
- L
  [Cherry-pick] Disable gloo by default #29559 #29805 (#29601) · 640f8cf0
  由 lilong12 提交于 12月 31, 2020
```
* update, test=develop (#29559)

* Disable gloo by default (#29805)

* update, test=develop

* update, test=develop
```
  640f8cf0
25 12月, 2020 1 次提交

2 0 ps core 2 (#29894) · f781ab08

由 tangwei12 提交于 12月 25, 2020

* add ps table (#29463)

* add ps table

Change-Id: I468a04bd071d21ff52654926fcf4d5f3da19e178

* add service (#29560)

* add service, remove ut on mac

* fix heter_profiler & add heter stop method

* fix code style

* merge pscore

Change-Id: Ie7f60d1cdde6755a0c29db26863c6283e9843d57

* fix cmake

Change-Id: I6773509a7b4ca79139ecc40b7bf3eb318ceff8bb

* fix conflit

Change-Id: I35575be0c96a8520f9d756ea7f1ff0b904a165ba

* fix conflit

Change-Id: Ic926ea0b0d67803226d51241397ba3b510226bfa

f781ab08

22 12月, 2020 2 次提交
- S
  Support multi-stream communication for dynamic graph distributed (#29525) (#29821) · f7a598fa
  由 ShenLiang 提交于 12月 22, 2020
```
* fix fleet for multi-stream

* fix memcpy for ncclid

* use sync to solve move operation
```
  f7a598fa
- W
  
  fleet sync build strategy, test=develop (#29732) (#29745) · f8888a07
  由 WangXi 提交于 12月 22, 2020
  
  f8888a07
17 12月, 2020 1 次提交

[cherry-pick]fix matmulv2 bug & add rebuild group & fix bug of download (#29726) · df0430dc

由 ShenLiang 提交于 12月 17, 2020

* Fix the dowanload bug in the case of multiple machines (#29551)

* fix the dowanload bug
* add sort for ips

* Fix bug of matmul_v2 for broadcast case (#29599)

* fix bug of matmul_v2 for broadcast

* Rebuild group automatically in dynamic graph distributed (#29255)

* add tensor_indices in AssignGroupBySize

* add rebuild group in reducer

* fix error message of gather nd (#29521)

df0430dc

16 12月, 2020 1 次提交

[2.0/cherrypick] cherry-pick Sharding PR:29518 (#29593) · ab04bf01

由 JZ-LIANG 提交于 12月 16, 2020

* Sharding add hybrid-dp feature

* update sharding in distributed_strategy

* update sharding unitest

* revise code format for sharding

ab04bf01

08 12月, 2020 1 次提交
- L
  [Cherry-pick] Fix bug in gloo that gloo initialization hangs (#29449) · d8e1e50a
  由 lilong12 提交于 12月 08, 2020
```
* update, test=develop (#29331)
```
  d8e1e50a
04 12月, 2020 1 次提交
- S
  
  support dp run single card (#29358) (#29372) · b6bc4cb5
  由 ShenLiang 提交于 12月 04, 2020
  
  b6bc4cb5
03 12月, 2020 2 次提交
- S
  [Cherry-Pick]Fix reducer warning & fix doc of fleet (#29333) · afa50f45
  由 ShenLiang 提交于 12月 03, 2020
```
* fix the warning of reducer (#29323)

* fix warning of fleet (#29317)

* Fix doc of fleet api (#29282)
```
  afa50f45
- S
  [cherry-pick]Change the api of DataParallel and Fleet (#29288) · ec57656e
  由 ShenLiang 提交于 12月 03, 2020
```
* Change the api of DataParallel and Fleet (#29224)
```
  ec57656e
01 12月, 2020 1 次提交
- 1
  test=develop, fix doc (#29200) · cc9c6196
  由 123malin 提交于 12月 01, 2020
```
* fix fleet api doc
```
  cc9c6196
30 11月, 2020 2 次提交
- W
  
  optimizer amp, all use fp16 communication, overlap last comm and compute (#28957) · 0c2a51d2
  由 WangXi 提交于 11月 30, 2020
  
  0c2a51d2
- 1
  test=develop, rm pathlib (#28658) · 92817f80
  由 123malin 提交于 11月 30, 2020
```
* test=develop, rm pathlib
```
  92817f80
27 11月, 2020 4 次提交
- S
  Support dynamic graph distributed (#28997) · e2d01eb6
  由 ShenLiang 提交于 11月 27, 2020
```
* add reducer

* refine envent for memorycopy

* add concat&split for allreduce

* apply concat & split for fuse tensor

* fix nccl dep

* fix the untest, compile problem and ddp initialize problem

* fix untest for mac & add some comments & solve the repeated param in sublayers

* fix untest for windows & fix document
```
  e2d01eb6
- C
  
  fix some docs test=develop;test=document_fix (#29159) · d576d6dd
  由 Chen Long 提交于 11月 27, 2020
  
  d576d6dd
- L
  
  update, test=develop (#29139) · 216e0856
  由 lilong12 提交于 11月 27, 2020
  
  216e0856
- L
  
  Add a flag to control whether to initialize gloo (#29150) · a1add716
  由 lilong12 提交于 11月 27, 2020
  
  a1add716
26 11月, 2020 5 次提交

S
fix InMemoryDataset doc (#28688) · cddc7096
由 ShenLiang 提交于 11月 26, 2020
```
* add Inmemorydataset
```
cddc7096

[sharding] doc, api, bug fixed (#28983) · 0dadacc4

由 JZ-LIANG 提交于 11月 26, 2020

* add lars to fleet meta optimizer

* add lamb to proto

* add lamb to fleet meta optimizer

* fixed syntax bug

* fixed syntax bug

* fixed syntax error in lamb, add config setter of lamb in distributed_strategy

* trigger unitest to rerun

* add new unitest func for lamb

* revise unitest for lars and lamb

* revise dgc meta unitest

* revise lars document in distribute_strategy

* revise lars lamb document in distributed_strategy.py

* revise lars lamb document in distributed_strategy.py

* add weight decay exclude logic to lars

* restore optimzier.py

* restore optimizer.py as develop except lars

* add epsilon and exclude fn to distributed_sttrategy

* add lars epsilon

* revise unitest for fleet lars and lamb

* revise lars lamb unitest for CI coverage

* revise lars argument api

* revise lars argument api

* revise lars argument api

* revise api doc of lars

* fix op role

* add sharding save and add_sync_comm_for_test function

* add comm_analyse to utlis

* revise sharding_utils

* add sharding saving unittest

* revise sharding utils for unittest

* revise sharding en doc

* update sharding utils api

* add doc for sharding

* fixed bug in sharding var size count

* update varsize count in sharding

* fix sharding num_nccl_comm

* Revert "fix sharding num_nccl_comm"

This reverts commit d51587c15e9323acf226ddd36154275f0d1daf76.

0dadacc4

L
fix the bug in gloo (#29112) · 2a864c70
由 lilong12 提交于 11月 26, 2020
```
* update, test=develop
```
2a864c70
W

Fix multi nccl comm & wait server ready (#28663) · e931c7ba
由 WangXi 提交于 11月 26, 2020

e931c7ba
G

Clean up the redundant files and unify the launch interface. (#28928) · 1358397e
由 gongweibao 提交于 11月 26, 2020

1358397e

24 11月, 2020 3 次提交

Polish parallel api impl & doc details (#28980) · bb16c251

由 Chen Weihang 提交于 11月 24, 2020

* polish parallel api impl & doc details

* add unittest for coverage

* remove spawn test in py2.7

* add parallel api into white list

bb16c251

Upgrade string literals to raw string (#28989) · 3815d7aa

由 Leo Chen 提交于 11月 24, 2020

* upgrade comment string to raw string

* fix string in

* fix string with ' '

* revert update on comments

* upgrade only necessary

* fix sample code checker

* fix comments with '''

3815d7aa

1
【paddle.distributed.fleet】Optimize ParameterServer's Async Mode (#28442) · fbf9564f
由 123malin 提交于 11月 24, 2020
```
* test=develop, optimize global_step
```
fbf9564f

23 11月, 2020 2 次提交
- L
  enable pipeline to run with Executor.run() (#28373) · f77a78cd
  由 lilong12 提交于 11月 23, 2020
```
* update, test=develop
```
  f77a78cd
- C
  
  lazily init global group in collective (#28780) · bff4179c
  由 Chen Weihang 提交于 11月 23, 2020
  
  bff4179c
18 11月, 2020 1 次提交

[Sharding] add new features (#28568) · 5a9f6889

由 JZ-LIANG 提交于 11月 18, 2020

* add lars to fleet meta optimizer

* add lamb to proto

* add lamb to fleet meta optimizer

* fixed syntax bug

* fixed syntax bug

* fixed syntax error in lamb, add config setter of lamb in distributed_strategy

* trigger unitest to rerun

* add new unitest func for lamb

* revise unitest for lars and lamb

* revise dgc meta unitest

* revise lars document in distribute_strategy

* revise lars lamb document in distributed_strategy.py

* revise lars lamb document in distributed_strategy.py

* add weight decay exclude logic to lars

* restore optimzier.py

* restore optimizer.py as develop except lars

* add epsilon and exclude fn to distributed_sttrategy

* add lars epsilon

* revise unitest for fleet lars and lamb

* revise lars lamb unitest for CI coverage

* revise lars argument api

* revise lars argument api

* revise lars argument api

* revise api doc of lars

* fix op role

* add sharding save and add_sync_comm_for_test function

* add comm_analyse to utlis

* revise sharding_utils

* add sharding saving unittest

* revise sharding utils for unittest

5a9f6889

17 11月, 2020 1 次提交
- L
  
  update doc, test=document_fix (#28498) · e4f94153
  由 lilong12 提交于 11月 17, 2020
  
  e4f94153
16 11月, 2020 1 次提交
- D
  
  fix nccl init failed in parallel dygraph mode (#28497) · a24d1868
  由 danleifeng 提交于 11月 16, 2020
  
  a24d1868

机器未来 / Paddle 与 Fork 源项目一致

机器未来 / Paddle
与 Fork 源项目一致