提交 · 7b29c89b268185ef410760e65fa1b5304dbd028b · 机器未来 / Paddle

09 8月, 2022 1 次提交

refine save/load interface for distributed cpups (#44862) · 7b29c89b

由 zhaocaibei123 提交于 8月 09, 2022

* save load

* save load

* add unittest

* first commit

* second commit

* third commit

* remove SaveLocalFS in memory sparse table

* save dense param

* update

* push slot

* fix push show clk: int -> float

* add unittest

* fix sample

* unittest

* add AsExtra for op

* unittest

* modify fs.py

* modify fs.py

* fix some bugs

* add dataset hdfs config

* local change

* dataset use differenct hadoop ugi/fs_name

* add

* fix conflict

* fix

* remove logs

* code style

* fix

* code style

* code style

* fix

* code style

* save_dense_param

* fix

* fix

* fix

* fix

* change momentum in dense optimzer

* fix

* fix

* change fluid => paddle.static

* remove some unuseful code
Co-authored-by: Nesythan <esythan@126.com>

7b29c89b

05 6月, 2022 1 次提交

【code format check upgrade】 step2：yapf (#42944) · a072fca8

由 Sing_chan 提交于 6月 05, 2022

* use yapf to format all python file

* yapf exclude two unittests file for they rely on writing and reading file, and format will break them

* disable diff_py_file because too many diff files cause command following failed

a072fca8

22 4月, 2022 1 次提交

Ssd sparse table (#41812) · cca57c4a

由 zhaocaibei123 提交于 4月 22, 2022

* [cherry-pick2.3]fix compile bug of windows cuda11.5 (#41464)

cherry-pick

fix compile bug of windows cuda11.5 #41433

* fix bug of missing boost when compile cache.cc (#41449)

【chery-pick #41430】fix bug of random compile failure, due to incorrect compile order of dependencies

* Fix eager try catch (#41438) (#41477)

[Cherry-Pick]Fix eager try catch (#41438)

* Cherry-pick-PR41407, fix device_id bug for final_state op in multiprocess testcase (#41407) (#41475)

Cherry-pick PR #41407

* [BugFix] Add error hint for one_hot gpu version (#41335) (#41495)

* add one_hot gpu hint

* move allow_out_of_range judgement

* delete useless unittest

* fix bugs of reshape double grad infermeta (#41459) (#41493)

* [cherrypick-2.3] modify infer gpu memory strategy (#41427), remove cudnn_deterministic=True (#41341)  (#41491)
Co-authored-by: NJingZhuangzhuang <75348594+JZZ-NOTE@users.noreply.github.com>

* [Cherry-pick][ROCm] fix dcu error in device event base, test=develop (#41523)

Cherry-pick of #41521

* [Cherry-Pick]Cherry pick PR41200, PR41474, PR41382 (#41509)

* Use `self`as a parameter of _hash_with_id function to avoid error caused by hash_id reuse (#41200)

* Add fill_constant_batch_size YAML and UT (#41474)

* Switch some dy2st UT to eager mode (#41382)

* Sitch some dy2st UT to eager mode

* Fix test_lstm and remove test_transformer

* Run test_resnet_v2 in old dy mode

* Unittest recover (#41431)

* update name

* update name

* fix test

* fix fleet bind

* update name

* update name

* fix test

* fix gpups wrapper

* remove Push/Pull/Load/Save with context in client and wrapper base class

* fix

* fix

* remove some interface

* fix

* remove

* code style

* recover

* fix

* remove code unused

* remove some unused table & accessor & CommonDenseTable => MemoryDenseTable

* fix

* fix

* fix

* recover

* remove unused code

* recover unittest

* fix

* remove

* fix

* remove code unuseful

* remove

* fix

* recover

* remove
Co-authored-by: Nesythan <esythan@126.com>

* add ssd sparse table

* fix

* add cache shuffle

* fix

* fix

* fix

* fix

* fix

* fix

* add unit test

* fix
Co-authored-by: Zhou Wei <1183042833@qq.com>
Co-authored-by: NSing_chan <51314274+betterpig@users.noreply.github.com>
Co-authored-by: N0x45f <23097963+0x45f@users.noreply.github.com>
Co-authored-by: Npangyoki <pangyoki@126.com>
Co-authored-by: NSiming Dai <908660116@qq.com>
Co-authored-by: NYuanRisheng <yuanrisheng@baidu.com>
Co-authored-by: NZhang Jun <ewalker@live.cn>
Co-authored-by: NJingZhuangzhuang <75348594+JZZ-NOTE@users.noreply.github.com>
Co-authored-by: NQi Li <qili93@qq.com>
Co-authored-by: Nesythan <esythan@126.com>

cca57c4a

09 4月, 2022 1 次提交

Unittest recover (#41431) · 7a07c4a5

由 zhaocaibei123 提交于 4月 09, 2022

* update name

* update name

* fix test

* fix fleet bind

* update name

* update name

* fix test

* fix gpups wrapper

* remove Push/Pull/Load/Save with context in client and wrapper base class

* fix

* fix

* remove some interface

* fix

* remove

* code style

* recover

* fix

* remove code unused

* remove some unused table & accessor & CommonDenseTable => MemoryDenseTable

* fix

* fix

* fix

* recover

* remove unused code

* recover unittest

* fix

* remove

* fix

* remove code unuseful

* remove

* fix

* recover

* remove
Co-authored-by: Nesythan <esythan@126.com>

7a07c4a5

23 3月, 2022 1 次提交

two-phase training for ps (#40762) · b1a4668c

由 zhaocaibei123 提交于 3月 23, 2022

* fix benchmark and communicator config

* fix bugs of the_one_ps

* multi program and fix bug in optimizer

* multi program in the_one_ps

* public commcontext

* ps optimizer multi programs

* cvm & datanorm backend

* fix dim

* fix unittest

* fix

* the one ps merge

* remove comm

* add DownpourLiteWorker

* all

* fix

* fix

* device worker downpour lite

* fix

* fix bug in global shuffle

* save inference model

* fix & add log

* fix

* remove log

* fix

* fix save summary

* fix

* fix pscore

* fix

* fix

* fix

* fix

* fix

* remove logs

* fix

* fix

* fix

* fix

* fix

* add some comments

* fix
Co-authored-by: Nesythan <esythan@126.com>

b1a4668c

30 11月, 2021 1 次提交
- Z
  
  pscore global shuffle&default accessor config (#37626) · 1514eec6
  由 zhaocaibei123 提交于 11月 30, 2021
  
  1514eec6
26 5月, 2021 1 次提交

ut fix (#33102) · e05a7a49

由 tangwei12 提交于 5月 26, 2021


Change-Id: I2e82dfcee6a1d0512b94cebc32281123fa5bf597

* pretty print for datafeed error

Change-Id: I056a8b6f03608e96679a83846c97aed289cef7e6

* fix fleet dist infer ut

e05a7a49

30 12月, 2020 1 次提交
- T
  fix ut (#29989) · ed856d25
  由 tangwei12 提交于 12月 30, 2020
```
* fix ut

Change-Id: I151e152919a1863db07792bffb42d0ca68995756
```
  ed856d25
24 12月, 2020 1 次提交

[Feature] one ps (3/4) (#29604) · 032414ca

由 tangwei12 提交于 12月 24, 2020

* oneps (3/4)
Co-authored-by: NMrChengmo <cmchengmo@163.com>
Co-authored-by: Nmalin10 <malin10@baidu.com>
Co-authored-by: Nchengmo <chengmo@baidu.com>

032414ca

08 9月, 2020 1 次提交
- 1
  【paddle.fleet】parameter_server_optimizer support auto_strategy (#26838) · f2d68d3e
  由 123malin 提交于 9月 08, 2020
```
* test=develop, add ps auto
```
  f2d68d3e
02 9月, 2020 1 次提交
- C
  supplement bug fix of parameter server (#26217) · d0962abd
  由 Chengmo 提交于 9月 02, 2020
```
* fix fluid.embedding
```
  d0962abd
20 8月, 2020 1 次提交
- 1
  add save/load for parameter server (#26235) · 57d434df
  由 123malin 提交于 8月 20, 2020
```
* add save/load for parameter server
```
  57d434df
19 8月, 2020 1 次提交
- C
  Fix ps gpu (#26218) · eeeef957
  由 Chengmo 提交于 8月 19, 2020
```
* support ps-gpu
```
  eeeef957
07 8月, 2020 1 次提交

【paddle.fleet】fleet_util move to paddle.fleet (#25805) · 2191a083

由 123malin 提交于 8月 07, 2020

* test=develop,test=document_fix, remove the out args

* fleet_util move to paddle.fleet
Co-authored-by: NWuHaobo <wuhaobo1994@gmail.com>
Co-authored-by: Ntangwei12 <tangwei12@baidu.com>

2191a083

30 7月, 2020 1 次提交

Integrated Trainer of Parameter Server (API add... · caa90a65

由 tangwei12 提交于 7月 30, 2020

Integrated Trainer of Parameter Server (API add `fluid.contrib.layers.sparse_embedding` only) (#22957)

* Integrated Trainer of Parameter Server

caa90a65

17 2月, 2020 1 次提交
- 1
  
  support dumping params/grads in transpiler mode (#22490) · 00594c1c
  由 123malin 提交于 2月 17, 2020
  
  00594c1c
12 2月, 2020 1 次提交
- T
  fix bug with compiledProgram (#22495) · b0675c81
  由 tangwei12 提交于 2月 12, 2020
```
* add thread barrier for the compiled program
```
  b0675c81
17 1月, 2020 1 次提交
- T
  integrated HALF_ASYNC to communicator (#21869) · 82bc814a
  由 tangwei12 提交于 1月 17, 2020
```
* add half_async in the communicator
* fix DistributedStrategy
```
  82bc814a
06 1月, 2020 1 次提交
- 1
  add distributed_strategy (#21710) · 7fb817d4
  由 123malin 提交于 1月 06, 2020
```
* add distributed_strategy
```
  7fb817d4
19 12月, 2019 1 次提交
- S
  
  modify the method of skipping CI in distributed unittests (#21764) · 3c334179
  由 silingtong123 提交于 12月 19, 2019
  
  3c334179
15 10月, 2019 1 次提交

Fix communicator slow bug & fix communicator stop bug (#20366) · 940c6ff1

由 Chengmo 提交于 10月 15, 2019

* test=develop,Fix communicator slow bug

* test=develop, delete if() in stop_worker()

* test=develop

* fix UT, test=develop

* fix bug in fetch handler, test=develop

* fix bug in fetch handler, test=develop

* test=develop, fix fetch barrier bug

* test=develop, bug fix

* test=develop, bug fix

* test=develop, fix bug

940c6ff1

07 10月, 2019 1 次提交
- T
  trainer from dataset fetch targets (#19760) · c9139c3d
  由 tangwei12 提交于 10月 07, 2019
```
add executor.FetchHandler for train/infer from the dataset
```
  c9139c3d
27 9月, 2019 1 次提交

the integrated communicator (#19849) · 8f0b3c05

由 tangwei12 提交于 9月 27, 2019

* add a base class for the Communicator
* add AsyncCommunicator Impl for async distributed training

8f0b3c05

28 8月, 2019 1 次提交

Fix the correctness of async mode at distributed training (#18863) · 65c73684

由 tangwei12 提交于 8月 28, 2019

* fix correctness of the communicator

* fix a bug in send thread when sending var context is empty, test=develop

* add lookup_table_prefetch_op and prefetch optimize, test=develop

* remove remote prefetch GPU supported

* word2vec force with CPU, test=develop

* test dist remote lookup table force with CPU, test=develop

65c73684

22 7月, 2019 1 次提交
- G
  split different comm method for mnist distributed training (#18715) · ebf9797e
  由 guru4elephant 提交于 7月 22, 2019
```
* split different comm method for mnist distributed training
```
  ebf9797e
12 6月, 2019 1 次提交
- T
  fix save/load in fleet (#17675) · 101f74cb
  由 tangwei12 提交于 6月 12, 2019
```
* fix save/load in Fleet
* add UT framework of Fleet
```
  101f74cb
29 10月, 2018 1 次提交

[1.1] [project] train imagenet using large batch size (#13766) · 26200f2e

由 Wu Yi 提交于 10月 29, 2018

* fix nccl2 lars dist support

* put lars in momentum op

* add tests lars

* fix ci

* fix cpu kernel

* soft warning

* remove lars in test_recognize_digits.py

* move to another op

* add file

* update api.spec test=develop

* update test=develop

* fix api.spec test=develop

* wip

* wip, finish grad merge ops

* wip, finish graph build

* wip test running

* work on 1 gpu

* workable version

* update

* fix tests

* fuse broadcast op

* fix compile failed

* refine

* add batch merge test mnist

* fix CI test=develop

* fix build

* use independent bn params for batch merge test=develop

* update api.spec

* follow comments and for test

* wip

* refine tests test=develop

* follow comments test=develop

* remove startup bn modify test=develop

* follow comments test=develop

* fix merge test=develop

26200f2e

机器未来 / Paddle 与 Fork 源项目一致

机器未来 / Paddle
与 Fork 源项目一致