提交 · 02c370c3dc5424941efa6e231a122e8ee80593d6 · BaiXuePrincess / Paddle

02 8月, 2019 1 次提交

support filelist size < trainer num && fix pull dense (#18956) · 02c370c3

由 jiaqi 提交于 8月 02, 2019

* support filelist size < trainer num
* pull dense when stop, to make sure local dense params are same as pserver, so save paddle model will save dense model same as pserver
*  enable QueueDataset train same filelist for serveral times

02c370c3

01 8月, 2019 1 次提交
- J
  adjust ins weight according to nid slot (#18784) · 768059b3
  由 jiaqi 提交于 8月 01, 2019
```
adjust ins weight according to nid slot , user can specify adjust_ins_weight in strategy
```
  768059b3
31 7月, 2019 1 次提交

set fleet_send_batch_num a default value according to trainer num · 233746d8

由 jiaqi 提交于 7月 31, 2019

(1) set fleet_send_batch_num a default value according to trainer num， the previous 80000 is fixed，if trainer num is much less or larger than 100，global shuffle may have timeout error.

(2) fix load one table bug, add barrier

233746d8

29 7月, 2019 1 次提交

add clear_model interface in fleetwrapper (#18815) · 52c1431e

由 Thunderbrook 提交于 7月 29, 2019

* dump slot

* test

* proto

* dump slot

* test

* proto

* code style

* code style

* code style

* style

* add delete after unseen days

* add unseen days

* code style

* conflict solve
test=develop

* add clear model

* code style
test=develop

* code style
test=develop

52c1431e

25 7月, 2019 2 次提交
- G
  refine launch_ps and role_maker (#18795) · 30562e37
  由 guru4elephant 提交于 7月 25, 2019
```
refine launch_ps and role_maker
```
  30562e37
- F
  Fix shrink-dense and add scale-datanorm (#18746) · c167a4b4
  由 fuyinno4 提交于 7月 25, 2019
```
Fix FleetWrapper:
1. fix shrink dense: just scale show
2. add datanorm scale: divide datanorm's gradient by batch_size
```
  c167a4b4
24 7月, 2019 1 次提交

add slot to sparse table (#18686) · d8396281

由 Thunderbrook 提交于 7月 24, 2019

The change includes 2 things:

1. save delta model and shrink table are control by the same parameter before, now add delete_after_unseen_days to control shrink table.
2. value in sparse table has no slot before, now add slot in sparse table, and add DownpureCtrAccessor to support the new meta.
test=develop

d8396281

23 7月, 2019 1 次提交

support patch data, add load_one_table, fix bug (#18509) · d18aabb4

由 jiaqi 提交于 7月 23, 2019

（1）support patch data （merge slots of instances of same line id, modify dense layer which
changes its size）
（2）add fleet load_one_table interface, support load from paddle model and load from pslib model
（3）fix push sparse bug which cause push sparse cost more time（about 10% in my testcase）
（4）when some slots are not in one of your network (join/update, etc.)，data feed、collect label info、push/pull sparse will skip these slots， instead of throw error.
（5）add more debug info in TrainFilesWithProfiler

d18aabb4

22 7月, 2019 1 次提交
- T
  do some odd jobs (#18641) · d8458483
  由 tangwei12 提交于 7月 22, 2019
```
do some odd jobs, test=develop
```
  d8458483
10 7月, 2019 1 次提交
- G
  upgrade collective fleet api (#18533) · 9c17a899
  由 guru4elephant 提交于 7月 10, 2019
```
* upgrade collective fleet api
```
  9c17a899
08 7月, 2019 1 次提交
- G
  add random port (#18504) · 1f1cc222
  由 guru4elephant 提交于 7月 08, 2019
```
* add random port
```
  1f1cc222
02 7月, 2019 1 次提交
- G
  make fleet support mpi job submit directly (#18441) · 357311fd
  由 guru4elephant 提交于 7月 02, 2019
```
make fleet support mpi job submit directly.
```
  357311fd
28 6月, 2019 1 次提交
- G
  add MultiSlotStringDataGenerator for speedup of string based user inp… (#18390) · e83f902b
  由 guru4elephant 提交于 6月 28, 2019
```
* add MultiSlotStringDataGenerator for speedup of string based user input data
```
  e83f902b
27 6月, 2019 2 次提交

T
fix communicator with pyreader (#18350) · 999d9a59
由 tangwei12 提交于 6月 27, 2019
```
* add is_runnning in communicator, test=develop
```
999d9a59

supports collective communicated training (#18175) · b7128bac

由 HaoRen 提交于 6月 27, 2019

* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop

* supports collective training in executor

* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop

* fix comment
test=develop

* use unique name for nccl_id

* supports output to stream in program_to_code

* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code

* set op role in collective training

* add collective op role

* remove orig file

* add build optimizer by strategy

* add collective strategy

* refine collective strategy

* add multi-process role maker

* refine strategy building factory so that we can easily plugin more strategy

* scale loss grad in collective sgd transpiler

* add support for distributed fc

* code format

* revert some features for dist fc

* add support for distributed fc training

* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop

* supports collective training in executor

* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop

* use unique name for nccl_id

* supports output to stream in program_to_code

* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code

* set op role in collective training

* add collective op role

* fix comment
test=develop

* remove orig file

* add build optimizer by strategy

* add collective strategy

* refine collective strategy

* add multi-process role maker

* refine strategy building factory so that we can easily plugin more strategy

* scale loss grad in collective sgd transpiler

* add support for distributed fc

* code format

* revert some features for dist fc

* add support for distributed fc training

* test=develop
add collective op unittest standard

* test=develop
remove the test_collective directory

* test=develop
remove the test_collective directory

* remove slicegather test

* code format for reducescatter

* update attr of shard_index_op

* Modify macro nccl_helper

* remove test without distribute

* macro collective_helper

* marcro update

* test=develop
update support python3.5

* test=develop change gpu memory use to 0.1 when test

* test=develop
update ut equal func

* test=develop
set flags to 1.5

* test=develop fix pickle dumple  py35

* test=develop
fix divide in slice and add sync_comm_stream
update atol and rtol to 1e-05
rm shard_index op and test
modify read input from file to read from memory
remove origin_program in framework and add i/o in c_sync_calc_stream

* test=develop update unittest sync operator I/O

b7128bac

23 6月, 2019 1 次提交
- G
  fix paddle cloud role maker bug (#18269) · ff399fd7
  由 guru4elephant 提交于 6月 23, 2019
```
* fix paddle cloud role maker bug
```
  ff399fd7
21 6月, 2019 1 次提交
- S
  
  fix bug in Class MultiSlotDataGenerator's function _gen_str, test=develop (#18222) · 432fda51
  由 songhao 提交于 6月 21, 2019
  
  432fda51
17 6月, 2019 2 次提交

Q
assign role_maker before use (#18137) · 23f8a4b1
由 Qiao Longfei 提交于 6月 17, 2019
```
fix role_maker bug
test=develop
```
23f8a4b1

add paddle cloud role maker for customized usage, note this is only for... · 58f3e1ba

由 guru4elephant 提交于 6月 17, 2019

add paddle cloud role maker for customized usage, note this is only for industrial users that have cloud environment pre-configuration (#18121)

add paddle cloud role maker for specific cloud usage. This pr will simplifies user's configuration in distributed training.

58f3e1ba

13 6月, 2019 1 次提交
- T
  
  fix bug in fleet, test=develop (#18058) · 4c735f24
  由 tangwei12 提交于 6月 13, 2019
  
  4c735f24
12 6月, 2019 2 次提交

T
fix save/load in fleet (#17675) · 101f74cb
由 tangwei12 提交于 6月 12, 2019
```
* fix save/load in Fleet
* add UT framework of Fleet
```
101f74cb

fix logging basicConfig cannot be setting after import paddle (#17786) · 96ee528e

由 Kaipeng Deng 提交于 6月 12, 2019

* fix logging unable. test=develop

* unset sys.stdout for stream handler. test=develop

* fix newly add basicConfig. test=develop

* fix import error. test=develop

96ee528e

11 6月, 2019 1 次提交

add UserDefinedCollectiveRoleMaker for collective mode (#17898) · b5c35ae3

由 lilong12 提交于 6月 11, 2019

* add 'UserDefinedRoleMakerNCCL' for collective mode.

* code style

* add the name UserDefinedRoleMakerNCCL to __all__

* rename to UserDefinedRoleMakerCollective

* rename to UserDefinedCollectiveRoleMaker

b5c35ae3

23 5月, 2019 1 次提交
- Q
  Async exe support communicator (#17386) · 58f7695a
  由 Qiao Longfei 提交于 5月 23, 2019
```
Async exe support communicator
```
  58f7695a
17 5月, 2019 1 次提交
- J
  support sparse table get shard_num from TableParameter (#17443) · 05df39ac
  由 jiaqi 提交于 5月 17, 2019
```
test=develop
```
  05df39ac
15 5月, 2019 1 次提交

support config file, cvm, load, save, shrink (#17319) · 34369944

由 jiaqi 提交于 5月 15, 2019

* support config file, cvm, load, save, shrink
test=develop

* fix error of worker_num & add table.compress_in_save
test=develop

* fix code style
test=develop

* fix save model bug
test=develop

34369944

09 5月, 2019 1 次提交

Reformat fleet API (#17135) · 565d3095

由 tangwei12 提交于 5月 09, 2019

* fix some logic in distributed transpiler, test=develop
* reformat fleet API, test=develop

565d3095

25 4月, 2019 1 次提交
- T
  Fleet unify distributed training (#16791) · 1a4a51db
  由 tangwei12 提交于 4月 25, 2019
```
* implement distributed transpiler with fleet
```
  1a4a51db
11 4月, 2019 1 次提交
- D
  
  fix code style for incubator · ceac9df8
  由 dongdaxiang 提交于 4月 10, 2019
  
  ceac9df8
10 4月, 2019 1 次提交
- X
  add Example in doc string of split_filelist · e784884e
  由 xjqbest 提交于 4月 10, 2019
```
test=develop
```
  e784884e
09 4月, 2019 3 次提交
- X
  fix code style · 1c0ef929
  由 xjqbest 提交于 4月 09, 2019
```
test=develop
```
  1c0ef929
- X
  fix code style · 19381329
  由 xujiaqi01 提交于 4月 09, 2019
```
test=develop
```
  19381329
- X
  move split filelist from trainer.py to fleet & fix error · d5ee580c
  由 xjqbest 提交于 4月 09, 2019
```
test=develop
```
  d5ee580c
05 4月, 2019 1 次提交
- X
  fix init_worker bug · 126d2a2f
  由 xjqbest 提交于 4月 05, 2019
```
test=develop
```
  126d2a2f
04 4月, 2019 2 次提交
- X
  fix code style · 7a759d76
  由 xjqbest 提交于 4月 04, 2019
```
test=develop
```
  7a759d76
- X
  fix runtime error · 5e513928
  由 xjqbest 提交于 4月 04, 2019
```
test=develop
```
  5e513928
30 3月, 2019 1 次提交
- X
  fix client to client communication bug · a99c8d0c
  由 xjqbest 提交于 3月 30, 2019
```
test=develop
```
  a99c8d0c
29 3月, 2019 3 次提交
- X
  fix code style & runtime error · a38b98cb
  由 xjqbest 提交于 3月 26, 2019
```
test=develop
```
  a38b98cb
- D
  
  make role maker and distributed optimizer private · 17790188
  由 dongdaxiang 提交于 3月 26, 2019
  
  17790188
- X
  add doc string · d52586a9
  由 xjqbest 提交于 3月 25, 2019
```
test=develop
```
  d52586a9

BaiXuePrincess / Paddle 与 Fork 源项目一致

BaiXuePrincess / Paddle
与 Fork 源项目一致