提交 · 1799c257ad0e72f741064b1fd5965beaeb14f348 · BaiXuePrincess / Paddle

16 8月, 2019 1 次提交
- G
  Remove node_num function. (#19167) · 86f05911
  由 gongweibao 提交于 8月 16, 2019
```
node_num is not needed for users, so remove them and fix the bugs about it!
```
  86f05911
14 8月, 2019 3 次提交

J
fix default value (#19193) · b86be13c
由 jiaqi 提交于 8月 14, 2019
```
* fix default value in ps_pb2.py:   delta_keep_days 30 -> 16
* test=develop
```
b86be13c

add get_last_save_xbox_base/get_last_save_xbox (#19122) · b104ea06

由 jiaqi 提交于 8月 14, 2019

* add get_last_save_xbox_base/get_last_save_xbox
* fix fleet_util bug of load paddle model
* add doc string in fleet api

b104ea06

fix default value of fleet desc (#19176) · bfd514c7

由 jiaqi 提交于 8月 14, 2019

* fix default value of fleet desc, default values are same with jingpai
* print log when save model

bfd514c7

12 8月, 2019 1 次提交
- G
  Polish fleet API to support cuda collective mode and nccl2 mode. (#18966) · 29d87812
  由 gongweibao 提交于 8月 12, 2019
```
Polish fleet API to support cuda collective mode and nccl2 mode
```
  29d87812
11 8月, 2019 1 次提交

add save cache model api in fleet& add slots shuffle in dataset module & add... · 9150cf50

由 yaoxuefeng 提交于 8月 11, 2019

add save cache model api in fleet& add slots shuffle in dataset module & add metric op to calculate ctr related metrics (#18871)

* add ctr related metric layer test=develop

* add save cache and slots shuffle test=develop

* add save cache and slots shuffle test=develop

* fix error

* fix error

* fix style for ci

* fix for comments

* change SlotsShuffle input to std::strinf for generality

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix stylr

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* change non-const reference to pointer

* fix style

* fix style

* fix style test=develop

* fix style  test=develop

* add return ins num in ctr metric op

* change dtype to float in metric_op.py

* fix error test=develop

* fix style test=develop

* fix API spec

* fix API spec

* fix API spec test=develop

* add UT test=develop

9150cf50

08 8月, 2019 1 次提交

add fleet util, add some interface in hdfs util (#18752) · a99bc64c

由 jiaqi 提交于 8月 08, 2019

* add fleet util (fleet/utils/fleet_util.py): functions for users' convenience
* add some interface in hdfs util : hdfs is_file、hdfs cat

a99bc64c

02 8月, 2019 1 次提交

support filelist size < trainer num && fix pull dense (#18956) · 02c370c3

由 jiaqi 提交于 8月 02, 2019

* support filelist size < trainer num
* pull dense when stop, to make sure local dense params are same as pserver, so save paddle model will save dense model same as pserver
*  enable QueueDataset train same filelist for serveral times

02c370c3

01 8月, 2019 1 次提交
- J
  adjust ins weight according to nid slot (#18784) · 768059b3
  由 jiaqi 提交于 8月 01, 2019
```
adjust ins weight according to nid slot , user can specify adjust_ins_weight in strategy
```
  768059b3
31 7月, 2019 1 次提交

set fleet_send_batch_num a default value according to trainer num · 233746d8

由 jiaqi 提交于 7月 31, 2019

(1) set fleet_send_batch_num a default value according to trainer num， the previous 80000 is fixed，if trainer num is much less or larger than 100，global shuffle may have timeout error.

(2) fix load one table bug, add barrier

233746d8

29 7月, 2019 1 次提交

add clear_model interface in fleetwrapper (#18815) · 52c1431e

由 Thunderbrook 提交于 7月 29, 2019

* dump slot

* test

* proto

* dump slot

* test

* proto

* code style

* code style

* code style

* style

* add delete after unseen days

* add unseen days

* code style

* conflict solve
test=develop

* add clear model

* code style
test=develop

* code style
test=develop

52c1431e

25 7月, 2019 2 次提交
- G
  refine launch_ps and role_maker (#18795) · 30562e37
  由 guru4elephant 提交于 7月 25, 2019
```
refine launch_ps and role_maker
```
  30562e37
- F
  Fix shrink-dense and add scale-datanorm (#18746) · c167a4b4
  由 fuyinno4 提交于 7月 25, 2019
```
Fix FleetWrapper:
1. fix shrink dense: just scale show
2. add datanorm scale: divide datanorm's gradient by batch_size
```
  c167a4b4
24 7月, 2019 1 次提交

add slot to sparse table (#18686) · d8396281

由 Thunderbrook 提交于 7月 24, 2019

The change includes 2 things:

1. save delta model and shrink table are control by the same parameter before, now add delete_after_unseen_days to control shrink table.
2. value in sparse table has no slot before, now add slot in sparse table, and add DownpureCtrAccessor to support the new meta.
test=develop

d8396281

23 7月, 2019 1 次提交

support patch data, add load_one_table, fix bug (#18509) · d18aabb4

由 jiaqi 提交于 7月 23, 2019

（1）support patch data （merge slots of instances of same line id, modify dense layer which
changes its size）
（2）add fleet load_one_table interface, support load from paddle model and load from pslib model
（3）fix push sparse bug which cause push sparse cost more time（about 10% in my testcase）
（4）when some slots are not in one of your network (join/update, etc.)，data feed、collect label info、push/pull sparse will skip these slots， instead of throw error.
（5）add more debug info in TrainFilesWithProfiler

d18aabb4

22 7月, 2019 1 次提交
- T
  do some odd jobs (#18641) · d8458483
  由 tangwei12 提交于 7月 22, 2019
```
do some odd jobs, test=develop
```
  d8458483
10 7月, 2019 1 次提交
- G
  upgrade collective fleet api (#18533) · 9c17a899
  由 guru4elephant 提交于 7月 10, 2019
```
* upgrade collective fleet api
```
  9c17a899
08 7月, 2019 1 次提交
- G
  add random port (#18504) · 1f1cc222
  由 guru4elephant 提交于 7月 08, 2019
```
* add random port
```
  1f1cc222
02 7月, 2019 1 次提交
- G
  make fleet support mpi job submit directly (#18441) · 357311fd
  由 guru4elephant 提交于 7月 02, 2019
```
make fleet support mpi job submit directly.
```
  357311fd
27 6月, 2019 2 次提交

T
fix communicator with pyreader (#18350) · 999d9a59
由 tangwei12 提交于 6月 27, 2019
```
* add is_runnning in communicator, test=develop
```
999d9a59

supports collective communicated training (#18175) · b7128bac

由 HaoRen 提交于 6月 27, 2019

* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop

* supports collective training in executor

* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop

* fix comment
test=develop

* use unique name for nccl_id

* supports output to stream in program_to_code

* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code

* set op role in collective training

* add collective op role

* remove orig file

* add build optimizer by strategy

* add collective strategy

* refine collective strategy

* add multi-process role maker

* refine strategy building factory so that we can easily plugin more strategy

* scale loss grad in collective sgd transpiler

* add support for distributed fc

* code format

* revert some features for dist fc

* add support for distributed fc training

* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop

* supports collective training in executor

* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop

* use unique name for nccl_id

* supports output to stream in program_to_code

* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code

* set op role in collective training

* add collective op role

* fix comment
test=develop

* remove orig file

* add build optimizer by strategy

* add collective strategy

* refine collective strategy

* add multi-process role maker

* refine strategy building factory so that we can easily plugin more strategy

* scale loss grad in collective sgd transpiler

* add support for distributed fc

* code format

* revert some features for dist fc

* add support for distributed fc training

* test=develop
add collective op unittest standard

* test=develop
remove the test_collective directory

* test=develop
remove the test_collective directory

* remove slicegather test

* code format for reducescatter

* update attr of shard_index_op

* Modify macro nccl_helper

* remove test without distribute

* macro collective_helper

* marcro update

* test=develop
update support python3.5

* test=develop change gpu memory use to 0.1 when test

* test=develop
update ut equal func

* test=develop
set flags to 1.5

* test=develop fix pickle dumple  py35

* test=develop
fix divide in slice and add sync_comm_stream
update atol and rtol to 1e-05
rm shard_index op and test
modify read input from file to read from memory
remove origin_program in framework and add i/o in c_sync_calc_stream

* test=develop update unittest sync operator I/O

b7128bac

23 6月, 2019 1 次提交
- G
  fix paddle cloud role maker bug (#18269) · ff399fd7
  由 guru4elephant 提交于 6月 23, 2019
```
* fix paddle cloud role maker bug
```
  ff399fd7
17 6月, 2019 2 次提交

Q
assign role_maker before use (#18137) · 23f8a4b1
由 Qiao Longfei 提交于 6月 17, 2019
```
fix role_maker bug
test=develop
```
23f8a4b1

add paddle cloud role maker for customized usage, note this is only for... · 58f3e1ba

由 guru4elephant 提交于 6月 17, 2019

add paddle cloud role maker for customized usage, note this is only for industrial users that have cloud environment pre-configuration (#18121)

add paddle cloud role maker for specific cloud usage. This pr will simplifies user's configuration in distributed training.

58f3e1ba

13 6月, 2019 1 次提交
- T
  
  fix bug in fleet, test=develop (#18058) · 4c735f24
  由 tangwei12 提交于 6月 13, 2019
  
  4c735f24
12 6月, 2019 2 次提交

T
fix save/load in fleet (#17675) · 101f74cb
由 tangwei12 提交于 6月 12, 2019
```
* fix save/load in Fleet
* add UT framework of Fleet
```
101f74cb

fix logging basicConfig cannot be setting after import paddle (#17786) · 96ee528e

由 Kaipeng Deng 提交于 6月 12, 2019

* fix logging unable. test=develop

* unset sys.stdout for stream handler. test=develop

* fix newly add basicConfig. test=develop

* fix import error. test=develop

96ee528e

11 6月, 2019 1 次提交

add UserDefinedCollectiveRoleMaker for collective mode (#17898) · b5c35ae3

由 lilong12 提交于 6月 11, 2019

* add 'UserDefinedRoleMakerNCCL' for collective mode.

* code style

* add the name UserDefinedRoleMakerNCCL to __all__

* rename to UserDefinedRoleMakerCollective

* rename to UserDefinedCollectiveRoleMaker

b5c35ae3

23 5月, 2019 1 次提交
- Q
  Async exe support communicator (#17386) · 58f7695a
  由 Qiao Longfei 提交于 5月 23, 2019
```
Async exe support communicator
```
  58f7695a
17 5月, 2019 1 次提交
- J
  support sparse table get shard_num from TableParameter (#17443) · 05df39ac
  由 jiaqi 提交于 5月 17, 2019
```
test=develop
```
  05df39ac
15 5月, 2019 1 次提交

support config file, cvm, load, save, shrink (#17319) · 34369944

由 jiaqi 提交于 5月 15, 2019

* support config file, cvm, load, save, shrink
test=develop

* fix error of worker_num & add table.compress_in_save
test=develop

* fix code style
test=develop

* fix save model bug
test=develop

34369944

09 5月, 2019 1 次提交

Reformat fleet API (#17135) · 565d3095

由 tangwei12 提交于 5月 09, 2019

* fix some logic in distributed transpiler, test=develop
* reformat fleet API, test=develop

565d3095

25 4月, 2019 1 次提交
- T
  Fleet unify distributed training (#16791) · 1a4a51db
  由 tangwei12 提交于 4月 25, 2019
```
* implement distributed transpiler with fleet
```
  1a4a51db
11 4月, 2019 1 次提交
- D
  
  fix code style for incubator · ceac9df8
  由 dongdaxiang 提交于 4月 10, 2019
  
  ceac9df8
10 4月, 2019 1 次提交
- X
  add Example in doc string of split_filelist · e784884e
  由 xjqbest 提交于 4月 10, 2019
```
test=develop
```
  e784884e
09 4月, 2019 3 次提交
- X
  fix code style · 1c0ef929
  由 xjqbest 提交于 4月 09, 2019
```
test=develop
```
  1c0ef929
- X
  fix code style · 19381329
  由 xujiaqi01 提交于 4月 09, 2019
```
test=develop
```
  19381329
- X
  move split filelist from trainer.py to fleet & fix error · d5ee580c
  由 xjqbest 提交于 4月 09, 2019
```
test=develop
```
  d5ee580c
05 4月, 2019 1 次提交
- X
  fix init_worker bug · 126d2a2f
  由 xjqbest 提交于 4月 05, 2019
```
test=develop
```
  126d2a2f
04 4月, 2019 1 次提交
- X
  fix code style · 7a759d76
  由 xjqbest 提交于 4月 04, 2019
```
test=develop
```
  7a759d76

BaiXuePrincess / Paddle 与 Fork 源项目一致

BaiXuePrincess / Paddle
与 Fork 源项目一致