提交 · a25a716e878be6798991c4f5bee0302eb07b0d33 · PaddlePaddle / Paddle

06 9月, 2019 1 次提交
- 1
  Optimize fleet API: add input check for some interfaces (#18971) · a25a716e
  由 123malin 提交于 9月 06, 2019
```
* fleet api add input check, test=develop
```
  a25a716e
05 9月, 2019 1 次提交
- 1
  fix the diff between async mode and async_half mode (#19535) · 2f037c31
  由 123malin 提交于 9月 05, 2019
```
* test=develop,  communicator merge add => merge average
```
  2f037c31
30 8月, 2019 1 次提交

add thread scope stat accurate metrics test=develop (#19480) · 10ca3f96

由 yaoxuefeng 提交于 8月 30, 2019

* add thread scope stat accurate metrics test=develop

* fix style

* fix style

* fix style

* fix style test=develop

* fix style test=develop

* fix style test=develop

* fix style test=develop

* fix style test=develop

* fix style test=develop

* fix style test=develop

* fix conflict

* fix style

* fix style test=develop

* fix error test=develop

* fix error test=develop

10ca3f96

29 8月, 2019 2 次提交

support debug each output of each ins (#19004) · 1fe468d3

由 Thunderbrook 提交于 8月 29, 2019

* dump slot

* test

* proto

* dump slot

* test

* proto

* code style

* code style

* code style

* style

* add delete after unseen days

* add unseen days

* code style

* conflict solve
test=develop

* add clear model

* code style
test=develop

* code style
test=develop

* support debug tensor of each ins
test=develop

* support debug tensor of each ins
test=develop

* learning rate

* code style

* code style

* code style

* code style

* code style

* code style

* code style

* code style

* code style

* code style

* code style

* code style

* code style
test=develop

* code style
test=develop

* unitest

* style

* style

* multi phase

* add channel

* code style

* style

* style

* unitest

* style

* define

* define
test=develop

* style
test=develop

* rm define
test=develop

* linux

* linux
test=develop

* style
test=develop

* output format
test=develop

* windows ci
test=develop

1fe468d3

Z
support fc sort by number, test=develop (#19466) · bd35a7f0
由 zhang wenhui 提交于 8月 29, 2019
```
fleet_desc sort fc name by dictionary sort, but we want to sort by number.
```
bd35a7f0

28 8月, 2019 2 次提交

Y
adapte fleet api for localsgd and support nccl comm configuration in executor (#19443) · 4ef6b845
由 Yi Liu 提交于 8月 28, 2019
```
test=develop
```
4ef6b845

Fix the correctness of async mode at distributed training (#18863) · 65c73684

由 tangwei12 提交于 8月 28, 2019

* fix correctness of the communicator

* fix a bug in send thread when sending var context is empty, test=develop

* add lookup_table_prefetch_op and prefetch optimize, test=develop

* remove remote prefetch GPU supported

* word2vec force with CPU, test=develop

* test dist remote lookup table force with CPU, test=develop

65c73684

27 8月, 2019 1 次提交
- Z
  fix fleet_desc bug && support format for abacus hotstart (#19430) · 0d794983
  由 zhang wenhui 提交于 8月 27, 2019
```
fix fleet_desc dense_table unsort bug ，not  support format for abacus hotstart yet.
```
  0d794983
23 8月, 2019 1 次提交
- Z
  add fleet_desc config feature & multi_sparse table, test=develop (#18827) · 4a3c4b8f
  由 zhang wenhui 提交于 8月 23, 2019
```
 add fleet_desc config feature & multi_sparse table,
```
  4a3c4b8f
16 8月, 2019 1 次提交
- G
  Remove node_num function. (#19167) · 86f05911
  由 gongweibao 提交于 8月 16, 2019
```
node_num is not needed for users, so remove them and fix the bugs about it!
```
  86f05911
14 8月, 2019 3 次提交

J
fix default value (#19193) · b86be13c
由 jiaqi 提交于 8月 14, 2019
```
* fix default value in ps_pb2.py:   delta_keep_days 30 -> 16
* test=develop
```
b86be13c

add get_last_save_xbox_base/get_last_save_xbox (#19122) · b104ea06

由 jiaqi 提交于 8月 14, 2019

* add get_last_save_xbox_base/get_last_save_xbox
* fix fleet_util bug of load paddle model
* add doc string in fleet api

b104ea06

fix default value of fleet desc (#19176) · bfd514c7

由 jiaqi 提交于 8月 14, 2019

* fix default value of fleet desc, default values are same with jingpai
* print log when save model

bfd514c7

12 8月, 2019 1 次提交
- G
  Polish fleet API to support cuda collective mode and nccl2 mode. (#18966) · 29d87812
  由 gongweibao 提交于 8月 12, 2019
```
Polish fleet API to support cuda collective mode and nccl2 mode
```
  29d87812
11 8月, 2019 1 次提交

add save cache model api in fleet& add slots shuffle in dataset module & add... · 9150cf50

由 yaoxuefeng 提交于 8月 11, 2019

add save cache model api in fleet& add slots shuffle in dataset module & add metric op to calculate ctr related metrics (#18871)

* add ctr related metric layer test=develop

* add save cache and slots shuffle test=develop

* add save cache and slots shuffle test=develop

* fix error

* fix error

* fix style for ci

* fix for comments

* change SlotsShuffle input to std::strinf for generality

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix stylr

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* change non-const reference to pointer

* fix style

* fix style

* fix style test=develop

* fix style  test=develop

* add return ins num in ctr metric op

* change dtype to float in metric_op.py

* fix error test=develop

* fix style test=develop

* fix API spec

* fix API spec

* fix API spec test=develop

* add UT test=develop

9150cf50

08 8月, 2019 1 次提交

add fleet util, add some interface in hdfs util (#18752) · a99bc64c

由 jiaqi 提交于 8月 08, 2019

* add fleet util (fleet/utils/fleet_util.py): functions for users' convenience
* add some interface in hdfs util : hdfs is_file、hdfs cat

a99bc64c

02 8月, 2019 1 次提交

support filelist size < trainer num && fix pull dense (#18956) · 02c370c3

由 jiaqi 提交于 8月 02, 2019

* support filelist size < trainer num
* pull dense when stop, to make sure local dense params are same as pserver, so save paddle model will save dense model same as pserver
*  enable QueueDataset train same filelist for serveral times

02c370c3

01 8月, 2019 1 次提交
- J
  adjust ins weight according to nid slot (#18784) · 768059b3
  由 jiaqi 提交于 8月 01, 2019
```
adjust ins weight according to nid slot , user can specify adjust_ins_weight in strategy
```
  768059b3
31 7月, 2019 1 次提交

set fleet_send_batch_num a default value according to trainer num · 233746d8

由 jiaqi 提交于 7月 31, 2019

(1) set fleet_send_batch_num a default value according to trainer num， the previous 80000 is fixed，if trainer num is much less or larger than 100，global shuffle may have timeout error.

(2) fix load one table bug, add barrier

233746d8

29 7月, 2019 1 次提交

add clear_model interface in fleetwrapper (#18815) · 52c1431e

由 Thunderbrook 提交于 7月 29, 2019

* dump slot

* test

* proto

* dump slot

* test

* proto

* code style

* code style

* code style

* style

* add delete after unseen days

* add unseen days

* code style

* conflict solve
test=develop

* add clear model

* code style
test=develop

* code style
test=develop

52c1431e

25 7月, 2019 2 次提交
- G
  refine launch_ps and role_maker (#18795) · 30562e37
  由 guru4elephant 提交于 7月 25, 2019
```
refine launch_ps and role_maker
```
  30562e37
- F
  Fix shrink-dense and add scale-datanorm (#18746) · c167a4b4
  由 fuyinno4 提交于 7月 25, 2019
```
Fix FleetWrapper:
1. fix shrink dense: just scale show
2. add datanorm scale: divide datanorm's gradient by batch_size
```
  c167a4b4
24 7月, 2019 1 次提交

add slot to sparse table (#18686) · d8396281

由 Thunderbrook 提交于 7月 24, 2019

The change includes 2 things:

1. save delta model and shrink table are control by the same parameter before, now add delete_after_unseen_days to control shrink table.
2. value in sparse table has no slot before, now add slot in sparse table, and add DownpureCtrAccessor to support the new meta.
test=develop

d8396281

23 7月, 2019 1 次提交

support patch data, add load_one_table, fix bug (#18509) · d18aabb4

由 jiaqi 提交于 7月 23, 2019

（1）support patch data （merge slots of instances of same line id, modify dense layer which
changes its size）
（2）add fleet load_one_table interface, support load from paddle model and load from pslib model
（3）fix push sparse bug which cause push sparse cost more time（about 10% in my testcase）
（4）when some slots are not in one of your network (join/update, etc.)，data feed、collect label info、push/pull sparse will skip these slots， instead of throw error.
（5）add more debug info in TrainFilesWithProfiler

d18aabb4

22 7月, 2019 1 次提交
- T
  do some odd jobs (#18641) · d8458483
  由 tangwei12 提交于 7月 22, 2019
```
do some odd jobs, test=develop
```
  d8458483
10 7月, 2019 1 次提交
- G
  upgrade collective fleet api (#18533) · 9c17a899
  由 guru4elephant 提交于 7月 10, 2019
```
* upgrade collective fleet api
```
  9c17a899
08 7月, 2019 1 次提交
- G
  add random port (#18504) · 1f1cc222
  由 guru4elephant 提交于 7月 08, 2019
```
* add random port
```
  1f1cc222
02 7月, 2019 1 次提交
- G
  make fleet support mpi job submit directly (#18441) · 357311fd
  由 guru4elephant 提交于 7月 02, 2019
```
make fleet support mpi job submit directly.
```
  357311fd
27 6月, 2019 2 次提交

T
fix communicator with pyreader (#18350) · 999d9a59
由 tangwei12 提交于 6月 27, 2019
```
* add is_runnning in communicator, test=develop
```
999d9a59

supports collective communicated training (#18175) · b7128bac

由 HaoRen 提交于 6月 27, 2019

* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop

* supports collective training in executor

* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop

* fix comment
test=develop

* use unique name for nccl_id

* supports output to stream in program_to_code

* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code

* set op role in collective training

* add collective op role

* remove orig file

* add build optimizer by strategy

* add collective strategy

* refine collective strategy

* add multi-process role maker

* refine strategy building factory so that we can easily plugin more strategy

* scale loss grad in collective sgd transpiler

* add support for distributed fc

* code format

* revert some features for dist fc

* add support for distributed fc training

* fix prepare context redundant code problem, optimize executor by caching create_varaiables
test=develop

* supports collective training in executor

* make fetch_list runable with variables, add more unittest for use_program_cache
test=develop

* use unique name for nccl_id

* supports output to stream in program_to_code

* insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code

* set op role in collective training

* add collective op role

* fix comment
test=develop

* remove orig file

* add build optimizer by strategy

* add collective strategy

* refine collective strategy

* add multi-process role maker

* refine strategy building factory so that we can easily plugin more strategy

* scale loss grad in collective sgd transpiler

* add support for distributed fc

* code format

* revert some features for dist fc

* add support for distributed fc training

* test=develop
add collective op unittest standard

* test=develop
remove the test_collective directory

* test=develop
remove the test_collective directory

* remove slicegather test

* code format for reducescatter

* update attr of shard_index_op

* Modify macro nccl_helper

* remove test without distribute

* macro collective_helper

* marcro update

* test=develop
update support python3.5

* test=develop change gpu memory use to 0.1 when test

* test=develop
update ut equal func

* test=develop
set flags to 1.5

* test=develop fix pickle dumple  py35

* test=develop
fix divide in slice and add sync_comm_stream
update atol and rtol to 1e-05
rm shard_index op and test
modify read input from file to read from memory
remove origin_program in framework and add i/o in c_sync_calc_stream

* test=develop update unittest sync operator I/O

b7128bac

23 6月, 2019 1 次提交
- G
  fix paddle cloud role maker bug (#18269) · ff399fd7
  由 guru4elephant 提交于 6月 23, 2019
```
* fix paddle cloud role maker bug
```
  ff399fd7
17 6月, 2019 2 次提交

Q
assign role_maker before use (#18137) · 23f8a4b1
由 Qiao Longfei 提交于 6月 17, 2019
```
fix role_maker bug
test=develop
```
23f8a4b1

add paddle cloud role maker for customized usage, note this is only for... · 58f3e1ba

由 guru4elephant 提交于 6月 17, 2019

add paddle cloud role maker for customized usage, note this is only for industrial users that have cloud environment pre-configuration (#18121)

add paddle cloud role maker for specific cloud usage. This pr will simplifies user's configuration in distributed training.

58f3e1ba

13 6月, 2019 1 次提交
- T
  
  fix bug in fleet, test=develop (#18058) · 4c735f24
  由 tangwei12 提交于 6月 13, 2019
  
  4c735f24
12 6月, 2019 2 次提交

T
fix save/load in fleet (#17675) · 101f74cb
由 tangwei12 提交于 6月 12, 2019
```
* fix save/load in Fleet
* add UT framework of Fleet
```
101f74cb

fix logging basicConfig cannot be setting after import paddle (#17786) · 96ee528e

由 Kaipeng Deng 提交于 6月 12, 2019

* fix logging unable. test=develop

* unset sys.stdout for stream handler. test=develop

* fix newly add basicConfig. test=develop

* fix import error. test=develop

96ee528e

11 6月, 2019 1 次提交

add UserDefinedCollectiveRoleMaker for collective mode (#17898) · b5c35ae3

由 lilong12 提交于 6月 11, 2019

* add 'UserDefinedRoleMakerNCCL' for collective mode.

* code style

* add the name UserDefinedRoleMakerNCCL to __all__

* rename to UserDefinedRoleMakerCollective

* rename to UserDefinedCollectiveRoleMaker

b5c35ae3

23 5月, 2019 1 次提交
- Q
  Async exe support communicator (#17386) · 58f7695a
  由 Qiao Longfei 提交于 5月 23, 2019
```
Async exe support communicator
```
  58f7695a
17 5月, 2019 1 次提交
- J
  support sparse table get shard_num from TableParameter (#17443) · 05df39ac
  由 jiaqi 提交于 5月 17, 2019
```
test=develop
```
  05df39ac
15 5月, 2019 1 次提交

support config file, cvm, load, save, shrink (#17319) · 34369944

由 jiaqi 提交于 5月 15, 2019

* support config file, cvm, load, save, shrink
test=develop

* fix error of worker_num & add table.compress_in_save
test=develop

* fix code style
test=develop

* fix save model bug
test=develop

34369944

PaddlePaddle / Paddle 大约 1 年 前同步成功

PaddlePaddle / Paddle
大约 1 年前同步成功