提交 · eec9c9cbe7e841a9b3fef301d8dc9518d6db2452 · BaiXuePrincess / Paddle

15 11月, 2019 1 次提交

由 xujiaqi01 提交于 11月 15, 2019

* copy some feasigns and corresponding embeddings from one sparse table to another
* copy all feasigns and corresponding embeddings from one sparse table to another
* copy all dense params from one table to another
* copy some local vars to other local vars

9e045170

12 11月, 2019 1 次提交

modify the implementation of save_persistables and save_inference_model for... · 53148e06

由 lilong12 提交于 11月 12, 2019

modify the implementation of save_persistables and save_inference_model for fleet collective mode (#20802)

* modify the implementation of  save_persistables and save_inference_model functions for fleet collective, test=develop

* add ut, test=develop

53148e06

04 11月, 2019 1 次提交
- T
  find lookup table in order (#20932) · 5970e8ac
  由 Thunderbrook 提交于 11月 04, 2019
```
test=develop
```
  5970e8ac
31 10月, 2019 3 次提交

Fix Paddle Cloud role maker (#20860) · 16596f64

由 Chengmo 提交于 10月 31, 2019

* fix PaddleCloud Role maker & add warning in distribute transpiler  & change rpc_retry_times

16596f64

B

fix hdfs.download, test=develop (#20907) · ac87d4e6
由 Bai Yifan 提交于 10月 31, 2019

ac87d4e6

support dump param of model into afs (#20302) · 59bcdc8a

由 Thunderbrook 提交于 10月 31, 2019

* support dump param to afs
test=develop

* code style
test=develop

* code style
test=develop

* dump param
test=develop

* dump param
test=develop

* dump param
test=develop

* dump param
test=develop

59bcdc8a

25 10月, 2019 1 次提交

fix several sparse table issuses (#20686) · 48669aa8

由 xujiaqi01 提交于 10月 25, 2019

* no longer need to define all embedding layers (no one less) of all slots in each program. make trainer_param repeated in ps.proto.
* add find_distributed_lookup_table_grads instead of hard code GRAD
* support embedding stop gradient. push sparse has error before fix this.* 
* fix fill sparse, skip slots which do not have embedding. each slot's embedding in a sparse table should be used in all training programs before fix this.
* fix pull sparse, skip slots which do not have embedding.
* fix collect feasign label info, skip slots which do not have embedding.
* support when there are multi sparse tables in one or multi training programs, each program can pull/push its own related sparse tables instead of all sparse tables.
* test=develop

48669aa8

18 10月, 2019 1 次提交
- X
  add check nan / inf in downpour worker (#20694) · 5223b0dd
  由 xujiaqi01 提交于 10月 18, 2019
```
* add check nan / inf in downpour worker during training
* test=develop
```
  5223b0dd
15 10月, 2019 3 次提交

Fix communicator slow bug & fix communicator stop bug (#20366) · 940c6ff1

由 Chengmo 提交于 10月 15, 2019

* test=develop,Fix communicator slow bug

* test=develop, delete if() in stop_worker()

* test=develop

* fix UT, test=develop

* fix bug in fetch handler, test=develop

* fix bug in fetch handler, test=develop

* test=develop, fix fetch barrier bug

* test=develop, bug fix

* test=develop, bug fix

* test=develop, fix bug

940c6ff1

W

fix dgc test and bug when not set trainers_endpoints_, test=develop (#20617) · cadc6a97
由 WangXi 提交于 10月 14, 2019

cadc6a97
M
Fleet: deal with special case: strategy is None (#20359) · f55d1c68
由 mapingshuo 提交于 10月 15, 2019
```
* special case: strategy is None
```
f55d1c68

14 10月, 2019 1 次提交
- T
  dump fix dov vec file num (#20539) · f76a32df
  由 Thunderbrook 提交于 10月 14, 2019
```
* support dump multi file
test=develop

* dump fix num file
test=develop
```
  f76a32df
12 10月, 2019 1 次提交
- Z
  
  fix converter , test=develop (#20522) · b5219920
  由 zhang wenhui 提交于 10月 12, 2019
  
  b5219920
11 10月, 2019 1 次提交
- Z
  fix pslib datanorm double bug (#20297) · b82e6520
  由 zhang wenhui 提交于 10月 11, 2019
```
* fix fc sort . test=develop
```
  b82e6520
07 10月, 2019 1 次提交
- Z
  
  fix fleet_desc delete_after_unseen_day bug in node.py (#20091) · b28d4a82
  由 zhang wenhui 提交于 10月 07, 2019
  
  b28d4a82
30 9月, 2019 1 次提交
- C
  Add GEO-SGD distribute training algorithm (#20018) · 728ec1b4
  由 Chengmo 提交于 9月 30, 2019
```
* refector geo sgd & communicator
```
  728ec1b4
24 9月, 2019 1 次提交

support change shuffle and train thread num (#19841) · cedc0477

由 xujiaqi01 提交于 9月 24, 2019

* support change shuffle thread num
* support change train thread num
* fix receive shuffle data of each channel
* data norm stop gradient
* add check thread_tensor type and root_tensor type when merge metric
* remove sleep in shuffle, add config
* add config of pslib client to client communication
* fix xbox str
* add data norm op testcase
* add flush in trainer finalize

cedc0477

23 9月, 2019 2 次提交

Forward recompute3 (#19913) · 9901f696

由 mapingshuo 提交于 9月 23, 2019

* add recompute based checkpoints methods for large batch training
test=develop

* add append_backward_with_forward_recomputation
test=develop

* refine optimizer
test=develop

* update backward and optimizer
test=develop

* make Variable usable
test=develop

* add recompute code

* refine optimizer
test=develop

* refine addup _append_backward_ops_with_checkpoints_
1) for recompute part, just cache the grad_op_desc without appending to block
2) before appending grad_op_desc to backward part, addup_repetitive_vars, remove unused branch
test=develop

* make method private

* add recompute strategy into DistributedStrategy
test=develop

* checkpoint version3
test=develop

* remove some print information
test=develop

* remove unused sumop
test=develop

* try to fix recompute with graph building modules

* add input names to vars should be held

* add memory debug tool

* backup backward

* Fix bugs

* add backward desc for op not in any segments

* add exception info for sub_block

test=develop

* modify code style

test=develop

* modify code style

test=develop

* remove print functions

test=develop

* add API spec

test=develop
test=document_preview

* make Recompute a child class of Optimizer

test=develop
test=document_preview

* add API spec

test=develop
test=document_preview

* modify API spec

test=develop
test=document_preview

* add document for Recompute

test=develop
test=document_preview

* change API doc of Rcompute

test=develop
test=document_preview

* code cleaning

test=develop
test=document_preview

* modify API spec

* fix bugs when segments hold no element

* add testcase for Recompute Optimizer

test=develop
test=document_preview

* add test for apply_gradient, and code cleaning

test=develop
test=document_preview

* add test case for load function

* enable CI

test=develop
test=document

* add test case

test=develop
test=document_preview

* add sample code for 4 function of recompute optimizer

test=develop
test=document_preview

9901f696

T
paddle cloud role maker fix (#19646) · 278dd003
由 tangwei12 提交于 9月 23, 2019
```
* optimize cloud rolemaker, test=develop
```
278dd003

19 9月, 2019 1 次提交
- G
  change _origin_program test=develop (#19863) · e8d3745c
  由 gongweibao 提交于 9月 19, 2019
```
change _origin_program test=develop
```
  e8d3745c
17 9月, 2019 1 次提交
- X
  support preload thread, optimize hdfs log, fix master+patch bug (#19695) · 6bf298bf
  由 xujiaqi01 提交于 9月 17, 2019
```
* support preload thread
* sleep before fleet wrapper exit for pslib core dump
* optimize hdfs log
* fix master+patch bug
```
  6bf298bf
10 9月, 2019 1 次提交
- G
  Fix float16 optimizer. (#19682) · 6c2bc29c
  由 gongweibao 提交于 9月 10, 2019
```
Fix float16 optimizer
```
  6c2bc29c
06 9月, 2019 1 次提交
- 1
  Optimize fleet API: add input check for some interfaces (#18971) · a25a716e
  由 123malin 提交于 9月 06, 2019
```
* fleet api add input check, test=develop
```
  a25a716e
05 9月, 2019 1 次提交
- 1
  fix the diff between async mode and async_half mode (#19535) · 2f037c31
  由 123malin 提交于 9月 05, 2019
```
* test=develop,  communicator merge add => merge average
```
  2f037c31
30 8月, 2019 1 次提交

add thread scope stat accurate metrics test=develop (#19480) · 10ca3f96

由 yaoxuefeng 提交于 8月 30, 2019

* add thread scope stat accurate metrics test=develop

* fix style

* fix style

* fix style

* fix style test=develop

* fix style test=develop

* fix style test=develop

* fix style test=develop

* fix style test=develop

* fix style test=develop

* fix style test=develop

* fix conflict

* fix style

* fix style test=develop

* fix error test=develop

* fix error test=develop

10ca3f96

29 8月, 2019 2 次提交

support debug each output of each ins (#19004) · 1fe468d3

由 Thunderbrook 提交于 8月 29, 2019

* dump slot

* test

* proto

* dump slot

* test

* proto

* code style

* code style

* code style

* style

* add delete after unseen days

* add unseen days

* code style

* conflict solve
test=develop

* add clear model

* code style
test=develop

* code style
test=develop

* support debug tensor of each ins
test=develop

* support debug tensor of each ins
test=develop

* learning rate

* code style

* code style

* code style

* code style

* code style

* code style

* code style

* code style

* code style

* code style

* code style

* code style

* code style
test=develop

* code style
test=develop

* unitest

* style

* style

* multi phase

* add channel

* code style

* style

* style

* unitest

* style

* define

* define
test=develop

* style
test=develop

* rm define
test=develop

* linux

* linux
test=develop

* style
test=develop

* output format
test=develop

* windows ci
test=develop

1fe468d3

Z
support fc sort by number, test=develop (#19466) · bd35a7f0
由 zhang wenhui 提交于 8月 29, 2019
```
fleet_desc sort fc name by dictionary sort, but we want to sort by number.
```
bd35a7f0

28 8月, 2019 2 次提交

Y
adapte fleet api for localsgd and support nccl comm configuration in executor (#19443) · 4ef6b845
由 Yi Liu 提交于 8月 28, 2019
```
test=develop
```
4ef6b845

Fix the correctness of async mode at distributed training (#18863) · 65c73684

由 tangwei12 提交于 8月 28, 2019

* fix correctness of the communicator

* fix a bug in send thread when sending var context is empty, test=develop

* add lookup_table_prefetch_op and prefetch optimize, test=develop

* remove remote prefetch GPU supported

* word2vec force with CPU, test=develop

* test dist remote lookup table force with CPU, test=develop

65c73684

27 8月, 2019 1 次提交
- Z
  fix fleet_desc bug && support format for abacus hotstart (#19430) · 0d794983
  由 zhang wenhui 提交于 8月 27, 2019
```
fix fleet_desc dense_table unsort bug ，not  support format for abacus hotstart yet.
```
  0d794983
23 8月, 2019 1 次提交
- Z
  add fleet_desc config feature & multi_sparse table, test=develop (#18827) · 4a3c4b8f
  由 zhang wenhui 提交于 8月 23, 2019
```
 add fleet_desc config feature & multi_sparse table,
```
  4a3c4b8f
16 8月, 2019 1 次提交
- G
  Remove node_num function. (#19167) · 86f05911
  由 gongweibao 提交于 8月 16, 2019
```
node_num is not needed for users, so remove them and fix the bugs about it!
```
  86f05911
14 8月, 2019 3 次提交

J
fix default value (#19193) · b86be13c
由 jiaqi 提交于 8月 14, 2019
```
* fix default value in ps_pb2.py:   delta_keep_days 30 -> 16
* test=develop
```
b86be13c

add get_last_save_xbox_base/get_last_save_xbox (#19122) · b104ea06

由 jiaqi 提交于 8月 14, 2019

* add get_last_save_xbox_base/get_last_save_xbox
* fix fleet_util bug of load paddle model
* add doc string in fleet api

b104ea06

fix default value of fleet desc (#19176) · bfd514c7

由 jiaqi 提交于 8月 14, 2019

* fix default value of fleet desc, default values are same with jingpai
* print log when save model

bfd514c7

12 8月, 2019 1 次提交
- G
  Polish fleet API to support cuda collective mode and nccl2 mode. (#18966) · 29d87812
  由 gongweibao 提交于 8月 12, 2019
```
Polish fleet API to support cuda collective mode and nccl2 mode
```
  29d87812
11 8月, 2019 1 次提交

add save cache model api in fleet& add slots shuffle in dataset module & add... · 9150cf50

由 yaoxuefeng 提交于 8月 11, 2019

add save cache model api in fleet& add slots shuffle in dataset module & add metric op to calculate ctr related metrics (#18871)

* add ctr related metric layer test=develop

* add save cache and slots shuffle test=develop

* add save cache and slots shuffle test=develop

* fix error

* fix error

* fix style for ci

* fix for comments

* change SlotsShuffle input to std::strinf for generality

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix stylr

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* fix style

* change non-const reference to pointer

* fix style

* fix style

* fix style test=develop

* fix style  test=develop

* add return ins num in ctr metric op

* change dtype to float in metric_op.py

* fix error test=develop

* fix style test=develop

* fix API spec

* fix API spec

* fix API spec test=develop

* add UT test=develop

9150cf50

08 8月, 2019 1 次提交

add fleet util, add some interface in hdfs util (#18752) · a99bc64c

由 jiaqi 提交于 8月 08, 2019

* add fleet util (fleet/utils/fleet_util.py): functions for users' convenience
* add some interface in hdfs util : hdfs is_file、hdfs cat

a99bc64c

02 8月, 2019 1 次提交

support filelist size < trainer num && fix pull dense (#18956) · 02c370c3

由 jiaqi 提交于 8月 02, 2019

* support filelist size < trainer num
* pull dense when stop, to make sure local dense params are same as pserver, so save paddle model will save dense model same as pserver
*  enable QueueDataset train same filelist for serveral times

02c370c3

01 8月, 2019 1 次提交
- J
  adjust ins weight according to nid slot (#18784) · 768059b3
  由 jiaqi 提交于 8月 01, 2019
```
adjust ins weight according to nid slot , user can specify adjust_ins_weight in strategy
```
  768059b3

BaiXuePrincess / Paddle 与 Fork 源项目一致

BaiXuePrincess / Paddle
与 Fork 源项目一致