- 15 11月, 2019 1 次提交
-
-
由 xujiaqi01 提交于
* copy some feasigns and corresponding embeddings from one sparse table to another * copy all feasigns and corresponding embeddings from one sparse table to another * copy all dense params from one table to another * copy some local vars to other local vars
-
- 12 11月, 2019 1 次提交
-
-
由 lilong12 提交于
modify the implementation of save_persistables and save_inference_model for fleet collective mode (#20802) * modify the implementation of save_persistables and save_inference_model functions for fleet collective, test=develop * add ut, test=develop
-
- 04 11月, 2019 1 次提交
-
-
由 Thunderbrook 提交于
test=develop
-
- 31 10月, 2019 3 次提交
-
-
由 Chengmo 提交于
* fix PaddleCloud Role maker & add warning in distribute transpiler & change rpc_retry_times
-
由 Bai Yifan 提交于
-
由 Thunderbrook 提交于
* support dump param to afs test=develop * code style test=develop * code style test=develop * dump param test=develop * dump param test=develop * dump param test=develop * dump param test=develop
-
- 25 10月, 2019 1 次提交
-
-
由 xujiaqi01 提交于
* no longer need to define all embedding layers (no one less) of all slots in each program. make trainer_param repeated in ps.proto. * add find_distributed_lookup_table_grads instead of hard code GRAD * support embedding stop gradient. push sparse has error before fix this.* * fix fill sparse, skip slots which do not have embedding. each slot's embedding in a sparse table should be used in all training programs before fix this. * fix pull sparse, skip slots which do not have embedding. * fix collect feasign label info, skip slots which do not have embedding. * support when there are multi sparse tables in one or multi training programs, each program can pull/push its own related sparse tables instead of all sparse tables. * test=develop
-
- 18 10月, 2019 1 次提交
-
-
由 xujiaqi01 提交于
* add check nan / inf in downpour worker during training * test=develop
-
- 15 10月, 2019 3 次提交
-
-
由 Chengmo 提交于
* test=develop,Fix communicator slow bug * test=develop, delete if() in stop_worker() * test=develop * fix UT, test=develop * fix bug in fetch handler, test=develop * fix bug in fetch handler, test=develop * test=develop, fix fetch barrier bug * test=develop, bug fix * test=develop, bug fix * test=develop, fix bug
-
由 WangXi 提交于
-
由 mapingshuo 提交于
* special case: strategy is None
-
- 14 10月, 2019 1 次提交
-
-
由 Thunderbrook 提交于
* support dump multi file test=develop * dump fix num file test=develop
-
- 12 10月, 2019 1 次提交
-
-
由 zhang wenhui 提交于
-
- 11 10月, 2019 1 次提交
-
-
由 zhang wenhui 提交于
* fix fc sort . test=develop
-
- 07 10月, 2019 1 次提交
-
-
由 zhang wenhui 提交于
-
- 30 9月, 2019 1 次提交
-
-
由 Chengmo 提交于
* refector geo sgd & communicator
-
- 24 9月, 2019 1 次提交
-
-
由 xujiaqi01 提交于
* support change shuffle thread num * support change train thread num * fix receive shuffle data of each channel * data norm stop gradient * add check thread_tensor type and root_tensor type when merge metric * remove sleep in shuffle, add config * add config of pslib client to client communication * fix xbox str * add data norm op testcase * add flush in trainer finalize
-
- 23 9月, 2019 2 次提交
-
-
由 mapingshuo 提交于
* add recompute based checkpoints methods for large batch training test=develop * add append_backward_with_forward_recomputation test=develop * refine optimizer test=develop * update backward and optimizer test=develop * make Variable usable test=develop * add recompute code * refine optimizer test=develop * refine addup _append_backward_ops_with_checkpoints_ 1) for recompute part, just cache the grad_op_desc without appending to block 2) before appending grad_op_desc to backward part, addup_repetitive_vars, remove unused branch test=develop * make method private * add recompute strategy into DistributedStrategy test=develop * checkpoint version3 test=develop * remove some print information test=develop * remove unused sumop test=develop * try to fix recompute with graph building modules * add input names to vars should be held * add memory debug tool * backup backward * Fix bugs * add backward desc for op not in any segments * add exception info for sub_block test=develop * modify code style test=develop * modify code style test=develop * remove print functions test=develop * add API spec test=develop test=document_preview * make Recompute a child class of Optimizer test=develop test=document_preview * add API spec test=develop test=document_preview * modify API spec test=develop test=document_preview * add document for Recompute test=develop test=document_preview * change API doc of Rcompute test=develop test=document_preview * code cleaning test=develop test=document_preview * modify API spec * fix bugs when segments hold no element * add testcase for Recompute Optimizer test=develop test=document_preview * add test for apply_gradient, and code cleaning test=develop test=document_preview * add test case for load function * enable CI test=develop test=document * add test case test=develop test=document_preview * add sample code for 4 function of recompute optimizer test=develop test=document_preview
-
由 tangwei12 提交于
* optimize cloud rolemaker, test=develop
-
- 19 9月, 2019 1 次提交
-
-
由 gongweibao 提交于
change _origin_program test=develop
-
- 17 9月, 2019 1 次提交
-
-
由 xujiaqi01 提交于
* support preload thread * sleep before fleet wrapper exit for pslib core dump * optimize hdfs log * fix master+patch bug
-
- 10 9月, 2019 1 次提交
-
-
由 gongweibao 提交于
Fix float16 optimizer
-
- 06 9月, 2019 1 次提交
-
-
由 123malin 提交于
* fleet api add input check, test=develop
-
- 05 9月, 2019 1 次提交
-
-
由 123malin 提交于
* test=develop, communicator merge add => merge average
-
- 30 8月, 2019 1 次提交
-
-
由 yaoxuefeng 提交于
* add thread scope stat accurate metrics test=develop * fix style * fix style * fix style * fix style test=develop * fix style test=develop * fix style test=develop * fix style test=develop * fix style test=develop * fix style test=develop * fix style test=develop * fix conflict * fix style * fix style test=develop * fix error test=develop * fix error test=develop
-
- 29 8月, 2019 2 次提交
-
-
由 Thunderbrook 提交于
* dump slot * test * proto * dump slot * test * proto * code style * code style * code style * style * add delete after unseen days * add unseen days * code style * conflict solve test=develop * add clear model * code style test=develop * code style test=develop * support debug tensor of each ins test=develop * support debug tensor of each ins test=develop * learning rate * code style * code style * code style * code style * code style * code style * code style * code style * code style * code style * code style * code style * code style test=develop * code style test=develop * unitest * style * style * multi phase * add channel * code style * style * style * unitest * style * define * define test=develop * style test=develop * rm define test=develop * linux * linux test=develop * style test=develop * output format test=develop * windows ci test=develop
-
由 zhang wenhui 提交于
fleet_desc sort fc name by dictionary sort, but we want to sort by number.
-
- 28 8月, 2019 2 次提交
-
-
由 Yi Liu 提交于
test=develop
-
由 tangwei12 提交于
* fix correctness of the communicator * fix a bug in send thread when sending var context is empty, test=develop * add lookup_table_prefetch_op and prefetch optimize, test=develop * remove remote prefetch GPU supported * word2vec force with CPU, test=develop * test dist remote lookup table force with CPU, test=develop
-
- 27 8月, 2019 1 次提交
-
-
由 zhang wenhui 提交于
fix fleet_desc dense_table unsort bug ,not support format for abacus hotstart yet.
-
- 23 8月, 2019 1 次提交
-
-
由 zhang wenhui 提交于
add fleet_desc config feature & multi_sparse table,
-
- 16 8月, 2019 1 次提交
-
-
由 gongweibao 提交于
node_num is not needed for users, so remove them and fix the bugs about it!
-
- 14 8月, 2019 3 次提交
-
-
由 jiaqi 提交于
* fix default value in ps_pb2.py: delta_keep_days 30 -> 16 * test=develop
-
由 jiaqi 提交于
* add get_last_save_xbox_base/get_last_save_xbox * fix fleet_util bug of load paddle model * add doc string in fleet api
-
由 jiaqi 提交于
* fix default value of fleet desc, default values are same with jingpai * print log when save model
-
- 12 8月, 2019 1 次提交
-
-
由 gongweibao 提交于
Polish fleet API to support cuda collective mode and nccl2 mode
-
- 11 8月, 2019 1 次提交
-
-
由 yaoxuefeng 提交于
add save cache model api in fleet& add slots shuffle in dataset module & add metric op to calculate ctr related metrics (#18871) * add ctr related metric layer test=develop * add save cache and slots shuffle test=develop * add save cache and slots shuffle test=develop * fix error * fix error * fix style for ci * fix for comments * change SlotsShuffle input to std::strinf for generality * fix style * fix style * fix style * fix style * fix style * fix style * fix stylr * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * change non-const reference to pointer * fix style * fix style * fix style test=develop * fix style test=develop * add return ins num in ctr metric op * change dtype to float in metric_op.py * fix error test=develop * fix style test=develop * fix API spec * fix API spec * fix API spec test=develop * add UT test=develop
-
- 08 8月, 2019 1 次提交
-
-
由 jiaqi 提交于
* add fleet util (fleet/utils/fleet_util.py): functions for users' convenience * add some interface in hdfs util : hdfs is_file、hdfs cat
-
- 02 8月, 2019 1 次提交
-
-
由 jiaqi 提交于
* support filelist size < trainer num * pull dense when stop, to make sure local dense params are same as pserver, so save paddle model will save dense model same as pserver * enable QueueDataset train same filelist for serveral times
-
- 01 8月, 2019 1 次提交
-
-
由 jiaqi 提交于
adjust ins weight according to nid slot , user can specify adjust_ins_weight in strategy
-