- 12 11月, 2019 1 次提交
-
-
由 lilong12 提交于
modify the implementation of save_persistables and save_inference_model for fleet collective mode (#20802) * modify the implementation of save_persistables and save_inference_model functions for fleet collective, test=develop * add ut, test=develop
-
- 04 11月, 2019 1 次提交
-
-
由 Thunderbrook 提交于
test=develop
-
- 31 10月, 2019 3 次提交
-
-
由 Chengmo 提交于
* fix PaddleCloud Role maker & add warning in distribute transpiler & change rpc_retry_times
-
由 Bai Yifan 提交于
-
由 Thunderbrook 提交于
* support dump param to afs test=develop * code style test=develop * code style test=develop * dump param test=develop * dump param test=develop * dump param test=develop * dump param test=develop
-
- 25 10月, 2019 1 次提交
-
-
由 xujiaqi01 提交于
* no longer need to define all embedding layers (no one less) of all slots in each program. make trainer_param repeated in ps.proto. * add find_distributed_lookup_table_grads instead of hard code GRAD * support embedding stop gradient. push sparse has error before fix this.* * fix fill sparse, skip slots which do not have embedding. each slot's embedding in a sparse table should be used in all training programs before fix this. * fix pull sparse, skip slots which do not have embedding. * fix collect feasign label info, skip slots which do not have embedding. * support when there are multi sparse tables in one or multi training programs, each program can pull/push its own related sparse tables instead of all sparse tables. * test=develop
-
- 18 10月, 2019 1 次提交
-
-
由 xujiaqi01 提交于
* add check nan / inf in downpour worker during training * test=develop
-
- 15 10月, 2019 3 次提交
-
-
由 Chengmo 提交于
* test=develop,Fix communicator slow bug * test=develop, delete if() in stop_worker() * test=develop * fix UT, test=develop * fix bug in fetch handler, test=develop * fix bug in fetch handler, test=develop * test=develop, fix fetch barrier bug * test=develop, bug fix * test=develop, bug fix * test=develop, fix bug
-
由 WangXi 提交于
-
由 mapingshuo 提交于
* special case: strategy is None
-
- 14 10月, 2019 1 次提交
-
-
由 Thunderbrook 提交于
* support dump multi file test=develop * dump fix num file test=develop
-
- 12 10月, 2019 1 次提交
-
-
由 zhang wenhui 提交于
-
- 11 10月, 2019 1 次提交
-
-
由 zhang wenhui 提交于
* fix fc sort . test=develop
-
- 07 10月, 2019 1 次提交
-
-
由 zhang wenhui 提交于
-
- 30 9月, 2019 1 次提交
-
-
由 Chengmo 提交于
* refector geo sgd & communicator
-
- 24 9月, 2019 1 次提交
-
-
由 xujiaqi01 提交于
* support change shuffle thread num * support change train thread num * fix receive shuffle data of each channel * data norm stop gradient * add check thread_tensor type and root_tensor type when merge metric * remove sleep in shuffle, add config * add config of pslib client to client communication * fix xbox str * add data norm op testcase * add flush in trainer finalize
-
- 23 9月, 2019 2 次提交
-
-
由 mapingshuo 提交于
* add recompute based checkpoints methods for large batch training test=develop * add append_backward_with_forward_recomputation test=develop * refine optimizer test=develop * update backward and optimizer test=develop * make Variable usable test=develop * add recompute code * refine optimizer test=develop * refine addup _append_backward_ops_with_checkpoints_ 1) for recompute part, just cache the grad_op_desc without appending to block 2) before appending grad_op_desc to backward part, addup_repetitive_vars, remove unused branch test=develop * make method private * add recompute strategy into DistributedStrategy test=develop * checkpoint version3 test=develop * remove some print information test=develop * remove unused sumop test=develop * try to fix recompute with graph building modules * add input names to vars should be held * add memory debug tool * backup backward * Fix bugs * add backward desc for op not in any segments * add exception info for sub_block test=develop * modify code style test=develop * modify code style test=develop * remove print functions test=develop * add API spec test=develop test=document_preview * make Recompute a child class of Optimizer test=develop test=document_preview * add API spec test=develop test=document_preview * modify API spec test=develop test=document_preview * add document for Recompute test=develop test=document_preview * change API doc of Rcompute test=develop test=document_preview * code cleaning test=develop test=document_preview * modify API spec * fix bugs when segments hold no element * add testcase for Recompute Optimizer test=develop test=document_preview * add test for apply_gradient, and code cleaning test=develop test=document_preview * add test case for load function * enable CI test=develop test=document * add test case test=develop test=document_preview * add sample code for 4 function of recompute optimizer test=develop test=document_preview
-
由 tangwei12 提交于
* optimize cloud rolemaker, test=develop
-
- 19 9月, 2019 1 次提交
-
-
由 gongweibao 提交于
change _origin_program test=develop
-
- 17 9月, 2019 1 次提交
-
-
由 xujiaqi01 提交于
* support preload thread * sleep before fleet wrapper exit for pslib core dump * optimize hdfs log * fix master+patch bug
-
- 10 9月, 2019 1 次提交
-
-
由 gongweibao 提交于
Fix float16 optimizer
-
- 06 9月, 2019 1 次提交
-
-
由 123malin 提交于
* fleet api add input check, test=develop
-
- 05 9月, 2019 1 次提交
-
-
由 123malin 提交于
* test=develop, communicator merge add => merge average
-
- 30 8月, 2019 1 次提交
-
-
由 yaoxuefeng 提交于
* add thread scope stat accurate metrics test=develop * fix style * fix style * fix style * fix style test=develop * fix style test=develop * fix style test=develop * fix style test=develop * fix style test=develop * fix style test=develop * fix style test=develop * fix conflict * fix style * fix style test=develop * fix error test=develop * fix error test=develop
-
- 29 8月, 2019 2 次提交
-
-
由 Thunderbrook 提交于
* dump slot * test * proto * dump slot * test * proto * code style * code style * code style * style * add delete after unseen days * add unseen days * code style * conflict solve test=develop * add clear model * code style test=develop * code style test=develop * support debug tensor of each ins test=develop * support debug tensor of each ins test=develop * learning rate * code style * code style * code style * code style * code style * code style * code style * code style * code style * code style * code style * code style * code style test=develop * code style test=develop * unitest * style * style * multi phase * add channel * code style * style * style * unitest * style * define * define test=develop * style test=develop * rm define test=develop * linux * linux test=develop * style test=develop * output format test=develop * windows ci test=develop
-
由 zhang wenhui 提交于
fleet_desc sort fc name by dictionary sort, but we want to sort by number.
-
- 28 8月, 2019 2 次提交
-
-
由 Yi Liu 提交于
test=develop
-
由 tangwei12 提交于
* fix correctness of the communicator * fix a bug in send thread when sending var context is empty, test=develop * add lookup_table_prefetch_op and prefetch optimize, test=develop * remove remote prefetch GPU supported * word2vec force with CPU, test=develop * test dist remote lookup table force with CPU, test=develop
-
- 27 8月, 2019 1 次提交
-
-
由 zhang wenhui 提交于
fix fleet_desc dense_table unsort bug ,not support format for abacus hotstart yet.
-
- 23 8月, 2019 1 次提交
-
-
由 zhang wenhui 提交于
add fleet_desc config feature & multi_sparse table,
-
- 16 8月, 2019 1 次提交
-
-
由 gongweibao 提交于
node_num is not needed for users, so remove them and fix the bugs about it!
-
- 14 8月, 2019 3 次提交
-
-
由 jiaqi 提交于
* fix default value in ps_pb2.py: delta_keep_days 30 -> 16 * test=develop
-
由 jiaqi 提交于
* add get_last_save_xbox_base/get_last_save_xbox * fix fleet_util bug of load paddle model * add doc string in fleet api
-
由 jiaqi 提交于
* fix default value of fleet desc, default values are same with jingpai * print log when save model
-
- 12 8月, 2019 1 次提交
-
-
由 gongweibao 提交于
Polish fleet API to support cuda collective mode and nccl2 mode
-
- 11 8月, 2019 1 次提交
-
-
由 yaoxuefeng 提交于
add save cache model api in fleet& add slots shuffle in dataset module & add metric op to calculate ctr related metrics (#18871) * add ctr related metric layer test=develop * add save cache and slots shuffle test=develop * add save cache and slots shuffle test=develop * fix error * fix error * fix style for ci * fix for comments * change SlotsShuffle input to std::strinf for generality * fix style * fix style * fix style * fix style * fix style * fix style * fix stylr * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * change non-const reference to pointer * fix style * fix style * fix style test=develop * fix style test=develop * add return ins num in ctr metric op * change dtype to float in metric_op.py * fix error test=develop * fix style test=develop * fix API spec * fix API spec * fix API spec test=develop * add UT test=develop
-
- 08 8月, 2019 1 次提交
-
-
由 jiaqi 提交于
* add fleet util (fleet/utils/fleet_util.py): functions for users' convenience * add some interface in hdfs util : hdfs is_file、hdfs cat
-
- 02 8月, 2019 1 次提交
-
-
由 jiaqi 提交于
* support filelist size < trainer num * pull dense when stop, to make sure local dense params are same as pserver, so save paddle model will save dense model same as pserver * enable QueueDataset train same filelist for serveral times
-
- 01 8月, 2019 1 次提交
-
-
由 jiaqi 提交于
adjust ins weight according to nid slot , user can specify adjust_ins_weight in strategy
-
- 31 7月, 2019 1 次提交
-
-
由 jiaqi 提交于
(1) set fleet_send_batch_num a default value according to trainer num, the previous 80000 is fixed,if trainer num is much less or larger than 100,global shuffle may have timeout error. (2) fix load one table bug, add barrier
-