- 16 8月, 2019 1 次提交
-
-
由 gongweibao 提交于
node_num is not needed for users, so remove them and fix the bugs about it!
-
- 14 8月, 2019 3 次提交
-
-
由 jiaqi 提交于
* fix default value in ps_pb2.py: delta_keep_days 30 -> 16 * test=develop
-
由 jiaqi 提交于
* add get_last_save_xbox_base/get_last_save_xbox * fix fleet_util bug of load paddle model * add doc string in fleet api
-
由 jiaqi 提交于
* fix default value of fleet desc, default values are same with jingpai * print log when save model
-
- 12 8月, 2019 1 次提交
-
-
由 gongweibao 提交于
Polish fleet API to support cuda collective mode and nccl2 mode
-
- 11 8月, 2019 1 次提交
-
-
由 yaoxuefeng 提交于
add save cache model api in fleet& add slots shuffle in dataset module & add metric op to calculate ctr related metrics (#18871) * add ctr related metric layer test=develop * add save cache and slots shuffle test=develop * add save cache and slots shuffle test=develop * fix error * fix error * fix style for ci * fix for comments * change SlotsShuffle input to std::strinf for generality * fix style * fix style * fix style * fix style * fix style * fix style * fix stylr * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * fix style * change non-const reference to pointer * fix style * fix style * fix style test=develop * fix style test=develop * add return ins num in ctr metric op * change dtype to float in metric_op.py * fix error test=develop * fix style test=develop * fix API spec * fix API spec * fix API spec test=develop * add UT test=develop
-
- 08 8月, 2019 1 次提交
-
-
由 jiaqi 提交于
* add fleet util (fleet/utils/fleet_util.py): functions for users' convenience * add some interface in hdfs util : hdfs is_file、hdfs cat
-
- 02 8月, 2019 1 次提交
-
-
由 jiaqi 提交于
* support filelist size < trainer num * pull dense when stop, to make sure local dense params are same as pserver, so save paddle model will save dense model same as pserver * enable QueueDataset train same filelist for serveral times
-
- 01 8月, 2019 1 次提交
-
-
由 jiaqi 提交于
adjust ins weight according to nid slot , user can specify adjust_ins_weight in strategy
-
- 31 7月, 2019 1 次提交
-
-
由 jiaqi 提交于
(1) set fleet_send_batch_num a default value according to trainer num, the previous 80000 is fixed,if trainer num is much less or larger than 100,global shuffle may have timeout error. (2) fix load one table bug, add barrier
-
- 29 7月, 2019 1 次提交
-
-
由 Thunderbrook 提交于
* dump slot * test * proto * dump slot * test * proto * code style * code style * code style * style * add delete after unseen days * add unseen days * code style * conflict solve test=develop * add clear model * code style test=develop * code style test=develop
-
- 25 7月, 2019 2 次提交
-
-
由 guru4elephant 提交于
refine launch_ps and role_maker
-
由 fuyinno4 提交于
Fix FleetWrapper: 1. fix shrink dense: just scale show 2. add datanorm scale: divide datanorm's gradient by batch_size
-
- 24 7月, 2019 1 次提交
-
-
由 Thunderbrook 提交于
The change includes 2 things: 1. save delta model and shrink table are control by the same parameter before, now add delete_after_unseen_days to control shrink table. 2. value in sparse table has no slot before, now add slot in sparse table, and add DownpureCtrAccessor to support the new meta. test=develop
-
- 23 7月, 2019 1 次提交
-
-
由 jiaqi 提交于
(1)support patch data (merge slots of instances of same line id, modify dense layer which changes its size) (2)add fleet load_one_table interface, support load from paddle model and load from pslib model (3)fix push sparse bug which cause push sparse cost more time(about 10% in my testcase) (4)when some slots are not in one of your network (join/update, etc.),data feed、collect label info、push/pull sparse will skip these slots, instead of throw error. (5)add more debug info in TrainFilesWithProfiler
-
- 22 7月, 2019 1 次提交
-
-
由 tangwei12 提交于
do some odd jobs, test=develop
-
- 10 7月, 2019 1 次提交
-
-
由 guru4elephant 提交于
* upgrade collective fleet api
-
- 08 7月, 2019 1 次提交
-
-
由 guru4elephant 提交于
* add random port
-
- 02 7月, 2019 1 次提交
-
-
由 guru4elephant 提交于
make fleet support mpi job submit directly.
-
- 27 6月, 2019 2 次提交
-
-
由 tangwei12 提交于
* add is_runnning in communicator, test=develop
-
由 HaoRen 提交于
* fix prepare context redundant code problem, optimize executor by caching create_varaiables test=develop * supports collective training in executor * make fetch_list runable with variables, add more unittest for use_program_cache test=develop * fix comment test=develop * use unique name for nccl_id * supports output to stream in program_to_code * insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code * set op role in collective training * add collective op role * remove orig file * add build optimizer by strategy * add collective strategy * refine collective strategy * add multi-process role maker * refine strategy building factory so that we can easily plugin more strategy * scale loss grad in collective sgd transpiler * add support for distributed fc * code format * revert some features for dist fc * add support for distributed fc training * fix prepare context redundant code problem, optimize executor by caching create_varaiables test=develop * supports collective training in executor * make fetch_list runable with variables, add more unittest for use_program_cache test=develop * use unique name for nccl_id * supports output to stream in program_to_code * insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code * set op role in collective training * add collective op role * fix comment test=develop * remove orig file * add build optimizer by strategy * add collective strategy * refine collective strategy * add multi-process role maker * refine strategy building factory so that we can easily plugin more strategy * scale loss grad in collective sgd transpiler * add support for distributed fc * code format * revert some features for dist fc * add support for distributed fc training * test=develop add collective op unittest standard * test=develop remove the test_collective directory * test=develop remove the test_collective directory * remove slicegather test * code format for reducescatter * update attr of shard_index_op * Modify macro nccl_helper * remove test without distribute * macro collective_helper * marcro update * test=develop update support python3.5 * test=develop change gpu memory use to 0.1 when test * test=develop update ut equal func * test=develop set flags to 1.5 * test=develop fix pickle dumple py35 * test=develop fix divide in slice and add sync_comm_stream update atol and rtol to 1e-05 rm shard_index op and test modify read input from file to read from memory remove origin_program in framework and add i/o in c_sync_calc_stream * test=develop update unittest sync operator I/O
-
- 23 6月, 2019 1 次提交
-
-
由 guru4elephant 提交于
* fix paddle cloud role maker bug
-
- 17 6月, 2019 2 次提交
-
-
由 Qiao Longfei 提交于
fix role_maker bug test=develop
-
由 guru4elephant 提交于
add paddle cloud role maker for customized usage, note this is only for industrial users that have cloud environment pre-configuration (#18121) add paddle cloud role maker for specific cloud usage. This pr will simplifies user's configuration in distributed training.
-
- 13 6月, 2019 1 次提交
-
-
由 tangwei12 提交于
-
- 12 6月, 2019 2 次提交
-
-
由 tangwei12 提交于
* fix save/load in Fleet * add UT framework of Fleet
-
由 Kaipeng Deng 提交于
* fix logging unable. test=develop * unset sys.stdout for stream handler. test=develop * fix newly add basicConfig. test=develop * fix import error. test=develop
-
- 11 6月, 2019 1 次提交
-
-
由 lilong12 提交于
* add 'UserDefinedRoleMakerNCCL' for collective mode. * code style * add the name UserDefinedRoleMakerNCCL to __all__ * rename to UserDefinedRoleMakerCollective * rename to UserDefinedCollectiveRoleMaker
-
- 23 5月, 2019 1 次提交
-
-
由 Qiao Longfei 提交于
Async exe support communicator
-
- 17 5月, 2019 1 次提交
-
-
由 jiaqi 提交于
test=develop
-
- 15 5月, 2019 1 次提交
-
-
由 jiaqi 提交于
* support config file, cvm, load, save, shrink test=develop * fix error of worker_num & add table.compress_in_save test=develop * fix code style test=develop * fix save model bug test=develop
-
- 09 5月, 2019 1 次提交
-
-
由 tangwei12 提交于
* fix some logic in distributed transpiler, test=develop * reformat fleet API, test=develop
-
- 25 4月, 2019 1 次提交
-
-
由 tangwei12 提交于
* implement distributed transpiler with fleet
-
- 11 4月, 2019 1 次提交
-
-
由 dongdaxiang 提交于
-
- 10 4月, 2019 1 次提交
-
-
由 xjqbest 提交于
test=develop
-
- 09 4月, 2019 3 次提交
- 05 4月, 2019 1 次提交
-
-
由 xjqbest 提交于
test=develop
-
- 04 4月, 2019 1 次提交
-
-
由 xjqbest 提交于
test=develop
-