1. 30 9月, 2019 1 次提交
  2. 24 9月, 2019 1 次提交
    • X
      support change shuffle and train thread num (#19841) · cedc0477
      xujiaqi01 提交于
      * support change shuffle thread num
      * support change train thread num
      * fix receive shuffle data of each channel
      * data norm stop gradient
      * add check thread_tensor type and root_tensor type when merge metric
      * remove sleep in shuffle, add config
      * add config of pslib client to client communication
      * fix xbox str
      * add data norm op testcase
      * add flush in trainer finalize
      cedc0477
  3. 23 9月, 2019 2 次提交
    • M
      Forward recompute3 (#19913) · 9901f696
      mapingshuo 提交于
      * add recompute based checkpoints methods for large batch training
      test=develop
      
      * add append_backward_with_forward_recomputation
      test=develop
      
      * refine optimizer
      test=develop
      
      * update backward and optimizer
      test=develop
      
      * make Variable usable
      test=develop
      
      * add recompute code
      
      * refine optimizer
      test=develop
      
      * refine addup _append_backward_ops_with_checkpoints_
      1) for recompute part, just cache the grad_op_desc without appending to block
      2) before appending grad_op_desc to backward part, addup_repetitive_vars, remove unused branch
      test=develop
      
      * make method private
      
      * add recompute strategy into DistributedStrategy
      test=develop
      
      * checkpoint version3
      test=develop
      
      * remove some print information
      test=develop
      
      * remove unused sumop
      test=develop
      
      * try to fix recompute with graph building modules
      
      * add input names to vars should be held
      
      * add memory debug tool
      
      * backup backward
      
      * Fix bugs
      
      * add backward desc for op not in any segments
      
      * add exception info for sub_block
      
      test=develop
      
      * modify code style
      
      test=develop
      
      * modify code style
      
      test=develop
      
      * remove print functions
      
      test=develop
      
      * add API spec
      
      test=develop
      test=document_preview
      
      * make Recompute a child class of Optimizer
      
      test=develop
      test=document_preview
      
      * add API spec
      
      test=develop
      test=document_preview
      
      * modify API spec
      
      test=develop
      test=document_preview
      
      * add document for Recompute
      
      test=develop
      test=document_preview
      
      * change API doc of Rcompute
      
      test=develop
      test=document_preview
      
      * code cleaning
      
      test=develop
      test=document_preview
      
      * modify API spec
      
      * fix bugs when segments hold no element
      
      * add testcase for Recompute Optimizer
      
      test=develop
      test=document_preview
      
      * add test for apply_gradient, and code cleaning
      
      test=develop
      test=document_preview
      
      * add test case for load function
      
      * enable CI
      
      test=develop
      test=document
      
      * add test case
      
      test=develop
      test=document_preview
      
      * add sample code for 4 function of recompute optimizer
      
      test=develop
      test=document_preview
      9901f696
    • T
      paddle cloud role maker fix (#19646) · 278dd003
      tangwei12 提交于
      * optimize cloud rolemaker, test=develop
      278dd003
  4. 19 9月, 2019 1 次提交
  5. 17 9月, 2019 1 次提交
  6. 10 9月, 2019 1 次提交
  7. 06 9月, 2019 1 次提交
  8. 05 9月, 2019 1 次提交
  9. 30 8月, 2019 1 次提交
    • Y
      add thread scope stat accurate metrics test=develop (#19480) · 10ca3f96
      yaoxuefeng 提交于
      * add thread scope stat accurate metrics test=develop
      
      * fix style
      
      * fix style
      
      * fix style
      
      * fix style test=develop
      
      * fix style test=develop
      
      * fix style test=develop
      
      * fix style test=develop
      
      * fix style test=develop
      
      * fix style test=develop
      
      * fix style test=develop
      
      * fix conflict
      
      * fix style
      
      * fix style test=develop
      
      * fix error test=develop
      
      * fix error test=develop
      10ca3f96
  10. 29 8月, 2019 2 次提交
    • T
      support debug each output of each ins (#19004) · 1fe468d3
      Thunderbrook 提交于
      * dump slot
      
      * test
      
      * proto
      
      * dump slot
      
      * test
      
      * proto
      
      * code style
      
      * code style
      
      * code style
      
      * style
      
      * add delete after unseen days
      
      * add unseen days
      
      * code style
      
      * conflict solve
      test=develop
      
      * add clear model
      
      * code style
      test=develop
      
      * code style
      test=develop
      
      * support debug tensor of each ins
      test=develop
      
      * support debug tensor of each ins
      test=develop
      
      * learning rate
      
      * code style
      
      * code style
      
      * code style
      
      * code style
      
      * code style
      
      * code style
      
      * code style
      
      * code style
      
      * code style
      
      * code style
      
      * code style
      
      * code style
      
      * code style
      test=develop
      
      * code style
      test=develop
      
      * unitest
      
      * style
      
      * style
      
      * multi phase
      
      * add channel
      
      * code style
      
      * style
      
      * style
      
      * unitest
      
      * style
      
      * define
      
      * define
      test=develop
      
      * style
      test=develop
      
      * rm define
      test=develop
      
      * linux
      
      * linux
      test=develop
      
      * style
      test=develop
      
      * output format
      test=develop
      
      * windows ci
      test=develop
      1fe468d3
    • Z
      support fc sort by number, test=develop (#19466) · bd35a7f0
      zhang wenhui 提交于
      fleet_desc sort fc name by dictionary sort, but we want to sort by number.
      bd35a7f0
  11. 28 8月, 2019 2 次提交
  12. 27 8月, 2019 1 次提交
  13. 23 8月, 2019 1 次提交
  14. 16 8月, 2019 1 次提交
  15. 14 8月, 2019 3 次提交
  16. 12 8月, 2019 1 次提交
  17. 11 8月, 2019 1 次提交
    • Y
      add save cache model api in fleet& add slots shuffle in dataset module & add... · 9150cf50
      yaoxuefeng 提交于
      add save cache model api in fleet& add slots shuffle in dataset module & add metric op to calculate ctr related metrics (#18871)
      
      * add ctr related metric layer test=develop
      
      * add save cache and slots shuffle test=develop
      
      * add save cache and slots shuffle test=develop
      
      * fix error
      
      * fix error
      
      * fix style for ci
      
      * fix for comments
      
      * change SlotsShuffle input to std::strinf for generality
      
      * fix style
      
      * fix style
      
      * fix style
      
      * fix style
      
      * fix style
      
      * fix style
      
      * fix stylr
      
      * fix style
      
      * fix style
      
      * fix style
      
      * fix style
      
      * fix style
      
      * fix style
      
      * fix style
      
      * fix style
      
      * fix style
      
      * fix style
      
      * fix style
      
      * fix style
      
      * fix style
      
      * change non-const reference to pointer
      
      * fix style
      
      * fix style
      
      * fix style test=develop
      
      * fix style  test=develop
      
      * add return ins num in ctr metric op
      
      * change dtype to float in metric_op.py
      
      * fix error test=develop
      
      * fix style test=develop
      
      * fix API spec
      
      * fix API spec
      
      * fix API spec test=develop
      
      * add UT test=develop
      9150cf50
  18. 08 8月, 2019 1 次提交
  19. 02 8月, 2019 1 次提交
    • J
      support filelist size < trainer num && fix pull dense (#18956) · 02c370c3
      jiaqi 提交于
      * support filelist size < trainer num
      * pull dense when stop, to make sure local dense params are same as pserver, so save paddle model will save dense model same as pserver
      *  enable QueueDataset train same filelist for serveral times
      02c370c3
  20. 01 8月, 2019 1 次提交
  21. 31 7月, 2019 1 次提交
    • J
      set fleet_send_batch_num a default value according to trainer num · 233746d8
      jiaqi 提交于
      (1) set fleet_send_batch_num a default value according to trainer num, the previous 80000 is fixed,if trainer num is much less or larger than 100,global shuffle may have timeout error.
      
      (2) fix load one table bug, add barrier
      233746d8
  22. 29 7月, 2019 1 次提交
    • T
      add clear_model interface in fleetwrapper (#18815) · 52c1431e
      Thunderbrook 提交于
      * dump slot
      
      * test
      
      * proto
      
      * dump slot
      
      * test
      
      * proto
      
      * code style
      
      * code style
      
      * code style
      
      * style
      
      * add delete after unseen days
      
      * add unseen days
      
      * code style
      
      * conflict solve
      test=develop
      
      * add clear model
      
      * code style
      test=develop
      
      * code style
      test=develop
      52c1431e
  23. 25 7月, 2019 2 次提交
  24. 24 7月, 2019 1 次提交
    • T
      add slot to sparse table (#18686) · d8396281
      Thunderbrook 提交于
      The change includes 2 things:
      
      1. save delta model and shrink table are control by the same parameter before, now add delete_after_unseen_days to control shrink table.
      2. value in sparse table has no slot before, now add slot in sparse table, and add DownpureCtrAccessor to support the new meta.
      test=develop
      d8396281
  25. 23 7月, 2019 1 次提交
    • J
      support patch data, add load_one_table, fix bug (#18509) · d18aabb4
      jiaqi 提交于
      (1)support patch data (merge slots of instances of same line id, modify dense layer which
      changes its size)
      (2)add fleet load_one_table interface, support load from paddle model and load from pslib model
      (3)fix push sparse bug which cause push sparse cost more time(about 10% in my testcase)
      (4)when some slots are not in one of your network (join/update, etc.),data feed、collect label info、push/pull sparse will skip these slots, instead of throw error.
      (5)add more debug info in TrainFilesWithProfiler
      d18aabb4
  26. 22 7月, 2019 1 次提交
  27. 10 7月, 2019 1 次提交
  28. 08 7月, 2019 1 次提交
  29. 02 7月, 2019 1 次提交
  30. 27 6月, 2019 2 次提交
    • T
      fix communicator with pyreader (#18350) · 999d9a59
      tangwei12 提交于
      * add is_runnning in communicator, test=develop
      999d9a59
    • H
      supports collective communicated training (#18175) · b7128bac
      HaoRen 提交于
      * fix prepare context redundant code problem, optimize executor by caching create_varaiables
      test=develop
      
      * supports collective training in executor
      
      * make fetch_list runable with variables, add more unittest for use_program_cache
      test=develop
      
      * fix comment
      test=develop
      
      * use unique name for nccl_id
      
      * supports output to stream in program_to_code
      
      * insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
      
      * set op role in collective training
      
      * add collective op role
      
      * remove orig file
      
      * add build optimizer by strategy
      
      * add collective strategy
      
      * refine collective strategy
      
      * add multi-process role maker
      
      * refine strategy building factory so that we can easily plugin more strategy
      
      * scale loss grad in collective sgd transpiler
      
      * add support for distributed fc
      
      * code format
      
      * revert some features for dist fc
      
      * add support for distributed fc training
      
      * fix prepare context redundant code problem, optimize executor by caching create_varaiables
      test=develop
      
      * supports collective training in executor
      
      * make fetch_list runable with variables, add more unittest for use_program_cache
      test=develop
      
      * use unique name for nccl_id
      
      * supports output to stream in program_to_code
      
      * insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
      
      * set op role in collective training
      
      * add collective op role
      
      * fix comment
      test=develop
      
      * remove orig file
      
      * add build optimizer by strategy
      
      * add collective strategy
      
      * refine collective strategy
      
      * add multi-process role maker
      
      * refine strategy building factory so that we can easily plugin more strategy
      
      * scale loss grad in collective sgd transpiler
      
      * add support for distributed fc
      
      * code format
      
      * revert some features for dist fc
      
      * add support for distributed fc training
      
      * test=develop
      add collective op unittest standard
      
      * test=develop
      remove the test_collective directory
      
      * test=develop
      remove the test_collective directory
      
      * remove slicegather test
      
      * code format for reducescatter
      
      * update attr of shard_index_op
      
      * Modify macro nccl_helper
      
      * remove test without distribute
      
      * macro collective_helper
      
      * marcro update
      
      * test=develop
      update support python3.5
      
      * test=develop change gpu memory use to 0.1 when test
      
      * test=develop
      update ut equal func
      
      * test=develop
      set flags to 1.5
      
      * test=develop fix pickle dumple  py35
      
      * test=develop
      fix divide in slice and add sync_comm_stream
      update atol and rtol to 1e-05
      rm shard_index op and test
      modify read input from file to read from memory
      remove origin_program in framework and add i/o in c_sync_calc_stream
      
      * test=develop update unittest sync operator I/O
      b7128bac
  31. 23 6月, 2019 1 次提交
  32. 17 6月, 2019 2 次提交