1. 07 9月, 2018 2 次提交
    • J
      Dev allreduce2 (#1211) · e1b30bd5
      Jinhui Yuan 提交于
      * add ReduceScatter2, ReduceAdd2, ReduceGather2 op and kernel
      
      * add ReduceScatter2, ReduceAdd2, ReduceGather2 task node and actor
      
      * complete Reduce2 op
      
      * TODO: complete ReduceAdd2 kernel
      
      * add ReduceScatter2 task to accept model_diff
      
      * sketch of connecting ReduceScatter2/Add2/Gather2
      
      * build allreduce2 logical graph
      
      * connect allreduce2 task graph
      
      * ReduceScatter2 task node
      
      * complete ReduceAdd2, ReduceGather2 task node
      
      * simplify ReduceAdd2 actor
      
      * refactor ReduceAdd2 task node
      
      * let global add -> gather share path
      
      * separate ReduceLocalAdd2 and ReduceGlobalAdd2
      
      * connect AllReduce2 task graph
      
      * complete ReduceGlobalAdd2 op
      
      * refine ReduceLocalAdd2 task node
      
      * complete ReduceGlobalAdd2 task node
      
      * global AllReduce2 works
      
      * add device_num_of_each_machine to parallel_context
      
      * simplify ReduceGlobalAdd2 runtime
      
      * multi machine multi gpus AllReduce2 works
      
      * add mem sharing and ctrl edge for AllReduce2
      
      * single machine multiple gpu mem sharing works
      
      * refine
      
      * remove the previous allreduce
      
      * change AllReduce2 to AllReduce variable convention
      
      * change filename
      
      * complete transfer to allreduce2
      
      * remove unnecessary format change
      
      * remove unnecessary format change
      
      * simplify
      
      * simplify mem sharing rule for reduce add and gather
      
      * check for local add
      
      * fix reduce_global_add actor bug
      
      * refine reduce task node
      
      * refine variable name
      
      * refine
      
      * refine
      
      
      Former-commit-id: 5909cc43
      e1b30bd5
    • J
      fix bug in add kernel of allreduce (#1214) · a76f47b3
      Jinhui Yuan 提交于
      
      
      Former-commit-id: 34ce4862
      a76f47b3
  2. 06 9月, 2018 1 次提交
  3. 04 9月, 2018 5 次提交
  4. 03 9月, 2018 2 次提交
  5. 02 9月, 2018 3 次提交
  6. 01 9月, 2018 2 次提交
  7. 31 8月, 2018 1 次提交
  8. 30 8月, 2018 1 次提交
  9. 29 8月, 2018 1 次提交
    • J
      sketch of merge reduce project (#1159) · 0252bca8
      Jinhui Yuan 提交于
      * sketch of merge reduce project
      
      * add reduce_concat, reduce_split in logical graph (#1160)
      
      * add reduce_concat, reduce_split in logical graph
      
      * init ReduceTaskNodes in CollectReduceTaskNodes
      
      * add CompTaskNode for ReduceConcat & ReduceSplit
      
      * set ReduceConcat/Split color index
      
      * copy blob desc from ReduceConcat in to ReduceSplit out
      
      * refine CollectReduceTaskNodes
      
      * SetMemSharing for ReduceConcat, ReduceSplit regst
      
      * complete ReduceConcat & ReduceSplit op
      
      * fill ReduceConcat & ReduceSplit kernel
      
      * simplify ReduceConcatCompActor
      
      * make ReduceScatter & ReduceSplit as input-wise actor
      
      * reduce_scatter & reduce_split use is_inplace
      
      * use ByteSizeOfBlobBody for reduce related packed blob
      
      * Fix dev merge reduce (#1168)
      
      * check concat and split occur simultaneously
      
      * fix ReduceScatter & ReduceSplit as Inputwise actor
      
      * ReduceConcat & ReduceSplit works
      
      * fix single gpu issue
      
      * Refactor reduce (#1170)
      
      * backup, not complete yet
      
      * remove reduce_id
      
      * rm useless comment
      
      * add reduce_graph (#1169)
      
      * add reduce_graph
      
      * fix iter
      
      * add IsLogicalNodeMergeable and fix bug
      
      * remove needless constructor calls
      
      * node VisualStr may conflict, using node_id_str instead
      
      * reduce group works (#1171)
      
      * refine
      
      * sort nodes in topo (#1172)
      
      * add reduce_group_size in job_conf, fix 121 config of ReduceSplit and MdUpdt
      
      * resolve code review issues (variable names)
      
      * refine variable names
      
      * Dev merge reduce rename reduce group (#1174)
      
      * ReduceGraph=>ChainLogicalGraph
      
      * rename Group=>Chain
      
      * reformat
      
      * use pointer instead of reference for mutable argument
      
      * format change
      
      * worker node only pull sub_plan (#1176)
      
      * log compile time
      
      * use c++11 member initialization syntax
      
      * FixPackedBlobDescOfProducedRegst for ReduceSplit
      
      * Dev merge reduce refine chain logical graph (#1177)
      
      * remove IsMerageable
      
      * split TryMergeOneChain and rename to TryMergeTwoChains
      
      * reformat
      
      * resolve review issues
      
      
      Former-commit-id: 3aa79c70
      0252bca8
  10. 27 8月, 2018 1 次提交
  11. 25 8月, 2018 2 次提交
  12. 24 8月, 2018 3 次提交
  13. 22 8月, 2018 3 次提交
  14. 21 8月, 2018 1 次提交
    • J
      Dev refine runtime (#1147) · 8e16abdd
      Jinhui Yuan 提交于
      * clear act_event_logger act_event_bin_filename
      
      * cluster_thrd_ids_key
      
      * simplify ofrecord_decoder multi-thread
      
      * let decoder use AllocateCpuThrdIdEvenly
      
      * let ofrecord_decoder use local thread pool
      
      
      Former-commit-id: a4860e5b
      8e16abdd
  15. 20 8月, 2018 6 次提交
  16. 19 8月, 2018 6 次提交