1. 23 9月, 2018 1 次提交
  2. 19 9月, 2018 2 次提交
  3. 18 9月, 2018 1 次提交
    • L
      Dev define test blob (#1247) · 8ebe859c
      Li Xinqi 提交于
      * define_test_blob
      
      * decode random compute task node
      
      * rename define_test_blob_conf.name => define_test_blob_conf.out
      
      * decode random task node color
      
      
      Former-commit-id: 0476d2c2
      8ebe859c
  4. 17 9月, 2018 6 次提交
    • L
      moving model (#1234) · 3d5244c8
      Li Xinqi 提交于
      * moving model
      
      * moving_model => forward_model
      
      * add todo commit
      
      * two model save node
      
      * let md_updt actor handle forward_model
      
      * remove useless code
      
      * rename local variable
      
      
      Former-commit-id: baa146bd
      3d5244c8
    • S
      refine model update conf (#1240) · 33868c01
      Shiyuan Shang-Guan 提交于
      * refine model update conf
      
      * make todo
      
      * add primary_lr and secondary_lr
      
      
      Former-commit-id: 5ccd29d7
      33868c01
    • S
      b3286301
    • J
      Dev refactor channel (#1181) · b012dc22
      Juncheng 提交于
      * add enum ChannelStatus
      
      * merge CloseSendEnd and CloseReceiveEnd
      
      * update channel_test
      
      
      Former-commit-id: fda25987
      b012dc22
    • J
      Refine runtime (#1108) · 03c635ba
      Jinhui Yuan 提交于
      * only master machine saves plan and has event logger
      
      * separate Data, Persistence, Cache, Log FileSystem config
      
      * refine
      
      * only specify data and snapshot path conf
      
      * forbit multiple machines use localfs as snapshot fs
      
      * networkfs as localfs
      
      * refine
      
      * Store log to snapshot (#1109)
      
      * use machine id, drop machine name
      
      * ensure setting machine id
      
      * allow save snapshot to localfs for distributed training (#1113)
      
      * Snapshot to master (#1116)
      
      * allow save snapshot to localfs for distributed training
      
      * fix mdSave to master for model parallel
      
      * fix review comment issues
      
      * add sanity check for machine id
      
      * rm useless comments
      
      * update example
      
      * Dev refine runtime add log stream mgr (#1142)
      
      * add LogStreamMgr
      
      * refine and refactor OutStream=>LogStream
      
      * bugfix
      
      * use LogStreamMgr to write graph, dot, plan, profile and proto
      
      * refine
      
      * simplify, remove LogStreamMgr (#1243)
      
      * simplify, remove LogStreamMgr
      
      * TeePersistentLogStream add static factory (#1244)
      
      
      Former-commit-id: d76513b3
      03c635ba
    • C
      fix bug of forward model -> copyD2H conflict with out regst (#1242) · b3f6e061
      cheng cheng 提交于
      * fix bug of forward model -> copyD2H conflict with out regst
      
      * use 1 line
      
      
      Former-commit-id: 0da0646c
      b3f6e061
  5. 16 9月, 2018 2 次提交
  6. 15 9月, 2018 2 次提交
    • L
      pb list data type (#1237) · d66ad601
      Li Xinqi 提交于
      
      
      Former-commit-id: 58f43ff5
      d66ad601
    • S
      separate model for update (#1232) · 9f22ecaa
      Shiyuan Shang-Guan 提交于
      * make each blob of the packed blob be updated separately in the ModelUpdate
      
      * make blob descs in regst be consistent in bw->md_diff_acc->shared_md_diff_add->md_update->fw
      
      * copy lbi2blob_descs from model
      
      * add shared_model_diff_add kernel
      
      * refine model_update actor and kernel
      
      * rm useless TODO
      
      * add shared_model_diff_add kernel
      
      * refine code
      
      
      Former-commit-id: 11408363
      9f22ecaa
  7. 14 9月, 2018 2 次提交
  8. 13 9月, 2018 1 次提交
  9. 10 9月, 2018 2 次提交
  10. 09 9月, 2018 1 次提交
  11. 07 9月, 2018 3 次提交
    • N
      feat: update the data members to use RegstSlot in Actor (#1208) · d0f50ede
      Niu Chong 提交于
      * feat(register_slot): add the RegstSlot
      
      * feat(register_slot): update RegstSlot if
      
      * feat(actor): update member of Actor to use RegstSlot
      
      * fix(register_slot): fix the available_regst_desc_cnt init val
      
      * refine(register_slot): rename PushBack/PopFront, FindTheRegstDescId to TryPushBack/TryPopFront, HasRegstDescId
      
      * feat(regst_slot): rename ForEachCurRegstDeq/ForEachCurFrontRegst to ForEachRegstDeq/ForEachFrontRegst
      
      * feat(regst_slot): add ForChosenRegstDeq/ForChosenFrontRegst, add CHECK empty in ForEachFrontRegst
      
      * fix(register_slot): fix the CHECK empty
      
      
      Former-commit-id: 38a50de4
      d0f50ede
    • J
      Dev allreduce2 (#1211) · e1b30bd5
      Jinhui Yuan 提交于
      * add ReduceScatter2, ReduceAdd2, ReduceGather2 op and kernel
      
      * add ReduceScatter2, ReduceAdd2, ReduceGather2 task node and actor
      
      * complete Reduce2 op
      
      * TODO: complete ReduceAdd2 kernel
      
      * add ReduceScatter2 task to accept model_diff
      
      * sketch of connecting ReduceScatter2/Add2/Gather2
      
      * build allreduce2 logical graph
      
      * connect allreduce2 task graph
      
      * ReduceScatter2 task node
      
      * complete ReduceAdd2, ReduceGather2 task node
      
      * simplify ReduceAdd2 actor
      
      * refactor ReduceAdd2 task node
      
      * let global add -> gather share path
      
      * separate ReduceLocalAdd2 and ReduceGlobalAdd2
      
      * connect AllReduce2 task graph
      
      * complete ReduceGlobalAdd2 op
      
      * refine ReduceLocalAdd2 task node
      
      * complete ReduceGlobalAdd2 task node
      
      * global AllReduce2 works
      
      * add device_num_of_each_machine to parallel_context
      
      * simplify ReduceGlobalAdd2 runtime
      
      * multi machine multi gpus AllReduce2 works
      
      * add mem sharing and ctrl edge for AllReduce2
      
      * single machine multiple gpu mem sharing works
      
      * refine
      
      * remove the previous allreduce
      
      * change AllReduce2 to AllReduce variable convention
      
      * change filename
      
      * complete transfer to allreduce2
      
      * remove unnecessary format change
      
      * remove unnecessary format change
      
      * simplify
      
      * simplify mem sharing rule for reduce add and gather
      
      * check for local add
      
      * fix reduce_global_add actor bug
      
      * refine reduce task node
      
      * refine variable name
      
      * refine
      
      * refine
      
      
      Former-commit-id: 5909cc43
      e1b30bd5
    • J
      fix bug in add kernel of allreduce (#1214) · a76f47b3
      Jinhui Yuan 提交于
      
      
      Former-commit-id: 34ce4862
      a76f47b3
  12. 06 9月, 2018 1 次提交
  13. 04 9月, 2018 5 次提交
  14. 03 9月, 2018 2 次提交
  15. 02 9月, 2018 3 次提交
  16. 01 9月, 2018 2 次提交
  17. 31 8月, 2018 1 次提交
  18. 30 8月, 2018 1 次提交
  19. 29 8月, 2018 1 次提交
    • J
      sketch of merge reduce project (#1159) · 0252bca8
      Jinhui Yuan 提交于
      * sketch of merge reduce project
      
      * add reduce_concat, reduce_split in logical graph (#1160)
      
      * add reduce_concat, reduce_split in logical graph
      
      * init ReduceTaskNodes in CollectReduceTaskNodes
      
      * add CompTaskNode for ReduceConcat & ReduceSplit
      
      * set ReduceConcat/Split color index
      
      * copy blob desc from ReduceConcat in to ReduceSplit out
      
      * refine CollectReduceTaskNodes
      
      * SetMemSharing for ReduceConcat, ReduceSplit regst
      
      * complete ReduceConcat & ReduceSplit op
      
      * fill ReduceConcat & ReduceSplit kernel
      
      * simplify ReduceConcatCompActor
      
      * make ReduceScatter & ReduceSplit as input-wise actor
      
      * reduce_scatter & reduce_split use is_inplace
      
      * use ByteSizeOfBlobBody for reduce related packed blob
      
      * Fix dev merge reduce (#1168)
      
      * check concat and split occur simultaneously
      
      * fix ReduceScatter & ReduceSplit as Inputwise actor
      
      * ReduceConcat & ReduceSplit works
      
      * fix single gpu issue
      
      * Refactor reduce (#1170)
      
      * backup, not complete yet
      
      * remove reduce_id
      
      * rm useless comment
      
      * add reduce_graph (#1169)
      
      * add reduce_graph
      
      * fix iter
      
      * add IsLogicalNodeMergeable and fix bug
      
      * remove needless constructor calls
      
      * node VisualStr may conflict, using node_id_str instead
      
      * reduce group works (#1171)
      
      * refine
      
      * sort nodes in topo (#1172)
      
      * add reduce_group_size in job_conf, fix 121 config of ReduceSplit and MdUpdt
      
      * resolve code review issues (variable names)
      
      * refine variable names
      
      * Dev merge reduce rename reduce group (#1174)
      
      * ReduceGraph=>ChainLogicalGraph
      
      * rename Group=>Chain
      
      * reformat
      
      * use pointer instead of reference for mutable argument
      
      * format change
      
      * worker node only pull sub_plan (#1176)
      
      * log compile time
      
      * use c++11 member initialization syntax
      
      * FixPackedBlobDescOfProducedRegst for ReduceSplit
      
      * Dev merge reduce refine chain logical graph (#1177)
      
      * remove IsMerageable
      
      * split TryMergeOneChain and rename to TryMergeTwoChains
      
      * reformat
      
      * resolve review issues
      
      
      Former-commit-id: 3aa79c70
      0252bca8
  20. 27 8月, 2018 1 次提交