提交 · 99d64b781ba699740c559935e172e5314ed917e2 · Oneflow-Inc / oneflow

01 10月, 2018 2 次提交
- N
  fix(normal_forward_compute_actor): fix SendMsgToForwardModelSaveActor() (#1270) · 99d64b78
  由 Niu Chong 提交于 10月 01, 2018
```
* fix(normal_forward_compute_actor): fix SendMsgToForwardModelSaveActor()

* refine(normal_forward_compute_actor)


Former-commit-id: d746016e
```
  99d64b78
- J
  enlarge the cudnn buf to 4GB (#1269) · ce674856
  由 Jinhui Yuan 提交于 10月 01, 2018
```
Former-commit-id: 28f981eb
```
  ce674856
30 9月, 2018 1 次提交

由 Niu Chong 提交于 9月 30, 2018

* feat(register_slot): add the RegstSlot

* feat(register_slot): update RegstSlot if

* feat(actor): update member of Actor to use RegstSlot

* fix(register_slot): fix the available_regst_desc_cnt init val

* refine(register_slot): rename PushBack/PopFront, FindTheRegstDescId to TryPushBack/TryPopFront, HasRegstDescId

* feat(regst_slot): rename ForEachCurRegstDeq/ForEachCurFrontRegst to ForEachRegstDeq/ForEachFrontRegst

* feat(regst_slot): add ForChosenRegstDeq/ForChosenFrontRegst, add CHECK empty in ForEachFrontRegst

* fix(register_slot): fix the CHECK empty

* feat: remove actual_writeable_regst_desc_id_ from Actor, add Naive/CustomizedProducedRegst

* fix(normal_model_update_actor): bug: not send customized regst to consumer when SendIntialModel

* fix(normal_forward_compute_actor): bug: not add kLoss/kAccuracy produced regst to NaiveProducedRegst

* fix(actor): UNIMPLEMENTED() for AsyncSendCustomizedProducedRegstMsgToConsumer

* fix(normal_forward_compute_actor): set const_buf_regst to nullptr when recv from consumers

* fix(actor): total_reading_data_regst_cnt, not total_reading_ctrl_regst_cnt

* refactor: update GetNaiveConsumedRegstDescName to GetNaiveOrCustomizedConsumedRegstDescName(same for Produced)

* feat: combine data_regst and ctrl_regst in Actor

* fix: fix bugs

* fix: fix bugs

* fix: remove .swp files and unused LOG

* feat: split Act and SendMsg (#1255)

* feat: split Act and SendMsg

* refine: rename HandleProduced/ConsumedDataRegst.. to HandleProduced/ConsumedNaiveDatRegst..

* fix(input_wise_comp_actor): bug: not set piece id

* fix(actor): potential bug: produced msg with no allowed actor still pop from queue

* refactor: mv some protected member function to private

* fix(actor): fix the condition about sending EORD msg

* refactor(input_wise_actor): use RegstSlot in InputWiseActor

* fix(copy_comm_net_actor): rename piece_id2regst_ctx to piece_id2regst_ctx_

* refactor: rename Name2RegstDescId to Name2RegstDescIds

* refactor(naive_actor): "override final" instead of only "final"

* refine(actor): little refine

* feat: update the return type of GetNaiveOrCustomizedNamesRegstDescName to enum class RegstNameType


Former-commit-id: e042befc

9fda43bf

26 9月, 2018 2 次提交

add impl of lars (#1163) · 388b945f

由 Shiyuan Shang-Guan 提交于 9月 26, 2018

* add lars set

* add lars

* override ibn&obn to lbi

* make model update consistent

* check cuda stream sync

* add LARSUpdateModelGpu

* checkout naive & momentum model update

* use cublas::dot compute SumOfSquare

* update lars for master

* refine lars for master


Former-commit-id: 9518970b

388b945f

Hinge loss test (#1263) · 3343e9b5

由 qq_22305325 提交于 9月 26, 2018

* hinge_loss_kernel_test

* fix opkernel_test

* fix test file

* optimize test file

* opyimize opkernel test

* complete opkernel test interface


Former-commit-id: 7faf75a6

3343e9b5

25 9月, 2018 2 次提交
- J
  cmake for nccl (#1262) · 53598b10
  由 Juncheng 提交于 9月 25, 2018
```
Former-commit-id: 57a82568
```
  53598b10
- J
  remove useless Copy in device_context (#1261) · 3f93d8bc
  由 Jinhui Yuan 提交于 9月 25, 2018
```
* remove useless Copy in device_context

* fix cyclic and copy_to_local bug in binary_in_stream_with_local_copy


Former-commit-id: 4b2c4ef0
```
  3f93d8bc
24 9月, 2018 1 次提交

Dev use nccl (#1198) · 9201b815

由 Jinhui Yuan 提交于 9月 24, 2018

* add nccl dependency

* add nccl comm handle

* nccl allreduce works

* NcclAllreduce -> NcclAllReduce

* fix header guard

* add NcclReduceScatter, NcclAllGather

* complete ReduceScatter and AllGather, (with cuda error)

* change variable name

* reduce-scatter, all-gather works

* add NcclScatter and NcclGather work type

* Dev use nccl add nccl comm manager (#1206)

* add parallel_set_id

* add nccl_comm_manager

* log nccl comm create

* use NcclCommMgr

* bugfix

* OF_DISALLOW_COPY_AND_MOVE

* remove nccl_scatter_handle and nccl_gather_handle from DeviceCtx

* remove nccl handles from cuda_stream_handle

* nccl_util and GetNcclDataType

* fix rank_num

* fix rank_id


fix rank_id

* CudaCheck->NcclCheck

* only GPU

* PoorCompTaskNode

SoleIn, SoleOut, SoleOp, SoleIbn, SoleObn

* PoorCompTaskNode

* reformat

* format change

* Dev use nccl merge reduce share mem (#1216)

* add parallel_set_id

* add nccl_comm_manager

* log nccl comm create

* use NcclCommMgr

* bugfix

* OF_DISALLOW_COPY_AND_MOVE

* remove nccl_scatter_handle and nccl_gather_handle from DeviceCtx

* remove nccl handles from cuda_stream_handle

* nccl_util and GetNcclDataType

* fix rank_num

* fix rank_id


fix rank_id

* CudaCheck->NcclCheck

* only GPU

* PoorCompTaskNode

SoleIn, SoleOut, SoleOp, SoleIbn, SoleObn

* PoorCompTaskNode

* reformat

* ReduceGather

* GlobalAdd

* ReduceScatter

* EnableIfNeed

* ConcatSplit

* EnableMemSharing for pred if need


EnableMemSharing for pred if need

* CtrlEdge for Gather

* CtrlEdge for GlobalAdd

* LocalAdd CtrlEdge

* CollectReduceTaskNode

* reverse nodes

* local_add_mem_sharing


local add mem sharing

* global add mem sharing

* reduce_mem_sharing

* bugfix

* refine

* format change (remove empty lines)

* format change

* fix local_add and gather issues

* Dev refactor reduce add (#1218)

* change ReduceGlobalAdd to ReduceAdd

* rm ReduceLocalAdd

* no mem sharing case works

* let ReduceAddCompActor decide whether it is local or global

* multi machine multi gpus Nccl and Oneflow allreduce works

* refine

* extract SortEdges

* make EdgeInfo protected

* Dev use nccl refine (#1220)

* const qualifier

* PoorCompTaskNode=>PipeCompTaskNode

* int=>int32_t

* refine ReduceMemSharingCtx

* NcclDeviceCtx and NcclActor


NcclDeviceCtx and NcclActor

* empty line

* CudaDeviceCtx<-NcclDeviceCtx

* fix wrong rank_id in reduce_add_actor (#1229)

* fix wrong rank_id in reduce_add_actor

* rm device_num_of_each_machine from parallel_ctx

* fix reduce gather control edge (#1235)

* fix reduce gather control edge

* extract FindNearestReduceAddCompTaskNode

* extract method ReduceCompTaskNodeIf::FindPredRduceTaskNodeIf

* CHECK nearest_add_copy_d2h

* Dev use nccl cross machine nccl all reduce (#1246)

* support ncclAllReduce cross machine

* fix rank_id and rank_num for mix

* reformat

* reformat

* simplify nccl_kernel (#1256)

* simplify REGISTER_BLD_SUB_TSK_GPH_MTHD (#1260)

* simplify REGISTER_BLD_SUB_TSK_GPH_MTHD

* note

* Dev use nccl reduce ranking ctx (#1252)

* reformat

* compute rank_id and rank_num with FixCompTaskNode

* reformat

* fix rank_id for reduceadd

* ReduceRankingCtx

* New Ranking and MemSharing for Reduce

* DECLARE_REDUCE_LOGICAL_NODE

* Ranking4NcclAllReduce

* fix ranking

* remove AsTaskNode

* reformat

* runtime rank ctx

* rank_set

* bugfix

* bugfix

* unittest

* change use_nccl_all_reduce_cross_machine to use_nccl_inter_node_communication

* refine


refine

* move BuildCtrlRegstBetweenReduceCopyNodes to ReduceAddCompTaskNode

* CHECK mem_size_


Former-commit-id: 55496813

9201b815

23 9月, 2018 1 次提交
- N
  chore(of_submit): update of_submit to support recently proto updates (#1258) · faf727b5
  由 Niu Chong 提交于 9月 23, 2018
```
Former-commit-id: 9fdcf61f
```
  faf727b5
19 9月, 2018 2 次提交
- L
  no out_diff then no backward node (#1250) · 1d7eae6f
  由 Li Xinqi 提交于 9月 19, 2018
```
Former-commit-id: cf84a6e8
```
  1d7eae6f
- S
  make save_download_file_conf false (#1254) · fe7764e1
  由 Shiyuan Shang-Guan 提交于 9月 19, 2018
```
Former-commit-id: 31693ec1
```
  fe7764e1
18 9月, 2018 1 次提交

Dev define test blob (#1247) · 8ebe859c

由 Li Xinqi 提交于 9月 18, 2018

* define_test_blob

* decode random compute task node

* rename define_test_blob_conf.name => define_test_blob_conf.out

* decode random task node color


Former-commit-id: 0476d2c2

8ebe859c

17 9月, 2018 6 次提交

moving model (#1234) · 3d5244c8

由 Li Xinqi 提交于 9月 17, 2018

* moving model

* moving_model => forward_model

* add todo commit

* two model save node

* let md_updt actor handle forward_model

* remove useless code

* rename local variable


Former-commit-id: baa146bd

3d5244c8

refine model update conf (#1240) · 33868c01

由 Shiyuan Shang-Guan 提交于 9月 17, 2018

* refine model update conf

* make todo

* add primary_lr and secondary_lr


Former-commit-id: 5ccd29d7

33868c01

S
fix loss/accuracy print op placement during logical graph construction (#1175) · b3286301
由 scxfjiang 提交于 9月 17, 2018
```
Former-commit-id: 4168d55e
```
b3286301

Dev refactor channel (#1181) · b012dc22

由 Juncheng 提交于 9月 17, 2018

* add enum ChannelStatus

* merge CloseSendEnd and CloseReceiveEnd

* update channel_test


Former-commit-id: fda25987

b012dc22

Refine runtime (#1108) · 03c635ba

由 Jinhui Yuan 提交于 9月 17, 2018

* only master machine saves plan and has event logger

* separate Data, Persistence, Cache, Log FileSystem config

* refine

* only specify data and snapshot path conf

* forbit multiple machines use localfs as snapshot fs

* networkfs as localfs

* refine

* Store log to snapshot (#1109)

* use machine id, drop machine name

* ensure setting machine id

* allow save snapshot to localfs for distributed training (#1113)

* Snapshot to master (#1116)

* allow save snapshot to localfs for distributed training

* fix mdSave to master for model parallel

* fix review comment issues

* add sanity check for machine id

* rm useless comments

* update example

* Dev refine runtime add log stream mgr (#1142)

* add LogStreamMgr

* refine and refactor OutStream=>LogStream

* bugfix

* use LogStreamMgr to write graph, dot, plan, profile and proto

* refine

* simplify, remove LogStreamMgr (#1243)

* simplify, remove LogStreamMgr

* TeePersistentLogStream add static factory (#1244)


Former-commit-id: d76513b3

03c635ba

C
fix bug of forward model -> copyD2H conflict with out regst (#1242) · b3f6e061
由 cheng cheng 提交于 9月 17, 2018
```
* fix bug of forward model -> copyD2H conflict with out regst

* use 1 line


Former-commit-id: 0da0646c
```
b3f6e061

16 9月, 2018 2 次提交
- L
  loss print has no in_diff (#1239) · de9e601d
  由 Li Xinqi 提交于 9月 16, 2018
```
Former-commit-id: f4e8f0fc
```
  de9e601d
- L
  Dev pb data type encode (#1241) · 77805f2d
  由 Li Xinqi 提交于 9月 16, 2018
```
* patch protobuf encode/decode files

* patch EncodeConf

* patch common/preprocessor.h


Former-commit-id: 517e1533
```
  77805f2d
15 9月, 2018 2 次提交

L
pb list data type (#1237) · d66ad601
由 Li Xinqi 提交于 9月 15, 2018
```
Former-commit-id: 58f43ff5
```
d66ad601

separate model for update (#1232) · 9f22ecaa

由 Shiyuan Shang-Guan 提交于 9月 15, 2018

* make each blob of the packed blob be updated separately in the ModelUpdate

* make blob descs in regst be consistent in bw->md_diff_acc->shared_md_diff_add->md_update->fw

* copy lbi2blob_descs from model

* add shared_model_diff_add kernel

* refine model_update actor and kernel

* rm useless TODO

* add shared_model_diff_add kernel

* refine code


Former-commit-id: 11408363

9f22ecaa

14 9月, 2018 2 次提交
- S
  refine op_type order (#1233) · a4461f07
  由 Shiyuan Shang-Guan 提交于 9月 14, 2018
```
Former-commit-id: c63bd8f3
```
  a4461f07
- L
  blob slice dptr (#1225) · 4e4abed6
  由 Li Xinqi 提交于 9月 14, 2018
```
* enable dptr<T>(...) if T is not void

* simplify dptr(...) by parameter packing


Former-commit-id: 642f1ba8
```
  4e4abed6
13 9月, 2018 1 次提交
- L
  mdupdt delayed topo (#1227) · cbf36fb9
  由 Li Xinqi 提交于 9月 13, 2018
```
Former-commit-id: 317267a0
```
  cbf36fb9
10 9月, 2018 2 次提交
- N
  fix(actor): bug: do not send consumed ctrl regst msg to producer rightly (#1222) · f6de1c9a
  由 Niu Chong 提交于 9月 10, 2018
```
Former-commit-id: 592227a1
```
  f6de1c9a
- J
  sketch: let input-wise actor and send input-wise ctrl msg (#1219) · 3d58474c
  由 Jinhui Yuan 提交于 9月 10, 2018
```
Former-commit-id: 9ddf4c53
```
  3d58474c
09 9月, 2018 1 次提交
- J
  fix: can not access the regst_desc_id of network regst (#1217) · 1d9ed416
  由 Jinhui Yuan 提交于 9月 09, 2018
```
Former-commit-id: 85496886
```
  1d9ed416
07 9月, 2018 3 次提交

feat: update the data members to use RegstSlot in Actor (#1208) · d0f50ede

由 Niu Chong 提交于 9月 07, 2018

* feat(register_slot): add the RegstSlot

* feat(register_slot): update RegstSlot if

* feat(actor): update member of Actor to use RegstSlot

* fix(register_slot): fix the available_regst_desc_cnt init val

* refine(register_slot): rename PushBack/PopFront, FindTheRegstDescId to TryPushBack/TryPopFront, HasRegstDescId

* feat(regst_slot): rename ForEachCurRegstDeq/ForEachCurFrontRegst to ForEachRegstDeq/ForEachFrontRegst

* feat(regst_slot): add ForChosenRegstDeq/ForChosenFrontRegst, add CHECK empty in ForEachFrontRegst

* fix(register_slot): fix the CHECK empty


Former-commit-id: 38a50de4

d0f50ede

Dev allreduce2 (#1211) · e1b30bd5

由 Jinhui Yuan 提交于 9月 07, 2018

* add ReduceScatter2, ReduceAdd2, ReduceGather2 op and kernel

* add ReduceScatter2, ReduceAdd2, ReduceGather2 task node and actor

* complete Reduce2 op

* TODO: complete ReduceAdd2 kernel

* add ReduceScatter2 task to accept model_diff

* sketch of connecting ReduceScatter2/Add2/Gather2

* build allreduce2 logical graph

* connect allreduce2 task graph

* ReduceScatter2 task node

* complete ReduceAdd2, ReduceGather2 task node

* simplify ReduceAdd2 actor

* refactor ReduceAdd2 task node

* let global add -> gather share path

* separate ReduceLocalAdd2 and ReduceGlobalAdd2

* connect AllReduce2 task graph

* complete ReduceGlobalAdd2 op

* refine ReduceLocalAdd2 task node

* complete ReduceGlobalAdd2 task node

* global AllReduce2 works

* add device_num_of_each_machine to parallel_context

* simplify ReduceGlobalAdd2 runtime

* multi machine multi gpus AllReduce2 works

* add mem sharing and ctrl edge for AllReduce2

* single machine multiple gpu mem sharing works

* refine

* remove the previous allreduce

* change AllReduce2 to AllReduce variable convention

* change filename

* complete transfer to allreduce2

* remove unnecessary format change

* remove unnecessary format change

* simplify

* simplify mem sharing rule for reduce add and gather

* check for local add

* fix reduce_global_add actor bug

* refine reduce task node

* refine variable name

* refine

* refine


Former-commit-id: 5909cc43

e1b30bd5

J
fix bug in add kernel of allreduce (#1214) · a76f47b3
由 Jinhui Yuan 提交于 9月 07, 2018
```
Former-commit-id: 34ce4862
```
a76f47b3

06 9月, 2018 1 次提交
- G
  fix Div function (#1212) · bc8b50c2
  由 guo ran 提交于 9月 06, 2018
```
Former-commit-id: 91432cb5
```
  bc8b50c2
04 9月, 2018 5 次提交

Dev hinge loss (#1207) · 9cdea308

由 qq_22305325 提交于 9月 04, 2018

* add hinge loss

* add hinge loss test

* hack hinge loss

* optimize hinge loss

* optimize hinge loss

* optimize hinge loss

* optimize hinge loss


Former-commit-id: 87db37ed

9cdea308

Dev matmul dot multiply (#1189) · 8100cf84

由 qq_22305325 提交于 9月 04, 2018

* add matmul & dot & multiply

* optimize dot kernel

* fix multiply kernel code style

* optimize matmul kernel


Former-commit-id: 6ab4006f

8100cf84

L
call cudnnBatchNormalizationForwardInference if trainable == flase (#1197) · d5c6eecb
由 Li Xinqi 提交于 9月 04, 2018
```
Former-commit-id: a21dea46
```
d5c6eecb

Dev embedding hb (#1188) · 5fa70913

由 qq_22305325 提交于 9月 04, 2018

* add embedding look up infer blob desc

* optimize inifer blob desc


Former-commit-id: 6c92495a

5fa70913

Dev hinge loss (#1190) · f676d774

由 qq_22305325 提交于 9月 04, 2018

* add hinge loss

* add hinge loss test

* hack hinge loss

* optimize hinge loss

* optimize hinge loss

* optimize hinge loss

* optimize hinge loss


Former-commit-id: e2da4ecf

f676d774

03 9月, 2018 2 次提交
- L
  split sources when infer shape (#1202) · a5f1e505
  由 Li Xinqi 提交于 9月 03, 2018
```
Former-commit-id: 34fb73fe
```
  a5f1e505
- L
  two pass to infer shape (#1200) · dd9be365
  由 Li Xinqi 提交于 9月 03, 2018
```
Former-commit-id: ece6957b
```
  dd9be365
02 9月, 2018 1 次提交
- L
  no blob coping gdb function (#1196) · 27630a89
  由 Li Xinqi 提交于 9月 02, 2018
```
Former-commit-id: da21ecd6
```
  27630a89

Oneflow-Inc / oneflow 上一次同步 2 年多

Oneflow-Inc / oneflow
上一次同步 2 年多