提交 · b4258477f18c00058450fa6f33458900b1374070 · Oneflow-Inc / oneflow

13 11月, 2018 1 次提交

Dev jinyi offline build (#1476) · b4258477

由 Jin Yi 提交于 11月 13, 2018

* chore: remove pre compiler funcs

* chore: add submoudles

* fix: fix project build URL from git_url -> submodule_dir_url

* fix: fix submodule commit id

* fix: fix .gitmodules

* chore: mv third_party dir

* chore: remove test-driver(glog#188) link in glog submodule

* fix: update glog from: da816ea70645e463aa04f9564544939fa327d5a7 ==> to: 4f3e18bf26cdb794fc66cec348f57b5838a0c929

* chore: update README.md


Former-commit-id: 8cc052f38cfd53c40186dc487df41b0c1f4a7189

b4258477

06 11月, 2018 1 次提交

Dev crop with random size (#1468) · 5d034a39

由 cheng cheng 提交于 11月 06, 2018

* random size crop proto

* ImagePreprocessImpl::<kCropWithRandomSize>

* clang format

* MaxVal


Former-commit-id: c027432320cc0f03248f9165994150fce058f00a

5d034a39

05 11月, 2018 1 次提交
- S
  align with tensorflow (#1461) · 49dd19bc
  由 Shiyuan Shang-Guan 提交于 11月 05, 2018
```
Former-commit-id: 94be14b8189a7123a4012bb34727b32f7ec07599
```
  49dd19bc
30 10月, 2018 1 次提交

Fix normlization epsilon check (#1441) · 9e6347a0

由 QiaoJing 提交于 10月 30, 2018

* fix normlization epsilon check

* remove check, fix eplison value in op_conf


Former-commit-id: 8ad160577179646a4d83f47a40d5de275ad19952

9e6347a0

29 10月, 2018 1 次提交
- Q
  fix normlization epsilon check (#1433) · d0ed5e84
  由 QiaoJing 提交于 10月 29, 2018
```
Former-commit-id: 8111c6c82b09d725b2f520744b2e6b3809288c65
```
  d0ed5e84
26 10月, 2018 2 次提交
- L
  batch num for prediction (#1405) · 18a30e66
  由 Li Xinqi 提交于 10月 26, 2018
```
* batch num for prediction

* !Train() => Predict()


Former-commit-id: 922a7f04cf58334d1b6c7c7ac758d0a2bc92323a
```
  18a30e66
- S
  fix a little bug in accuracy print (#1403) · de5210bc
  由 scxfjiang 提交于 10月 26, 2018
```
Former-commit-id: 936f59162bc4ccd9eacfd113ac804d9fb122c5ee
```
  de5210bc
25 10月, 2018 1 次提交

refine ibverbs lib (#1391) · f658a64b

由 Shiyuan Shang-Guan 提交于 10月 25, 2018

* refine link ibverbs lib

* modify minor


Former-commit-id: a7e61a6704b38ca2d4957ee699fd5962be1eac75

f658a64b

21 10月, 2018 1 次提交

Fix bug in gcc54 (#1352) · 604a9958

由 Shiyuan Shang-Guan 提交于 10月 21, 2018

* fix bug in gcc 5.4

* update


Former-commit-id: f180aadf59e9866bd0e0c065726fe5b316efbca6

604a9958

20 10月, 2018 1 次提交

Fix bug in model parallel (#1345) · 43ee7582

由 Shiyuan Shang-Guan 提交于 10月 20, 2018

* fix conv in model parallel

* add TODO


Former-commit-id: 5ed0f04822a94ab9941fda91c8ba8fb18c36aeeb

43ee7582

19 10月, 2018 1 次提交

feat: enhance cmake download & options (#1281) · b3fb9acf

由 Jin Yi 提交于 10月 19, 2018

* feat: enhance cmake download & options

* feat(tools/): add share libs build scripts

* fix: add cmake options

* feat: add 3rd party download

* chore: updat README

* fix: fix protobuf & cmake repo

* fix: fix options name

* chore: merge 3rd_party.cmake & third_party.cmake

* chore: revert pre cmake URL fix

* chore: update ExternalProject check

* fix: fix typo & missing download

* fix: fix download url

* chore: update readme

* chore: fix typo

* fix: fix bugs

* fix: fix bugs

* fix: fix pre

* print all third party libs

* refine readme

* DOWNLOAD_THIRD_PARTY -> PRECOMPILED_THIRD_PARTY

* refine readme

* minor typo fix


Former-commit-id: d7d1ec98a868c32e3a43658823ae136caa73feb5

b3fb9acf

17 10月, 2018 1 次提交

Fix snapshot (#1320) · 71d34a97

由 Shiyuan Shang-Guan 提交于 10月 17, 2018

* fix bug of snapshot

* refine distribute.sh

* use more accurate function calls

* rename function

* update for model parallel

* refine code


Former-commit-id: e0c2ad2b2dad82e0cb3adce6de9fba98f0c4434c

71d34a97

14 10月, 2018 1 次提交
- J
  gpu (#1310) · e5764885
  由 Juncheng 提交于 10月 14, 2018
```
Former-commit-id: 82681d523fa9e521e2c04b5fd32e6f435f9ba722
```
  e5764885
12 10月, 2018 3 次提交

S
Refine portmap in epoll (#1303) · 93c8971d
由 Shiyuan Shang-Guan 提交于 10月 12, 2018
```
* refine portmap in epoll

* refine code about sockfd

* add log


Former-commit-id: ca1903b9
```
93c8971d

Refine ctrl addr (#1297) · 516df8c9

由 Shiyuan Shang-Guan 提交于 10月 12, 2018

* refine ctrl addr (ip and port)

* update ctrl client&server

* update ctrl client&server

* update by comment

* update example resource.prototxt


Former-commit-id: 6de48fa5

516df8c9

add run scripts to tools/ (#1308) · 8697308c

由 Shiyuan Shang-Guan 提交于 10月 12, 2018

* add scripts to tools/

* update scripts

* update distribute.sh

* Redirect stderr


Former-commit-id: 749560c5

8697308c

11 10月, 2018 1 次提交
- J
  Fix naive infer time shape ignore meaning less (#1302) · a04d6688
  由 Juncheng 提交于 10月 11, 2018
```
* naive infer time shape ignore meaning less

* InferTimeShapeIfMeaningful


Former-commit-id: 69213dfd
```
  a04d6688
09 10月, 2018 3 次提交
- S
  rm -this_machine_id in of_submit (#1305) · e73f8417
  由 Shiyuan Shang-Guan 提交于 10月 09, 2018
```
Former-commit-id: abee1b98
```
  e73f8417
- S
  refine_glog_dir (#1300) · 80f00920
  由 Shiyuan Shang-Guan 提交于 10月 09, 2018
```
Former-commit-id: 312cfb10
```
  80f00920
- J
  naive infer time shape ignore meaning less (#1299) · 3adaae85
  由 Juncheng 提交于 10月 09, 2018
```
Former-commit-id: f972a8da
```
  3adaae85
05 10月, 2018 2 次提交

use hostname as log_dir_path and get this_machine_id through ip_addr (#1287) · f2d75848

由 Shiyuan Shang-Guan 提交于 10月 05, 2018

* use hostname as log_dir_path and get this_machine_id through ip_addr

* update by comment

* fix ParseThisMachineId

* fixbug

* rm TODO


Former-commit-id: a18f2912

f2d75848

J
build nccl from source (#1289) · bf6a73f8
由 Jinhui Yuan 提交于 10月 05, 2018
```
* build nccl from source

* refine

* refine BUILD_CUDA


Former-commit-id: dfd11137
```
bf6a73f8

03 10月, 2018 2 次提交
- J
  remove needless import (#1280) · 79a59d7f
  由 Juncheng 提交于 10月 03, 2018
```
Former-commit-id: fcb7ade1
```
  79a59d7f
- S
  fix decode_random and refine synthetic_data (#1278) · 56e032bf
  由 Shiyuan Shang-Guan 提交于 10月 03, 2018
```
* fix decode_random and refine synthetic_data

* add example

* initialize only once


Former-commit-id: a1b44c05
```
  56e032bf
02 10月, 2018 2 次提交
- J
  RegstDesc4RegstDescId (#1277) · ff1de4cd
  由 Juncheng 提交于 10月 02, 2018
```
Former-commit-id: b1987003
```
  ff1de4cd
- J
  time shape (#1265) · 93e0752c
  由 Juncheng 提交于 10月 02, 2018
```
* time shape

* refine

* reformat

* fix forward_model time shape

* refactor

* refine


Former-commit-id: e4747699
```
  93e0752c
01 10月, 2018 5 次提交

Dev pod desc (#1268) · 1c29eb42

由 Li Xinqi 提交于 10月 01, 2018

* available instance num

* import shape.proto

* PodProto

* rename message

* union pod is useless

* PodPtr

* rename: PodPtr::get() => PodPtr::Get()

* BlobDescProto.pod

* mv register_desc.time_shape into another pr

* pod_helper.h

* FieldAlignedByteSize

* pod_desc

* PodDesc copy constructor

* BlobDesc::body_shape_pod_desc_

* add BlobDesc::opaque_header_pod_desc_

* align_shift => alignment

* default alignment

* add field Blob::header_pod_ptr_

* rename AlignedFieldPodProto => FieldPodProto

* bugfix

* check

* FieldId

* simplify RtBlobDesc

* simplify Blob

* ShapedPod => TensorPod

* refine ComputePackedBlobDesc


Former-commit-id: 8800da93

1c29eb42

fix: add AsyncSednRegstMsgToConsumer() for send single produced regst, e.g.... · 09761973

由 Niu Chong 提交于 10月 01, 2018

fix: add AsyncSednRegstMsgToConsumer() for send single produced regst, e.g. forward_model_regst (#1274)

* fix(normal_model_update_compute_actor): fix send forward_model_regst_ to consumer

* fix: add AsyncSednRegstMsgToConsumer() for send single produced regst, e.g. forward_model_regst


Former-commit-id: 139c2241

09761973

refine cudnn_limit_buf (#1271) · 8626f4c2

由 Shiyuan Shang-Guan 提交于 10月 01, 2018

* refine cudnn_limit_buf

* rename default_cudnn_buf_limit_mbyte -> cudnn_buf_limit_mbyte


Former-commit-id: 7390c2f7

8626f4c2

fix(normal_forward_compute_actor): fix SendMsgToForwardModelSaveActor() (#1270) · 99d64b78

由 Niu Chong 提交于 10月 01, 2018

* fix(normal_forward_compute_actor): fix SendMsgToForwardModelSaveActor()

* refine(normal_forward_compute_actor)


Former-commit-id: d746016e

99d64b78

J
enlarge the cudnn buf to 4GB (#1269) · ce674856
由 Jinhui Yuan 提交于 10月 01, 2018
```
Former-commit-id: 28f981eb
```
ce674856

30 9月, 2018 1 次提交

Refactor Actor (#1259) · 9fda43bf

由 Niu Chong 提交于 9月 30, 2018

* feat(register_slot): add the RegstSlot

* feat(register_slot): update RegstSlot if

* feat(actor): update member of Actor to use RegstSlot

* fix(register_slot): fix the available_regst_desc_cnt init val

* refine(register_slot): rename PushBack/PopFront, FindTheRegstDescId to TryPushBack/TryPopFront, HasRegstDescId

* feat(regst_slot): rename ForEachCurRegstDeq/ForEachCurFrontRegst to ForEachRegstDeq/ForEachFrontRegst

* feat(regst_slot): add ForChosenRegstDeq/ForChosenFrontRegst, add CHECK empty in ForEachFrontRegst

* fix(register_slot): fix the CHECK empty

* feat: remove actual_writeable_regst_desc_id_ from Actor, add Naive/CustomizedProducedRegst

* fix(normal_model_update_actor): bug: not send customized regst to consumer when SendIntialModel

* fix(normal_forward_compute_actor): bug: not add kLoss/kAccuracy produced regst to NaiveProducedRegst

* fix(actor): UNIMPLEMENTED() for AsyncSendCustomizedProducedRegstMsgToConsumer

* fix(normal_forward_compute_actor): set const_buf_regst to nullptr when recv from consumers

* fix(actor): total_reading_data_regst_cnt, not total_reading_ctrl_regst_cnt

* refactor: update GetNaiveConsumedRegstDescName to GetNaiveOrCustomizedConsumedRegstDescName(same for Produced)

* feat: combine data_regst and ctrl_regst in Actor

* fix: fix bugs

* fix: fix bugs

* fix: remove .swp files and unused LOG

* feat: split Act and SendMsg (#1255)

* feat: split Act and SendMsg

* refine: rename HandleProduced/ConsumedDataRegst.. to HandleProduced/ConsumedNaiveDatRegst..

* fix(input_wise_comp_actor): bug: not set piece id

* fix(actor): potential bug: produced msg with no allowed actor still pop from queue

* refactor: mv some protected member function to private

* fix(actor): fix the condition about sending EORD msg

* refactor(input_wise_actor): use RegstSlot in InputWiseActor

* fix(copy_comm_net_actor): rename piece_id2regst_ctx to piece_id2regst_ctx_

* refactor: rename Name2RegstDescId to Name2RegstDescIds

* refactor(naive_actor): "override final" instead of only "final"

* refine(actor): little refine

* feat: update the return type of GetNaiveOrCustomizedNamesRegstDescName to enum class RegstNameType


Former-commit-id: e042befc

9fda43bf

26 9月, 2018 2 次提交

add impl of lars (#1163) · 388b945f

由 Shiyuan Shang-Guan 提交于 9月 26, 2018

* add lars set

* add lars

* override ibn&obn to lbi

* make model update consistent

* check cuda stream sync

* add LARSUpdateModelGpu

* checkout naive & momentum model update

* use cublas::dot compute SumOfSquare

* update lars for master

* refine lars for master


Former-commit-id: 9518970b

388b945f

Hinge loss test (#1263) · 3343e9b5

由 qq_22305325 提交于 9月 26, 2018

* hinge_loss_kernel_test

* fix opkernel_test

* fix test file

* optimize test file

* opyimize opkernel test

* complete opkernel test interface


Former-commit-id: 7faf75a6

3343e9b5

25 9月, 2018 2 次提交
- J
  cmake for nccl (#1262) · 53598b10
  由 Juncheng 提交于 9月 25, 2018
```
Former-commit-id: 57a82568
```
  53598b10
- J
  remove useless Copy in device_context (#1261) · 3f93d8bc
  由 Jinhui Yuan 提交于 9月 25, 2018
```
* remove useless Copy in device_context

* fix cyclic and copy_to_local bug in binary_in_stream_with_local_copy


Former-commit-id: 4b2c4ef0
```
  3f93d8bc
24 9月, 2018 1 次提交

Dev use nccl (#1198) · 9201b815

由 Jinhui Yuan 提交于 9月 24, 2018

* add nccl dependency

* add nccl comm handle

* nccl allreduce works

* NcclAllreduce -> NcclAllReduce

* fix header guard

* add NcclReduceScatter, NcclAllGather

* complete ReduceScatter and AllGather, (with cuda error)

* change variable name

* reduce-scatter, all-gather works

* add NcclScatter and NcclGather work type

* Dev use nccl add nccl comm manager (#1206)

* add parallel_set_id

* add nccl_comm_manager

* log nccl comm create

* use NcclCommMgr

* bugfix

* OF_DISALLOW_COPY_AND_MOVE

* remove nccl_scatter_handle and nccl_gather_handle from DeviceCtx

* remove nccl handles from cuda_stream_handle

* nccl_util and GetNcclDataType

* fix rank_num

* fix rank_id


fix rank_id

* CudaCheck->NcclCheck

* only GPU

* PoorCompTaskNode

SoleIn, SoleOut, SoleOp, SoleIbn, SoleObn

* PoorCompTaskNode

* reformat

* format change

* Dev use nccl merge reduce share mem (#1216)

* add parallel_set_id

* add nccl_comm_manager

* log nccl comm create

* use NcclCommMgr

* bugfix

* OF_DISALLOW_COPY_AND_MOVE

* remove nccl_scatter_handle and nccl_gather_handle from DeviceCtx

* remove nccl handles from cuda_stream_handle

* nccl_util and GetNcclDataType

* fix rank_num

* fix rank_id


fix rank_id

* CudaCheck->NcclCheck

* only GPU

* PoorCompTaskNode

SoleIn, SoleOut, SoleOp, SoleIbn, SoleObn

* PoorCompTaskNode

* reformat

* ReduceGather

* GlobalAdd

* ReduceScatter

* EnableIfNeed

* ConcatSplit

* EnableMemSharing for pred if need


EnableMemSharing for pred if need

* CtrlEdge for Gather

* CtrlEdge for GlobalAdd

* LocalAdd CtrlEdge

* CollectReduceTaskNode

* reverse nodes

* local_add_mem_sharing


local add mem sharing

* global add mem sharing

* reduce_mem_sharing

* bugfix

* refine

* format change (remove empty lines)

* format change

* fix local_add and gather issues

* Dev refactor reduce add (#1218)

* change ReduceGlobalAdd to ReduceAdd

* rm ReduceLocalAdd

* no mem sharing case works

* let ReduceAddCompActor decide whether it is local or global

* multi machine multi gpus Nccl and Oneflow allreduce works

* refine

* extract SortEdges

* make EdgeInfo protected

* Dev use nccl refine (#1220)

* const qualifier

* PoorCompTaskNode=>PipeCompTaskNode

* int=>int32_t

* refine ReduceMemSharingCtx

* NcclDeviceCtx and NcclActor


NcclDeviceCtx and NcclActor

* empty line

* CudaDeviceCtx<-NcclDeviceCtx

* fix wrong rank_id in reduce_add_actor (#1229)

* fix wrong rank_id in reduce_add_actor

* rm device_num_of_each_machine from parallel_ctx

* fix reduce gather control edge (#1235)

* fix reduce gather control edge

* extract FindNearestReduceAddCompTaskNode

* extract method ReduceCompTaskNodeIf::FindPredRduceTaskNodeIf

* CHECK nearest_add_copy_d2h

* Dev use nccl cross machine nccl all reduce (#1246)

* support ncclAllReduce cross machine

* fix rank_id and rank_num for mix

* reformat

* reformat

* simplify nccl_kernel (#1256)

* simplify REGISTER_BLD_SUB_TSK_GPH_MTHD (#1260)

* simplify REGISTER_BLD_SUB_TSK_GPH_MTHD

* note

* Dev use nccl reduce ranking ctx (#1252)

* reformat

* compute rank_id and rank_num with FixCompTaskNode

* reformat

* fix rank_id for reduceadd

* ReduceRankingCtx

* New Ranking and MemSharing for Reduce

* DECLARE_REDUCE_LOGICAL_NODE

* Ranking4NcclAllReduce

* fix ranking

* remove AsTaskNode

* reformat

* runtime rank ctx

* rank_set

* bugfix

* bugfix

* unittest

* change use_nccl_all_reduce_cross_machine to use_nccl_inter_node_communication

* refine


refine

* move BuildCtrlRegstBetweenReduceCopyNodes to ReduceAddCompTaskNode

* CHECK mem_size_


Former-commit-id: 55496813

9201b815

23 9月, 2018 1 次提交
- N
  chore(of_submit): update of_submit to support recently proto updates (#1258) · faf727b5
  由 Niu Chong 提交于 9月 23, 2018
```
Former-commit-id: 9fdcf61f
```
  faf727b5
19 9月, 2018 2 次提交
- L
  no out_diff then no backward node (#1250) · 1d7eae6f
  由 Li Xinqi 提交于 9月 19, 2018
```
Former-commit-id: cf84a6e8
```
  1d7eae6f
- S
  make save_download_file_conf false (#1254) · fe7764e1
  由 Shiyuan Shang-Guan 提交于 9月 19, 2018
```
Former-commit-id: 31693ec1
```
  fe7764e1

Oneflow-Inc / oneflow 上一次同步 2 年多

Oneflow-Inc / oneflow
上一次同步 2 年多