- 14 8月, 2023 1 次提交
-
-
由 Azure 提交于
* temp commit * distribute best cfg * update metric extracting * fix bugs of prune and reading log * fix adding cfg bug * reset status * remove alarm and set logdir * deepcopy ctx * change alarm * fix restart bug * best no need alarm * add gbs search, add gpu memory to history csv, add memory detect * fix bug * fix memory read bug; fix etcd connection bug * fix memory read bug, add oom detection for all ranks * fix read log and oom detaction, add error code for read log * add unit test * Update master.py --------- Co-authored-by: Ncaozhou <caozhou@radi.ac.cn>
-
- 14 7月, 2023 1 次提交
-
-
由 caozhou 提交于
* distribute best cfg * adapt to multi args transmission * update metric extracting * fix bugs of prune and reading log * fix time default value * remove time record * adjust the order of searching dim * fix prune bugs * fix adding cfg bug * fix multi nodes bug * reset status * remove alarm and set logdir * deepcopy ctx * change alarm * fix restart bug * add exit * best no need alarm * add warmup time
-
- 20 6月, 2023 1 次提交
-
-
由 Azure 提交于
* add auto tuner * compare and record module * revert launch main * add prune rule * add unit test * add auto tuner * revert launch main * add prune rule * modify unit test script * fix bug for dump nodes; fix bug for checking log file * fix bug --------- Co-authored-by: Ncaozhou <caozhou@radi.ac.cn>
-
- 14 6月, 2023 1 次提交
-
-
由 caozhou 提交于
* add auto tuner * fix prune * fix sharding prune and mbs candidates * fix cfg * fix launch * fix launch * add unittest * fix code style
-