- 14 8月, 2023 1 次提交
-
-
由 Azure 提交于
* temp commit * distribute best cfg * update metric extracting * fix bugs of prune and reading log * fix adding cfg bug * reset status * remove alarm and set logdir * deepcopy ctx * change alarm * fix restart bug * best no need alarm * add gbs search, add gpu memory to history csv, add memory detect * fix bug * fix memory read bug; fix etcd connection bug * fix memory read bug, add oom detection for all ranks * fix read log and oom detaction, add error code for read log * add unit test * Update master.py --------- Co-authored-by: Ncaozhou <caozhou@radi.ac.cn>
-
- 14 6月, 2023 1 次提交
-
-
由 caozhou 提交于
* add auto tuner * fix prune * fix sharding prune and mbs candidates * fix cfg * fix launch * fix launch * add unittest * fix code style
-