[AutoTuner] Add GBS search, gpu memory usage (#55466)
* temp commit
* distribute best cfg
* update metric extracting
* fix bugs of prune and reading log
* fix adding cfg bug
* reset status
* remove alarm and set logdir
* deepcopy ctx
* change alarm
* fix restart bug
* best no need alarm
* add gbs search, add gpu memory to history csv, add memory detect
* fix bug
* fix memory read bug; fix etcd connection bug
* fix memory read bug, add oom detection for all ranks
* fix read log and oom detaction, add error code for read log
* add unit test
* Update master.py
---------
Co-authored-by: Ncaozhou <caozhou@radi.ac.cn>
Showing
想要评论请 注册 或 登录