1. 14 8月, 2023 1 次提交
    • A
      [AutoTuner] Add GBS search, gpu memory usage (#55466) · 4c0c458a
      Azure 提交于
      * temp commit
      
      * distribute best cfg
      
      * update metric extracting
      
      * fix bugs of prune and reading log
      
      * fix adding cfg bug
      
      * reset status
      
      * remove alarm and set logdir
      
      * deepcopy ctx
      
      * change alarm
      
      * fix restart bug
      
      * best no need alarm
      
      * add gbs search, add gpu memory to history csv, add memory detect
      
      * fix bug
      
      * fix memory read bug; fix etcd connection bug
      
      * fix memory read bug, add oom detection for all ranks
      
      * fix read log and oom detaction, add error code for read log
      
      * add unit test
      
      * Update master.py
      
      ---------
      Co-authored-by: Ncaozhou <caozhou@radi.ac.cn>
      4c0c458a
  2. 14 7月, 2023 1 次提交
    • C
      [AutoTuner] Distribute best cfg (#54834) · 7f6d222f
      caozhou 提交于
      * distribute best cfg
      
      * adapt to multi args transmission
      
      * update metric extracting
      
      * fix bugs of prune and reading log
      
      * fix time default value
      
      * remove time record
      
      * adjust the order of searching dim
      
      * fix prune bugs
      
      * fix adding cfg bug
      
      * fix multi nodes bug
      
      * reset status
      
      * remove alarm and set logdir
      
      * deepcopy ctx
      
      * change alarm
      
      * fix restart bug
      
      * add exit
      
      * best no need alarm
      
      * add warmup time
      7f6d222f
  3. 20 6月, 2023 1 次提交
    • A
      [AutoTuner] Add compare and record (#54668) · 6fe7b5e2
      Azure 提交于
      * add auto tuner
      
      * compare and record module
      
      * revert launch main
      
      * add prune rule
      
      * add unit test
      
      * add auto tuner
      
      * revert launch main
      
      * add prune rule
      
      * modify unit test script
      
      * fix bug for dump nodes; fix bug for checking log file
      
      * fix bug
      
      ---------
      Co-authored-by: Ncaozhou <caozhou@radi.ac.cn>
      6fe7b5e2
  4. 14 6月, 2023 1 次提交