• M
    Add MPICH Multinode Runner (#2839) · 8d53ac0c
    mzl 提交于
    * MPICH support
    
    * MPICH changes
    
    * MPICH changes
    
    * MPICH changes
    
    * MPICH changes
    
    * accelerator runtime modifications
    
    * Accelerator runtime changes
    
    * Accelerator runtime modifications
    
    * Remove redundant print from single node
    
    * Move hostfile to tmp
    
    * Code cleanup for MPICH class
    
    * Code cleanup, rm whitespace
    
    * Removing mpiexec environment check details
    
    * Not needed tmp hostfile as pass directly
    
    * Remove debugging comments
    
    * rm print statement
    
    * Revert comm changes as WA not needed
    
    * Use MPICHRunner name for class
    
    * Use MPICHRunner as class name
    
    * No need to use args.force_multi and args.launcher .
    
    This should be set in deepspeedexamples gpt-3.6b .sh script as:
    $launcher=MPICH
    run_cmd=" deepspeed  --hostfile=${hostfile_ds}  --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --launcher=${launcher} --force_multi pretrain_gpt2.py $@ ${gpt_options}"
    
    * Adhere to code pattern
    
    * Rm empty lines in MPICHRunner class
    
    * Uncomment check for num nodes and workers when used hostfile_deepspeed in gpt-3.6b.sh
    
    * pass MPICH hostfile through launcher_args in gpt-3.6b.sh
    
    * Clean code and remove args hostfile
    
    * fix merge
    
    * fix merge
    
    ---------
    Co-authored-by: NAbhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
    
    * clean up and fix format
    
    * add ut
    
    ---------
    Co-authored-by: NAbhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
    Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
    Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
    8d53ac0c
runner.py 22.1 KB