- 30 3月, 2023 1 次提交
-
-
由 Mayank Mishra 提交于
*
💩 drop dead code *♻ replace has_all_gather_base with has_all_gather_into_tensor *♻ remove deprecated _all_gather_base *♻ remove deprecated _reduce_scatter_base *🎨 reformat files *🔧 fix _six * Trigger CI * Trigger CI * Trigger CI *🎨 formatting * incorporate suggestion * incorporate suggestion --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 29 3月, 2023 1 次提交
-
-
由 Michael Wyatt 提交于
* disable CPUAdam pathways in optimizer copy/step * Update stage_1_and_2.py --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 28 3月, 2023 1 次提交
-
-
由 Quentin Anthony 提交于
* Fix benchmark import issues and support MPI launching with pure torch.dist * Formatting * Update comms benchmark README * Formatting * Added better error handling and support MPI torch.dist backend * Update formatting versions * Formatting again * Trigger CI --------- Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 27 3月, 2023 1 次提交
-
-
由 Jeff Rasley 提交于
-
- 24 3月, 2023 6 次提交
-
-
由 Logan Adams 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Ma, Guokai 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Olatunji Ruwase 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Michael Wyatt 提交于
* bump torch18 -> torch19 * fix gptj --------- Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Satpal Singh Rathore 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 FreyaRao 提交于
Co-authored-by: NQinghuan Rao <qinghuanrao@microsoft.com>
-
- 22 3月, 2023 7 次提交
-
-
由 Connor Holmes 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Molly Smith 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Mor Zusman 提交于
Co-authored-by: NMor Zusman <morz@ai21.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Jeff Rasley 提交于
-
由 Molly Smith 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Logan Adams 提交于
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Quentin Anthony 提交于
-
- 18 3月, 2023 1 次提交
-
-
由 Satpal Singh Rathore 提交于
-
- 16 3月, 2023 1 次提交
-
-
由 Jeff Rasley 提交于
-
- 15 3月, 2023 6 次提交
-
-
由 Quentin Anthony 提交于
* Improve overflow logs * Trigger CI --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Joe Mayer 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Joe Mayer 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Stas Bekman 提交于
-
由 Jeff Rasley 提交于
Co-authored-by: NZhewei Yao <zheweiyao@gmail.com>
-
- 14 3月, 2023 1 次提交
-
-
由 Masahiro Tanaka 提交于
* fix buffer size for pipeline parallel (#2800) * improve explanation of buffer size for pipeline parallelism Co-authored-by: NJae-Won Chung <jwnchung@umich.edu> * fix format of comment --------- Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NJae-Won Chung <jwnchung@umich.edu> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 13 3月, 2023 1 次提交
-
-
由 Adam Moody 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 11 3月, 2023 1 次提交
-
-
由 Lev Kurilenko 提交于
This PR fixes Meta Tensor checkpoint loading for OPT models where the SD keys start with `model.`.
-
- 10 3月, 2023 1 次提交
-
-
由 Jeff Rasley 提交于
-
- 09 3月, 2023 1 次提交
-
-
由 Jeff Rasley 提交于
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
-
- 08 3月, 2023 4 次提交
-
-
由 Rahil Bathwal 提交于
Co-authored-by: NRajhans Samdani <rajhans@gmail.com>
-
由 Jeff Rasley 提交于
-
由 noabauma 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Ma, Guokai 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 07 3月, 2023 3 次提交
-
-
由 Olatunji Ruwase 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Lev Kurilenko 提交于
-
由 Molly Smith 提交于
* check kernel injection supported models * Clarify why user should use kernel injection
-
- 02 3月, 2023 2 次提交
-
-
由 Molly Smith 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 mzl 提交于
* MPICH support * MPICH changes * MPICH changes * MPICH changes * MPICH changes * accelerator runtime modifications * Accelerator runtime changes * Accelerator runtime modifications * Remove redundant print from single node * Move hostfile to tmp * Code cleanup for MPICH class * Code cleanup, rm whitespace * Removing mpiexec environment check details * Not needed tmp hostfile as pass directly * Remove debugging comments * rm print statement * Revert comm changes as WA not needed * Use MPICHRunner name for class * Use MPICHRunner as class name * No need to use args.force_multi and args.launcher . This should be set in deepspeedexamples gpt-3.6b .sh script as: $launcher=MPICH run_cmd=" deepspeed --hostfile=${hostfile_ds} --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --launcher=${launcher} --force_multi pretrain_gpt2.py $@ ${gpt_options}" * Adhere to code pattern * Rm empty lines in MPICHRunner class * Uncomment check for num nodes and workers when used hostfile_deepspeed in gpt-3.6b.sh * pass MPICH hostfile through launcher_args in gpt-3.6b.sh * Clean code and remove args hostfile * fix merge * fix merge --------- Co-authored-by: NAbhilash Majumder <30946547+abhilash1910@users.noreply.github.com> * clean up and fix format * add ut --------- Co-authored-by: NAbhilash Majumder <30946547+abhilash1910@users.noreply.github.com> Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 01 3月, 2023 1 次提交
-
-
由 Sam Foreman 提交于
Updates `deepspeed/monitor/monitor.py` to instantiate objects with correct configs Relevant issue: https://github.com/microsoft/DeepSpeed/issues/2853Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-