- 15 8月, 2023 3 次提交
-
-
由 Olatunji Ruwase 提交于
* Respect memory pinning config * Bug fix
-
由 Olatunji Ruwase 提交于
* Fix unit test * Fix unit test
-
由 Chris M 提交于
* Update engine.py This branch includes changes to handle potential exceptions that may occur when attempting to change file permissions using the os.chmod function within the DeepSpeed engine. The specific issue addressed is the PermissionError that may arise when working with certain filesystems or under restricted permissions. * Change to use logger * Split permissions out and add unit test * UnitTest(use DistTestClass) + trailing whitespace * update unit test * UT parametrize 1, 2 ,3 * trim white space from unit test * change to PermissionError * run pre-commit formats * Catch FileNotFoundError & PermissionError
-
- 11 8月, 2023 1 次提交
-
-
由 Logan Adams 提交于
* Fix torch19 tests * test pip list and --no-build-isolation * Enable verbosity * pin to older accelerate version * Update oldest tested torch to 1.10 * Properly rename directories * Return PR tests to CI again. * Remove -vv
-
- 10 8月, 2023 3 次提交
-
-
由 Logan Adams 提交于
* Update H100 workflow to open an issue if nightly CI fails * Test running as not CI * Add all nightly/switch envvar name * Test with AMD * Add way to get url, switch path of template * Add additional checkout step * Move actions checkout step * Try absolute path with github workspace * Create issue without template/path * Re-enable and add debug logic * add if failed() * More debug * Try without checkout action uses * Rename file * Update variables * Update issue template * Confirm removing permissions still work * Revert "Confirm removing permissions still work" This reverts commit e7c2915a. * Re-enable permissions * Remove PR trigger for AMD MI200 tests * Revert "Remove PR trigger for AMD MI200 tests" This reverts commit 5c5c5fd6. * Test update_existing * Switch to composite action * Fix line ending encoding issue * Switch failure to be a variable * Test with second workflow * Format fix * Switch failure to always * Switch back to previously working way * Test permission changes * Revert "Test permission changes" This reverts commit e051da75. * Update existing bugs with newest build failure link * Remove PR triggers for that were used for testing.
-
由 Logan Adams 提交于
-
由 Joe Mayer 提交于
* removing bad check * adding offload check for bf16 optimizer * grad reduce for extra large param * check grad_accum exists before converting --------- Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-
- 09 8月, 2023 6 次提交
-
-
由 leiwen83 提交于
In cpu ram limited machine, loading checkpoint at the start up may cause oom as all rank in the same node are loading the opt state in the same time. So for this scenario, we make a choice that loading checkpoint could be made pipeline way. Signed-off-by: NLei Wen <wenlei03@qiyi.com> Co-authored-by: NLei Wen <wenlei03@qiyi.com>
-
由 Conglong Li 提交于
* add deepspeed chat arxiv report * add zeroquant v2 and fp * add selective enhencement * add ignore for 'Youn' in spell checker --------- Co-authored-by: Nyaozhewei <zheweiy@berkeley.edu> Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-
由 Connor Holmes 提交于
* Pass correct node size * formatting --------- Co-authored-by: NConnor Holmes <development@cmikeh2.me> Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-
由 Olatunji Ruwase 提交于
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-
* base_dir may not present all time and results in incorrect path * Update replace_module.py * Update config.py --------- Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-
由 Michael Wyatt 提交于
-
- 08 8月, 2023 2 次提交
-
-
由 Earlee 提交于
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
-
由 Earlee 提交于
-
- 05 8月, 2023 2 次提交
-
-
由 mzl 提交于
* update ut/doc for glm/codegen * formatting/spacing on docs * re-order/alphabetize the models --------- Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: NLogan Adams <loadams@microsoft.com>
-
由 digger yu 提交于
-
- 04 8月, 2023 2 次提交
-
-
由 marcobellagente93 提交于
* update partition_uniform util function * formatting --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Lev Kurilenko 提交于
* Initial commit * Clean up * Fix formatting
-
- 03 8月, 2023 1 次提交
-
-
由 Michael Wyatt 提交于
-
- 01 8月, 2023 3 次提交
-
-
由 Molly Smith 提交于
* Refactor autoTP inference for HE * Formatting * Move redundant functions to autotp * Remove self from loading class * formatting * Some gpt2 autotp path fixes * precommit
-
由 Hugh Pu 提交于
-
由 Xie Zejian 提交于
* add reproducible compilation environment * fix ci * fix typo for formatting check * Fix casing for format --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NMichael Wyatt <mrwyattii@gmail.com> Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: NLogan Adams <loadams@microsoft.com>
-
- 29 7月, 2023 1 次提交
-
-
由 Zhen Zhang 提交于
Co-authored-by: NZhen Zhang <zhzhn@amazon.com>
-
- 28 7月, 2023 8 次提交
-
-
由 Ma, Guokai 提交于
* Fix deadlock when allreduce spin too fast * Change state to enum to increase readability --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Olatunji Ruwase 提交于
* Option to override module apply * Removing early partitioning in override * Unit tests * Cleanup * Adapt unit test to succeed * Handle missed params * Add accelerate * Code cleanup * Add doc * Add doc * Add doc
-
由 Ma, Guokai 提交于
-
由 mzl 提交于
* autoTP for fused qkv weight * fix format * clean up * clean up * clean up * update * make logic flow to util and move to file * fix formatting * remove empty line --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Wang, Yi 提交于
* enable autoTP for MPT Signed-off-by: NWang, Yi A <yi.a.wang@intel.com> * add model specific func to auto_tp_model_utils.py Signed-off-by: NWang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: NWang, Yi A <yi.a.wang@intel.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Wang, Yi 提交于
Signed-off-by: NWang, Yi A <yi.a.wang@intel.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
-
由 Ma, Guokai 提交于
-
由 digger yu 提交于
-
- 27 7月, 2023 2 次提交
-
-
由 Minjia Zhang 提交于
* Engine side fix for loading llama checkpoint fine-tuned with zero3 * Fixes to support llama fine-tuning in ds-chat * Refactored the code to avoid using an except block. * formatting * revert permissions change --------- Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-
由 Alexander Jipa 提交于
Co-authored-by: NAlexander Jipa <azzhipa@amazon.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 26 7月, 2023 5 次提交
-
-
由 Hugh Pu 提交于
-
由 Joe Mayer 提交于
* Moving losses tracking above monitor statement * Fixing loss calculations. * Restoring deleted comment. * formatting --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-
由 Puneesh Khanna 提交于
-
由 Xuehai Pan 提交于
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
-
由 mzl 提交于
* remove duplicate check for pp and zero stage * remove line * use ZeroStageEnum --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 25 7月, 2023 1 次提交
-
-
由 Ma, Guokai 提交于
* fused adam can build * use cpu adam to implement fused adam * enable zero stage 1 and 2 for synchronized accelerator (a.k.a. CPU) * remove unused parameters * fix format error * Remove adam class * fix format * support stage3 * reuse simd.h * fix format * make memory_stat return meaningful dict * fix format * add cpu_adam * reuse cpu_adam * header cleanup * fix cpu_adam * fix format, add missing file --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-