- 16 11月, 2022 1 次提交
-
-
由 Michael Wyatt 提交于
* update zero config docs * add autogenerated docs for pydantic models used in ZeRO and Inference configs
-
- 15 11月, 2022 2 次提交
-
-
由 Ammar Ahmad Awan 提交于
Changes to inference API to use accept a config dict and cleaning up Inference Engine to utilize the newly added inference config. Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-
由 Jeff Rasley 提交于
-
- 14 11月, 2022 1 次提交
-
-
由 iLeGend 提交于
-
- 12 11月, 2022 1 次提交
-
-
由 lokoppakmsft 提交于
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-
- 11 11月, 2022 2 次提交
-
-
由 Michael Wyatt 提交于
* fix for lm-eval nightly tests and add gpt-j to MPtest because OOM on single GPU * add nv-nightly badge
-
由 Olatunji Ruwase 提交于
-
- 10 11月, 2022 5 次提交
-
-
由 郭叶军 提交于
-
由 Connor Holmes 提交于
Co-authored-by: Ncmikeh2 <connorholmes@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NReza Yazdani <reyazda@microsoft.com>
-
由 Kevin Ko 提交于
* Add scale_attn_by_inverse_layer_idx feature * Fix layer_id bug * Fix scaling value Co-authored-by: NConnor Holmes <connorholmes@microsoft.com> Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
-
由 Jeff Rasley 提交于
-
由 Jeff Rasley 提交于
-
- 09 11月, 2022 1 次提交
-
-
由 Michael Wyatt 提交于
* remove any cupy install when setting up environments * revert previous changes to run on cu111 runners * fix for when no cupy is installed * remove cupy uninstall for workflows not using latest torch version * update to cu116 for inference tests * fix pip uninstall line * move python environment list to after DS install * remove cupy uninstall * re-add --forked * fix how we get cupy version (should be based on nvcc version)
-
- 08 11月, 2022 2 次提交
-
-
由 Reza Yazdani 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NConnor Holmes <connorholmes@microsoft.com>
-
由 kyoto7250 提交于
-
- 05 11月, 2022 2 次提交
-
-
由 savitamittal1 提交于
* Added MLFLOW environment variables for logging metrics within trainign script * exporting MLFlow env variables from AML env Co-authored-by: NCheng Li <pistasable@gmail.com>
-
由 Joe Mayer 提交于
* Updating autotune default in docs. * Running pre-commit.
-
- 04 11月, 2022 2 次提交
-
-
由 郭叶军 提交于
* don't gather partitioned activations for mp size 1 * add inline comment for the change Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Ammar Ahmad Awan 提交于
-
- 03 11月, 2022 2 次提交
-
-
由 Reza Yazdani 提交于
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
-
由 Connor Holmes 提交于
Co-authored-by: NReza Yazdani <reyazda@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 02 11月, 2022 1 次提交
-
-
由 Michael Wyatt 提交于
* check only major CUDA version in CI * update expected torch latest version * pin torch latest to 1.12 until issues with 1.13 are resolve * wrong expected torch version * Update nv-torch18-v100.yml * remove forked from pytest option due to cuda re-initialization errors * removed expected torch version from inference tests, causing errors currently * fix various bugs that popped up * move all tests over to cu111 runners, cu113 runners having problems
-
- 28 10月, 2022 2 次提交
-
-
由 郭叶军 提交于
-
由 Connor Holmes 提交于
* Initial reduction_utils.h implementation * Add initialization helper, ensures correct min/max behavior * Remove unnecessary warp sync
-
- 27 10月, 2022 3 次提交
-
-
由 Joe Mayer 提交于
-
由 Michael Wyatt 提交于
* use cuda event timers for model profiling
-
由 Cheng Li 提交于
* rollback ds config changes * fix format * Fix error when output_file is a relative path without a prefix (#2397) Co-authored-by: NBenjamin Steenhoek <benjaminjsteenhoek@gmail.com> * fix restuls and exprs path to use absolute path * write out optimial config after tuning * fix format * assert tuning result dir creation Co-authored-by: NBenjamin Steenhoek <benjaminjsteenhoek@gmail.com> Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-
- 26 10月, 2022 2 次提交
-
-
由 eltonzheng 提交于
* Fix build issues on Windows * small fix to complie with new version of Microsoft C++ Build Tools Co-authored-by: NReza Yazdani <reyazda@microsoft.com> Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
-
由 Cheng Li 提交于
* update pytorch pool operator function signiture * fix the case where kwargs is None
-
- 25 10月, 2022 1 次提交
-
-
由 Joe Mayer 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 22 10月, 2022 3 次提交
-
-
由 Jeff Rasley 提交于
-
由 lekurile 提交于
Co-authored-by: NLev Kurilenko <lekurile@microsoft.com>
-
由 Adam Moody 提交于
* parallelize layer checkpoints across data parallel groups * use partition_uniform to determine start/end index values * formatting fix * config: add option for parallel write of layer checkpoints in pipeline stage * yapf fixes * enable parallel layer write according to config param * avoid extraneous makedir when rank 0 writes all layers Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 20 10月, 2022 1 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 19 10月, 2022 2 次提交
-
-
由 lekurile 提交于
This PR adds a TestInjectionPolicy inference unittest class for testing custom injection policies. This test differs from the existing tests in that the injection_policy dictionary is explicitly specified when calling the DeepSpeed init_inference API. The google/t5-v1_1-small text2text-generation model and the roberta-large fill-mask model are added as tests with the injection policy explicitly specified. This is done to expand our unittest coverage to test the path where the replace_wo_policy function is invoked (see GH-2387). Co-authored-by: NLev Kurilenko <lekurile@microsoft.com> Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-
由 Jeff Rasley 提交于
-
- 18 10月, 2022 3 次提交
-
-
由 Olatunji Ruwase 提交于
* Refactor universal checkpointing and tensor fragments * Formatting * Support zero stage1; Expand TP dim * Remove debug prints * Detect sharded optimizer state * Format fixes * Encode reshaping guide * More symbolic constants Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-
由 Joe Mayer 提交于
* fixing bug 2361 * adding pytest for config initialization * chaning expected output to FusedAdam * remove print statement * running yapf on modified files * running pre-commit formatting Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Michael Wyatt 提交于
* fix for gpt-j failing due to tokenizer error * limit number of gpt-j tokens generated due to low memory
-
- 15 10月, 2022 1 次提交
-
-
由 Alexander Jipa 提交于
truncating expert param storage for checkpointing Co-authored-by: NAlexander Jipa <azzhipa@amazon.com> Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-