Fix nightly CI tests (#2493)

* fix for lm-eval nightly tests and add gpt-j to MPtest because OOM on single GPU * add nv-nightly badge

Fix nightly CI tests (#2493)
* fix for lm-eval nightly tests and add gpt-j to MPtest because OOM on single GPU * add nv-nightly badge
be5ec506 · Michael Wyatt · GitHub · ee39187d · be5ec506 · be5ec506
隐藏空白更改
内联并排

Showing with 14 addition and 4 deletion

.github/workflows/nv-nightly.yml .github/workflows/nv-nightly.yml +8 -1

README.md README.md +1 -1

tests/unit/inference/test_inference.py tests/unit/inference/test_inference.py +5 -2

未找到文件。
--- a/.github/workflows/nv-nightly.yml
+++ b/.github/workflows/nv-nightly.yml
@@ -45,6 +45,13 @@ jobs:
          pip install .[dev,1bit,autotuning,inf]
          ds_report

+      - name: Install lm-eval
+        run: |
+          pip uninstall --yes lm-eval
+          pip install git+https://github.com/EleutherAI/lm-evaluation-harness
+          # This is required until lm-eval makes a new release. v0.2.0 is
+          # broken for latest version of transformers
+
      - name: Python environment
        run: |
          pip list
@@ -54,4 +61,4 @@ jobs:
          unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
          if [[ -d ./torch-extensions ]]; then rm -rf ./torch-extensions; fi
          cd tests
-          TORCH_EXTENSIONS_DIR=./torch-extensions pytest --color=yes --durations=0 --forked --verbose -m 'nightly' unit/ --torch_ver="1.13" --cuda_ver="11.6"
+          TRANSFORMERS_CACHE=/blob/transformers_cache/ TORCH_EXTENSIONS_DIR=./torch-extensions pytest --color=yes --durations=0 --forked --verbose -m 'nightly' unit/ --torch_ver="1.13" --cuda_ver="11.6"
--- a/README.md
+++ b/README.md
@@ -102,7 +102,7 @@ DeepSpeed has been integrated with several different popular open-source DL fram

 | Description | Status |
 | ----------- | ------ |
-| NVIDIA | [![nv-torch12-p40](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch12-p40.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch12-p40.yml) [![nv-torch18-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch18-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch18-v100.yml) [![nv-torch-latest-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch-latest-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch-latest-v100.yml) [![nv-inference](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-inference.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-inference.yml) |
+| NVIDIA | [![nv-torch12-p40](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch12-p40.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch12-p40.yml) [![nv-torch18-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch18-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch18-v100.yml) [![nv-torch-latest-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch-latest-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch-latest-v100.yml) [![nv-inference](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-inference.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-inference.yml) [![nv-nightly](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-nightly.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-nightly.yml) |
 | AMD | [![amd](https://github.com/microsoft/DeepSpeed/actions/workflows/amd.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/amd.yml) |
 | PyTorch Nightly | [![nv-torch-nightly-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch-nightly-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch-nightly-v100.yml) |
 | Integrations | [![nv-transformers-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-transformers-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-transformers-v100.yml) [![nv-lightning-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-lightning-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-lightning-v100.yml) [![nv-accelerate-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-accelerate-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-accelerate-v100.yml) |

--- a/tests/unit/inference/test_inference.py
+++ b/tests/unit/inference/test_inference.py
@@ -293,10 +293,13 @@ class TestModelTask(DistributedTest):
                          ("EleutherAI/gpt-neox-20b",
                           "text-generation"),
                          ("bigscience/bloom-3b",
+                           "text-generation"),
+                          ("EleutherAI/gpt-j-6B",
                           "text-generation")],
                         ids=["gpt-neo",
                              "gpt-neox",
-                              "bloom"])
+                              "bloom",
+                              "gpt-j"])
 class TestMPSize(DistributedTest):
    world_size = 4

@@ -433,7 +436,7 @@ class TestLMCorrectness(DistributedTest):
        else:
            lm = lm_eval.models.get_model(model_family).create_from_arg_string(
                f"pretrained={model_name}",
-                {"device": f"cuda:{local_rank}"})
+                {"device": "cuda"})

        torch.cuda.synchronize()
        start = time.time()