1. 23 8月, 2023 7 次提交
  2. 22 8月, 2023 5 次提交
  3. 21 8月, 2023 7 次提交
  4. 19 8月, 2023 3 次提交
  5. 17 8月, 2023 4 次提交
  6. 16 8月, 2023 3 次提交
  7. 15 8月, 2023 3 次提交
    • O
      Respect memory pinning config (#4131) · 9d79cfd1
      Olatunji Ruwase 提交于
      * Respect memory pinning config
      
      * Bug fix
      9d79cfd1
    • O
      Generalize frozen weights unit test (#4140) · 7a282db8
      Olatunji Ruwase 提交于
      * Fix unit test
      
      * Fix unit test
      7a282db8
    • C
      Handle PermissionError in os.chmod Call - Update engine.py (#4139) · 629b2039
      Chris M 提交于
      * Update engine.py
      
      This branch includes changes to handle potential exceptions that may occur when attempting to change file permissions using the os.chmod function within the DeepSpeed engine. The specific issue addressed is the PermissionError that may arise when working with certain filesystems or under restricted permissions.
      
      * Change to use logger
      
      * Split permissions out and add unit test
      
      * UnitTest(use DistTestClass) + trailing whitespace
      
      * update unit test
      
      * UT parametrize 1, 2 ,3
      
      * trim white space from unit test
      
      * change to PermissionError
      
      * run pre-commit formats
      
      * Catch FileNotFoundError & PermissionError
      629b2039
  8. 11 8月, 2023 1 次提交
  9. 10 8月, 2023 3 次提交
    • L
      Update nightly workflows to open an issue if CI fails (#3952) · 0c75f4a3
      Logan Adams 提交于
      * Update H100 workflow to open an issue if nightly CI fails
      
      * Test running as not CI
      
      * Add all nightly/switch envvar name
      
      * Test with AMD
      
      * Add way to get url, switch path of template
      
      * Add additional checkout step
      
      * Move actions checkout step
      
      * Try absolute path with github workspace
      
      * Create issue without template/path
      
      * Re-enable and add debug logic
      
      * add if failed()
      
      * More debug
      
      * Try without checkout action uses
      
      * Rename file
      
      * Update variables
      
      * Update issue template
      
      * Confirm removing permissions still work
      
      * Revert "Confirm removing permissions still work"
      
      This reverts commit e7c2915a.
      
      * Re-enable permissions
      
      * Remove PR trigger for AMD MI200 tests
      
      * Revert "Remove PR trigger for AMD MI200 tests"
      
      This reverts commit 5c5c5fd6.
      
      * Test update_existing
      
      * Switch to composite action
      
      * Fix line ending encoding issue
      
      * Switch failure to be a variable
      
      * Test with second workflow
      
      * Format fix
      
      * Switch failure to always
      
      * Switch back to previously working way
      
      * Test permission changes
      
      * Revert "Test permission changes"
      
      This reverts commit e051da75.
      
      * Update existing bugs with newest build failure link
      
      * Remove PR triggers for that were used for testing.
      0c75f4a3
    • L
      Add ops (#4119) · d300517f
      Logan Adams 提交于
      d300517f
    • J
      Fix Issue 4083 (#4084) · 8a8683d3
      Joe Mayer 提交于
      * removing bad check
      
      * adding offload check for bf16 optimizer
      
      * grad reduce for extra large param
      
      * check grad_accum exists before converting
      
      ---------
      Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
      8a8683d3
  10. 09 8月, 2023 4 次提交