1. 11 2月, 2023 1 次提交
  2. 09 2月, 2023 2 次提交
  3. 08 2月, 2023 2 次提交
    • L
      Add container load checkpoint error reporting + refactor (#2792) · 10f3c301
      Lev Kurilenko 提交于
      This PR refactors the organization of meta tensor checkpoint loading as follows:
      
      - Move get_param_names() abstract method definition from TransformerPolicy into MetaTensorContainer
      - Model-specific get_param_names() definitions moved from policy into model-specific container
      - selected_policy_g, megatron_v2_g, and transformer_config_g globals replaced with a single container_g global, since the container will contain all of the information those globals previously captured
      - ckpt_load_enabled flag added to containers that's set to False by default in the base.py container and gets set to True when the MetaTensorContainer feature is inherited
      - Assertion added to replace_transformer_layer before performing checkpoint loading to check if ckpt_load_enabled ==True, otherwise an error message will be printed saying that the container does not support meta tensor checkpoint loading.
      
      The aim of these changes is to more closely couple meta tensor checkpoint loading code to the MetaTensorContainer and to allow for better error reporting of load checkpoint use on model types that don't support this feature.
      10f3c301
    • O
      Enable page-locked tensors without CUDA (#2775) · c9b08888
      Olatunji Ruwase 提交于
      * Enable page-locked memory in cpu only env
      
      * Enable page-locked memory in cpu only env
      
      * Formatting
      
      * Add TODOs; Release page-locked memory
      
      * Update perf microbenchmark; Reduce unit test memory
      
      * Reduce CI mem usage
      c9b08888
  4. 07 2月, 2023 2 次提交
  5. 05 2月, 2023 1 次提交
  6. 04 2月, 2023 2 次提交
  7. 03 2月, 2023 2 次提交
  8. 02 2月, 2023 3 次提交
  9. 01 2月, 2023 4 次提交
  10. 31 1月, 2023 2 次提交
  11. 29 1月, 2023 1 次提交
  12. 27 1月, 2023 5 次提交
  13. 26 1月, 2023 2 次提交
    • M
      Abstract accelerator (step 3) (#2677) · 98cc35b6
      Ma, Guokai 提交于
      * Integrate accelerator abstraction interface into deepspeed/
      
      * Fix error message in fp16/fused_optimizer
      
      * fix error message in fp16/unfused_optimizer.py
      
      * assign get_accelerator().pin_memory() result to input Tensor name
      
      * no need to check cuda and whether nvtx supported
      
      * move try-except into inner most block
      
      * call Event() and Stream() in get_accelerator() for data type
      
      * Make Stream and Event as properties of abstract interface so they can be used as data type in deepspeed
      
      * Apply op_builder backend api change from #2705 from @jeffra
      
      * fix tests where Builder NAME is used
      
      * keep original ...Builder.NAME interface instead of ...Builder().NAME interface
      
      * fix builder closure for installation
      
      * fix randomltd builder
      
      * add comments to clarify create_op_builder and get_op_builder
      
      * fix compatibility with pip install -e
      Co-authored-by: NCheng Li <pistasable@gmail.com>
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      98cc35b6
    • S
      [GatheredParameters] fix memory leak (#2665) · ddd48b36
      Stas Bekman 提交于
      * [GatheredParameters] fix memory leak
      
      * simplify
      
      * cleanup and move
      
      * style
      
      * Formatting
      
      * fix test
      
      * fix test
      
      * fix test take 2
      
      * Trigger CI
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      Co-authored-by: NJoe Mayer <114769929+jomayeri@users.noreply.github.com>
      ddd48b36
  14. 25 1月, 2023 3 次提交
  15. 20 1月, 2023 1 次提交
  16. 19 1月, 2023 4 次提交
  17. 18 1月, 2023 3 次提交