1. 12 7月, 2023 1 次提交
    • M
      Reduce Unit Test Times (Part 3) (#3850) · aef6c65c
      Michael Wyatt 提交于
      * add coverage report
      
      * define env vars in shared action
      
      * reduce time for longest running tests
      
      * fix broken shared action
      
      * reduce test time
      
      * reducing Pipeline test times
      
      * further reducing test times
      
      * rework Z3 test
      
      * testing new mp.pool and persistent dist envs
      
      * fix import
      
      * reuse distributed environment for tests with lots of param combos
      
      * fix for dist teardown
      
      * fix pickling issue with pool cache
      
      * actually fix pickling problem
      
      * avoid running pool cache stuff on non-distributed tests
      
      * fix issues with nested mp.pool
      
      * fix for nested pools in Pipeline Engine
      
      * re-add params
      
      * update workflows with pytest opts
      
      * implement feedback
      
      * resolve race condition with port selection
      
      * Update tests/unit/common.py
      
      ---------
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      aef6c65c
  2. 07 7月, 2023 1 次提交
  3. 06 7月, 2023 2 次提交
  4. 05 7月, 2023 1 次提交
  5. 03 7月, 2023 1 次提交
  6. 30 6月, 2023 2 次提交
  7. 29 6月, 2023 1 次提交
  8. 27 6月, 2023 1 次提交
  9. 24 6月, 2023 2 次提交
  10. 13 6月, 2023 1 次提交
  11. 09 6月, 2023 1 次提交
    • H
      zero3 performance optimizations (#3622) · 0977106a
      hablb 提交于
      * Remove dead code
      
      params_already_reduced is not used
      
      * Prevent evaluation of debug strings
      
      Debug strings are evaluated even when logging is disabled
      
      * Use contiguous gradients tensor reduce scatter between ranks
      
      Use allreduce instead of reduce scatter. lower cpu overhead.
      
      * move overflow tracker to optimizer.step
      
      Don't check overflow in gradients for every bucket.
      Do overflow chack once on grad flat buffer just before optimizer step
      
      ---------
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      0977106a
  12. 08 6月, 2023 2 次提交
  13. 05 6月, 2023 1 次提交
  14. 24 5月, 2023 1 次提交
    • J
      Fixing bf16 test (#3551) · 49d399cd
      Joe Mayer 提交于
      * Fixing bf16 test that was missing a config.
      
      * Chaning train_batch_size to train_micro_batch_size_per_gpu
      
      * Chaning all train_batch_size to train_micro_batch_size_per_gpu
      49d399cd
  15. 17 5月, 2023 1 次提交
  16. 16 5月, 2023 2 次提交
    • M
      [CPU] Support Intel CPU inference (#3041) · 1f72082f
      Ma, Guokai 提交于
      * add fallback path for kernels used in megatron
      
      * temporary numactl WA for SPR 56core
      
      * adapt core allocation according to number of ranks
      
      * add switch to turn on numactl
      
      * detect number of cores on the system
      
      * allow select a subset of the cores on the system to bind
      
      * remove unneeded changes
      
      * add ccl backend
      
      * change nccl to ccl
      
      * remove unused code
      
      * add comm/ccl to ops
      
      * initial ccl comm support
      
      * first broadcast case passed
      
      * add CCL_Backend to DeepSpeed
      
      * support comm timer for CPU
      
      * support barrier for comm backend
      
      * support specify master address from deepspeed command line
      
      * support pytorch 2.0
      
      * remove 'block' from api
      
      * Tweak for debug
      Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>
      
      * Remove unecessary directory
      Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>
      
      * Add bf16 kernel support for inference
      
      * Add temporary torch implement for cpu inference
      
      * Add softmax ops cpu fallback for inference
      
      * bind cores to numa domain as well
      
      * merge latest change in gma/numactl
      
      * initial bf16 kernel support with fallback path
      
      * initial fallback path for bloom kernel injection
      
      * fix softmax attn mask
      
      * check KMP_AFFINITY to avoid conflict with numactl
      
      * New CCLBackend which utilize TorchBackend for initialization
      
      * rollback last change because there is result error
      
      * fix bloom injection policy TP could not work issue.
      
      injection_policy={BloomBlock: ("self_attention.dense", "mlp.dense_4h_to_h")}
      
      * Use TorchBackend to initialize CCLBackend, make behavior consistent
      
      * remove comm under deepspeed/ops
      
      * add license header
      
      * code clean up
      
      * fix format issue
      
      * remove magic number in main address
      
      * add caching support but not turn on by default
      
      * change name of inference_cuda_module to inference_module
      
      * Check for is_synchronized_device in accelerator before get Event
      
      * fix typo
      
      * Fix fallback path of softmax kernel on CUDA device for BF16 data type, because CUDA tril does not support BF16 datatype, enforce fp32 data type
      
      * add cpu backend files
      
      * change CPU_Accelerator op_builder_dir
      
      * remove cpu_kernel_path
      
      * using CPU_Accelerator on non-cuda device
      
      * fix deepspeed.op_builder => deepspeed.ops.op_builder
      
      * add alias for num_gpus: num_accelerators
      
      * allow loading cpu_builder in build stage
      
      * Assume cuda available if torch not installed
      
      * add oneccl_binding_pt to requirements
      
      * move oneccl-binding-pt to seperate requiremetns-cpu.txt
      
      * add missing file
      
      * use dependency_links in setuptools.setup() call for additional dependency links
      
      * install oneccl_bind_pt in workflows
      
      * change oneccl_bind_pt's version from 1.13 to 2.0
      
      * use intel_exention_for_pytorch as indicator that CPU_Accelerator should be used
      
      * Add indicator for Accelerator used
      
      * change foo.c to foo.cpp
      
      * exclude 'cpu' directory in CUDA op builder reflection
      
      * add a cpu-inference workflow
      
      * run cpu-inference workflow on self-hosted instance
      
      * change cpu runs-on node to v100 node
      
      * print out python version in workflow
      
      * add verbose in pip command to understand oneccl_bind_pt install issue
      
      * update cpu-inference workflow
      
      * add a stage to detect instance instruction sets
      
      * add back bf16 support for CPU inference
      
      * enable autoTP for bloom
      Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>
      
      * update workflow to detect cpu instruction sets
      
      * temporary WA for Intel Extension for PyTorch AVX2 instructioon set detection
      
      * change cpu-inference workflow machine to ubuntu-20.04
      
      * add sharded checkpoint loading for AutoTP path to reduce the peak memory in initialization stage
      Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>
      
      * enable policy for llama
      
      * use a special build ipex to test avx2 detection fix
      
      * fix format
      
      * fix test fail issue
      Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>
      
      * fix gptj sharded checkpoint loading problem
      Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>
      
      * return a not implemented build in get_op_builder in cpu_backend
      
      * support cpu device in tests
      
      * use cpuinfo to extract number of CPUs
      
      * use ~/tmp as transfomer cache rather than /blob/
      
      * Add support for mpich launcher with prefer_deepspeed_comm
      
      * add missing modification in accelerator
      
      * enable IMPI launcher
      
      * remove unused file and fix formatting
      
      * clean up ccl.cpp
      
      * Less confusing error message when certin op builder are not implemented
      
      * Fix license header
      
      * Add license header
      
      * add license headers
      
      * add license header
      
      * fix cuda specific code in test
      
      * update CPU workflow
      
      * use numactl to bind to core
      
      * allow bind_cores_to_rank in multi-node impi runner
      
      * fix format error
      
      * Remove InferenceBuilder
      
      * fix format error in numa.py
      
      * check whether op is in installed ops in ds_report.py
      
      * allow override accelerator with DS_ACCELERATOR='cuda','cpu' or 'xpu'
      
      * lazy init class_dict in CUDA_Accelerator to avoid cyclic initialization of CUDA_Accelerator
      
      * put short path in the beginning in real_accelerator.py
      
      * device_count return number of NUMA nodes
      
      * fix typo
      
      * install numactl in cpu workflow
      
      * Follow comments
      
      * Better implementation of device_count() and current_device()
      
      * remove dependency_link for Intel Extension for DeepSpeed
      
      * use check is_synchronized_device in timer only once
      
      * remove env mapping WA in cpu_accelerator
      
      * fix duplicate definition
      
      * fix format error
      
      * refine ccl backend selection
      
      * move comments to the right place
      
      * remove prefer_deepspeed_comm, use CCLBackend by default
      
      * refractor fallback path
      
      * Fix execution failure in kernel injection path
      
      * do not refractory kernel injection fallback path in  residual_add because it contains function call with side-effect
      
      * guard residual_add fallback path with environ DS_KI_FALLBACK=True
      
      * fix format error
      
      * add test for allreduce on CPU workflow
      
      * fix format error
      
      * Fallback to TorchBackend if CCLBackend kernel are not implemented
      
      * Update Intel Extension for Pytorch installation link
      
      * Don't specify version number of Intel Extension for PyTorch
      
      * install oneCCL for CCLBackend
      
      * fix link path for CPU comm kernels
      
      * fix source oneCCL environment
      
      * source oneCCL env before run UT
      
      * Give more specific instruction when CCL_ROOT not defined
      
      ---------
      Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com>
      Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>
      Co-authored-by: Nsdp <sdp@aia-sdp-spr-108864.jf.intel.com>
      Co-authored-by: NCao, Zhong Z <zhong.z.cao@intel.com>
      Co-authored-by: NZhenhuan Chen <zhenhuan.chen@intel.com>
      Co-authored-by: Nbaodii <di.bao@intel.com>
      Co-authored-by: NWang, Yi A <yi.a.wang@intel.com>
      Co-authored-by: Njianan-gu <jianan.gu@intel.com>
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
      1f72082f
    • D
      fix typo in comments with deepspeed/ (#3537) · c8d3f5eb
      digger yu 提交于
      * fix spelling error with deepspeed/runtime/
      
      * fix typo docs/
      
      * fix typo in comments with deepspeed/
      
      ---------
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
      c8d3f5eb
  17. 15 5月, 2023 1 次提交
  18. 11 5月, 2023 1 次提交
  19. 10 5月, 2023 1 次提交
  20. 04 5月, 2023 1 次提交
  21. 30 4月, 2023 1 次提交
  22. 29 4月, 2023 1 次提交
  23. 26 4月, 2023 1 次提交
  24. 21 4月, 2023 1 次提交
  25. 18 4月, 2023 1 次提交
  26. 14 4月, 2023 1 次提交
  27. 13 4月, 2023 2 次提交
  28. 06 4月, 2023 3 次提交
  29. 31 3月, 2023 1 次提交
  30. 27 3月, 2023 1 次提交
  31. 24 3月, 2023 2 次提交