- 12 7月, 2023 1 次提交
-
-
由 Michael Wyatt 提交于
* add coverage report * define env vars in shared action * reduce time for longest running tests * fix broken shared action * reduce test time * reducing Pipeline test times * further reducing test times * rework Z3 test * testing new mp.pool and persistent dist envs * fix import * reuse distributed environment for tests with lots of param combos * fix for dist teardown * fix pickling issue with pool cache * actually fix pickling problem * avoid running pool cache stuff on non-distributed tests * fix issues with nested mp.pool * fix for nested pools in Pipeline Engine * re-add params * update workflows with pytest opts * implement feedback * resolve race condition with port selection * Update tests/unit/common.py --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 07 7月, 2023 1 次提交
-
-
由 Ramya Ramineni 提交于
* Workaround to pass unit/ops/accelerators/test_accelerator_backward.py unit tests on ROCm * Rearranged is_rocm_pytorch() * Introduced is_rocm_pytorch() for ROCm * Fixed formatting errors * Function call * formatting fix --------- Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: NLogan Adams <loadams@microsoft.com>
-
- 06 7月, 2023 2 次提交
-
-
由 Ammar Ahmad Awan 提交于
* extend the test and fix fp16 typo. * guard reset params with z3 enabled check. --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Reza Yazdani 提交于
* Add FALCON auto-tp support * added (skipped) unit test, refactored code to be more readable --------- Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-
- 05 7月, 2023 1 次提交
-
-
由 Xingjian Shi 提交于
* fix lora fuse unfuse in hybrid_engine * fix name * fix typo * remove empty lines * Update gptj.py * add lora test-case + fix gptneo implementation * try to fix format * try to accelerate testcase by reducing max length * reduce test runtime * Fix bloom / gpt-neox and add test for bloom * fix CI + fix issue in engine --------- Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 03 7月, 2023 1 次提交
-
-
由 hablb 提交于
Grad tensors that don't fit in the bucket flat buffer are not added to it, but still added to params_in_ipg_bucket if such tensors exists use reduce_scatter of params_in_ipg_bucket instead of allreduce. since allreduce assumes all grads are in ipg_bucket_flat_buffer. Add test for reduce scatter=false Fix padding to zeros instead of undefined values Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 30 6月, 2023 2 次提交
-
-
由 Alexander Jipa 提交于
Co-authored-by: NAlexander Jipa <azzhipa@amazon.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Michael Wyatt 提交于
* utilize shorter tests for MII * use cached torch download * rework zero++ unit tests * formatting --------- Co-authored-by: NHeyangQin <heyangqin@microsoft.com>
-
- 29 6月, 2023 1 次提交
-
-
由 Michael Wyatt 提交于
-
- 27 6月, 2023 1 次提交
-
-
由 Masahiro Tanaka 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 24 6月, 2023 2 次提交
-
-
由 Heyang Qin 提交于
Co-authored-by: NSam Abe Jacobs <samjacobs@microsoft.com> Co-authored-by: NGuanhuaWang <alexwgh333@gmail.com> Co-authored-by: Ncmikeh2 <connorholmes@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-
由 stephen youn 提交于
Co-authored-by: NStephen Youn <styoun@microsoft.com> Co-authored-by: NArash Bakhtiari <arash@bakhtiari.org> Co-authored-by: NCheng Li <pistasable@gmail.com> Co-authored-by: NEthan Doe <yidoe@microsoft.com> Co-authored-by: Nyidoe <68296935+yidoe@users.noreply.github.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 13 6月, 2023 1 次提交
-
-
由 Joe Mayer 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 09 6月, 2023 1 次提交
-
-
由 hablb 提交于
* Remove dead code params_already_reduced is not used * Prevent evaluation of debug strings Debug strings are evaluated even when logging is disabled * Use contiguous gradients tensor reduce scatter between ranks Use allreduce instead of reduce scatter. lower cpu overhead. * move overflow tracker to optimizer.step Don't check overflow in gradients for every bucket. Do overflow chack once on grad flat buffer just before optimizer step --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 08 6月, 2023 2 次提交
-
-
由 Michael Wyatt 提交于
* mix typo and missing epsilon value * Touch file to re-build * revert changes * Touch file to re-build * Format --------- Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: NLogan Adams <loadams@microsoft.com>
-
由 Reza Yazdani 提交于
* fix gpt-j inference issue for mlp_gemm_func call * bring back the gpt-j inference-test * fix formatting * fix the neox and pythia injection issue
-
- 05 6月, 2023 1 次提交
-
-
由 Zhen Zhang 提交于
* fix mics save checkpoint hanging * MiCS load_checkpoint * copyright * fix for torch-1.9.0 all_reduce_coalesced api does not support nccl backend * Naming alignment * adding more test conditions for mics shard size * test with different shard sizes * adding assertion for better error msg --------- Co-authored-by: NZhen Zhang <zhzhn@amazon.com>
-
- 24 5月, 2023 1 次提交
-
-
由 Joe Mayer 提交于
* Fixing bf16 test that was missing a config. * Chaning train_batch_size to train_micro_batch_size_per_gpu * Chaning all train_batch_size to train_micro_batch_size_per_gpu
-
- 17 5月, 2023 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Clone tensors to avoid torch.save bloat * Adddocs * Fix clang-formatting * Update docs/code-docs/source/model-checkpointing.rst Co-authored-by: NStas Bekman <stas00@users.noreply.github.com> * Update deepspeed/checkpoint/utils.py Co-authored-by: NStas Bekman <stas00@users.noreply.github.com> * Update deepspeed/checkpoint/utils.py Co-authored-by: NStas Bekman <stas00@users.noreply.github.com> * Fix url * url fix * Tweak docs --------- Co-authored-by: NStas Bekman <stas00@users.noreply.github.com> Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
-
- 16 5月, 2023 2 次提交
-
-
由 Ma, Guokai 提交于
* add fallback path for kernels used in megatron * temporary numactl WA for SPR 56core * adapt core allocation according to number of ranks * add switch to turn on numactl * detect number of cores on the system * allow select a subset of the cores on the system to bind * remove unneeded changes * add ccl backend * change nccl to ccl * remove unused code * add comm/ccl to ops * initial ccl comm support * first broadcast case passed * add CCL_Backend to DeepSpeed * support comm timer for CPU * support barrier for comm backend * support specify master address from deepspeed command line * support pytorch 2.0 * remove 'block' from api * Tweak for debug Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com> * Remove unecessary directory Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com> * Add bf16 kernel support for inference * Add temporary torch implement for cpu inference * Add softmax ops cpu fallback for inference * bind cores to numa domain as well * merge latest change in gma/numactl * initial bf16 kernel support with fallback path * initial fallback path for bloom kernel injection * fix softmax attn mask * check KMP_AFFINITY to avoid conflict with numactl * New CCLBackend which utilize TorchBackend for initialization * rollback last change because there is result error * fix bloom injection policy TP could not work issue. injection_policy={BloomBlock: ("self_attention.dense", "mlp.dense_4h_to_h")} * Use TorchBackend to initialize CCLBackend, make behavior consistent * remove comm under deepspeed/ops * add license header * code clean up * fix format issue * remove magic number in main address * add caching support but not turn on by default * change name of inference_cuda_module to inference_module * Check for is_synchronized_device in accelerator before get Event * fix typo * Fix fallback path of softmax kernel on CUDA device for BF16 data type, because CUDA tril does not support BF16 datatype, enforce fp32 data type * add cpu backend files * change CPU_Accelerator op_builder_dir * remove cpu_kernel_path * using CPU_Accelerator on non-cuda device * fix deepspeed.op_builder => deepspeed.ops.op_builder * add alias for num_gpus: num_accelerators * allow loading cpu_builder in build stage * Assume cuda available if torch not installed * add oneccl_binding_pt to requirements * move oneccl-binding-pt to seperate requiremetns-cpu.txt * add missing file * use dependency_links in setuptools.setup() call for additional dependency links * install oneccl_bind_pt in workflows * change oneccl_bind_pt's version from 1.13 to 2.0 * use intel_exention_for_pytorch as indicator that CPU_Accelerator should be used * Add indicator for Accelerator used * change foo.c to foo.cpp * exclude 'cpu' directory in CUDA op builder reflection * add a cpu-inference workflow * run cpu-inference workflow on self-hosted instance * change cpu runs-on node to v100 node * print out python version in workflow * add verbose in pip command to understand oneccl_bind_pt install issue * update cpu-inference workflow * add a stage to detect instance instruction sets * add back bf16 support for CPU inference * enable autoTP for bloom Signed-off-by: NWang, Yi A <yi.a.wang@intel.com> * update workflow to detect cpu instruction sets * temporary WA for Intel Extension for PyTorch AVX2 instructioon set detection * change cpu-inference workflow machine to ubuntu-20.04 * add sharded checkpoint loading for AutoTP path to reduce the peak memory in initialization stage Signed-off-by: NWang, Yi A <yi.a.wang@intel.com> * enable policy for llama * use a special build ipex to test avx2 detection fix * fix format * fix test fail issue Signed-off-by: NWang, Yi A <yi.a.wang@intel.com> * fix gptj sharded checkpoint loading problem Signed-off-by: NWang, Yi A <yi.a.wang@intel.com> * return a not implemented build in get_op_builder in cpu_backend * support cpu device in tests * use cpuinfo to extract number of CPUs * use ~/tmp as transfomer cache rather than /blob/ * Add support for mpich launcher with prefer_deepspeed_comm * add missing modification in accelerator * enable IMPI launcher * remove unused file and fix formatting * clean up ccl.cpp * Less confusing error message when certin op builder are not implemented * Fix license header * Add license header * add license headers * add license header * fix cuda specific code in test * update CPU workflow * use numactl to bind to core * allow bind_cores_to_rank in multi-node impi runner * fix format error * Remove InferenceBuilder * fix format error in numa.py * check whether op is in installed ops in ds_report.py * allow override accelerator with DS_ACCELERATOR='cuda','cpu' or 'xpu' * lazy init class_dict in CUDA_Accelerator to avoid cyclic initialization of CUDA_Accelerator * put short path in the beginning in real_accelerator.py * device_count return number of NUMA nodes * fix typo * install numactl in cpu workflow * Follow comments * Better implementation of device_count() and current_device() * remove dependency_link for Intel Extension for DeepSpeed * use check is_synchronized_device in timer only once * remove env mapping WA in cpu_accelerator * fix duplicate definition * fix format error * refine ccl backend selection * move comments to the right place * remove prefer_deepspeed_comm, use CCLBackend by default * refractor fallback path * Fix execution failure in kernel injection path * do not refractory kernel injection fallback path in residual_add because it contains function call with side-effect * guard residual_add fallback path with environ DS_KI_FALLBACK=True * fix format error * add test for allreduce on CPU workflow * fix format error * Fallback to TorchBackend if CCLBackend kernel are not implemented * Update Intel Extension for Pytorch installation link * Don't specify version number of Intel Extension for PyTorch * install oneCCL for CCLBackend * fix link path for CPU comm kernels * fix source oneCCL environment * source oneCCL env before run UT * Give more specific instruction when CCL_ROOT not defined --------- Signed-off-by: NCao, Zhong Z <zhong.z.cao@intel.com> Signed-off-by: NWang, Yi A <yi.a.wang@intel.com> Co-authored-by: Nsdp <sdp@aia-sdp-spr-108864.jf.intel.com> Co-authored-by: NCao, Zhong Z <zhong.z.cao@intel.com> Co-authored-by: NZhenhuan Chen <zhenhuan.chen@intel.com> Co-authored-by: Nbaodii <di.bao@intel.com> Co-authored-by: NWang, Yi A <yi.a.wang@intel.com> Co-authored-by: Njianan-gu <jianan.gu@intel.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
-
由 digger yu 提交于
* fix spelling error with deepspeed/runtime/ * fix typo docs/ * fix typo in comments with deepspeed/ --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
-
- 15 5月, 2023 1 次提交
-
-
由 Yizhou Wang 提交于
* * try to fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 * * fix format error * * fix format issue * * add TODO for integrated testing of TP and ZeRO 1/2/3 * fix default pg error --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 11 5月, 2023 1 次提交
-
-
由 Lev Kurilenko 提交于
Co-authored-by: NConnor Holmes <connorholmes@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
-
- 10 5月, 2023 1 次提交
-
-
由 Wang, Yi 提交于
fix regression in shard checkpoint loading in AutoTP Path caused by qkv_copy() is deleted and add UT case for shard checkpoint loading in AutoTP (#3457) * add UT case for shard checkpoint loading in AutoTP Signed-off-by: NWang, Yi A <yi.a.wang@intel.com> * autoTP path also support shard loading Signed-off-by: NWang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: NWang, Yi A <yi.a.wang@intel.com>
-
- 04 5月, 2023 1 次提交
-
-
由 Connor Holmes 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 30 4月, 2023 1 次提交
-
-
由 hablb 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 29 4月, 2023 1 次提交
-
-
由 Jeff Rasley 提交于
* remove megatron-lm, no longer pip installable * Add skips to tests that require megatron-lm and can't be run currently. * formatting * Formatting --------- Co-authored-by: NLogan Adams <loadams@microsoft.com>
-
- 26 4月, 2023 1 次提交
-
-
由 Alexander Jipa 提交于
Co-authored-by: NAlexander Jipa <azzhipa@amazon.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
-
- 21 4月, 2023 1 次提交
-
-
由 Olatunji Ruwase 提交于
* zero3 checkpoint frozen params * Remove debug prints * Move to cpu * WIP * WIP * WIP * Cleanup * Cleanup * Extend unit test for frozen params * API fix
-
- 18 4月, 2023 1 次提交
-
-
由 Heyang Qin 提交于
* Fixes for asymmetric quantization * addtional offset to further improve accuracy * put the 0.5 into offset rather than applying it later * update unit test for quantization * fix format * attempt to fix format --------- Co-authored-by: NConnor Holmes <connorholmes@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 14 4月, 2023 1 次提交
-
-
由 Masahiro Tanaka 提交于
* support nesting zero.Init() and dynamically defined module * throw an error if model class defined in zero.Init is not wrapped * fix check on new classes that are not wrapped in zero.Init() * add tests of nesting zero.Init() and dynamically defined classes * fix tests for zero.Init * fix style --------- Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 13 4月, 2023 2 次提交
-
-
由 Ma, Guokai 提交于
* add fallback path for kernels used in megatron * temporary numactl WA for SPR 56core * adapt core allocation according to number of ranks * add switch to turn on numactl * detect number of cores on the system * allow select a subset of the cores on the system to bind * remove unneeded changes * use current_env to set OMP_NUM_THREADS in subprocess * add test for ds_arguments * change --bind_cores_to_rank option to store_true * add test for parse_range_list * add comment for parse range list * add test for parse range list, rewrite parse_range_list * fix format error * fix format * add -m parameter to numactl when necessary * Check KMP_AFFINITY to avoid conflict with numactl * fix format * negative case for parse_range_list * detect whether numactl is installed before use numactl to bind cores * check numactl with package manager of distro --------- Co-authored-by: Nsdp <sdp@aia-sdp-spr-108864.jf.intel.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Alexander van Eck 提交于
* feat: Add support for `NamedTuple` when sharding parameters [#3029] * Formatting --------- Co-authored-by: NAlexander van Eck <alexander.vaneck@paige.ai> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 06 4月, 2023 3 次提交
-
-
由 Logan Adams 提交于
This reverts commit 1ec34e54.
-
由 Logan Adams 提交于
-
由 Logan Adams 提交于
* Replace old torch version checks with existing function * Clean up formatting
-
- 31 3月, 2023 1 次提交
-
-
由 Michael Wyatt 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 27 3月, 2023 1 次提交
-
-
由 Jeff Rasley 提交于
-
- 24 3月, 2023 2 次提交
-
-
由 Logan Adams 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Ma, Guokai 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-