- 07 5月, 2022 1 次提交
-
-
由 Stas Bekman 提交于
* GatheredParameters - accept any iterable * torch tensor is an iterable, so can't use collections.abc.Iterable * fix
-
- 06 5月, 2022 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Fix OOM and type mismatch * Toggle prefetching * Disable z3 prefetching for inference (temp workaround) * Fix zero3 tracing issues * Remove debug prints * Enable prefetch for inference * Code clarity * Invalidate trace cache * Trace cache invalidation when needed Separate nvme prefetch from all-gather prefetch * Track last used step id * Use debug name in error message * Construct param trace from module trace Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 27 4月, 2022 1 次提交
-
-
由 Jeff Rasley 提交于
Co-authored-by: NReza Yazdani <reyazda@microsoft.com> Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
-
- 26 4月, 2022 1 次提交
-
-
由 Olatunji Ruwase 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 20 4月, 2022 1 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 23 1月, 2022 1 次提交
-
-
由 Alex Hedges 提交于
-
- 21 1月, 2022 1 次提交
-
-
由 Justin Chiu 提交于
* Changes for bfloat16 Zero2 * ZeRO stage3 optimizations, with some bug fixes optimizations for stage3: - prefetching improvements - batching allgather calls to amortize fixed overhead and improve bandwidth utilization - batching reduce_scatter calls to amortize fixed overhead and improve bandwidth utilization - using *_base variants of allgather and reduce scatter to reduce memory allocations and data movement - more fine grained synchronization for communication that allows blocking on less work - precomputation of fetching code - using a fetch queue rather than deciding what to (pre)fetch at each iteration - limiting queued coalesced communication ops to reduce memory pressure on pytorch cuda caching allocator (not elegant solution) optimizations for stage3-offload: - made some host-device tensor copies async to improve performance bug fixes and qol improvements: - fix init context method when parent modules modify child weights - speed up model initialization by moving model to GPU before weight initialization - fixed unit test imports so that unit tests can be run from any directory - change performance logging to include memory consumption - add logging w/ model size when done partitioning model new features - bfloat16 support for ZeRO 3 * fix import in ut * ran yapf * improvements to cache flush warn log * backwards compatibility with older versions of pytorch * handle edge case where reduced tensor smaller than world size * moved event synchronization to allgather handle wait() call * removed unnecessary barrier call * formatting fix after resolving merge conflict * skip nvme prefetch when trace not complete * opportunistically avoid memory allocation in allgather coalesced where possible * fix indentation after merge * fixes to account for parameter offload * accounting for torch.cuda.memory_stats not being available * moved partition_all_params to optimizer step * allgathering on params before item gets called * fix param status checks needed after moving partition_all_parameters call to optimizer step * fix grad accumulation with optimizer offload * grad norm computation fix for optimizer offload * change post divide in reduce-scatter to pre divide * fix gradient race condition w/ optimizer offload * improve inf/nan gradient tracking * don't prefetch when not in training mode * format fix after merging * fix prefetching issue when using NVME offload * improved defragmentation for fp16 parameters * relative imports for bf16 tests * changes for bwd compatibility with pytorch 1.2 * remove buffered_reduce_fallback * removed unused parameter offset bookkeeping * fixed tracking for multiple param groups * unbroke bfloat16 config after merge conflict * using base allgather params when only 1 param * cleanup/fixes for fp16 partition defragmentation * switch to CRLF * convert to same new-line style as master * align new line with master * Fix merge issues * switch to CRLF * fix to LF line endings * minor merge fixes * remove extra bfloat16_enabled definition * asserting params inflight for AllGatherHandle * remove get_cuda_mem_allocated_str * Format fixes * fix bfloat16 zero stage check (broken after merge commit) * +self.communication_data_type, -self.allreduce_always_fp32; delete dead code * Add self.reduce_scatter * Format fix * Fix merge issues * iterate over params_to_fetch rather than make another iterator * add some TODOs * remove unnecessary division by micro_step_id * rename config keys "bfloat16" -> "bf16" * rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bit_weights_on_model_save * add unit test to check backwards compatibility for gather_16bit_weights * added test to confirm bf16 key bwd compatibility * Format fixes Co-authored-by: NRana Ali Amjad <raamjad@amazon.com> Co-authored-by: NJustin Chiu <justchiu@amazon.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 02 12月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
-
- 01 12月, 2021 1 次提交
-
-
由 Alex Hedges 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 19 11月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
-
- 13 11月, 2021 1 次提交
-
-
由 Cheng Li 提交于
* [squash] Staging autotuning v4 Co-authored-by: NCheng Li <pistasable@gmail.com> Co-authored-by: NMinjia Zhang <minjiaz@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com> * add new extra, guard xgboost, cleanup dead files (#268) * Fix autotuning docs (#1553) * fix docs * rewording the goal * fix typos * fix typos (#1556) * fix typos * fix format * fix bug (#1557) * fix bug Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NMinjia Zhang <minjiaz@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 11 11月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
-
- 31 10月, 2021 1 次提交
-
-
由 Zhen Zhang 提交于
* remove norm(), avoid memcpy after allgather 1) Removing the norm computation in debug printing 2) Changing _all_gather to be sync op in fetch_sub_module Reason: the async version is not async at all, because each all_gather calls torch.cuda.synchronize() to guarantee previous communication op to be completed 3) Adding new function _allgather_params_split_launch the existing _allgather_params has explicit memcpy after the all-gather op. We can avoid the explicit memory copy at python side, to improve the performance. Known issue: the `torch.distributed.all_gather` will do implicit memcpy at the end of each `ncclAllgather`. * WIP: wrapped ncclAllgather as customized op in DS micro benchmark shows the improvement of allgather a transformer layer with 9834560 elements in half precision is about 1.1ms on aws-p4d instance. * WIP: integrated into partition_parameters Performance improvement of 5.1B bert on aws-p4d: fwd: 300ms -> 200ms bwd: 680ms -> 610ms * Fix format * cleaned dead code, modified unit test * removed customized c++ extension revert back to use torch distributed API * change torch.ones to torch empty * typo * warn if not cuda tensor for allgather * fix formatting * fix: move ds_tensor to cuda device but it is strange that the ds_tensor haven't been moved to cuda * remove try clause on the path for fetching params Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 22 10月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 02 10月, 2021 1 次提交
-
-
由 Alex Hedges 提交于
* Fix typos in docs/ * Fix typos in code comments and output strings * Fix typos in the code itself * Fix typos in tests/ Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 01 10月, 2021 1 次提交
-
-
由 Manuel R. Ciosici 提交于
-
- 30 9月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NShaden Smith <shaden.smith@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Neltonzheng <eltonz@microsoft.com> Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>
-
- 16 9月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* [zero Init] fix regression * clean up the warning
-
- 02 9月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
-
- 13 7月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 12 7月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
-
- 10 7月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* post_init to be run only by a child module * better solution * add test * safer attr name * wants half() * improve doc Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 29 6月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 24 6月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 08 6月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* fix missed subclassed partitioning bug * fix on exit Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 14 5月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 01 5月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
-
- 30 4月, 2021 1 次提交
-
-
由 Samyam Rajbhandari 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 29 4月, 2021 1 次提交
-
-
由 Sean Naren 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 22 4月, 2021 1 次提交
-
-
由 Cheng Li 提交于
* use wierd shaped tensor to avoid silent failures when not registering externel params * fix typo Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 21 4月, 2021 1 次提交
-
-
由 Sean Naren 提交于
* Add check to see if json file is already loaded * Update doc * Address review * Remove doc comment Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 19 4月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
-
- 17 4月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Fix UnboundLocalError * Get full partition size
-
- 14 4月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
-
- 08 4月, 2021 3 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Samyam Rajbhandari 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 02 4月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* zero.Init() clarification clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must. this proposal is via @samyam's clarification shared elsewhere. Thank you. * style * add clarity * style Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 16 3月, 2021 1 次提交
-
-
由 Samyam Rajbhandari 提交于
* Fix mis-aligned-grad When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that. * Formatting fix * Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size * also removing alignment from flat fp16 buffers * Testing for hidden dim alignment * inference hook fix * Update stage3.py * formatting * [bug-fix] move params to gpu if offload params is turned off Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 09 3月, 2021 1 次提交
-
-
由 Samyam Rajbhandari 提交于
* Squash stage3 v1 (#146) Co-authored-by: NSamyam <samyamr@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Neltonzheng <eltonz@microsoft.com> * Fix correctness bug (#147) * formatting fix (#150) * stage3 bugfix (API) update and simplified FP16 Z3 tests (#151) * fp16 Z3 API update and bugfix * revert debug change * ZeRO-3 detach and race condition bugfixes (#149) * trying out ZeRO-3 race condition fix * CUDA sync instead of stream * reduction stream sync * remove commented code * Fix optimizer state_dict KeyError (#148) Co-authored-by: NJeff Rasley <jerasley@microsoft.com> * fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152) * Simplifying the logic for getting averaged gradients (#153) * skip for now * Z3 Docs redux (#154) * removing some TODOs and commented code (#155) * New Z3 defaults (#156) Co-authored-by: NJeff Rasley <jerasley@microsoft.com> * formatting * megatron external params Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Neltonzheng <eltonz@microsoft.com>
-