- 07 8月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
-
- 06 8月, 2021 1 次提交
-
-
由 Denis Tarasov 提交于
Make add operation inplace. Without it momentum decays to zero and training has no effect on corresponding parameters
-
- 03 8月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
* fix empty grad zero tests * dont clear grads in stage 1 code path * prevent none grads from being reduced
-
- 31 7月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
-
- 30 7月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Fix docstring * Make screenshots clickable for easier viewing * Navigation menu in alphabetical order; More clicable screenshots * Rename 1Cycle doc * Tweak naming * Remove no longer used flag * ZeRO3 Offload release * Single GPU results * Rearrange figures * Single GPU text * tweak intro * zero3-offload section * Add asynchronous i/o docs * Fix print_per_steps doc * Document round_robin_gradients * Tweak description * Trigger CI
-
- 29 7月, 2021 4 次提交
-
-
由 Adam Moody 提交于
* aio: test for libaio with various package managers * aio: note typical tool used to install libaio package * setup: abort with error if cannot build requested op * setup: define op_envvar to return op build environment variable * setup: call is_compatible once for each op * setup: only print suggestion to disable op when its envvar not set * setup: add method to abort from fatal error * Revert "setup: add method to abort from fatal error" This reverts commit 0e4cde6b0a650591c3fafface7e27b4efd9aad4f. * setup: add method to abort from fatal error Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Olatunji Ruwase 提交于
* Make round robin gradient partitioning configurable (default False) * Use the correct default * Log config setting
-
由 Ivan Komarov 提交于
Co-authored-by: NIvan Komarov <dfyz@yandex-team.ru> Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
-
由 Olatunji Ruwase 提交于
-
- 27 7月, 2021 2 次提交
-
-
由 Jeff Rasley 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Adam Moody 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 25 7月, 2021 1 次提交
-
-
由 Adam Moody 提交于
-
- 21 7月, 2021 1 次提交
-
-
由 Reza Yazdani 提交于
* fixing inference api for FP32 and non-masking GPT-based models * use a dummy tensor if input_mask is none * fix input_mask * minor fix * send input_mask to compute_attn func for checking
-
- 20 7月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* zero_param_shapes: switch to round_robin_fp16_groups * add test * old torch workaround
-
- 16 7月, 2021 2 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Adam Moody 提交于
* enable async io op on powerpc architectures * drop any empty strings returned by cxx_args Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 14 7月, 2021 3 次提交
-
-
由 Stas Bekman 提交于
* fix reference counting in backward over multiple forwards * test + cleanup
-
由 Stas Bekman 提交于
* release tmp memory when consolidating fp16 weights take2 * cleanup Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 dependabot[bot] 提交于
Bumps [addressable](https://github.com/sporkmonger/addressable) from 2.7.0 to 2.8.0. - [Release notes](https://github.com/sporkmonger/addressable/releases) - [Changelog](https://github.com/sporkmonger/addressable/blob/main/CHANGELOG.md) - [Commits](https://github.com/sporkmonger/addressable/compare/addressable-2.7.0...addressable-2.8.0) --- updated-dependencies: - dependency-name: addressable dependency-type: indirect ... Signed-off-by: Ndependabot[bot] <support@github.com> Co-authored-by: Ndependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 13 7月, 2021 7 次提交
-
-
由 Olatunji Ruwase 提交于
* Disable copy stream * Format fixes * Remove debug codes * Remove debug codes * Fix indent
-
由 Jeff Rasley 提交于
-
由 Jeff Rasley 提交于
-
由 Stas Bekman 提交于
https://github.com/microsoft/DeepSpeed/pull/1220 fixed the leak, but lead to another problem. reverting that part so that we could do release and will work on it after the release. @jeffra
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Stas Bekman 提交于
* add live zero checkpoint to fp32 consolidation version * some more docs * zero2 model states uses a different filename * fix * make debug mode cli configurable * copy the script only on node 0 process 0 * validate that we have the right number of files * revamp _get_zero_param_shapes, instrument with easier debug * correct assertion * rename API; add even simpler API * style * docs improve * update the docs * revert the unpartitioned_params detection and report as it's most likely persistent params Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Adam Moody 提交于
* enable cpu adam operation on powerpc * fix formatting Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 12 7月, 2021 2 次提交
-
-
由 senwang 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Stas Bekman 提交于
-
- 10 7月, 2021 2 次提交
-
-
由 Stas Bekman 提交于
* post_init to be run only by a child module * better solution * add test * safer attr name * wants half() * improve doc Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Stas Bekman 提交于
* [zero3] params_to_reduce isn't always there Trying to port HF's Electra model's to Deepspeed I'm getting this on the very first backward step (with some extra debug): ``` Incrementing with parameter id 42 ------ Before allocating allgather param name=generator_lm_head.weight id=41 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=327680 ------allgather param with name=generator_lm_head.weight id=41 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=327680 ------ Before allocating allgather param name=generator_lm_head.bias id=42 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=5120 ------allgather param with name=generator_lm_head.bias id=42 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=5120 Backward name=generator_lm_head.weight id=41 shape=torch.Size([5120, 64]) Inside reduce ipg buckets. name=generator_lm_head.weight id=41 shape=torch.Size([5120, 64]), ipg elements 0, reduce bucket size 4096 Params in ipg bucket [] Reducing [] GOT 1 torch.Size([4096]) Traceback (most recent call last): File "examples/pytorch/language-modeling/run_mlm.py", line 533, in <module> main() File "examples/pytorch/language-modeling/run_mlm.py", line 484, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/mnt/nvme1/code/huggingface/transformers-ds-zero_to_fp32-tests/src/transformers/trainer.py", line 1269, in train tr_loss += self.training_step(model, inputs) File "/mnt/nvme1/code/huggingface/transformers-ds-zero_to_fp32-tests/src/transformers/trainer.py", line 1778, in training_step loss = self.deepspeed.backward(loss) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/engine.py", line 1188, in backward self.optimizer.backward(loss) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2964, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward scaled_loss.backward(retain_graph=retain_graph) File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward Variable._execution_engine.run_backward( File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1867, in reduce_partition_and_remove_grads self.reduce_ready_partitions_and_remove_grads(param, i) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2212, in reduce_ready_partitions_and_remove_grads self.reduce_independent_p_g_buckets_and_remove_grads(param, i) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1897, in reduce_independent_p_g_buckets_and_remove_grads self.reduce_ipg_grads() File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2193, in reduce_ipg_grads self.average_tensor(reduction_list, params_to_reduce) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1972, in average_tensor params_to_reduce[0].reduce_gradients_at_owner( ``` Is it always that `params_to_reduce` is populated? If I add this check the problem goes away it seems. * real fix
-
- 08 7月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
-
- 02 7月, 2021 2 次提交
-
-
由 Samyam Rajbhandari 提交于
* contiguous gradients should be set to True by default * Set contiguous gradients to True by default Features such as reduce_scatter depends on contiguous gradients being True. This is also the preferred default configuration.
-
由 Jeff Rasley 提交于
-
- 29 6月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 26 6月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* undo noise * another
-
- 24 6月, 2021 5 次提交
-
-
由 Hyunwoong Ko 提交于
* Fix bugs about non-contiguous tensor broadcasting * Fix typo Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Stas Bekman 提交于
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Jeff Rasley 提交于
-