- 14 11月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 13 11月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
-
- 31 10月, 2021 1 次提交
-
-
由 Zhen Zhang 提交于
* remove norm(), avoid memcpy after allgather 1) Removing the norm computation in debug printing 2) Changing _all_gather to be sync op in fetch_sub_module Reason: the async version is not async at all, because each all_gather calls torch.cuda.synchronize() to guarantee previous communication op to be completed 3) Adding new function _allgather_params_split_launch the existing _allgather_params has explicit memcpy after the all-gather op. We can avoid the explicit memory copy at python side, to improve the performance. Known issue: the `torch.distributed.all_gather` will do implicit memcpy at the end of each `ncclAllgather`. * WIP: wrapped ncclAllgather as customized op in DS micro benchmark shows the improvement of allgather a transformer layer with 9834560 elements in half precision is about 1.1ms on aws-p4d instance. * WIP: integrated into partition_parameters Performance improvement of 5.1B bert on aws-p4d: fwd: 300ms -> 200ms bwd: 680ms -> 610ms * Fix format * cleaned dead code, modified unit test * removed customized c++ extension revert back to use torch distributed API * change torch.ones to torch empty * typo * warn if not cuda tensor for allgather * fix formatting * fix: move ds_tensor to cuda device but it is strange that the ds_tensor haven't been moved to cuda * remove try clause on the path for fetching params Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 22 10月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 02 10月, 2021 1 次提交
-
-
由 Alex Hedges 提交于
* Fix typos in docs/ * Fix typos in code comments and output strings * Fix typos in the code itself * Fix typos in tests/ Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 30 9月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NShaden Smith <shaden.smith@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Neltonzheng <eltonz@microsoft.com> Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>
-
- 14 7月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* fix reference counting in backward over multiple forwards * test + cleanup
-
- 13 7月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* add live zero checkpoint to fp32 consolidation version * some more docs * zero2 model states uses a different filename * fix * make debug mode cli configurable * copy the script only on node 0 process 0 * validate that we have the right number of files * revamp _get_zero_param_shapes, instrument with easier debug * correct assertion * rename API; add even simpler API * style * docs improve * update the docs * revert the unpartitioned_params detection and report as it's most likely persistent params Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 12 7月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
-
- 10 7月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* [zero3] params_to_reduce isn't always there Trying to port HF's Electra model's to Deepspeed I'm getting this on the very first backward step (with some extra debug): ``` Incrementing with parameter id 42 ------ Before allocating allgather param name=generator_lm_head.weight id=41 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=327680 ------allgather param with name=generator_lm_head.weight id=41 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=327680 ------ Before allocating allgather param name=generator_lm_head.bias id=42 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=5120 ------allgather param with name=generator_lm_head.bias id=42 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=5120 Backward name=generator_lm_head.weight id=41 shape=torch.Size([5120, 64]) Inside reduce ipg buckets. name=generator_lm_head.weight id=41 shape=torch.Size([5120, 64]), ipg elements 0, reduce bucket size 4096 Params in ipg bucket [] Reducing [] GOT 1 torch.Size([4096]) Traceback (most recent call last): File "examples/pytorch/language-modeling/run_mlm.py", line 533, in <module> main() File "examples/pytorch/language-modeling/run_mlm.py", line 484, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/mnt/nvme1/code/huggingface/transformers-ds-zero_to_fp32-tests/src/transformers/trainer.py", line 1269, in train tr_loss += self.training_step(model, inputs) File "/mnt/nvme1/code/huggingface/transformers-ds-zero_to_fp32-tests/src/transformers/trainer.py", line 1778, in training_step loss = self.deepspeed.backward(loss) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/engine.py", line 1188, in backward self.optimizer.backward(loss) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2964, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward scaled_loss.backward(retain_graph=retain_graph) File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward Variable._execution_engine.run_backward( File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1867, in reduce_partition_and_remove_grads self.reduce_ready_partitions_and_remove_grads(param, i) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2212, in reduce_ready_partitions_and_remove_grads self.reduce_independent_p_g_buckets_and_remove_grads(param, i) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1897, in reduce_independent_p_g_buckets_and_remove_grads self.reduce_ipg_grads() File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2193, in reduce_ipg_grads self.average_tensor(reduction_list, params_to_reduce) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1972, in average_tensor params_to_reduce[0].reduce_gradients_at_owner( ``` Is it always that `params_to_reduce` is populated? If I add this check the problem goes away it seems. * real fix
-
- 29 6月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 26 6月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* undo noise * another
-
- 24 6月, 2021 2 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Stas Bekman 提交于
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 17 6月, 2021 1 次提交
-
-
由 Samyam Rajbhandari 提交于
* largest_partitioned_params calculation fix largest partitioned params was getting calculated incorrectly * Update stage3.py * Update stage3.py * formatting fix * changing sub-group size default to 1e9 Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 21 5月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Align fp16 param wap buffers * Integrating swap buffer manager for fp16 params * Support swapping misaligned fp16 parameters * Support swap into unaligned fp16 buffer
-
- 19 5月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Align fp16 param wap buffers * Integrating swap buffer manager for fp16 params * Support swapping misaligned fp16 parameters
-
- 14 5月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
-
- 01 5月, 2021 1 次提交
-
-
由 Sean Naren 提交于
* Add additional conditions when checking types of output from the model * Add test * Modify test to use torch.tensor as well Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 30 4月, 2021 2 次提交
-
-
由 Olatunji Ruwase 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Samyam Rajbhandari 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 23 4月, 2021 1 次提交
-
-
由 William Buchwalter 提交于
* Fix issue where gradient_predivide_factor was called as a func. `gradient_predivide_factor` is a `float`, hence shouldn't be called as func. This crashes when `reduce_scatter` flag is set to `False`.
-
- 21 4月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
-
- 19 4月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com>
-
- 15 4月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* faster flatten/unflatten with apex * switch to cpp flatten/unflatten * style * better comment * missing import * switch to build ops at run time * fixes Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 08 4月, 2021 2 次提交
-
-
由 Samyam Rajbhandari 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 17 3月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 16 3月, 2021 1 次提交
-
-
由 Samyam Rajbhandari 提交于
* Fix mis-aligned-grad When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that. * Formatting fix * Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size * also removing alignment from flat fp16 buffers * Testing for hidden dim alignment * inference hook fix * Update stage3.py * formatting * [bug-fix] move params to gpu if offload params is turned off Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 12 3月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Control ZeRO wall clock timers * Disable more ZeRO3 debug prints Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 09 3月, 2021 1 次提交
-
-
由 Samyam Rajbhandari 提交于
* Squash stage3 v1 (#146) Co-authored-by: NSamyam <samyamr@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Neltonzheng <eltonz@microsoft.com> * Fix correctness bug (#147) * formatting fix (#150) * stage3 bugfix (API) update and simplified FP16 Z3 tests (#151) * fp16 Z3 API update and bugfix * revert debug change * ZeRO-3 detach and race condition bugfixes (#149) * trying out ZeRO-3 race condition fix * CUDA sync instead of stream * reduction stream sync * remove commented code * Fix optimizer state_dict KeyError (#148) Co-authored-by: NJeff Rasley <jerasley@microsoft.com> * fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152) * Simplifying the logic for getting averaged gradients (#153) * skip for now * Z3 Docs redux (#154) * removing some TODOs and commented code (#155) * New Z3 defaults (#156) Co-authored-by: NJeff Rasley <jerasley@microsoft.com> * formatting * megatron external params Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NShaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: NShaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Neltonzheng <eltonz@microsoft.com>
-