- 02 9月, 2021 3 次提交
-
-
由 Olatunji Ruwase 提交于
-
由 Olatunji Ruwase 提交于
-
由 Hari Prasad 提交于
* Added drop_last to DeepSpeedDataLoader This solves issue #326 * Updated drop_last in engine.py added drop_last as a ds_config as mentioned by @tjruwase * Update engine.py * Update engine.py * updated config.py and constants.py * Update constants.py * added dataloader_ prefix * Update dataloader.py * corrected yapf test errors * Update test_data.py Added dataloader_drop_last unit test * Corrected yapf and formatting issues * updated simple_model.py and test_data.py * Update simple_model.py * pre-commit fix * corrected issues * Update test_data.py * Update test_data.py * Update test_data.py * Update test_data.py * removed batch_size from test_data.py * Update simple_model.py * Update test_data.py * Update test_data.py * Fix unit test issues * Use fp32 to make things work Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 31 8月, 2021 1 次提交
-
-
由 Ammar Ahmad Awan 提交于
* Remove the wrong function with duplicate name * fix format. * add mpu check. fix tests.
-
- 28 8月, 2021 3 次提交
-
-
由 Olatunji Ruwase 提交于
-
由 Olatunji Ruwase 提交于
* Rename PA_TO_cpu * Code cleanup * Revert accidental change
-
由 Reza Yazdani 提交于
* add more synchronizations and barriers for resolving gpu-halt issue * removing unuseful broadcasts
-
- 27 8月, 2021 1 次提交
-
-
由 Reza Yazdani 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 26 8月, 2021 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Callable option for optimizer and scheduler * Add unit test * Formatting * Disable debug prints * Use base optimizer to construct lr scheduler * Formatting * Remove dead import
-
- 25 8月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
* restore fp16 params if no zero ckpts available * formatting
-
- 18 8月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
-
- 17 8月, 2021 2 次提交
-
-
由 Ammar Ahmad Awan 提交于
Co-authored-by: NAlex Muzio <Alex.Muzio@microsoft.com> Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: NConglong Li <conglong.li@gmail.com> Co-authored-by: NFelipe Cruz Salinas <Andres.Cruz@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NReza Yazdani <reyazda@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NShaden Smith <shaden.smith@microsoft.com> Co-authored-by: NYoung Jin Kim <youki@microsoft.com> Co-authored-by: Nbapatra <bapatra@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NShaden Smith <shaden.smith@microsoft.com> Co-authored-by: NYoung Jin Kim <youki@microsoft.com>
-
由 Conglong Li 提交于
Co-authored-by: NConglong Li <conglong.li@gmail.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 07 8月, 2021 2 次提交
-
-
由 Olatunji Ruwase 提交于
* Use correct input size for splits * Use smarter partitioning
-
由 Olatunji Ruwase 提交于
-
- 06 8月, 2021 1 次提交
-
-
由 Denis Tarasov 提交于
Make add operation inplace. Without it momentum decays to zero and training has no effect on corresponding parameters
-
- 03 8月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
* fix empty grad zero tests * dont clear grads in stage 1 code path * prevent none grads from being reduced
-
- 29 7月, 2021 2 次提交
-
-
由 Olatunji Ruwase 提交于
* Make round robin gradient partitioning configurable (default False) * Use the correct default * Log config setting
-
由 Olatunji Ruwase 提交于
-
- 27 7月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 21 7月, 2021 1 次提交
-
-
由 Reza Yazdani 提交于
* fixing inference api for FP32 and non-masking GPT-based models * use a dummy tensor if input_mask is none * fix input_mask * minor fix * send input_mask to compute_attn func for checking
-
- 20 7月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* zero_param_shapes: switch to round_robin_fp16_groups * add test * old torch workaround
-
- 16 7月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 14 7月, 2021 2 次提交
-
-
由 Stas Bekman 提交于
* fix reference counting in backward over multiple forwards * test + cleanup
-
由 Stas Bekman 提交于
* release tmp memory when consolidating fp16 weights take2 * cleanup Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 13 7月, 2021 4 次提交
-
-
由 Olatunji Ruwase 提交于
* Disable copy stream * Format fixes * Remove debug codes * Remove debug codes * Fix indent
-
由 Stas Bekman 提交于
https://github.com/microsoft/DeepSpeed/pull/1220 fixed the leak, but lead to another problem. reverting that part so that we could do release and will work on it after the release. @jeffra
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Stas Bekman 提交于
* add live zero checkpoint to fp32 consolidation version * some more docs * zero2 model states uses a different filename * fix * make debug mode cli configurable * copy the script only on node 0 process 0 * validate that we have the right number of files * revamp _get_zero_param_shapes, instrument with easier debug * correct assertion * rename API; add even simpler API * style * docs improve * update the docs * revert the unpartitioned_params detection and report as it's most likely persistent params Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 12 7月, 2021 2 次提交
-
-
由 senwang 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Stas Bekman 提交于
-
- 10 7月, 2021 2 次提交
-
-
由 Stas Bekman 提交于
* post_init to be run only by a child module * better solution * add test * safer attr name * wants half() * improve doc Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
由 Stas Bekman 提交于
* [zero3] params_to_reduce isn't always there Trying to port HF's Electra model's to Deepspeed I'm getting this on the very first backward step (with some extra debug): ``` Incrementing with parameter id 42 ------ Before allocating allgather param name=generator_lm_head.weight id=41 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=327680 ------allgather param with name=generator_lm_head.weight id=41 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=327680 ------ Before allocating allgather param name=generator_lm_head.bias id=42 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=5120 ------allgather param with name=generator_lm_head.bias id=42 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=5120 Backward name=generator_lm_head.weight id=41 shape=torch.Size([5120, 64]) Inside reduce ipg buckets. name=generator_lm_head.weight id=41 shape=torch.Size([5120, 64]), ipg elements 0, reduce bucket size 4096 Params in ipg bucket [] Reducing [] GOT 1 torch.Size([4096]) Traceback (most recent call last): File "examples/pytorch/language-modeling/run_mlm.py", line 533, in <module> main() File "examples/pytorch/language-modeling/run_mlm.py", line 484, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/mnt/nvme1/code/huggingface/transformers-ds-zero_to_fp32-tests/src/transformers/trainer.py", line 1269, in train tr_loss += self.training_step(model, inputs) File "/mnt/nvme1/code/huggingface/transformers-ds-zero_to_fp32-tests/src/transformers/trainer.py", line 1778, in training_step loss = self.deepspeed.backward(loss) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/engine.py", line 1188, in backward self.optimizer.backward(loss) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2964, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward scaled_loss.backward(retain_graph=retain_graph) File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward Variable._execution_engine.run_backward( File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1867, in reduce_partition_and_remove_grads self.reduce_ready_partitions_and_remove_grads(param, i) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2212, in reduce_ready_partitions_and_remove_grads self.reduce_independent_p_g_buckets_and_remove_grads(param, i) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1897, in reduce_independent_p_g_buckets_and_remove_grads self.reduce_ipg_grads() File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2193, in reduce_ipg_grads self.average_tensor(reduction_list, params_to_reduce) File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1972, in average_tensor params_to_reduce[0].reduce_gradients_at_owner( ``` Is it always that `params_to_reduce` is populated? If I add this check the problem goes away it seems. * real fix
-
- 02 7月, 2021 1 次提交
-
-
由 Samyam Rajbhandari 提交于
* contiguous gradients should be set to True by default * Set contiguous gradients to True by default Features such as reduce_scatter depends on contiguous gradients being True. This is also the preferred default configuration.
-
- 29 6月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 26 6月, 2021 1 次提交
-
-
由 Stas Bekman 提交于
* undo noise * another
-
- 24 6月, 2021 4 次提交
-
-
由 Hyunwoong Ko 提交于
* Fix bugs about non-contiguous tensor broadcasting * Fix typo Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Stas Bekman 提交于
Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
由 Stas Bekman 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-