1. 14 11月, 2021 1 次提交
  2. 13 11月, 2021 1 次提交
  3. 31 10月, 2021 1 次提交
    • Z
      ZeRO3, improved parameter all-gather operation (#1188) · c0eeb69d
      Zhen Zhang 提交于
      * remove norm(), avoid memcpy after allgather
      
      1) Removing the norm computation in debug printing
      2) Changing _all_gather to be sync op in fetch_sub_module
          Reason: the async version is not async at all, because each
          all_gather calls torch.cuda.synchronize() to guarantee previous
          communication op to be completed
      3) Adding new function _allgather_params_split_launch
          the existing _allgather_params has explicit memcpy after the
          all-gather op. We can avoid the explicit memory copy at
          python side, to improve the performance.
      
      Known issue:
          the `torch.distributed.all_gather` will do implicit memcpy
          at the end of each `ncclAllgather`.
      
      * WIP: wrapped ncclAllgather as customized op in DS
      
      micro benchmark shows the improvement of allgather a
      transformer layer with 9834560 elements in half precision is about
      1.1ms on aws-p4d instance.
      
      * WIP: integrated into partition_parameters
      
      Performance improvement of 5.1B bert on aws-p4d:
      fwd: 300ms -> 200ms
      bwd: 680ms -> 610ms
      
      * Fix format
      
      * cleaned dead code, modified unit test
      
      * removed customized c++ extension
      
      revert back to use torch distributed API
      
      * change torch.ones to torch empty
      
      * typo
      
      * warn if not cuda tensor for allgather
      
      * fix formatting
      
      * fix: move ds_tensor to cuda device
      
      but it is strange that the ds_tensor haven't been moved to cuda
      
      * remove try clause on the path for fetching params
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      c0eeb69d
  4. 22 10月, 2021 1 次提交
  5. 02 10月, 2021 1 次提交
  6. 30 9月, 2021 1 次提交
  7. 14 7月, 2021 1 次提交
  8. 13 7月, 2021 1 次提交
    • S
      [model weights] zero_to_fp32 multiple improvements (#1181) · 2a921069
      Stas Bekman 提交于
      * add live zero checkpoint to fp32 consolidation version
      
      * some more docs
      
      * zero2 model states uses a different filename
      
      * fix
      
      * make debug mode cli configurable
      
      * copy the script only on node 0 process 0
      
      * validate that we have the right number of files
      
      * revamp _get_zero_param_shapes, instrument with easier debug
      
      * correct assertion
      
      * rename API; add even simpler API
      
      * style
      
      * docs improve
      
      * update the docs
      
      * revert the unpartitioned_params detection and report as it's most likely persistent params
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      2a921069
  9. 12 7月, 2021 1 次提交
  10. 10 7月, 2021 1 次提交
    • S
      [zero3] params_to_reduce isn't always there (#1214) · 91f58c06
      Stas Bekman 提交于
      * [zero3] params_to_reduce isn't always there
      
      Trying to port HF's Electra model's to Deepspeed I'm getting this on the very first backward step (with some extra debug):
      
      ```
      Incrementing with parameter id 42
      ------ Before allocating allgather param name=generator_lm_head.weight id=41 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=327680
      ------allgather param with name=generator_lm_head.weight id=41 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=327680
      ------ Before allocating allgather param name=generator_lm_head.bias id=42 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=5120
      ------allgather param with name=generator_lm_head.bias id=42 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=5120
      Backward name=generator_lm_head.weight id=41 shape=torch.Size([5120, 64])
      Inside reduce ipg buckets. name=generator_lm_head.weight id=41 shape=torch.Size([5120, 64]), ipg elements 0, reduce bucket size 4096
      Params in ipg bucket []
      Reducing []
      GOT 1
      torch.Size([4096])
      Traceback (most recent call last):
        File "examples/pytorch/language-modeling/run_mlm.py", line 533, in <module>
          main()
        File "examples/pytorch/language-modeling/run_mlm.py", line 484, in main
          train_result = trainer.train(resume_from_checkpoint=checkpoint)
        File "/mnt/nvme1/code/huggingface/transformers-ds-zero_to_fp32-tests/src/transformers/trainer.py", line 1269, in train
          tr_loss += self.training_step(model, inputs)
        File "/mnt/nvme1/code/huggingface/transformers-ds-zero_to_fp32-tests/src/transformers/trainer.py", line 1778, in training_step
          loss = self.deepspeed.backward(loss)
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/engine.py", line 1188, in backward
          self.optimizer.backward(loss)
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2964, in backward
          self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
          scaled_loss.backward(retain_graph=retain_graph)
        File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
          torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
        File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
          Variable._execution_engine.run_backward(
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1867, in reduce_partition_and_remove_grads
          self.reduce_ready_partitions_and_remove_grads(param, i)
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2212, in reduce_ready_partitions_and_remove_grads
          self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1897, in reduce_independent_p_g_buckets_and_remove_grads
          self.reduce_ipg_grads()
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2193, in reduce_ipg_grads
          self.average_tensor(reduction_list, params_to_reduce)
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1972, in average_tensor
          params_to_reduce[0].reduce_gradients_at_owner(
      ```
      
      Is it always that `params_to_reduce` is populated?
      
      If I add this check the problem goes away it seems.
      
      * real fix
      91f58c06
  11. 29 6月, 2021 1 次提交
  12. 26 6月, 2021 1 次提交
  13. 24 6月, 2021 2 次提交
  14. 17 6月, 2021 1 次提交
  15. 21 5月, 2021 1 次提交
  16. 19 5月, 2021 1 次提交
  17. 14 5月, 2021 1 次提交
  18. 01 5月, 2021 1 次提交
  19. 30 4月, 2021 2 次提交
  20. 23 4月, 2021 1 次提交
  21. 21 4月, 2021 1 次提交
  22. 19 4月, 2021 1 次提交
  23. 15 4月, 2021 1 次提交
  24. 08 4月, 2021 2 次提交
  25. 17 3月, 2021 1 次提交
  26. 16 3月, 2021 1 次提交
  27. 12 3月, 2021 1 次提交
  28. 09 3月, 2021 1 次提交