1. 02 9月, 2021 1 次提交
  2. 26 8月, 2021 1 次提交
  3. 17 8月, 2021 1 次提交
  4. 07 8月, 2021 2 次提交
  5. 03 8月, 2021 1 次提交
  6. 29 7月, 2021 2 次提交
  7. 20 7月, 2021 1 次提交
  8. 14 7月, 2021 1 次提交
  9. 13 7月, 2021 3 次提交
  10. 12 7月, 2021 1 次提交
  11. 10 7月, 2021 2 次提交
    • S
      [zero.Init] post_init partitining is to be run only by a child module (#1202) · 497b741f
      Stas Bekman 提交于
      * post_init to be run only by a child module
      
      * better solution
      
      * add test
      
      * safer attr name
      
      * wants half()
      
      * improve doc
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      497b741f
    • S
      [zero3] params_to_reduce isn't always there (#1214) · 91f58c06
      Stas Bekman 提交于
      * [zero3] params_to_reduce isn't always there
      
      Trying to port HF's Electra model's to Deepspeed I'm getting this on the very first backward step (with some extra debug):
      
      ```
      Incrementing with parameter id 42
      ------ Before allocating allgather param name=generator_lm_head.weight id=41 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=327680
      ------allgather param with name=generator_lm_head.weight id=41 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=327680
      ------ Before allocating allgather param name=generator_lm_head.bias id=42 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=5120
      ------allgather param with name=generator_lm_head.bias id=42 shape=torch.Size([1]) status=ZeroParamStatus.NOT_AVAILABLE partition size=5120
      Backward name=generator_lm_head.weight id=41 shape=torch.Size([5120, 64])
      Inside reduce ipg buckets. name=generator_lm_head.weight id=41 shape=torch.Size([5120, 64]), ipg elements 0, reduce bucket size 4096
      Params in ipg bucket []
      Reducing []
      GOT 1
      torch.Size([4096])
      Traceback (most recent call last):
        File "examples/pytorch/language-modeling/run_mlm.py", line 533, in <module>
          main()
        File "examples/pytorch/language-modeling/run_mlm.py", line 484, in main
          train_result = trainer.train(resume_from_checkpoint=checkpoint)
        File "/mnt/nvme1/code/huggingface/transformers-ds-zero_to_fp32-tests/src/transformers/trainer.py", line 1269, in train
          tr_loss += self.training_step(model, inputs)
        File "/mnt/nvme1/code/huggingface/transformers-ds-zero_to_fp32-tests/src/transformers/trainer.py", line 1778, in training_step
          loss = self.deepspeed.backward(loss)
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/engine.py", line 1188, in backward
          self.optimizer.backward(loss)
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2964, in backward
          self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
          scaled_loss.backward(retain_graph=retain_graph)
        File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
          torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
        File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
          Variable._execution_engine.run_backward(
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1867, in reduce_partition_and_remove_grads
          self.reduce_ready_partitions_and_remove_grads(param, i)
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2212, in reduce_ready_partitions_and_remove_grads
          self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1897, in reduce_independent_p_g_buckets_and_remove_grads
          self.reduce_ipg_grads()
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 2193, in reduce_ipg_grads
          self.average_tensor(reduction_list, params_to_reduce)
        File "/mnt/nvme1/code/github/00optimize/DeepSpeed-zero-init-child-only-post_init/deepspeed/runtime/zero/stage3.py", line 1972, in average_tensor
          params_to_reduce[0].reduce_gradients_at_owner(
      ```
      
      Is it always that `params_to_reduce` is populated?
      
      If I add this check the problem goes away it seems.
      
      * real fix
      91f58c06
  12. 02 7月, 2021 1 次提交
  13. 29 6月, 2021 1 次提交
  14. 26 6月, 2021 1 次提交
  15. 24 6月, 2021 2 次提交
  16. 17 6月, 2021 1 次提交
  17. 09 6月, 2021 1 次提交
  18. 08 6月, 2021 1 次提交
  19. 22 5月, 2021 1 次提交
  20. 21 5月, 2021 1 次提交
  21. 20 5月, 2021 1 次提交
  22. 19 5月, 2021 1 次提交
  23. 16 5月, 2021 1 次提交
  24. 14 5月, 2021 2 次提交
  25. 08 5月, 2021 1 次提交
  26. 01 5月, 2021 2 次提交
  27. 30 4月, 2021 2 次提交
  28. 29 4月, 2021 1 次提交
  29. 25 4月, 2021 1 次提交
    • H
      Add find_unused_parameters option to DeepSpeedEngine (#945) · d0b61f18
      hamlet 提交于
      * Add find_unused_parameters option
      
      As unused parameters in modules may not be expected sometimes, 
      add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707
      
      * Add find_unused_parameters option
      
      As unused parameters in modules may not be expected sometimes, 
      add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707
      
      * Fix syntax error
      
      * Fix yapf error
      
      * Fix yapf error
      
      * Fix yapf error
      
      * Fix yapf error
      
      * Move stage2 find_unused_parameters to config file
      
      * Add stage2 find_unused_parameters
      
      * Add stage2 find_unused_parameters
      
      * Add stage2_find_unused_parameters option
      
      * Change error msg to reflect zero_optimization config change
      
      * Fix yapf error
      
      * Fix yapf errors
      
      * Change find_unused_parameters option name
      
      * Change find_unused_parameters option name
      
      * Change find_unused_parameters option name
      
      * Change find_unused_parameters option name
      
      * Change find_unused_parameters option name
      
      * Add UnusedParametersModel for test option find_unused_parameters
      
      * Add unit test for stage2 find_unused_parameters
      
      * Add cpu-adam compatible check
      
      * Remove dups import
      
      * Trim spaces
      
      * Fix yapf errors
      
      * Trim spaces
      
      * Add False Positive test check
      
      * Fix find_unused_parameters test
      
      * Trim spaces
      
      * Fix yapf error
      d0b61f18
  30. 24 4月, 2021 1 次提交
  31. 23 4月, 2021 1 次提交