1. 21 7月, 2022 1 次提交
    • O
      Checkpoint reshaping (#1953) · 80d0a32f
      Olatunji Ruwase 提交于
      * unit test, remove exception, add notes
      
      * Move param_shapes to model files
      
      * Remove hard-coded constants
      
      * Conditioned to zero optimizer
      
      * Add zero checkpoint merging
      
      * Print checkpoint version
      
      * Reshape zero_* ckpt files
      
      * Merge zero* files contraction
      
      * Utils for 3D contraction reshaping
      
      * Remove bogus import
      
      * Support bf16_zero ckpts
      
      * Add param slice mappings
      
      * Load universal checkpoints
      
      * Per group mappings from Stas
      
      * Hack to load bf16 zero files
      
      * Param attributes
      
      * WIP
      
      * Fix api bug
      
      * Update lp with local/remote hp
      
      * Disable vocab padding handling
      
      * Update z2 checkpoint
      
      * Remove debug prints
      
      * Remove debug prints; Rebase unit test
      
      * Add reshape assert
      
      * Padding
      
      * Typo
      
      * Catch nonexistent checkpoint path
      
      * Cleanup
      
      * Restore checkpoint state comparisons
      
      * Add torch version guards
      
      * More precise avoidance of false positives.
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      80d0a32f
  2. 14 7月, 2022 1 次提交
  3. 08 7月, 2022 1 次提交
  4. 07 7月, 2022 1 次提交
  5. 28 6月, 2022 1 次提交
  6. 21 6月, 2022 1 次提交
  7. 14 6月, 2022 1 次提交
  8. 11 6月, 2022 1 次提交
  9. 16 5月, 2022 1 次提交
  10. 12 5月, 2022 1 次提交
  11. 20 4月, 2022 1 次提交
    • O
      bf16+pipeline parallelism (#1801) · 56c52238
      Olatunji Ruwase 提交于
      * bf16 updates
      
      * Got bf16 working
      
      * fp32 reduction; flattened tensors
      
      * bf16+zero_stage_1 first cut
      
      * finish zero_stage 1 sharding
      
      * Matching fp16 with debugging codes
      
      * Matching loss with fp16
      
      * Fix gradient clipping
      
      * bf16 gradient clipping fix
      bf16 checkpoint save/load
      
      * Unscale grad norm
      
      * Fix grad norm scaling
      
      * Enable loading fp16_zero_1 into bf16_zero_1 engine and vice versa
      
      * Fix clip_grad key error
      
      * Reduce tied weight gradients
      
      * Fix grad norm for moe
      
      * Reduce specified gradients
      
      * Use O(n) instead of O(n^2)
      
      * Remove optimizer restriction for bf16
      
      * Link bf16 & fp32 params
      
      * Clip gradients of last stage tied weights
      
      * Simplify tied weights reduction logic
      
      * Also clip all tp rank parameters
      
      * lp to hp mapping
      
      * Link lp/hp/optim state; Refresh links after checkpoint load
      
      * Remove debug print
      
      * Remove debug print
      
      * Simplify zero_grad logic
      
      * fp32 accessors
      
      * Fix update bug
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      56c52238
  12. 23 3月, 2022 1 次提交
  13. 18 3月, 2022 1 次提交
  14. 17 3月, 2022 1 次提交
  15. 01 3月, 2022 1 次提交
  16. 16 2月, 2022 1 次提交
  17. 08 2月, 2022 1 次提交
  18. 23 1月, 2022 1 次提交
  19. 19 1月, 2022 1 次提交
  20. 15 1月, 2022 1 次提交
  21. 07 1月, 2022 1 次提交
  22. 14 12月, 2021 1 次提交
  23. 28 11月, 2021 1 次提交
  24. 27 11月, 2021 1 次提交
  25. 17 11月, 2021 1 次提交
  26. 14 11月, 2021 1 次提交
  27. 13 11月, 2021 1 次提交
  28. 02 11月, 2021 1 次提交
    • R
      Bfloat16 zero2 (#1398) · 648f7bfa
      Rana Ali Amjad 提交于
      * Changes for bfloat16 Zero2
      
      * Cleaned up additional comments and debugging code
      
      * Adapted fp16_master_weights_and_grads option to cover BF16
      
      * Reverted fp16_master_weights_and_gradients extension to BFloat16 and minor cleanup
      
      * Fixed formatting and variable naming errors recognized in testing
      
      * Added relevant unit tests for bfloat16 with ZeRO-2
      
      * Updates conditions for skipping BFloat16 unit tests
      
      * Added check for NCCL inconsistent version naming convention
      
      * Update skip message for Bfloat16 tests to mention additional checks
      Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
      648f7bfa
  29. 23 10月, 2021 1 次提交
  30. 07 10月, 2021 1 次提交
  31. 03 10月, 2021 1 次提交
  32. 02 10月, 2021 1 次提交
  33. 30 9月, 2021 1 次提交
  34. 10 9月, 2021 1 次提交
  35. 26 8月, 2021 1 次提交
  36. 17 8月, 2021 1 次提交
  37. 07 8月, 2021 1 次提交
  38. 03 8月, 2021 1 次提交
  39. 29 7月, 2021 2 次提交