1. 21 7月, 2022 1 次提交
    • O
      Checkpoint reshaping (#1953) · 80d0a32f
      Olatunji Ruwase 提交于
      * unit test, remove exception, add notes
      
      * Move param_shapes to model files
      
      * Remove hard-coded constants
      
      * Conditioned to zero optimizer
      
      * Add zero checkpoint merging
      
      * Print checkpoint version
      
      * Reshape zero_* ckpt files
      
      * Merge zero* files contraction
      
      * Utils for 3D contraction reshaping
      
      * Remove bogus import
      
      * Support bf16_zero ckpts
      
      * Add param slice mappings
      
      * Load universal checkpoints
      
      * Per group mappings from Stas
      
      * Hack to load bf16 zero files
      
      * Param attributes
      
      * WIP
      
      * Fix api bug
      
      * Update lp with local/remote hp
      
      * Disable vocab padding handling
      
      * Update z2 checkpoint
      
      * Remove debug prints
      
      * Remove debug prints; Rebase unit test
      
      * Add reshape assert
      
      * Padding
      
      * Typo
      
      * Catch nonexistent checkpoint path
      
      * Cleanup
      
      * Restore checkpoint state comparisons
      
      * Add torch version guards
      
      * More precise avoidance of false positives.
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      80d0a32f
  2. 20 7月, 2022 1 次提交
  3. 19 7月, 2022 1 次提交
  4. 14 7月, 2022 1 次提交
  5. 08 7月, 2022 1 次提交
  6. 07 7月, 2022 1 次提交
  7. 28 6月, 2022 1 次提交
  8. 22 6月, 2022 2 次提交
  9. 21 6月, 2022 1 次提交
  10. 20 6月, 2022 1 次提交
  11. 16 6月, 2022 2 次提交
  12. 14 6月, 2022 1 次提交
  13. 11 6月, 2022 1 次提交
  14. 23 5月, 2022 1 次提交
  15. 19 5月, 2022 1 次提交
  16. 16 5月, 2022 1 次提交
  17. 14 5月, 2022 1 次提交
  18. 12 5月, 2022 1 次提交
  19. 10 5月, 2022 2 次提交
  20. 07 5月, 2022 1 次提交
  21. 06 5月, 2022 1 次提交
    • O
      Improve z3 trace management (#1916) · 673cb608
      Olatunji Ruwase 提交于
      * Fix OOM and type mismatch
      
      * Toggle prefetching
      
      * Disable z3 prefetching for inference (temp workaround)
      
      * Fix zero3 tracing issues
      
      * Remove debug prints
      
      * Enable prefetch for inference
      
      * Code clarity
      
      * Invalidate trace cache
      
      * Trace cache invalidation when needed
      Separate nvme prefetch from all-gather prefetch
      
      * Track last used step id
      
      * Use debug name in error message
      
      * Construct param trace from module trace
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      673cb608
  22. 04 5月, 2022 1 次提交
  23. 03 5月, 2022 1 次提交
  24. 29 4月, 2022 1 次提交
  25. 27 4月, 2022 2 次提交
  26. 26 4月, 2022 1 次提交
  27. 21 4月, 2022 2 次提交
  28. 20 4月, 2022 3 次提交
    • S
      fb00e6a1
    • M
      Use f-strings where possible (#1900) · 3b5b92cb
      Manuel R. Ciosici 提交于
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      3b5b92cb
    • O
      bf16+pipeline parallelism (#1801) · 56c52238
      Olatunji Ruwase 提交于
      * bf16 updates
      
      * Got bf16 working
      
      * fp32 reduction; flattened tensors
      
      * bf16+zero_stage_1 first cut
      
      * finish zero_stage 1 sharding
      
      * Matching fp16 with debugging codes
      
      * Matching loss with fp16
      
      * Fix gradient clipping
      
      * bf16 gradient clipping fix
      bf16 checkpoint save/load
      
      * Unscale grad norm
      
      * Fix grad norm scaling
      
      * Enable loading fp16_zero_1 into bf16_zero_1 engine and vice versa
      
      * Fix clip_grad key error
      
      * Reduce tied weight gradients
      
      * Fix grad norm for moe
      
      * Reduce specified gradients
      
      * Use O(n) instead of O(n^2)
      
      * Remove optimizer restriction for bf16
      
      * Link bf16 & fp32 params
      
      * Clip gradients of last stage tied weights
      
      * Simplify tied weights reduction logic
      
      * Also clip all tp rank parameters
      
      * lp to hp mapping
      
      * Link lp/hp/optim state; Refresh links after checkpoint load
      
      * Remove debug print
      
      * Remove debug print
      
      * Simplify zero_grad logic
      
      * fp32 accessors
      
      * Fix update bug
      Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
      56c52238
  29. 24 3月, 2022 1 次提交
  30. 23 3月, 2022 1 次提交
  31. 18 3月, 2022 1 次提交
  32. 17 3月, 2022 1 次提交
  33. 12 3月, 2022 1 次提交