[MiCS] [Fix] saving and loading model checkpoint logic for MiCS sharding (#3440)
* fix mics save checkpoint hanging
* MiCS load_checkpoint
* copyright
* fix for torch-1.9.0
all_reduce_coalesced api does not support nccl backend
* Naming alignment
* adding more test conditions for mics shard size
* test with different shard sizes
* adding assertion for better error msg
---------
Co-authored-by: NZhen Zhang <zhzhn@amazon.com>
Showing
想要评论请 注册 或 登录