- 26 8月, 2023 1 次提交
-
-
由 hamlet 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com>
-
- 25 8月, 2023 1 次提交
-
-
由 Björn Plüster 提交于
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
-
- 24 8月, 2023 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Load z3 checkpoints for inference * PR feedback * Fix API bugs * Fix typo
-
- 27 7月, 2023 1 次提交
-
-
由 Alexander Jipa 提交于
Co-authored-by: NAlexander Jipa <azzhipa@amazon.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 26 7月, 2023 1 次提交
-
-
由 mzl 提交于
* remove duplicate check for pp and zero stage * remove line * use ZeroStageEnum --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 20 7月, 2023 1 次提交
-
-
由 Olatunji Ruwase 提交于
* Option to exclude frozen weights for checkpoint save * Extend unit test * Support PP training
-
- 15 7月, 2023 1 次提交
-
-
由 mzl 提交于
Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
-
- 09 5月, 2023 1 次提交
-
-
由 YiSheng5 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NLogan Adams <114770087+loadams@users.noreply.github.com>
-
- 03 5月, 2023 1 次提交
-
-
由 Joe Mayer 提交于
* Add ZeRO 1 support to PP for BF16. * Switching enum. --------- Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 02 5月, 2023 1 次提交
-
-
由 Nr Wu 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 30 4月, 2023 1 次提交
-
-
由 hablb 提交于
Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 21 4月, 2023 1 次提交
-
-
由 Olatunji Ruwase 提交于
* zero3 checkpoint frozen params * Remove debug prints * Move to cpu * WIP * WIP * WIP * Cleanup * Cleanup * Extend unit test for frozen params * API fix
-
- 31 3月, 2023 1 次提交
-
-
由 Michael Wyatt 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 27 3月, 2023 1 次提交
-
-
由 Jeff Rasley 提交于
-
- 24 3月, 2023 1 次提交
-
-
由 Satpal Singh Rathore 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 26 1月, 2023 1 次提交
-
-
由 Ma, Guokai 提交于
* Integrate accelerator abstraction interface into deepspeed/ * Fix error message in fp16/fused_optimizer * fix error message in fp16/unfused_optimizer.py * assign get_accelerator().pin_memory() result to input Tensor name * no need to check cuda and whether nvtx supported * move try-except into inner most block * call Event() and Stream() in get_accelerator() for data type * Make Stream and Event as properties of abstract interface so they can be used as data type in deepspeed * Apply op_builder backend api change from #2705 from @jeffra * fix tests where Builder NAME is used * keep original ...Builder.NAME interface instead of ...Builder().NAME interface * fix builder closure for installation * fix randomltd builder * add comments to clarify create_op_builder and get_op_builder * fix compatibility with pip install -e Co-authored-by: NCheng Li <pistasable@gmail.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 17 12月, 2022 1 次提交
-
-
由 Alexander Jipa 提交于
taking gradient accumulation steps into account for throughput calculation Co-authored-by: NAlexander Jipa <azzhipa@amazon.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 13 12月, 2022 1 次提交
-
-
由 Conglong Li 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 22 10月, 2022 1 次提交
-
-
由 Adam Moody 提交于
* parallelize layer checkpoints across data parallel groups * use partition_uniform to determine start/end index values * formatting fix * config: add option for parallel write of layer checkpoints in pipeline stage * yapf fixes * enable parallel layer write according to config param * avoid extraneous makedir when rank 0 writes all layers Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 30 7月, 2022 1 次提交
-
-
由 Arpan Jain 提交于
Co-authored-by: NArpan Jain <t-arpanjain@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 28 7月, 2022 1 次提交
-
-
由 trajep 提交于
* enable checkpoint engine * seprated nebula config * add __init__.py for nebula importing * linter fix * fix: ds_config is None * fix: ds config * fix: get sd loader fix * align the API with torch raw code * linter fix * remove duplicate tag params * make checkpoint_engine as required args * fix args * extract parameters out to config * fix: load state dict * separate load engine * linter fix * extract checkpoint engine to abstract calss * linter fix * construct function args fix * add docs for dev/customers * linter fix * remove load engine * print->log_dist * linter fix * add tag flag to distinguish the loading order Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 26 7月, 2022 2 次提交
-
-
由 Alex Hedges 提交于
-
由 Quentin Anthony 提交于
Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 21 6月, 2022 1 次提交
-
-
由 Karim Foda 提交于
Co-authored-by: NMichael Wyatt <michaelwyatt@microsoft.com>
-
- 16 6月, 2022 1 次提交
-
-
由 Quentin Anthony 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 11 6月, 2022 1 次提交
-
-
由 Ammar Ahmad Awan 提交于
Co-authored-by: NQuentin Anthony <qganthony@yahoo.com> Co-authored-by: NAmmar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 12 5月, 2022 1 次提交
-
-
由 Jeff Rasley 提交于
-
- 10 5月, 2022 1 次提交
-
-
由 Stas Bekman 提交于
* [pipe] prevent deadlock with multiple evals sequence * style * style * style * align DSE commit w. latest master Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 04 5月, 2022 1 次提交
-
-
由 Zhengqiang Yin 提交于
-
- 27 4月, 2022 1 次提交
-
-
由 Jeff Rasley 提交于
Co-authored-by: NReza Yazdani <reyazda@microsoft.com> Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
-
- 20 4月, 2022 1 次提交
-
-
由 Olatunji Ruwase 提交于
* bf16 updates * Got bf16 working * fp32 reduction; flattened tensors * bf16+zero_stage_1 first cut * finish zero_stage 1 sharding * Matching fp16 with debugging codes * Matching loss with fp16 * Fix gradient clipping * bf16 gradient clipping fix bf16 checkpoint save/load * Unscale grad norm * Fix grad norm scaling * Enable loading fp16_zero_1 into bf16_zero_1 engine and vice versa * Fix clip_grad key error * Reduce tied weight gradients * Fix grad norm for moe * Reduce specified gradients * Use O(n) instead of O(n^2) * Remove optimizer restriction for bf16 * Link bf16 & fp32 params * Clip gradients of last stage tied weights * Simplify tied weights reduction logic * Also clip all tp rank parameters * lp to hp mapping * Link lp/hp/optim state; Refresh links after checkpoint load * Remove debug print * Remove debug print * Simplify zero_grad logic * fp32 accessors * Fix update bug Co-authored-by: NJeff Rasley <jerasley@microsoft.com>
-
- 11 2月, 2022 1 次提交
-
-
由 Du Li 提交于
-
- 23 1月, 2022 1 次提交
-
-
由 Alex Hedges 提交于
-
- 22 10月, 2021 2 次提交
-
-
由 Conglong Li 提交于
-
由 Conglong Li 提交于
* fix pp * better fix
-
- 10 10月, 2021 1 次提交
-
-
由 Conglong Li 提交于
-
- 09 10月, 2021 1 次提交
-
-
由 Conglong Li 提交于
* CL+PP * add TODO
-
- 08 10月, 2021 1 次提交
-
-
由 Thomas Wang 提交于
Co-authored-by: NThomas <thomas@Thomass-MacBook-Pro.local> Co-authored-by: NShaden Smith <shaden.smith@microsoft.com> Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NTunji Ruwase <olruwase@microsoft.com> Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>
-
- 02 10月, 2021 1 次提交
-
-
由 Hyunwoong Ko 提交于
* Add flexibility of pipeline module and engine * Separate PRs * Separate PRs Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com>
-
- 30 9月, 2021 1 次提交
-
-
由 Jeff Rasley 提交于
Co-authored-by: NJeff Rasley <jerasley@microsoft.com> Co-authored-by: NShaden Smith <shaden.smith@microsoft.com> Co-authored-by: NOlatunji Ruwase <olruwase@microsoft.com> Co-authored-by: NSamyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: NReza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Neltonzheng <eltonz@microsoft.com> Co-authored-by: NStas Bekman <stas00@users.noreply.github.com>
-