1. 18 2月, 2022 2 次提交
    • Z
      [AMP] support GPU BF16 amp for dygraph (#39029) · 7d6d3848
      zhangbo9674 提交于
      * support dtype param for auto_cast
      
      * add amp_dtype for tracer
      
      * add unsupported bf16 list
      
      * support bf16 amp for O2
      
      * refine python interface for bfloat16
      
      * refine code
      
      * refine code
      
      * refine unittest
      
      * refine code
      
      * refine code
      
      * add bf16 o1
      
      * refine code by comment
      
      * add gradient accumulator
      
      * add recompute
      7d6d3848
    • B
      Fix sharding group (#39668) · bc3ca678
      Baibaifan 提交于
      * fix_sharding_group
      
      * fix_sharding_group
      bc3ca678
  2. 17 2月, 2022 1 次提交
  3. 16 2月, 2022 1 次提交
    • Z
      sync/geo test ok & fix heter_worker program ok (#39511) · b2986bab
      ziyoujiyi 提交于
      * delete gloo connect retry
      
      * the_one_ps dirs reconstruct
      
      * .
      
      * .
      
      * create the_one_ps dirs
      
      * create the_one_ps dirs
      
      * create the_one_ps dirs
      
      * create the_one_ps dirs
      
      * create the_one_ps dirs
      
      * create the_one_ps dirs
      
      * the one ps dirs modify
      
      * the one ps dirs modify
      
      * the one ps dirs modify
      
      * the one ps dirs modify
      
      * refactor ps optimize
      
      * refactor ps optimize
      
      * refactor ps optimize
      
      * .
      
      * .
      
      * .
      
      * .
      
      * .
      
      * .
      
      * refactor theoneps
      
      * the_one_ps
      
      * add ps pass unittest
      
      * add ps pass unittest
      
      * ps unitest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * add cpu_async_ps_mode test
      
      * add cpu_async_ps_mode test
      
      * add cpu_async_ps_mode test
      
      * ps unittest ready
      
      * ps unittest ready
      
      * solve dist_pass init conflict
      
      * solve import CommContext error
      
      * unittest ok
      
      * implement AllocateFrom
      
      * solve setup.py.in conflict
      
      * solve conflict
      
      * solve conflict
      
      * solve conflict
      
      * .
      
      * .
      
      * cpu-async-ps minimize test ok & gpu minimize test ok
      
      * add heter 2stage unittest
      
      * add heter 2stage unittest
      
      * add heter 2stage unittest
      
      * sync/geo test ok & fix heter_worker program ok
      
      * .
      Co-authored-by: Nzkh2016 <zhangkaihuo@baidu.com>
      b2986bab
  4. 11 2月, 2022 1 次提交
    • Z
      统一 ps 开发 - python (#39431) · 22c67d14
      ziyoujiyi 提交于
      * delete gloo connect retry
      
      * the_one_ps dirs reconstruct
      
      * .
      
      * .
      
      * create the_one_ps dirs
      
      * create the_one_ps dirs
      
      * create the_one_ps dirs
      
      * create the_one_ps dirs
      
      * create the_one_ps dirs
      
      * create the_one_ps dirs
      
      * the one ps dirs modify
      
      * the one ps dirs modify
      
      * the one ps dirs modify
      
      * the one ps dirs modify
      
      * refactor ps optimize
      
      * refactor ps optimize
      
      * refactor ps optimize
      
      * .
      
      * .
      
      * .
      
      * .
      
      * .
      
      * .
      
      * refactor theoneps
      
      * the_one_ps
      
      * add ps pass unittest
      
      * add ps pass unittest
      
      * ps unitest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * add cpu_async_ps_mode test
      
      * add cpu_async_ps_mode test
      
      * add cpu_async_ps_mode test
      
      * ps unittest ready
      
      * ps unittest ready
      
      * solve dist_pass init conflict
      
      * solve import CommContext error
      
      * unittest ok
      
      * implement AllocateFrom
      
      * solve setup.py.in conflict
      
      * solve conflict
      
      * solve conflict
      
      * solve conflict
      
      * .
      
      * .
      
      * cpu-async-ps minimize test ok & gpu minimize test ok
      Co-authored-by: Nzkh2016 <zhangkaihuo@baidu.com>
      22c67d14
  5. 09 2月, 2022 1 次提交
  6. 08 2月, 2022 2 次提交
    • Z
      ps optimize refactor (#38982) · 196dbfc2
      ziyoujiyi 提交于
      * delete gloo connect retry
      
      * the_one_ps dirs reconstruct
      
      * .
      
      * .
      
      * create the_one_ps dirs
      
      * create the_one_ps dirs
      
      * create the_one_ps dirs
      
      * create the_one_ps dirs
      
      * create the_one_ps dirs
      
      * create the_one_ps dirs
      
      * the one ps dirs modify
      
      * the one ps dirs modify
      
      * the one ps dirs modify
      
      * the one ps dirs modify
      
      * refactor ps optimize
      
      * refactor ps optimize
      
      * refactor ps optimize
      
      * .
      
      * .
      
      * .
      
      * .
      
      * .
      
      * .
      
      * refactor theoneps
      
      * the_one_ps
      
      * add ps pass unittest
      
      * add ps pass unittest
      
      * ps unitest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * ps unittest frame
      
      * add cpu_async_ps_mode test
      
      * add cpu_async_ps_mode test
      
      * add cpu_async_ps_mode test
      
      * ps unittest ready
      
      * ps unittest ready
      
      * solve dist_pass init conflict
      
      * solve import CommContext error
      
      * unittest ok
      
      * implement AllocateFrom
      
      * solve setup.py.in conflict
      
      * solve conflict
      
      * solve conflict
      
      * solve conflict
      
      * .
      
      * .
      Co-authored-by: Nzkh2016 <zhangkaihuo@baidu.com>
      196dbfc2
    • B
      optimize sharding stage3 (#39334) · 23d559dd
      Baibaifan 提交于
      23d559dd
  7. 30 1月, 2022 1 次提交
  8. 28 1月, 2022 2 次提交
    • F
      [PSLIB] Add Metrics Module, Support User-defined Add Metric (#38789) · 2e6be886
      Fan Zhang 提交于
      * [PSLIB] Add Metrics Module, Support User-defined Add Metric
      
      * [PSLIB] Modify According to CI
      
      * [PSLIB] Modify According to CI
      
      * [PSLIB] Modify According to CI
      
      * [PSLIB] Modify According to CI Coverage
      
      * [PSLIB] Modify According to CI
      
      * [PSLIB] Modify According to CI
      
      * [PSLIB] Modify According to CI
      
      * [PSLIB] Modify According to CI
      
      * [PSLIB] Modify According to CI
      
      * [PSLIB] Modify According to CI Coverage
      
      * [PSLIB] Modify According to CI Coverage
      
      * [PSLIB] Modify According to CI Coverage
      
      * modify role_maker
      
      * update CMakeLists.txt
      2e6be886
    • B
      fix_stage2_minimize (#39285) · 90f44c6f
      Baibaifan 提交于
      90f44c6f
  9. 25 1月, 2022 3 次提交
  10. 24 1月, 2022 2 次提交
  11. 18 1月, 2022 1 次提交
  12. 17 1月, 2022 2 次提交
  13. 14 1月, 2022 1 次提交
  14. 06 1月, 2022 1 次提交
  15. 30 12月, 2021 1 次提交
  16. 21 12月, 2021 3 次提交
  17. 20 12月, 2021 1 次提交
  18. 19 12月, 2021 1 次提交
  19. 17 12月, 2021 3 次提交
  20. 09 12月, 2021 2 次提交
  21. 07 12月, 2021 2 次提交
    • Z
      Buf fix for reset grad inplace version (#37811) · cf586021
      Zhanlue Yang 提交于
      * Debug
      
      * Fixed issue with reset_grad_inplace_version when used with clear_gradient & cross-batch accumulation
      
      * Rearranged interfaces
      
      * Fixed ci issues
      cf586021
    • Y
      [Auto para] Relaunch with auto mapping function (#37326) · 506e79d1
      Yulong Ao 提交于
      * [Auto Parallel]  Add the unified cluster representation
      
      * [Auto Parallel] Add the graph class for physical mapping
      
      * [Auto Parallel] Add the simple physical mapper
      
      * Set the timeout of the mapper
      
      * Merge the upstream develop unittests cmake files
      
      * Fix a bug of the process group
      
      * Remove mapper unittest from platforms which is not GPU
      
      * Move the instantiation of process group after resharding
      
      * Add the local id for devices
      
      * Update the rank mapping format
      
      * [Auto Parallel] Relaunch with the rank mapping file
      
      * Remove the unnecessary json file
      
      * Avoid entering get_device_proc_info for auto mapping
      
      * Correct the mapper unit test
      
      * Add some comments
      
      * Remove the related files about mapping
      
      * Update the unittest for auto mapping
      
      * Remove unused rank_mapping unittest
      
      * Improve the unittest coverage
      
      * Improve the unittest coverage
      
      * Improve the unittest of relaunch
      
      * Fix the unittest problem in CI
      
      * Improve the unittest of relaunch
      
      * Remove unnecessary statements
      
      * Update the unittest cmakefile
      
      * Correct the cmakefile of auto parallel unittests
      
      * Modify codes based on the new elastic change
      
      * Use the GPUs exclusively in the unittest
      
      * Correct the cmakefile
      
      * Set the timeout of the unittest
      506e79d1
  22. 06 12月, 2021 2 次提交
  23. 02 12月, 2021 2 次提交
  24. 01 12月, 2021 1 次提交
  25. 30 11月, 2021 1 次提交
    • X
      [Auto Parallel] elastic support auto parallel re-launch (#37523) · 5440d2f9
      xiayanming 提交于
      * [Auto Parallel] elastic support auto parallel re-launch
      
      * [Auto Parallel] elastic support auto parallel re-launch
      
      * fix ci issue
      
      * fix ci issue
      
      * fix rank mapping unittest
      
      * fix rank mapping unittest
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      5440d2f9