1. 07 12月, 2021 1 次提交
    • Y
      [Auto para] Relaunch with auto mapping function (#37326) · 506e79d1
      Yulong Ao 提交于
      * [Auto Parallel]  Add the unified cluster representation
      
      * [Auto Parallel] Add the graph class for physical mapping
      
      * [Auto Parallel] Add the simple physical mapper
      
      * Set the timeout of the mapper
      
      * Merge the upstream develop unittests cmake files
      
      * Fix a bug of the process group
      
      * Remove mapper unittest from platforms which is not GPU
      
      * Move the instantiation of process group after resharding
      
      * Add the local id for devices
      
      * Update the rank mapping format
      
      * [Auto Parallel] Relaunch with the rank mapping file
      
      * Remove the unnecessary json file
      
      * Avoid entering get_device_proc_info for auto mapping
      
      * Correct the mapper unit test
      
      * Add some comments
      
      * Remove the related files about mapping
      
      * Update the unittest for auto mapping
      
      * Remove unused rank_mapping unittest
      
      * Improve the unittest coverage
      
      * Improve the unittest coverage
      
      * Improve the unittest of relaunch
      
      * Fix the unittest problem in CI
      
      * Improve the unittest of relaunch
      
      * Remove unnecessary statements
      
      * Update the unittest cmakefile
      
      * Correct the cmakefile of auto parallel unittests
      
      * Modify codes based on the new elastic change
      
      * Use the GPUs exclusively in the unittest
      
      * Correct the cmakefile
      
      * Set the timeout of the unittest
      506e79d1
  2. 06 12月, 2021 2 次提交
  3. 02 12月, 2021 2 次提交
  4. 01 12月, 2021 1 次提交
  5. 30 11月, 2021 3 次提交
    • X
      [Auto Parallel] elastic support auto parallel re-launch (#37523) · 5440d2f9
      xiayanming 提交于
      * [Auto Parallel] elastic support auto parallel re-launch
      
      * [Auto Parallel] elastic support auto parallel re-launch
      
      * fix ci issue
      
      * fix ci issue
      
      * fix rank mapping unittest
      
      * fix rank mapping unittest
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      
      * fix ci issue
      5440d2f9
    • Z
      1514eec6
    • Y
      [Auto Parallel] Do the physical mapping between the process graph and the cluster graph (#37094) · b0dff05d
      Yulong Ao 提交于
      * [Auto Parallel]  Add the unified cluster representation
      
      * [Auto Parallel] Add the graph class for physical mapping
      
      * [Auto Parallel] Add the simple physical mapper
      
      * Set the timeout of the mapper
      
      * Merge the upstream develop unittests cmake files
      
      * Fix a bug of the process group
      
      * Remove mapper unittest from platforms which is not GPU
      
      * Move the instantiation of process group after resharding
      
      * Add the local id for devices
      
      * Update the rank mapping format
      
      * Add some comments
      
      * Remove the related files about mapping
      
      * Update the unittest for auto mapping
      
      * Remove unused rank_mapping unittest
      
      * Improve the unittest coverage
      
      * Improve the unittest coverage
      b0dff05d
  6. 29 11月, 2021 2 次提交
  7. 27 11月, 2021 1 次提交
    • Y
      [Auto Parallel] Add the graph class for the process and cluster (#37482) · 48faf638
      Yulong Ao 提交于
      * [Auto Parallel]  Add the unified cluster representation
      
      * [Auto Parallel] Add the graph class for physical mapping
      
      * [Auto Parallel] Add the simple physical mapper
      
      * Set the timeout of the mapper
      
      * Merge the upstream develop unittests cmake files
      
      * Fix a bug of the process group
      
      * Remove mapper unittest from platforms which is not GPU
      
      * Move the instantiation of process group after resharding
      
      * Add the local id for devices
      
      * Update the rank mapping format
      
      * Add some comments
      
      * Remove the related files about mapping
      
      * Remove unused rank_mapping unittest
      
      * Improve the unittest coverage
      48faf638
  8. 26 11月, 2021 2 次提交
    • Z
      upgrade async distributed training in pscore (#37515) · 74605fc2
      zhaocaibei123 提交于
      * test
      
      * test
      
      * rm test
      
      * update
      
      * update
      
      * update
      
      * add unittest
      
      * update
      
      * update save
      74605fc2
    • W
      TDM2 (#37044) · 4826167c
      wangzhen38 提交于
      * add tdm sample
      
      * add tdm sample in c++
      
      * update tdm sample
      
      * modify sample count
      
      * fix conflict
      
      * add set_date
      
      * fix cmake error
      
      * fix bug of proto
      
      * update index_dataset proto
      
      * update cmake
      
      * fix error cmake
      
      * fix cmake mkldnn
      
      * fix cmake proto
      
      * update cmake proto
      
      * update cmake
      
      * update rec
      
      * update dataset
      
      * update dataset
      
      * update dataset
      
      * updata dataset
      
      * updata dataset
      
      * updata coverage
      
      * updata ci
      
      * goback4
      
      * fix npu ci
      
      * add xxhash dep
      4826167c
  9. 25 11月, 2021 2 次提交
  10. 24 11月, 2021 2 次提交
  11. 23 11月, 2021 1 次提交
  12. 22 11月, 2021 3 次提交
  13. 19 11月, 2021 1 次提交
  14. 18 11月, 2021 3 次提交
    • Z
      [heterps]change default executor for heter trainer (#37314) · c98d175d
      zmx 提交于
      * fix pslib. test=develop
      
      * add device to train_from_dataset. test=develop
      
      * refine fleet.stop_worker. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix executor & ut. test=develop
      
      * fix executor & ut. test=develop
      
      * fix executor & ut. test=develop
      c98d175d
    • X
      Optimize fleet elastic scale in/out (#37177) · 6d34d266
      xiayanming 提交于
      * fleet support elastic train
      
      * fleet support elastic train
      
      * support elastic
      
      * add unittest
      
      * fix unitest bug
      
      * fix unittest bug
      
      * fix unittest bug
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix elastic bug
      
      * fix ci fail
      
      * fix ci fail
      
      * fix elastic bug
      
      * fix elastic bug
      
      * fix joint debugging bug
      
      * fix joint debugging bug
      
      * fix windows ci failed
      
      * fix windows ci failed
      
      * Optimize fleet elastic scale in/out
      
      * elastic support pre hook
      
      * add prehook unittest
      6d34d266
    • Z
      [heterps]add heterps mode judgement (#37298) · dd7189ff
      zmx 提交于
      dd7189ff
  15. 17 11月, 2021 3 次提交
    • Z
      update dataset (#37194) · ca8c4f3e
      zhaocaibei123 提交于
      ca8c4f3e
    • Z
      [heterps]Refactor heterogenous worker (#37244) · 54d2626a
      zmx 提交于
      * fix. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * refactor heter trainer. test=develop
      
      * fix. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      54d2626a
    • W
      [npu][hybrid] support offload (#37224) · 762819a8
      WangXi 提交于
      762819a8
  16. 16 11月, 2021 1 次提交
  17. 15 11月, 2021 2 次提交
  18. 12 11月, 2021 1 次提交
    • Z
      [AutoParallel] Add AutoConvert (#36958) · 1773afd7
      zhaoyingli 提交于
      * add AutoConvert
      
      * add unitest
      
      * amend merge&slice
      
      * amend default dist_attr
      
      * update doc&improve coverage
      
      * add interface dist_context
      
      * tiny modify
      1773afd7
  19. 11 11月, 2021 2 次提交
    • X
      fleet support elastic scale up/down (#36684) · 6af531b7
      xiayanming 提交于
      * fleet support elastic train
      
      * fleet support elastic train
      
      * support elastic
      
      * add unittest
      
      * fix unitest bug
      
      * fix unittest bug
      
      * fix unittest bug
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix elastic bug
      
      * fix ci fail
      
      * fix ci fail
      
      * fix elastic bug
      
      * fix elastic bug
      
      * fix joint debugging bug
      
      * fix joint debugging bug
      
      * fix windows ci failed
      
      * fix windows ci failed
      6af531b7
    • Z
      [Heterps]Refactor Heter Pipeline Parameter Server (#36845) · a2da1efa
      zmx 提交于
      * change username
      
      * fix
      
      * fix
      
      * fix
      
      * fix
      
      * fix
      
      * update
      
      * update
      
      * update unittests
      
      * fix
      
      * update
      
      * fix
      
      * update
      
      * fix
      
      * fix
      
      * fix
      
      * update
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update send_and_recv op. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix ut. test=develop
      
      * fix unit. notest,test=coverage
      
      * fix ut. notest, test=coverage
      
      * update. notest,test=coverage
      
      * fix ut. notest, test=coverage
      
      * fix ut. notest, test=coverage
      
      * fix. notest, test=coverage
      
      * fix. notest, test=coverage
      
      * fix ut. notest, test=coverage
      
      * fix ut. notest, test=coverage
      
      * fix ut. notest, test=coverage
      
      * fix ut. notest, test=coverage
      
      * add func. notest, test=coverage
      
      * fix ut. notest, test=coverage
      
      * fix. test=develop
      
      * fix. test=develop
      a2da1efa
  20. 08 11月, 2021 1 次提交
  21. 02 11月, 2021 1 次提交
    • Z
      [AutoParallel] Save&Load Module (#36558) · b9defb4f
      zhaoyingli 提交于
      * AutoParallel Save&Load
      
      * tiny modi
      
      * update func name
      
      * tiny fix
      
      * add NotImplementedError
      
      * fix doc
      
      * update func name
      
      * update func param
      
      * update interface
      
      * add unitest & modi make_data_unshard
      
      * update unittest
      
      * update unittest
      
      * fix unittest
      
      * fix cmakelist
      
      * update unittest
      b9defb4f
  22. 29 10月, 2021 1 次提交
    • Y
      [Auto Parallel] Improve the interface and the underlying mechanisms (#36617) · a02532b5
      Yulong Ao 提交于
      * default dist op
      
      * add dist_attr for dist op
      
      * add unitest
      
      * update inputname
      
      * update function name
      
      * add unitest
      
      * update CMakeLists.txt for CI
      
      * fix dis_matmul
      
      * fix compile error
      
      * update matmul to matmul_v2
      
      * unify api
      
      * unify api
      
      * todo
      
      * update distop forward func
      
      * update distop forward func
      
      * auto parallel backward
      
      * update dist op
      
      * autoparallel backward
      
      * add backward for embedding
      
      * temp1
      
      * temp2
      
      * temp3
      
      * temp4
      
      * backward done1
      
      * backward done2
      
      * backward done3
      
      * dist embedding remove mp mode
      
      * dist matmul remove mp mode
      
      * update dist embedding
      『
      
      * dist op init1
      
      * dist op init 2
      
      * update unitest
      
      * context remove parallel mode
      
      * partitioner remove parallel mode
      
      * update unitest
      
      * a more general method to support varying mesh in pipeline parallel
      
      * support varying mesh in pipeline parallel
      
      * embedding support varying mesh in pipeline parallel
      
      * matmul support varying mesh in pipeline parallel
      
      * default dist op support varying mesh in pipeline parallel
      
      * dist attribute for startup program
      
      * default dist op support varying mesh in pipeline parallel 2
      
      * partitoner support varying mesh in pipeline parallel
      
      * revise logic for auto compeletion
      
      * revise framework.py
      
      * revise reshard unitest
      
      * revise unitest for parallelize
      
      * chmod
      
      * fixed bug for dist embedding name mapping
      
      * Improve the interface and the underlying mechanisms of auto parallel
      
      * revise completion for backward
      
      * revise completion for update
      
      * revise completion for update
      
      * update unitest
      
      * chmod
      
      * bugfix for grad_op output var's mesh
      
      * Modify codes for pr 36744
      
      * Remove unnecessary comments in framework.py
      
      * Remove unnecessary comments in completion.py
      Co-authored-by: NJZ-LIANG <jianzhongliang10@gmail.com>
      Co-authored-by: Nzhaoyingli <zhaoyingli@baidu.com>
      Co-authored-by: NJZ-LIANG <38102074+JZ-LIANG@users.noreply.github.com>
      a02532b5
  23. 28 10月, 2021 2 次提交