1. 26 11月, 2021 2 次提交
    • Z
      upgrade async distributed training in pscore (#37515) · 74605fc2
      zhaocaibei123 提交于
      * test
      
      * test
      
      * rm test
      
      * update
      
      * update
      
      * update
      
      * add unittest
      
      * update
      
      * update save
      74605fc2
    • W
      TDM2 (#37044) · 4826167c
      wangzhen38 提交于
      * add tdm sample
      
      * add tdm sample in c++
      
      * update tdm sample
      
      * modify sample count
      
      * fix conflict
      
      * add set_date
      
      * fix cmake error
      
      * fix bug of proto
      
      * update index_dataset proto
      
      * update cmake
      
      * fix error cmake
      
      * fix cmake mkldnn
      
      * fix cmake proto
      
      * update cmake proto
      
      * update cmake
      
      * update rec
      
      * update dataset
      
      * update dataset
      
      * update dataset
      
      * updata dataset
      
      * updata dataset
      
      * updata coverage
      
      * updata ci
      
      * goback4
      
      * fix npu ci
      
      * add xxhash dep
      4826167c
  2. 25 11月, 2021 2 次提交
  3. 24 11月, 2021 2 次提交
  4. 23 11月, 2021 1 次提交
  5. 22 11月, 2021 3 次提交
  6. 19 11月, 2021 1 次提交
  7. 18 11月, 2021 3 次提交
    • Z
      [heterps]change default executor for heter trainer (#37314) · c98d175d
      zmx 提交于
      * fix pslib. test=develop
      
      * add device to train_from_dataset. test=develop
      
      * refine fleet.stop_worker. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix executor & ut. test=develop
      
      * fix executor & ut. test=develop
      
      * fix executor & ut. test=develop
      c98d175d
    • X
      Optimize fleet elastic scale in/out (#37177) · 6d34d266
      xiayanming 提交于
      * fleet support elastic train
      
      * fleet support elastic train
      
      * support elastic
      
      * add unittest
      
      * fix unitest bug
      
      * fix unittest bug
      
      * fix unittest bug
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix elastic bug
      
      * fix ci fail
      
      * fix ci fail
      
      * fix elastic bug
      
      * fix elastic bug
      
      * fix joint debugging bug
      
      * fix joint debugging bug
      
      * fix windows ci failed
      
      * fix windows ci failed
      
      * Optimize fleet elastic scale in/out
      
      * elastic support pre hook
      
      * add prehook unittest
      6d34d266
    • Z
      [heterps]add heterps mode judgement (#37298) · dd7189ff
      zmx 提交于
      dd7189ff
  8. 17 11月, 2021 3 次提交
    • Z
      update dataset (#37194) · ca8c4f3e
      zhaocaibei123 提交于
      ca8c4f3e
    • Z
      [heterps]Refactor heterogenous worker (#37244) · 54d2626a
      zmx 提交于
      * fix. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * refactor heter trainer. test=develop
      
      * fix. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      
      * fix ut. test=develop
      54d2626a
    • W
      [npu][hybrid] support offload (#37224) · 762819a8
      WangXi 提交于
      762819a8
  9. 16 11月, 2021 1 次提交
  10. 15 11月, 2021 2 次提交
  11. 12 11月, 2021 1 次提交
    • Z
      [AutoParallel] Add AutoConvert (#36958) · 1773afd7
      zhaoyingli 提交于
      * add AutoConvert
      
      * add unitest
      
      * amend merge&slice
      
      * amend default dist_attr
      
      * update doc&improve coverage
      
      * add interface dist_context
      
      * tiny modify
      1773afd7
  12. 11 11月, 2021 2 次提交
    • X
      fleet support elastic scale up/down (#36684) · 6af531b7
      xiayanming 提交于
      * fleet support elastic train
      
      * fleet support elastic train
      
      * support elastic
      
      * add unittest
      
      * fix unitest bug
      
      * fix unittest bug
      
      * fix unittest bug
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix unittest coverage
      
      * fix elastic bug
      
      * fix ci fail
      
      * fix ci fail
      
      * fix elastic bug
      
      * fix elastic bug
      
      * fix joint debugging bug
      
      * fix joint debugging bug
      
      * fix windows ci failed
      
      * fix windows ci failed
      6af531b7
    • Z
      [Heterps]Refactor Heter Pipeline Parameter Server (#36845) · a2da1efa
      zmx 提交于
      * change username
      
      * fix
      
      * fix
      
      * fix
      
      * fix
      
      * fix
      
      * update
      
      * update
      
      * update unittests
      
      * fix
      
      * update
      
      * fix
      
      * update
      
      * fix
      
      * fix
      
      * fix
      
      * update
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update send_and_recv op. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * update. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix. test=develop
      
      * fix ut. test=develop
      
      * fix unit. notest,test=coverage
      
      * fix ut. notest, test=coverage
      
      * update. notest,test=coverage
      
      * fix ut. notest, test=coverage
      
      * fix ut. notest, test=coverage
      
      * fix. notest, test=coverage
      
      * fix. notest, test=coverage
      
      * fix ut. notest, test=coverage
      
      * fix ut. notest, test=coverage
      
      * fix ut. notest, test=coverage
      
      * fix ut. notest, test=coverage
      
      * add func. notest, test=coverage
      
      * fix ut. notest, test=coverage
      
      * fix. test=develop
      
      * fix. test=develop
      a2da1efa
  13. 08 11月, 2021 1 次提交
  14. 02 11月, 2021 1 次提交
    • Z
      [AutoParallel] Save&Load Module (#36558) · b9defb4f
      zhaoyingli 提交于
      * AutoParallel Save&Load
      
      * tiny modi
      
      * update func name
      
      * tiny fix
      
      * add NotImplementedError
      
      * fix doc
      
      * update func name
      
      * update func param
      
      * update interface
      
      * add unitest & modi make_data_unshard
      
      * update unittest
      
      * update unittest
      
      * fix unittest
      
      * fix cmakelist
      
      * update unittest
      b9defb4f
  15. 29 10月, 2021 1 次提交
    • Y
      [Auto Parallel] Improve the interface and the underlying mechanisms (#36617) · a02532b5
      Yulong Ao 提交于
      * default dist op
      
      * add dist_attr for dist op
      
      * add unitest
      
      * update inputname
      
      * update function name
      
      * add unitest
      
      * update CMakeLists.txt for CI
      
      * fix dis_matmul
      
      * fix compile error
      
      * update matmul to matmul_v2
      
      * unify api
      
      * unify api
      
      * todo
      
      * update distop forward func
      
      * update distop forward func
      
      * auto parallel backward
      
      * update dist op
      
      * autoparallel backward
      
      * add backward for embedding
      
      * temp1
      
      * temp2
      
      * temp3
      
      * temp4
      
      * backward done1
      
      * backward done2
      
      * backward done3
      
      * dist embedding remove mp mode
      
      * dist matmul remove mp mode
      
      * update dist embedding
      『
      
      * dist op init1
      
      * dist op init 2
      
      * update unitest
      
      * context remove parallel mode
      
      * partitioner remove parallel mode
      
      * update unitest
      
      * a more general method to support varying mesh in pipeline parallel
      
      * support varying mesh in pipeline parallel
      
      * embedding support varying mesh in pipeline parallel
      
      * matmul support varying mesh in pipeline parallel
      
      * default dist op support varying mesh in pipeline parallel
      
      * dist attribute for startup program
      
      * default dist op support varying mesh in pipeline parallel 2
      
      * partitoner support varying mesh in pipeline parallel
      
      * revise logic for auto compeletion
      
      * revise framework.py
      
      * revise reshard unitest
      
      * revise unitest for parallelize
      
      * chmod
      
      * fixed bug for dist embedding name mapping
      
      * Improve the interface and the underlying mechanisms of auto parallel
      
      * revise completion for backward
      
      * revise completion for update
      
      * revise completion for update
      
      * update unitest
      
      * chmod
      
      * bugfix for grad_op output var's mesh
      
      * Modify codes for pr 36744
      
      * Remove unnecessary comments in framework.py
      
      * Remove unnecessary comments in completion.py
      Co-authored-by: NJZ-LIANG <jianzhongliang10@gmail.com>
      Co-authored-by: Nzhaoyingli <zhaoyingli@baidu.com>
      Co-authored-by: NJZ-LIANG <38102074+JZ-LIANG@users.noreply.github.com>
      a02532b5
  16. 28 10月, 2021 3 次提交
  17. 27 10月, 2021 2 次提交
  18. 25 10月, 2021 1 次提交
  19. 21 10月, 2021 2 次提交
  20. 20 10月, 2021 3 次提交
    • H
      fix bugs of ClipGradByGlobalNorm in HybridParallel (#36555) · 6a3941e3
      Haohongxiang 提交于
      * fix bugs of ClipGradByGlobalNorm
      
      * add unittests
      
      * add unittests
      6a3941e3
    • Fix global gather and global scatter operators (#36517) · 17b4dd70
      李季 提交于
      * fix global gather and global scatter operators
      17b4dd70
    • J
      [Auto Parallel] Generalization for Partition and Completion (#35735) · 797bd40d
      JZ-LIANG 提交于
      * default dist op
      
      * add dist_attr for dist op
      
      * add unitest
      
      * update inputname
      
      * update function name
      
      * add unitest
      
      * update CMakeLists.txt for CI
      
      * fix dis_matmul
      
      * fix compile error
      
      * update matmul to matmul_v2
      
      * unify api
      
      * unify api
      
      * todo
      
      * update distop forward func
      
      * update distop forward func
      
      * auto parallel backward
      
      * update dist op
      
      * autoparallel backward
      
      * add backward for embedding
      
      * temp1
      
      * temp2
      
      * temp3
      
      * temp4
      
      * backward done1
      
      * backward done2
      
      * backward done3
      
      * dist embedding remove mp mode
      
      * dist matmul remove mp mode
      
      * update dist embedding
      『
      
      * dist op init1
      
      * dist op init 2
      
      * update unitest
      
      * context remove parallel mode
      
      * partitioner remove parallel mode
      
      * update unitest
      
      * a more general method to support varying mesh in pipeline parallel
      
      * support varying mesh in pipeline parallel
      
      * embedding support varying mesh in pipeline parallel
      
      * matmul support varying mesh in pipeline parallel
      
      * default dist op support varying mesh in pipeline parallel
      
      * dist attribute for startup program
      
      * default dist op support varying mesh in pipeline parallel 2
      
      * partitoner support varying mesh in pipeline parallel
      
      * revise logic for auto compeletion
      
      * revise framework.py
      
      * revise reshard unitest
      
      * revise unitest for parallelize
      
      * chmod
      
      * fixed bug for dist embedding name mapping
      Co-authored-by: Nzhaoyingli <zhaoyingli@baidu.com>
      797bd40d
  21. 19 10月, 2021 3 次提交