- 07 12月, 2021 1 次提交
-
-
由 Yulong Ao 提交于
* [Auto Parallel] Add the unified cluster representation * [Auto Parallel] Add the graph class for physical mapping * [Auto Parallel] Add the simple physical mapper * Set the timeout of the mapper * Merge the upstream develop unittests cmake files * Fix a bug of the process group * Remove mapper unittest from platforms which is not GPU * Move the instantiation of process group after resharding * Add the local id for devices * Update the rank mapping format * [Auto Parallel] Relaunch with the rank mapping file * Remove the unnecessary json file * Avoid entering get_device_proc_info for auto mapping * Correct the mapper unit test * Add some comments * Remove the related files about mapping * Update the unittest for auto mapping * Remove unused rank_mapping unittest * Improve the unittest coverage * Improve the unittest coverage * Improve the unittest of relaunch * Fix the unittest problem in CI * Improve the unittest of relaunch * Remove unnecessary statements * Update the unittest cmakefile * Correct the cmakefile of auto parallel unittests * Modify codes based on the new elastic change * Use the GPUs exclusively in the unittest * Correct the cmakefile * Set the timeout of the unittest
-
- 06 12月, 2021 2 次提交
-
-
由 Baibaifan 提交于
-
由 kuizhiqing 提交于
-
- 02 12月, 2021 2 次提交
-
-
由 xiayanming 提交于
-
由 Baibaifan 提交于
-
- 01 12月, 2021 1 次提交
-
-
由 zmxdream 提交于
* fix launch_utils.py. test=develop * fix launch_utils.py. test=develop
-
- 30 11月, 2021 2 次提交
-
-
由 xiayanming 提交于
* [Auto Parallel] elastic support auto parallel re-launch * [Auto Parallel] elastic support auto parallel re-launch * fix ci issue * fix ci issue * fix rank mapping unittest * fix rank mapping unittest * fix ci issue * fix ci issue * fix ci issue * fix ci issue * fix ci issue * fix ci issue * fix ci issue * fix ci issue * fix ci issue * fix ci issue * fix ci issue * fix ci issue * fix ci issue
-
由 zhaocaibei123 提交于
-
- 29 11月, 2021 2 次提交
-
-
由 Baibaifan 提交于
-
由 李季 提交于
Co-authored-by: NChen Long <1300851984@qq.com>
-
- 26 11月, 2021 2 次提交
-
-
由 zhaocaibei123 提交于
* test * test * rm test * update * update * update * add unittest * update * update save
-
由 wangzhen38 提交于
* add tdm sample * add tdm sample in c++ * update tdm sample * modify sample count * fix conflict * add set_date * fix cmake error * fix bug of proto * update index_dataset proto * update cmake * fix error cmake * fix cmake mkldnn * fix cmake proto * update cmake proto * update cmake * update rec * update dataset * update dataset * update dataset * updata dataset * updata dataset * updata coverage * updata ci * goback4 * fix npu ci * add xxhash dep
-
- 25 11月, 2021 2 次提交
- 24 11月, 2021 1 次提交
-
-
由 zhaoyingli 提交于
* adapt auto search * adapt auto search * fix matmulv2 compatible * del debug
-
- 22 11月, 2021 2 次提交
- 19 11月, 2021 1 次提交
-
-
由 wangguanqun 提交于
-
- 18 11月, 2021 3 次提交
-
-
由 zmx 提交于
* fix pslib. test=develop * add device to train_from_dataset. test=develop * refine fleet.stop_worker. test=develop * fix ut. test=develop * fix ut. test=develop * fix executor & ut. test=develop * fix executor & ut. test=develop * fix executor & ut. test=develop
-
由 xiayanming 提交于
* fleet support elastic train * fleet support elastic train * support elastic * add unittest * fix unitest bug * fix unittest bug * fix unittest bug * fix unittest coverage * fix unittest coverage * fix unittest coverage * fix unittest coverage * fix unittest coverage * fix elastic bug * fix ci fail * fix ci fail * fix elastic bug * fix elastic bug * fix joint debugging bug * fix joint debugging bug * fix windows ci failed * fix windows ci failed * Optimize fleet elastic scale in/out * elastic support pre hook * add prehook unittest
-
由 zmx 提交于
-
- 17 11月, 2021 3 次提交
-
-
由 zhaocaibei123 提交于
-
由 zmx 提交于
* fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * refactor heter trainer. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop
-
由 WangXi 提交于
-
- 15 11月, 2021 2 次提交
-
-
由 Zeng Jinle 提交于
* add split_program * make ut faster * increase ut timeout * make result deterministic * add fuse_all_reduce pass * add ut framework, update * fix ut framework * remove useless code * add coverage support * update * fix CI * fix some bugs and fix ci coverage * fix conflict
-
由 zmx 提交于
* fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix ut. test=develop * fix ut. test=develop * fix ut. test=develop
-
- 11 11月, 2021 2 次提交
-
-
由 xiayanming 提交于
* fleet support elastic train * fleet support elastic train * support elastic * add unittest * fix unitest bug * fix unittest bug * fix unittest bug * fix unittest coverage * fix unittest coverage * fix unittest coverage * fix unittest coverage * fix unittest coverage * fix elastic bug * fix ci fail * fix ci fail * fix elastic bug * fix elastic bug * fix joint debugging bug * fix joint debugging bug * fix windows ci failed * fix windows ci failed
-
由 zmx 提交于
* change username * fix * fix * fix * fix * fix * update * update * update unittests * fix * update * fix * update * fix * fix * fix * update * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update send_and_recv op. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * update. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix. test=develop * fix ut. test=develop * fix unit. notest,test=coverage * fix ut. notest, test=coverage * update. notest,test=coverage * fix ut. notest, test=coverage * fix ut. notest, test=coverage * fix. notest, test=coverage * fix. notest, test=coverage * fix ut. notest, test=coverage * fix ut. notest, test=coverage * fix ut. notest, test=coverage * fix ut. notest, test=coverage * add func. notest, test=coverage * fix ut. notest, test=coverage * fix. test=develop * fix. test=develop
-
- 08 11月, 2021 1 次提交
-
-
由 kuizhiqing 提交于
-
- 28 10月, 2021 3 次提交
-
-
由 wangguanqun 提交于
* add trainer desc config to distributed strategy * code style modified * data_feed set lod * fix bug * code style * fix bug * save load * save load * save unittest * add unittest of the_one_ps * unittest * add todo in communicator sendsparse
-
由 seemingwang 提交于
-
由 Bo Liu 提交于
-
- 27 10月, 2021 1 次提交
-
-
由 xiongkun 提交于
* bugfix: only check backend when mode == Collecive * fix bug
-
- 25 10月, 2021 1 次提交
-
-
由 Haohongxiang 提交于
* fix bug of check_inf * fix allreduce
-
- 21 10月, 2021 2 次提交
-
-
由 danleifeng 提交于
-
由 xiongkun 提交于
-
- 20 10月, 2021 1 次提交
-
-
由 Haohongxiang 提交于
* fix bugs of ClipGradByGlobalNorm * add unittests * add unittests
-
- 19 10月, 2021 2 次提交
-
-
由 danleifeng 提交于
-
由 WangXi 提交于
-
- 18 10月, 2021 1 次提交
-
-
由 Haohongxiang 提交于
* [HybridParallel]Support fp16 in dygraph hybrid parallel * update * update * update for recompute * add unittest of pp+fp16 * add unittest of recompute+fp16 * update * modify ut
-