- 23 4月, 2021 1 次提交
-
-
由 Baibaifan 提交于
solve hccl communicate conflict (#32447)
-
- 21 4月, 2021 1 次提交
-
-
由 zhang wenhui 提交于
* add allreduce and broadcast without test (#31024) add allreduce and broadcast without test * Refactor HCCLCommContext to be compatible with Paddle (#31359) Refactor HCCLCommContext to be compatible with Paddle (#31359) * [NPU] add npu kernel for communication op (#31437) * add allreduce and broadcast without test * add c_broadcast_test case * build c_comm_init and c_create_group operators * make the whole thing compile * add broadcast and init op test case but run failed * make unit test compile * fix broadcast test bug and change into hcom for ccl * change c_comm_init and c_create_group ops accordingly * make tests compile * transfer code to 27 * compiled successfully in 28, but run failed * test broadcast in 28, but failed * make hcom primitives work * change hccl data type for base.h * fix broadcast bug * make attributes work * fix group name bug * add allreduce but test failed * allreduce bug for qiuliang * allreduce finished * add allgather and reducescatter * merge all op code * add allgather test * finish run all ccl op test exclude send/recv * all all op and test exclude send/recv * send_v2_npu.cc recv_v2_npiu.cc compiled * fix ccl core dump bug and test allgather, reducescatter, broadcast op * fix allreduce bug just for test * hcom send&recv test pass, without hcom_destroy * for qiuliang test * Ascend Send&Recv Test Pass * all op (ex send/recv) ok * fix bug * merge all ccl op * style merge to PaddlePaddle * merge style * new merge style * merge style 2 * insert an empty at the end * disable ctest for hcom to pass ci Co-authored-by: Nvoid-main <voidmain1313113@gmail.com> Co-authored-by: Nf2hkop <f2huestc@outlook.com> * Add auto-increasing tag id for Hcom OPs (#31702) * add c_reduce_sum op (#31793) add c_reduce_sum op * update Ascendrc hccl to 20.3 (#32126) update Ascendrc hccl to 20.3 (#32126) * fix merge code * change cmake.txt1 * [NPU] Support npu kernel for c sync stream op (#31386) * sync stream npu op * add with_ascend_acl * update c++ unittest * compile all failed * try to pre commit * after pre commit * merge&compile&test hccl successfully! * fix code style * fix code style * fix bugs about hccl * fix some bugs * fix code style * fix style * fix style * fix * fixed * merge develop Co-authored-by: Nlw921014 <liuwei921014@yeah.net> Co-authored-by: NVoid Main <voidmain1313113@gmail.com> Co-authored-by: Nf2hkop <f2huestc@outlook.com> Co-authored-by: Nxiayanming <41795079@qq.com>
-
- 07 4月, 2021 1 次提交
-
-
由 zhang wenhui 提交于
* Ascend rc (#30483) * Fix compilcation on CANN20.1 and older (#30494) Fix compilcation on CANN20.1 and older * Add distribution supported (#30578) Add distribution supported * Build praser for Hcom* operators (#30627) Build praser for Hcom* operators * Pass device_ids info from launch to trainer. (#30632) Pass device_ids info from launch to trainer * Add Hccl program group (#30642) Add Hccl program group * Add startup bash files of test_ascend_group. (#30645) Add startup bash files of test_ascend_group * cleanup (#30646) cleanup test_ascend_group.py * [Feature] Build parser to support distributed training (#30658) [Feature] Build parser to support distributed training * fix compilation on ascend-20.1 (#30722) fix compilation on ascend-20.1 * Dev/fix ascend string (#30749) Dev/fix ascend string * code style (#30781) code style * Merge ascend_optimizer and ascend_parser. (#30776) Merge ascend_optimizer and ascend_parser. * Ascendrc add converted op : [range/equal/range/uniform_random/expand/squeeze], fix cast op bug (#30797) Ascendrc add converted op : [range/equal/range/uniform_random/expand/squeeze], fix cast op bug * Add paddle ascend distribution training supported (#30796) Add paddle ascend distribution training supported * pass cxx_flags to gloo cmake (#30857) * Destroy session first. (#30954) Destroy session first. * merge * fix, test=develop * fix, test=develop * fix style, test=develop * fix, test=develop * fix * fix log fatal, test=develop * fix enforce style, test=develop * fix, test=develop * fix, test=develop * fix rccl, test=develop * fix test, test=develop * fix, test=develop * fix, test=develop * fix, test=develop * fix node_num, test=develop * fix ids str, test=develop * fix ids str, test=develop * fix ids str, test=develop * fix, test=develop * fix, test=develop * fix, test=develop * fix, test=develop * fix, test=develop * fix, test=develop * fix, test=develop * fix, test=develop * fix style code, test=develop * fix style code, test=develop * fix style code, test=develop * fix style code, test=develop Co-authored-by: Nhutuxian <hutuxian2011@sina.cn> Co-authored-by: Ngongweibao <weibao.gong@gmail.com> Co-authored-by: NVoid Main <voidmain1313113@gmail.com> Co-authored-by: NLeo Chen <chenqiuliang@baidu.com> Co-authored-by: Ndingsiyu <18369187719@163.com> Co-authored-by: NOleNet <olenet@126.com>
-
- 24 2月, 2021 1 次提交
-
-
由 Thunderbrook 提交于
* push multi node * multi node * MultiThread * remove log * solve bug in 30829
-
- 30 7月, 2020 1 次提交
-
-
由 tangwei12 提交于
Integrated Trainer of Parameter Server (API add `fluid.contrib.layers.sparse_embedding` only) (#22957) * Integrated Trainer of Parameter Server
-
- 01 7月, 2020 1 次提交
-
-
由 Chengmo 提交于
* test=develop, fix_embedding
-
- 22 5月, 2020 1 次提交
-
-
由 ShenLiang 提交于
-
- 14 5月, 2020 1 次提交
-
-
由 swtkiwi 提交于
-
- 26 4月, 2020 1 次提交
-
-
由 Chen Weihang 提交于
* add to_readable_code method, test=develop * polish doc details, test=develop * polish doc note, test=develop * fix unittest error, test=develop * fix coverage, test=develop * add print test, test=develop * add print param, test=develop * hidden to_readable_code api, test=develop * remove original tool methods, test=develop * remove old api using code, test=develop
-
- 18 3月, 2020 1 次提交
-
-
由 tangwei12 提交于
-
- 28 2月, 2020 1 次提交
-
-
由 tianshuo78520a 提交于
-
- 25 2月, 2020 1 次提交
-
-
由 hutuxian 提交于
* Add two types of Metric Calculator: MultiTaskCalculator & CmatchRankCalculator. * Add a config for DynamicAdjustChannelNum function to denote whether we will discard the remaining instances when they are not be distributed evenly. * Remove CPU code in Pull/PushSparse and we will add it back when testing it fully. * Fix some known issues: such as copying persistable vars after one epoch running.
-
- 23 2月, 2020 1 次提交
-
-
由 tianshuo78520a 提交于
-
- 22 2月, 2020 1 次提交
-
-
由 tangwei12 提交于
* add sync communicator and implement
-
- 15 2月, 2020 1 次提交
-
-
由 tangwei12 提交于
* add deprecated for distribute transpiler, will delete it after 2.0.0, test=develop
-
- 17 1月, 2020 1 次提交
-
-
由 tangwei12 提交于
* add half_async in the communicator * fix DistributedStrategy
-
- 13 1月, 2020 1 次提交
-
-
由 123malin 提交于
* test=develop, bug fix for sparse recorder
-
- 07 1月, 2020 2 次提交
- 06 1月, 2020 1 次提交
-
-
由 123malin 提交于
* add distributed_strategy
-
- 12 12月, 2019 1 次提交
-
-
由 tangwei12 提交于
* add fake init for the trainer, fix large memory hold in the trainer * do not merge recv vars from a remote endpoint, test=develop * add recv and save op, merge slice var in one op, save memory * remove hsigmoid with pull sparse, test=develop
-
- 06 12月, 2019 1 次提交
-
-
由 hutuxian 提交于
* Add a single_process_multi_thread transpiler. * Add some UTs. * Fix some API description.
-
- 28 11月, 2019 1 次提交
-
-
由 Kaipeng Deng 提交于
* add Adam beta1/beta2 support Variable. test=develop
-
- 01 11月, 2019 1 次提交
-
-
由 123malin 提交于
* update pserver decay blocks * update distributed notify handler
-
- 17 10月, 2019 1 次提交
-
-
由 tangwei12 提交于
* fix fetch handler error with pslib * fix distributed lookup table op with 1 pserver
-
- 15 10月, 2019 2 次提交
-
-
由 Chengmo 提交于
* test=develop,Fix communicator slow bug * test=develop, delete if() in stop_worker() * test=develop * fix UT, test=develop * fix bug in fetch handler, test=develop * fix bug in fetch handler, test=develop * test=develop, fix fetch barrier bug * test=develop, bug fix * test=develop, bug fix * test=develop, fix bug
-
由 123malin 提交于
* bug fix: invalid learning rate decay in pserver async mode
-
- 11 10月, 2019 1 次提交
-
-
由 tangwei12 提交于
* doc fix, test=develop, test=document_fix
-
- 09 10月, 2019 1 次提交
-
-
由 Chengmo 提交于
* test=develop,test=document_fix,fix transpiler doc,add API.spec
-
- 07 10月, 2019 2 次提交
- 30 9月, 2019 2 次提交
-
-
由 Chengmo 提交于
* refector geo sgd & communicator
-
由 Zeng Jinle 提交于
* add deprecated memory optimize doc, test=develop, test=document_fix * merge develop to solve conflict, test=develop, test=document_fix
-
- 26 9月, 2019 1 次提交
-
-
由 123malin 提交于
* fix DistributeTranspilerConfig document, test=develop
-
- 16 9月, 2019 1 次提交
-
-
由 tangwei12 提交于
fix wrong place with distributed_lookup_table
-
- 06 9月, 2019 1 次提交
-
-
由 123malin 提交于
* fleet api add input check, test=develop
-
- 28 8月, 2019 2 次提交
-
-
由 Yi Liu 提交于
test=develop
-
由 tangwei12 提交于
* fix correctness of the communicator * fix a bug in send thread when sending var context is empty, test=develop * add lookup_table_prefetch_op and prefetch optimize, test=develop * remove remote prefetch GPU supported * word2vec force with CPU, test=develop * test dist remote lookup table force with CPU, test=develop
-
- 26 8月, 2019 1 次提交
-
-
由 tangwei12 提交于
* fix sync mode hang in transpiler * remove sync mode in send/recv * replace PADDLE_ENFORCE with PADDLE_ENFORCE_NE
-
- 16 8月, 2019 1 次提交
-
-
由 Tao Luo 提交于
* remove unused inference_transpiler unit-tests test=develop * remove InferenceTranspiler usage in quantize_transpiler.py test=develop
-