1. 24 3月, 2023 1 次提交
    • TaoTao Li's avatar
      add phi operator allreduce/reduce (#51857) · 47f87ad3
      TaoTao Li 提交于
      * add all_reduce, reduce kernel and api
      
      * fix all_reduce reduce ut
      
      fix reduce op maker conflict
      
      fix merge conflicts
      
      * fix conflicts, rename ReduceOp->ReduceBaseOp in reduce_ops
      
      rename allreduce op, to remove
      
      * fix code format
      
      fix comments
      
      * modify test_collective_reduce_api ut timeout
      
      * fix PR-CI-Build
      
      fix comments: format phi operator
      47f87ad3
  2. 13 3月, 2023 1 次提交
  3. 09 3月, 2023 1 次提交
    • TaoTao Li's avatar
      Add comm context manager, add phi broadcast op (#51072) · c191b707
      TaoTao Li 提交于
      * * add comm context for device context
      
      * add broadcast phi operator kernel and api
      
      * add broadcast support dtype, update ut
      
      * fix broadcast bfloat16 type
      
      * fix ut
      
      * update test_collective_broadcast_api timeout to 300
      c191b707
  4. 20 1月, 2023 1 次提交
    • G
      Fluid clean remove io data (#49301) · 5670644c
      GGBond8488 提交于
      * replace paddle.fluid.layers.data and remove io.data
      
      * partial commit
      
      * partial commit
      
      * partial commit
      
      * partial commit
      
      * partial commit
      
      * partial commit
      
      * remove data in fluid.layers.io.__all__
      
      * fix errors
      
      * fix unitests
      
      * fix unitest
      
      * fix unitests
      
      * fix unitest
      
      * fix unitest
      
      * fix unitests
      
      * fix unitest
      
      * fix test_layers unitests
      
      * fix typro
      
      * fix unitest
      
      * fix unitest
      
      * fix unitest
      
      * fix typro
      
      * fix unitest test_model_cast_to_bf16
      
      * fix test_reducescatter
      
      * fix collective unitest
      
      * fix collective unitests
      
      * fix collective unitests
      
      * add coverage
      
      * fix add layers.data
      
      * re run ci
      
      * fix some typro
      
      * fix samplecode error
      
      * fix samplecode error
      5670644c
  5. 25 11月, 2022 1 次提交
  6. 23 10月, 2022 1 次提交
  7. 29 9月, 2022 1 次提交
  8. 27 9月, 2022 1 次提交
  9. 26 8月, 2022 1 次提交
    • R
      move collective tests into a collective directory (#45223) · 9eb4d89b
      Roc 提交于
      * add simple reformated ci files
      
      * update
      
      * add radme for new unitetsts
      
      * add radme for new unitetsts
      
      * add radme for new unitetsts
      
      * reset mlu
      
      * update for samples
      
      * add base api
      
      * reset some dist unit tests
      
      * add warning in grenerated cmakelists file
      
      * update readme for new dist unit tests
      
      * add all collective tests
      
      * remain base file and launcher file
      
      * Update README.md
      
      * Update README.md
      
      * fix env PYTHONPATH
      
      * Update gen_ut_cmakelists.py
      
      * add all collective tests
      
      * add docs for gen_ut_cmakelists.py
      
      * pretify codes
      
      * commont name == "name"
      
      * update for comments
      
      * update function's help
      
      * update for run type
      
      * update readme
      
      * add all collective tests
      
      * add all collective tests
      
      * mv  collective test files
      
      * update for all collective tests
      
      * update
      
      * update
      
      * update
      
      * update for all tests
      
      * update for checking name
      
      * Update Cmakelists.txt
      
      * update testlist.csv
      
      * remain test_parallel_dygraph_dataparallel in unittests
      
      * set broadcast op all platforms
      
      * update
      
      * remain test_broadcast_tensors_op
      
      * fix
      
      * rm some collective files
      
      * update more colective tests
      
      * update
      
      * update
      
      * update
      gen_ut_supports recursion
      
      * update
      
      * update
      
      * update
      
      * update
      
      * fix nccl version
      
      * update
      
      * update
      
      * update
      
      * update
      
      * fix a bug and try to pass
      
      * update
      
      * add csv
      
      * update for timeout
      
      * remove tcp store
      
      * fix
      
      * fix
      
      * update
      
      * update
      
      * update for more dist tests
      
      * move multi node tests
      
      * update
      
      * update
      
      * update
      
      * fix for auto parallele
      
      * update
      
      * update path in python file
      
      * update
      
      * reset some test in unittests
      
      * fix
      
      * update readme
      
      * fix
      
      * update
      
      * fix port
      9eb4d89b
  10. 05 6月, 2022 1 次提交
    • S
      【code format check upgrade】 step2:yapf (#42944) · a072fca8
      Sing_chan 提交于
      * use yapf to format all python file
      
      * yapf exclude two unittests file for they rely on writing and reading file, and format will break them
      
      * disable diff_py_file because too many diff files cause command following failed
      a072fca8
  11. 22 9月, 2020 1 次提交
    • P
      Use dygraph mode by default (#27443) · 827ac36f
      pangyoki 提交于
      * default open dygraph mode
      
      * fix CI-Mac
      
      * fix Mac-CI other unittest file
      
      * fix CI-Py3
      
      * fix test_communicator_geo and test_buffer_shared_memory_reuse_pass
      
      * add enable_static to fix CI-Py3
      
      * add enable_static to fix CI-coverage
      
      * delete try except
      827ac36f
  12. 27 8月, 2020 1 次提交
  13. 03 12月, 2019 1 次提交
  14. 27 6月, 2019 1 次提交
    • H
      supports collective communicated training (#18175) · b7128bac
      HaoRen 提交于
      * fix prepare context redundant code problem, optimize executor by caching create_varaiables
      test=develop
      
      * supports collective training in executor
      
      * make fetch_list runable with variables, add more unittest for use_program_cache
      test=develop
      
      * fix comment
      test=develop
      
      * use unique name for nccl_id
      
      * supports output to stream in program_to_code
      
      * insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
      
      * set op role in collective training
      
      * add collective op role
      
      * remove orig file
      
      * add build optimizer by strategy
      
      * add collective strategy
      
      * refine collective strategy
      
      * add multi-process role maker
      
      * refine strategy building factory so that we can easily plugin more strategy
      
      * scale loss grad in collective sgd transpiler
      
      * add support for distributed fc
      
      * code format
      
      * revert some features for dist fc
      
      * add support for distributed fc training
      
      * fix prepare context redundant code problem, optimize executor by caching create_varaiables
      test=develop
      
      * supports collective training in executor
      
      * make fetch_list runable with variables, add more unittest for use_program_cache
      test=develop
      
      * use unique name for nccl_id
      
      * supports output to stream in program_to_code
      
      * insert sync_comm_stream before regularization; add skip_op_callstack capability in program_to_code
      
      * set op role in collective training
      
      * add collective op role
      
      * fix comment
      test=develop
      
      * remove orig file
      
      * add build optimizer by strategy
      
      * add collective strategy
      
      * refine collective strategy
      
      * add multi-process role maker
      
      * refine strategy building factory so that we can easily plugin more strategy
      
      * scale loss grad in collective sgd transpiler
      
      * add support for distributed fc
      
      * code format
      
      * revert some features for dist fc
      
      * add support for distributed fc training
      
      * test=develop
      add collective op unittest standard
      
      * test=develop
      remove the test_collective directory
      
      * test=develop
      remove the test_collective directory
      
      * remove slicegather test
      
      * code format for reducescatter
      
      * update attr of shard_index_op
      
      * Modify macro nccl_helper
      
      * remove test without distribute
      
      * macro collective_helper
      
      * marcro update
      
      * test=develop
      update support python3.5
      
      * test=develop change gpu memory use to 0.1 when test
      
      * test=develop
      update ut equal func
      
      * test=develop
      set flags to 1.5
      
      * test=develop fix pickle dumple  py35
      
      * test=develop
      fix divide in slice and add sync_comm_stream
      update atol and rtol to 1e-05
      rm shard_index op and test
      modify read input from file to read from memory
      remove origin_program in framework and add i/o in c_sync_calc_stream
      
      * test=develop update unittest sync operator I/O
      b7128bac