1. 23 2月, 2022 1 次提交
  2. 22 2月, 2022 1 次提交
  3. 21 2月, 2022 1 次提交
  4. 19 2月, 2022 1 次提交
    • S
      Add the DistributedFusedLamb optimizer (#39148) · 5df3cd61
      sneaxiy 提交于
      * add DistributedFusedLamb op
      
      * polish code
      
      * fix compile error
      
      * compatible with pten changement
      
      * fix rocm compile error
      
      * improve converage
      
      * update upstream/develop
      
      * fix cast_with_ptr.h
      
      * add FLAGS_distributed_lamb_divide_nranks_when_allreduce=1
      
      * fix clip before allreduce
      
      * add use_master_param_norm
      
      * code polish
      
      * fix bug
      
      * fix ROCM ci
      5df3cd61
  5. 09 2月, 2022 1 次提交
    • J
      Replace EagerTensor with Tensor (#39376) · 945a3ce9
      Jiabin Yang 提交于
      * merge legacy to fluid
      
      * Remove legacy code
      
      * Remove legacy code
      
      * Remove DataType test
      
      * Using Tensor directly instead of using EagerTensor
      
      * support gradient_accumulation
      
      * make test_imperative_lod_tensor_to_selected_rows longer
      
      * make test_imperative_lod_tensor_to_selected_rows longer
      945a3ce9
  6. 28 1月, 2022 1 次提交
  7. 25 1月, 2022 1 次提交
  8. 24 1月, 2022 1 次提交
    • Z
      Refactored python-level trace_op to call through _C_ops instead of... · c3796061
      Zhanlue Yang 提交于
      Refactored python-level trace_op to call through _C_ops instead of Tracer::TraceOp, under eager_mode (#38338)
      
      * Replaced core.ops with _C_ops
      
      * Refactored python-level trace_op to call through _C_ops instead of Tracer::TraceOp, under eager_mode
      
      * Modified trace_op interface
      
      * Refactored trace_op logic for eager mode
      
      * Added Eager Dygraph support for OpTest
      
      * Fixed ci issues
      
      * Fixed CI failures
      
      * Fixed Coverage CI Issues
      
      * Fixed XPU CI Issues
      c3796061
  9. 21 1月, 2022 1 次提交
  10. 19 1月, 2022 1 次提交
    • J
      ipu python interface p1 (#38096) · 0837a2cc
      jianghaicheng 提交于
      * ipu_commit_tests p1
      
      * resolve comments
      
      * resolve comments
      
      * resolve comments
      
      * resolve comments
      
      * resolve comments
      
      * resolve comments
      
      * resolve comments
      
      * update lint and ipustrategy introduction
      
      * update ipu_config
      
      * update __init__ of static
      
      * update doc
      
      * update doc 2
      
      * update doc 3
      
      * update doc 4
      
      * update doc 5
      
      * update doc 5
      
      * update doc 6
      
      * update lint
      
      * update lint 2
      
      * update ipustrategy
      
      * add IpuStrategy to all
      
      * update ipustrategy
      
      * update ipu_shard_guard
      
      * update ipu_shard_guard 2
      Co-authored-by: Nyaozhixin <522190855@qq.com>
      0837a2cc
  11. 14 1月, 2022 2 次提交
  12. 11 1月, 2022 1 次提交
  13. 10 1月, 2022 1 次提交
  14. 31 12月, 2021 1 次提交
  15. 23 12月, 2021 1 次提交
  16. 21 12月, 2021 1 次提交
  17. 20 12月, 2021 1 次提交
  18. 08 12月, 2021 1 次提交
  19. 07 12月, 2021 1 次提交
    • Y
      [Auto para] Relaunch with auto mapping function (#37326) · 506e79d1
      Yulong Ao 提交于
      * [Auto Parallel]  Add the unified cluster representation
      
      * [Auto Parallel] Add the graph class for physical mapping
      
      * [Auto Parallel] Add the simple physical mapper
      
      * Set the timeout of the mapper
      
      * Merge the upstream develop unittests cmake files
      
      * Fix a bug of the process group
      
      * Remove mapper unittest from platforms which is not GPU
      
      * Move the instantiation of process group after resharding
      
      * Add the local id for devices
      
      * Update the rank mapping format
      
      * [Auto Parallel] Relaunch with the rank mapping file
      
      * Remove the unnecessary json file
      
      * Avoid entering get_device_proc_info for auto mapping
      
      * Correct the mapper unit test
      
      * Add some comments
      
      * Remove the related files about mapping
      
      * Update the unittest for auto mapping
      
      * Remove unused rank_mapping unittest
      
      * Improve the unittest coverage
      
      * Improve the unittest coverage
      
      * Improve the unittest of relaunch
      
      * Fix the unittest problem in CI
      
      * Improve the unittest of relaunch
      
      * Remove unnecessary statements
      
      * Update the unittest cmakefile
      
      * Correct the cmakefile of auto parallel unittests
      
      * Modify codes based on the new elastic change
      
      * Use the GPUs exclusively in the unittest
      
      * Correct the cmakefile
      
      * Set the timeout of the unittest
      506e79d1
  20. 02 12月, 2021 1 次提交
  21. 30 11月, 2021 1 次提交
    • Y
      [Auto Parallel] Do the physical mapping between the process graph and the cluster graph (#37094) · b0dff05d
      Yulong Ao 提交于
      * [Auto Parallel]  Add the unified cluster representation
      
      * [Auto Parallel] Add the graph class for physical mapping
      
      * [Auto Parallel] Add the simple physical mapper
      
      * Set the timeout of the mapper
      
      * Merge the upstream develop unittests cmake files
      
      * Fix a bug of the process group
      
      * Remove mapper unittest from platforms which is not GPU
      
      * Move the instantiation of process group after resharding
      
      * Add the local id for devices
      
      * Update the rank mapping format
      
      * Add some comments
      
      * Remove the related files about mapping
      
      * Update the unittest for auto mapping
      
      * Remove unused rank_mapping unittest
      
      * Improve the unittest coverage
      
      * Improve the unittest coverage
      b0dff05d
  22. 27 11月, 2021 1 次提交
    • Y
      [Auto Parallel] Add the graph class for the process and cluster (#37482) · 48faf638
      Yulong Ao 提交于
      * [Auto Parallel]  Add the unified cluster representation
      
      * [Auto Parallel] Add the graph class for physical mapping
      
      * [Auto Parallel] Add the simple physical mapper
      
      * Set the timeout of the mapper
      
      * Merge the upstream develop unittests cmake files
      
      * Fix a bug of the process group
      
      * Remove mapper unittest from platforms which is not GPU
      
      * Move the instantiation of process group after resharding
      
      * Add the local id for devices
      
      * Update the rank mapping format
      
      * Add some comments
      
      * Remove the related files about mapping
      
      * Remove unused rank_mapping unittest
      
      * Improve the unittest coverage
      48faf638
  23. 26 11月, 2021 1 次提交
  24. 25 11月, 2021 2 次提交
  25. 15 11月, 2021 1 次提交
  26. 12 11月, 2021 3 次提交
  27. 05 11月, 2021 1 次提交
  28. 03 11月, 2021 1 次提交
  29. 02 11月, 2021 1 次提交
    • Z
      [AutoParallel] Save&Load Module (#36558) · b9defb4f
      zhaoyingli 提交于
      * AutoParallel Save&Load
      
      * tiny modi
      
      * update func name
      
      * tiny fix
      
      * add NotImplementedError
      
      * fix doc
      
      * update func name
      
      * update func param
      
      * update interface
      
      * add unitest & modi make_data_unshard
      
      * update unittest
      
      * update unittest
      
      * fix unittest
      
      * fix cmakelist
      
      * update unittest
      b9defb4f
  30. 28 10月, 2021 1 次提交
  31. 26 10月, 2021 3 次提交
  32. 25 10月, 2021 1 次提交
    • Z
      add op: fused_feedforward(forward) (#35843) · b18cbfb2
      zhangkaihuo 提交于
      这个PR只包含fused_feedforward前向的代码。
      
      相关kernel实现:fused_dropout_act_bias, fused_residual_dropout_bias, fused_layernorm_residual_dropout_bias
      
      fused_feedforward是一个融合算子,该算子对transformer模型的feed forward层的算子进行融合和封装,使得前端只呈现一个接口,通过融合减少部分访存和kernel launch的时间,以此提升性能。
      b18cbfb2
  33. 22 10月, 2021 1 次提交
    • L
      Fused attention op forward (#35905) · d4906214
      Li Min 提交于
      功能:本PR的目标是提高attention模块的计算性能。
      为了减少框架层对op的调度开销,本PR通过在C++层手动实现attention模块,对外提供attention 大op;
      为了减少防存开销,本PR采取了两种优化方法:
      (1)在q,k,v计算时通过共享输入X,将该处的gemm,transpose和bias add从三次调用减少为一次;
      (2)使用kernel融合优化技术,在不同cuda kernel之间通过寄存器传输数据;
      d4906214
  34. 21 10月, 2021 1 次提交