1. 26 10月, 2021 2 次提交
    • L
      [cherry-pick-2.2] Fused attention op forward (#35905) (#36708) · d2be870a
      Li Min 提交于
      功能:本PR的目标是提高attention模块的计算性能。
      为了减少框架层对op的调度开销,本PR通过在C++层手动实现attention模块,对外提供attention 大op;
      为了减少防存开销,本PR采取了两种优化方法:
      (1)在q,k,v计算时通过共享输入X,将该处的gemm,transpose和bias add从三次调用减少为一次;
      (2)使用kernel融合优化技术,在不同cuda kernel之间通过寄存器传输数据;
      d2be870a
    • X
      [cherry-pick] Support CPU Parallel in DataParallel Interface by GLOO to speed... · beb920cd
      xiongkun 提交于
      [cherry-pick] Support CPU Parallel in DataParallel Interface by GLOO to speed up training (#35745) (#36605)
      
      * User specified backend (#35745)
      
      * remove tensordot
      beb920cd
  2. 25 10月, 2021 9 次提交
  3. 24 10月, 2021 1 次提交
    • J
      Add viterbi decode (#35778) (#36615) · 1906c746
      Jack Zhou 提交于
      * add viterbi decode cpu kernel
      
      * add viterbi decoder api in paddle.text
      
      * add a data buffer once to avoid create many small pieces of data buffer frequently
      
      * fix viterbi max_seq_length bug
      
      * fix seq_len=1 bug
      
      * fix device context
      
      * move split out of for loop
      
      * remove INVERSE_SUB
      
      * remove 2 GET_CAST_MASK
      
      * remove 1 loop
      
      * remove Functor
      
      * add to_static deploy code
      
      * use MAX_FUNC instead of ELE_MAX
      
      * add MaxFunctor
      
      * impl max_func
      
      * remove MaxFunctor
      
      * remove cast op
      
      * use REGISTER_OP_WITHOUT_GRADIENT
      
      * add viterbi cuda kernel
      
      * add FIX_BLOCKDIM_CASE macro
      
      * add MKL add, mul; add get data mask
      
      * add arange mkl impl
      
      * add CPU Argmax
      
      * add cpu gather
      
      * use EXECUTE_MKL_ELEMENT_BINARY_OP instead of some ADD, MUL
      
      * use SameDimsBinaryOP instead of EXECUTE_MKL_ELEMENT_BINARY_OP
      
      * use SAME_DIMS_ELEMENT_BINARY_OP
      
      * add SimpleBroadcastBinaryOP
      
      * use int instead of int64_t to accelerate
      
      * optimize SimpleBroadcastBinaryOP
      
      * optimize SimpleBroadcastBinaryOP
      
      * optimize performance in both single thread and multithread situation
      
      * remove useless line
      
      * remove useless code
      
      * add CREATE_TENSOR_BUFFER macro
      
      * add INIT_REQUIRED_TENSOR macro
      
      * add comment
      
      * fix windows ci
      
      * add viterbi unittest
      
      * remove cuda add functor
      
      * remove cuda equal
      
      * remove a template function
      
      * fix windows ci
      
      * fix windows dtype
      
      * remove some template instance
      
      * remove useless header file
      
      * remove some blockdim
      
      * remove transpose impl
      
      * accelerate cpu performance on single thread situation
      
      * viterbi_decode->crf_decode
      
      * rename crf params name
      
      * add viterbi api test
      
      * remove useless import
      
      * add enable_static
      
      * use viterbi decoder
      
      * fix viterbi len=1
      
      * fix  viterbi unittest
      
      * remove useless comments
      
      * reconstruct viterbi decode
      
      * remove ADD,SUB,MUL structure
      
      * fix coverage
      
      * remove CREATE_TENSOR
      
      * add name args
      
      * crf.py->ops.py; with_start_stop_tag->include_start_end_tag
      
      * update crf_decode en docs
      
      * fix viterbi decode en docs
      
      * fix some review comments
      
      * add FIXED_BLOCK_DIM_CASE in cuda
      
      * push_back->emplace_back
      
      * crf_decode->viterbi_decode; include_start_end_tag->include_bos_eos_tag
      
      * paddle.text.ops.viterbi_decode->paddle.text.viterbi_decode
      
      * fix viterbi_decode en docs
      1906c746
  4. 21 10月, 2021 2 次提交
  5. 20 10月, 2021 2 次提交
  6. 19 10月, 2021 3 次提交
    • L
      [cherry-pick]Add sparse attention cherrypick (#36447) · 36edb0e1
      Liu-xiandong 提交于
          The code of this PR can only support CUDA 11.2. Currently, CI does not have GPU with CUDA 11.2 , and all tests will be skipped automatically.
      
          The new OP is paddle._C_ops.sparse_attention. Regarding the work of the python API, it will be resolved in a follow-up PR.
      
          The code of this PR lacks tests on dynamic graphs and static graphs, and will be added in subsequent PRs.
      36edb0e1
    • C
      quant support matmul_v2 (#36469) (#36499) · b8167ed2
      ceci3 提交于
      * quant support matmul_v2
      
      * fix format
      b8167ed2
    • S
      Add operators for async read & async write (#36333) (#36501) · d65f8af8
      Siming Dai 提交于
      * fix async_read bug
      
      * change index place to cpu
      
      * add tensor size judge
      
      * add async_read & async_write test
      
      * fix bug in async_write
      
      * fix mac py3 ci
      
      * fix bug for cpu version paddle
      
      * fix windows ci bug
      
      * change input argument error type
      
      * change const_cast to mutable_data
      
      * add async_write out-of-bound check and consumate error hint
      
      * fix a small bug for dst_tensor
      
      * add docs and refine codes
      
      * refine docs
      
      * notest,test=windows_ci
      
      * fix windows ci
      
      * fix require
      
      * fix code-block
      
      * add core.is_compiled_with_cuda()
      d65f8af8
  7. 18 10月, 2021 1 次提交
  8. 15 10月, 2021 2 次提交
  9. 14 10月, 2021 1 次提交
  10. 13 10月, 2021 3 次提交
  11. 12 10月, 2021 1 次提交
  12. 11 10月, 2021 2 次提交
  13. 30 9月, 2021 5 次提交
  14. 29 9月, 2021 4 次提交
  15. 28 9月, 2021 1 次提交
    • Z
      [cherry-pick] update multi_dot exposure rules (#36018) (#36131) · 632a0064
      zhangkaihuo 提交于
      根据线性代数库的API暴露规则修改multi_dot的API暴露规则:
      1、在python/paddle/tensor/linalg.py 路径下实现
      2、在python/paddle/linalg.py 下import并加入__all__列表
      3、在python/paddle/tensor/init.py下引入并加入tensor_method_func列表
      4、删除了pythonpaddle/init.py的import
      632a0064
  16. 27 9月, 2021 1 次提交