1. 01 11月, 2021 3 次提交
    • C
      Paddle Tensor Operation Library initial implementation (#34425) · b9fdd3bc
      Chen Weihang 提交于
      * initial tensor design & sign kernel demo
      
      * add move constructor for meta & add lodtensor
      
      * add dirs & sign xpu kernel
      
      * add mean cpu&cuda kernel impl
      
      * move sign & mean xpu & npu kernel
      
      * add selected_rows basic impl
      
      * refactor design, BaseTensor to DenseTensor, etc.
      
      * add scale mkldnn kernel
      
      * polish xpu & npu impl details
      
      * fix mkldnn reuse compile failed
      
      * change tensor operation lib name
      
      * rename util filename
      
      * add more comments
      
      * change TensorImplInterface to TensorInterface
      
      * add kernel key and factory
      
      * remove MKLDNNTensorMeta, add MKLDNNDenseTensor
      
      * change XXDeviceContext to XXContext
      
      * add base kernel registrar utils & test on sign
      
      * replace boost::any by paddle::any
      
      * fix several ci failed
      
      * fix npu compile error
      
      * add ordered map util
      
      * fix multiple ordered_map compile errors
      
      * move dev into include dir
      
      * support sign op in static op run
      
      * fix static op run error
      
      * fix new executor compile failed
      
      * add dygraph branch & remove sign_op.h
      
      * fix test_infer_no_need_buffer_slots
      
      * fix rocm compile link error
      
      * fix unitybuild error & clear glog
      
      * fix npu compile failed
      
      * skip quant trans test
      
      * fix part windows compile problem
      
      * fix xpu enforce error
      
      * fix inference test failed
      
      * remove ordered_map to solve quant failed
      
      * fix part of rcom compile faild
      
      * add more register kernels
      
      * revert scale kernel temporarily
      
      * fix code format error
      
      * add new kernel registrar marco
      
      * rename top to tcmpt
      
      * revert xpu, npu, mkldnn impl & remove op def
      
      * add kernel args parse functor to auto parse args
      
      * revert some change & add scale kernels
      
      * add op proto in dygraph kernelcontext building
      
      * polish kernel dispatch logic & nameing rule
      
      * fix scale kernel match error
      
      * fix scale test failed
      
      * add mean API and unittest
      
      * test mean api success
      
      * add branch to solve compiled error
      
      * skip clang format error
      
      * add mean skip rule in op_library
      
      * add dot kernel, api and unittest (#6)
      
      * remove old kernel and add symbol link
      
      * fix dot compiled failed
      
      * add merco for module declare
      
      * fix npu and xpu compile error
      
      * revert sign, mean, scale, dot kernel removing
      
      * add comment for keeping old kernel impl
      
      * fix mutable_data error
      
      * fix bfloat16 conflit
      
      * fix inference undef error
      
      * adapt to msvc compile rules
      
      * polish comment for template inst
      
      * add cmake template instantiation for win
      
      * fix backend to place device id bug
      
      * fix ifdef error
      
      * Op2functor (#7)
      
      * add kernel args maker class
      
      * make args maker non-const
      
      * remove debug log
      
      * modify codes by review options
      
      * split constructPrKernelContext function
      
      * fix output name bug
      
      * fix test_mean_op test_sign_op failed
      
      * fill_any_like kernel refactor (#10)
      
      * fill_any_like kernel refactor
      
      * remove useless code of full_like c++ api
      
      * skip dtype for fill_any_like
      
      * add attrs for kernel key constrcut
      
      * add use_pt_kernel Flags to control whether to use pt kernel (#13)
      
      * add use_pt_kernel Flags to control whether to use pt kernel
      
      * change the default value to true for cheking pt kernels
      
      * fix mutable_data cuda place error
      
      * move high level apis into hapi
      
      * remove selectedrows adapting temporarily
      
      * Support Scalar in Tensor Compute Library (#14)
      
      * fill_any_like kernel refactor
      
      * remove useless code of full_like c++ api
      
      * Support Scalar in Tensor Compute Library
      
      * add scalar in dygraph and static graph mode
      
      * keep the basic type for attr, instead of using scalar for all
      
      * merge the code
      
      * remove mkldnn tensor & polish details
      
      * use flat_hash_map and small_vector in kernel factory
      
      * Refactor flatten kernel (#12)
      
      * refactor flatten kernel
      
      * update infershape function
      
      * fix compile bugs
      
      * fix bugs when merge
      
      * fix compiler bugs
      
      * fix bugs when run test_flatten_api
      
      * fix bugs when run test
      
      * Revert "use flat_hash_map and small_vector in kernel factory"
      
      This reverts commit 23091495cfdd3df8cc1be592d30f09ea66a7c72b.
      
      * Move cpu, cuda and other device code into kernels (#15)
      
      * fill_any_like kernel refactor
      
      * remove useless code of full_like c++ api
      
      * Support Scalar in Tensor Compute Library
      
      * add scalar in dygraph and static graph mode
      
      * keep the basic type for attr, instead of using scalar for all
      
      * merge the code
      
      * start refactor matmul
      
      * move cpu, cuda and other device modules into kernels
      
      * merge code
      
      * polish code in operator.cc
      
      * Perfect unitests (#16)
      
      * perfect unittest
      
      * update license
      
      * replace with flat_hash_map, small_vector (#19)
      
      * fix small_vector build error on windows platform
      
      * replace with flat_hash_map, small_vector
      
      * remove todo
      
      * Perfect unitests (#20)
      
      * perfect unittest
      
      * update license
      
      * fix bug when run tcmpt_utils_test
      
      * refactor execution adapting impl
      
      * fix insert conflit
      
      * Fix CI bug of test_yolov3 (#21)
      
      * fill_any_like kernel refactor
      
      * remove useless code of full_like c++ api
      
      * Support Scalar in Tensor Compute Library
      
      * add scalar in dygraph and static graph mode
      
      * keep the basic type for attr, instead of using scalar for all
      
      * merge the code
      
      * start refactor matmul
      
      * move cpu, cuda and other device modules into kernels
      
      * merge code
      
      * polish code in operator.cc
      
      * Fix CI bug of test_yolov3
      
      * add the tensor base class, test=develop (#17)
      
      * update the tensor base class, test=develop
      
      * remove two funcs, test=develop
      
      * update the error msg, test=develop
      Co-authored-by: NChen Weihang <chenweihang@baidu.com>
      
      * [no-verify] commit backend and tensor signature changes
      
      * Rename tcmpt to pten (#23)
      
      * rename tcmpt to pten
      
      * update omitted files for rename to pten
      
      * update omitted file for rename to pten
      
      * remove k of all enum var
      
      * remove kernel_instantiate (#26)
      
      * remove symbols and spatial_tensor
      
      * change common to functions
      
      * readd share tensor impl methods
      
      * add a candidate dense tensor class, test=develop (#28)
      
      * change all Pt to Pten
      
      * resolve conflit with xiaowei
      
      * Op2functor opt1 (#27)
      
      * replace to small vector and change to const &
      
      * add std::move
      Co-authored-by: NChen Weihang <chenweihang@baidu.com>
      
      * polish kernel factory and kernel registry
      
      * fix operator test error msg mismatch
      
      * remove tensor signature and backend set member
      
      * move scalar and polish enforce
      
      * revert dtype layout change to fix error
      
      * fix enum operator override error
      
      * add several base unittests
      
      * add pten utils tests
      
      * polish some details
      
      * Dev/op2func refactor 3 (#30)
      
      * add a candidate dense tensor class, test=develop
      
      * remove TensorBase::backend(), test=develop
      
      * remove some ops, test=develop
      
      * cherry-pick the pr of tensor meta, test=develop
      
      * moves the dense tensor and some ops, test=develop
      
      * update the linalg operator, test=develop
      
      * update other operators, test=develop
      
      * fix errors, test=develop
      
      * fix bugs, test=develop
      
      * try to resolve the problem of windows ci, test=develop
      
      * updates codes, test=develop
      
      * fix the tensor_utils.cc, test=develop
      
      * modify the dense tensor, test=develop
      
      * fix the data type, test=develop
      Co-authored-by: Nshixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
      
      * polish some details
      
      * polish kernel signature details
      
      * fix a bug about offsets of the tensor, test=develop (#31)
      Co-authored-by: Nshixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
      
      * polish some details
      Co-authored-by: Nchentianyu03 <ctychentianyu@gmail.com>
      Co-authored-by: Nzyfncg <1370305206@qq.com>
      Co-authored-by: NYuanRisheng <yuanrisheng@baidu.com>
      Co-authored-by: N石晓伟 <39303645+Shixiaowei02@users.noreply.github.com>
      b9fdd3bc
    • A
      [NPU] fix lookup_table_v2_grad ACL error for model BoW (#36864) · 792d3d76
      Aganlengzi 提交于
      * [NPU] fix lookup_table_v2_grad ACL error for model BoW
      
      * add more unit tests
      792d3d76
    • C
      add cinn_launch_op for using CINN to optimize graph (#36600) · 0a963ee9
      CtfGo 提交于
      增加CinnLaunchOp,负责执行Cinn子图编译的结果,要点如下:
      1. 在子图划分的BuildCinnPass中,每个子图在原图中会被替换为该CinnLaunchOp,由它来调用Cinn进行子图编译、执行的功能。
      2. CinnLaunchOp的输入/输出即为子图的输入和输出,另外增加`compilation_key`属性,它可由该属性key从全局Cache中获取子图对象、编译结果,该属性由BuildCinnPass在创建Op时进行设置
      3. CinnLaunchOp功能实现的流程为:
              - 从全局Cache中获取子图对象
              - 从全局Cache中获取子图编译结果,未命中cache时进行即时编译
              - 根据编译结果的变量信息(数据类型、shape)初始化运行时数据,分配内存/显存
              - 将运行时数据打包为参数,调用cinn的可执行对象runtime program进行计算
              - 子图运行结果通过参数指针同步到paddle侧的tensor
      0a963ee9
  2. 29 10月, 2021 4 次提交
  3. 28 10月, 2021 4 次提交
  4. 27 10月, 2021 8 次提交
  5. 26 10月, 2021 7 次提交
  6. 25 10月, 2021 6 次提交
    • A
      [NPU] modifications for model ernie-1.0 (#36642) · 19b02d95
      Aganlengzi 提交于
      * [NPU] modifications for model ernie-1.0
      
      * rollback 503003 and change cast to dtype
      19b02d95
    • Z
      add op: fused_feedforward(backward) (#35611) · 2dd0a46a
      zhangkaihuo 提交于
      这个PR是fused_feedforward反向的代码
      
      相关kernel实现:fused_dropout_act_bias, fused_residual_dropout_bias, fused_layernorm_residual_dropout_bias
      
      fused_feedforward是一个融合算子,该算子对transformer模型的feed forward层的算子进行融合和封装,使得前端只呈现一个接口,通过融合减少部分访存和kernel launch的时间,以此提升性能。
      2dd0a46a
    • S
      Add bincount op (#36317) · 39f19127
      smallv0221 提交于
      * Add bincount op
      
      * upload cpu version
      
      * fix unitest
      
      * fix unittest
      
      * fix unittest
      
      * fix en doc
      
      * add more test
      
      * fix en doc
      
      * add more test case
      
      * fix test
      
      * fix input vailidation
      
      * fix input check
      
      * fix unittest
      
      * fix test
      
      * fix en doc
      39f19127
    • T
      add some ops to train ssd on kunlun (#36407) · 50778ad6
      TTerror 提交于
      * add some ops to train ssd on kunlun
      
      * add some ops to train ssd on kunlun
      
      * add some ops to train ssd on kunlun
      
      * update cast op unittest
      
      * update cast op unittest
      
      * update cast op unittest
      
      * update xpu cmake
      
      * update cast unittest
      50778ad6
    • W
      Fix grid sampler while input size is [1] (#36183) · eff3ee5e
      whs 提交于
      eff3ee5e
    • Z
      add op: fused_feedforward(forward) (#35843) · b18cbfb2
      zhangkaihuo 提交于
      这个PR只包含fused_feedforward前向的代码。
      
      相关kernel实现:fused_dropout_act_bias, fused_residual_dropout_bias, fused_layernorm_residual_dropout_bias
      
      fused_feedforward是一个融合算子,该算子对transformer模型的feed forward层的算子进行融合和封装,使得前端只呈现一个接口,通过融合减少部分访存和kernel launch的时间,以此提升性能。
      b18cbfb2
  7. 23 10月, 2021 1 次提交
  8. 22 10月, 2021 3 次提交
    • Z
      add fp16 kernel for clip_op (#36577) · 1962d3af
      zhangbo9674 提交于
      1962d3af
    • L
      Fused attention op forward (#35905) · d4906214
      Li Min 提交于
      功能:本PR的目标是提高attention模块的计算性能。
      为了减少框架层对op的调度开销,本PR通过在C++层手动实现attention模块,对外提供attention 大op;
      为了减少防存开销,本PR采取了两种优化方法:
      (1)在q,k,v计算时通过共享输入X,将该处的gemm,transpose和bias add从三次调用减少为一次;
      (2)使用kernel融合优化技术,在不同cuda kernel之间通过寄存器传输数据;
      d4906214
    • W
      【Bug Fixes】Elementwise_add triple grad, fixed an input uninitialized problem (#36618) · 6580ad16
      Weilong Wu 提交于
      * Support elementwise_add triple grad Kernel
      
      * Change code-format to follow CI std
      
      * Removed unreasonable code, and fixed an input uninitialized issue
      
      * Support elementwise_add triple grad Kernel
      
      * Change code-format to follow CI std
      
      * Removed unreasonable code, and fixed an input uninitialized issue
      6580ad16
  9. 21 10月, 2021 4 次提交