1. 25 10月, 2021 8 次提交
    • Z
      add op: fused_feedforward(backward) (#35611) · 2dd0a46a
      zhangkaihuo 提交于
      这个PR是fused_feedforward反向的代码
      
      相关kernel实现:fused_dropout_act_bias, fused_residual_dropout_bias, fused_layernorm_residual_dropout_bias
      
      fused_feedforward是一个融合算子,该算子对transformer模型的feed forward层的算子进行融合和封装,使得前端只呈现一个接口,通过融合减少部分访存和kernel launch的时间,以此提升性能。
      2dd0a46a
    • S
      Add bincount op (#36317) · 39f19127
      smallv0221 提交于
      * Add bincount op
      
      * upload cpu version
      
      * fix unitest
      
      * fix unittest
      
      * fix unittest
      
      * fix en doc
      
      * add more test
      
      * fix en doc
      
      * add more test case
      
      * fix test
      
      * fix input vailidation
      
      * fix input check
      
      * fix unittest
      
      * fix test
      
      * fix en doc
      39f19127
    • T
      CI build PR and dev whl (#36532) · e16fe48d
      tianshuo78520a 提交于
      CI build PR and dev whl
      e16fe48d
    • Z
      Create CinnCompiler class for compiling subgraphs found by build_cinn_pass. (#36562) · 4c460378
      Zhen Wang 提交于
      * Init the functions of CinnCompiler.
      
      * Add the unit test for CinnCompiler.
      
      * Fix some compilation errors.
      
      * Update the UT of cinn_compiler.
      
      * Use Decomposer&OpFusion passes in CinnCompiler::CompileGraph.
      
      * Update some comments.
      
      * Uncomment some includes in build_cinn_pass.cc.
      
      * Use refs instead of ptrs as returned types of FindGraph & Compile in
      CinnCompiler.
      
      * Use the merged CinnGraphSymbolization functions in CinnCompiler.
      4c460378
    • T
      add some ops to train ssd on kunlun (#36407) · 50778ad6
      TTerror 提交于
      * add some ops to train ssd on kunlun
      
      * add some ops to train ssd on kunlun
      
      * add some ops to train ssd on kunlun
      
      * update cast op unittest
      
      * update cast op unittest
      
      * update cast op unittest
      
      * update xpu cmake
      
      * update cast unittest
      50778ad6
    • L
      [new-exec] Add events waiter (#36480) · cdb9bfa3
      liutiexing 提交于
      * add align for WorkQueue
      
      * add spinlock
      
      * merge develop
      
      * merge
      
      * Add EventsWaiter
      
      * update
      
      * update
      
      * update Error MSG
      
      * update EventsWaiter
      cdb9bfa3
    • W
      Fix grid sampler while input size is [1] (#36183) · eff3ee5e
      whs 提交于
      eff3ee5e
    • Z
      add op: fused_feedforward(forward) (#35843) · b18cbfb2
      zhangkaihuo 提交于
      这个PR只包含fused_feedforward前向的代码。
      
      相关kernel实现:fused_dropout_act_bias, fused_residual_dropout_bias, fused_layernorm_residual_dropout_bias
      
      fused_feedforward是一个融合算子,该算子对transformer模型的feed forward层的算子进行融合和封装,使得前端只呈现一个接口,通过融合减少部分访存和kernel launch的时间,以此提升性能。
      b18cbfb2
  2. 24 10月, 2021 1 次提交
  3. 23 10月, 2021 6 次提交
    • J
      add cinn graph symbolization (#36417) · bbd4bd73
      jiangcheng 提交于
      * add cinn graph symbolization
      
      * fix some bug
      
      * add paddle scope to cinn scope
      
      * add paddle scope to CINN scope in Symbolization, and add feed op when build cinn pass
      
      * fix some bug
      
      * fix some bug by review advices
      
      * optimize code problem
      
      * revert build_cinn_pass and move the change to https://github.com/PaddlePaddle/Paddle/pull/36503
      
      * fix some bug after co-compilation
      
      * perfect single test script
      
      * remove scope and rename feed_target to input_tensor
      
      * using std::unordered_map instead of absl::flat_hash_map
      
      * fix single test bug
      
      * revert to preverion for WITH_CINN has add in later PR
      
      * full error information for CI
      
      * full enfore information for CI pass
      bbd4bd73
    • W
      disable padding if dynamic shape (#36648) · 99e396f8
      wenbin 提交于
      * disable padding if dynamic shape
      
      * add parentheses
      
      * correct
      99e396f8
    • B
      fix interpolate mkldnn op error (#36623) · f6d82526
      baoachun 提交于
      f6d82526
    • W
      add file exists check (#36628) · 425db7c8
      Wilber 提交于
      * add file check
      
      * add ut
      425db7c8
    • J
      Add transformer of paddle desc and cinn desc (#36100) · 3cb6f65e
      jiangcheng 提交于
      * add transformer of paddle desc and cinn desc
      
      * change LOG(FATAL) to PADDLE_THROW for ci
      
      * full error imformation for ci
      
      * fix some problem as review advice
      
      * fix some bug
      
      * move vat type utils to tansform_desc header file
      
      * add if NOT WITH_CINN control whether compile
      
      * build_strategy check whether open WITH_CINN
      
      * add control WITH_CINN in cmake
      3cb6f65e
    • H
      New Paddle-CINN Compile PR (#36584) · ab732884
      Huihuang Zheng 提交于
      This PR added some changes to match the CINN change for compilation. It also tried to fix JiangCheng's Problem in PR: https://github.com/PaddlePaddle/Paddle/pull/36100
      
      These changes include:
      1. Set `CINN_GIT_TAG` to a newer tag
      2. CINN now just `make cinnapi -j`
      3. We have to add `-DPY_VERSION=${PY_VERSION} -DWITH_TESTING=ON` to CINN cmake args
      4. For CINN's third party dependencies, we could just include headers without target_link_libraries
      5. Moved `cinn.cmake` from `paddle/cmake` to `paddle/cmake/external` to match old style. External folder contains `lite`, which is the same level of `cinn`
      6. CINN added `-DNAMESPACE=cinn_gflags` in `gflags.cmake` to have different gflag namespaces between CINN and Paddle. It solved re-define problem.
      7. Change namespace of `::google::` in gflags to `::GFLAGS_NAMESPACE`
      ab732884
  4. 22 10月, 2021 6 次提交
  5. 21 10月, 2021 12 次提交
    • Z
      [NPU] Add p_norm_grad (#36497) · ed478a3e
      zhulei 提交于
      ed478a3e
    • R
      add swish_op for npu (#36579) · 7eab0fa6
      ronnywang 提交于
      7eab0fa6
    • J
      Added matmul_v2+transpose+reshape fuse pass (#36481) · 856cb9c5
      jakpiase 提交于
      * added base changes for matmul_v2+trans+resh fuse pass
      
      * added full matmul_v2+transpose+reshape pass
      
      * removed a file added by mistake
      
      * added reviewers suggestions
      
      * Changed ops type in checking capatibility version
      
      * Deteled one statement
      856cb9c5
    • F
      [NPU] Add sync_batch_norm and sync_batch_norm_grad NPU Kernel (#36320) · 0ca2807c
      furnace 提交于
      * add sync_batch_norm (support train, infer, and fp32, fp16, and NCHW, NHWC)
      
      * [NPU] Delete debug codes
      
      * [NPU] Remove FP16
      0ca2807c
    • J
      Add viterbi decode (#35778) · 6072aecb
      Jack Zhou 提交于
      * add viterbi decode cpu kernel
      
      * add viterbi decoder api in paddle.text
      
      * add a data buffer once to avoid create many small pieces of data buffer frequently
      
      * fix viterbi max_seq_length bug
      
      * fix seq_len=1 bug
      
      * fix device context
      
      * move split out of for loop
      
      * remove INVERSE_SUB
      
      * remove 2 GET_CAST_MASK
      
      * remove 1 loop
      
      * remove Functor
      
      * add to_static deploy code
      
      * use MAX_FUNC instead of ELE_MAX
      
      * add MaxFunctor
      
      * impl max_func
      
      * remove MaxFunctor
      
      * remove cast op
      
      * use REGISTER_OP_WITHOUT_GRADIENT
      
      * add viterbi cuda kernel
      
      * add FIX_BLOCKDIM_CASE macro
      
      * add MKL add, mul; add get data mask
      
      * add arange mkl impl
      
      * add CPU Argmax
      
      * add cpu gather
      
      * use EXECUTE_MKL_ELEMENT_BINARY_OP instead of some ADD, MUL
      
      * use SameDimsBinaryOP instead of EXECUTE_MKL_ELEMENT_BINARY_OP
      
      * use SAME_DIMS_ELEMENT_BINARY_OP
      
      * add SimpleBroadcastBinaryOP
      
      * use int instead of int64_t to accelerate
      
      * optimize SimpleBroadcastBinaryOP
      
      * optimize SimpleBroadcastBinaryOP
      
      * optimize performance in both single thread and multithread situation
      
      * remove useless line
      
      * remove useless code
      
      * add CREATE_TENSOR_BUFFER macro
      
      * add INIT_REQUIRED_TENSOR macro
      
      * add comment
      
      * fix windows ci
      
      * add viterbi unittest
      
      * remove cuda add functor
      
      * remove cuda equal
      
      * remove a template function
      
      * fix windows ci
      
      * fix windows dtype
      
      * remove some template instance
      
      * remove useless header file
      
      * remove some blockdim
      
      * remove transpose impl
      
      * accelerate cpu performance on single thread situation
      
      * viterbi_decode->crf_decode
      
      * rename crf params name
      
      * add viterbi api test
      
      * remove useless import
      
      * add enable_static
      
      * use viterbi decoder
      
      * fix viterbi len=1
      
      * fix  viterbi unittest
      
      * remove useless comments
      
      * reconstruct viterbi decode
      
      * remove ADD,SUB,MUL structure
      
      * fix coverage
      
      * remove CREATE_TENSOR
      
      * add name args
      
      * crf.py->ops.py; with_start_stop_tag->include_start_end_tag
      
      * update crf_decode en docs
      
      * fix viterbi decode en docs
      
      * fix some review comments
      
      * add FIXED_BLOCK_DIM_CASE in cuda
      
      * push_back->emplace_back
      
      * crf_decode->viterbi_decode; include_start_end_tag->include_bos_eos_tag
      
      * paddle.text.ops.viterbi_decode->paddle.text.viterbi_decode
      
      * fix viterbi_decode en docs
      6072aecb
    • T
      add fill_any_like/flatten ops to train ssd on kunlun (#36550) · 7bf2aa38
      TTerror 提交于
      * add some ops to train ssd on kunlun
      
      * update test_fill_any_like_op_xpu.py
      7bf2aa38
    • X
      User specified backend (#35745) · b6e7f8e9
      xiongkun 提交于
      b6e7f8e9
    • N
      Fix a bug in ReadData, ReadDataBc and ReadDataReduce when NX != 1 (#36373) · 921c0917
      niuliling123 提交于
      * Update the implement of reduceAnyKernel according to kernel primitive api
      * Fix a bug in ReadData, ReadDataBc and ReadDataReduce when NX != 1
      921c0917
    • S
      Graph engine4 (#36587) · 5eb640c6
      seemingwang 提交于
      5eb640c6
    • Z
      add ctr table depends (#36465) · d64f7b3b
      zhaocaibei123 提交于
      * add ctr table depends
      
      * code style
      
      * fix
      
      * fix
      
      * fix naming
      
      * rename
      
      * rename
      d64f7b3b
    • L
      Fix flame graph (#36578) · 72533986
      liutiexing 提交于
      * add align for WorkQueue
      
      * add spinlock
      
      * merge develop
      
      * merge
      
      * Add EventsWaiter
      
      * Revert "Add EventsWaiter"
      
      This reverts commit e206173aa9be7401b83a53581627bfaf557c8fb2.
      
      * adjust multithread using, fix flame graph
      
      * update
      72533986
    • A
      Support No DataTransform From GetKernelTypeForVar (#36571) · e82c3a5f
      Aurelius84 提交于
      * Add kQueueSync.synchronize_run_ logic
      
      * Support No DataTransform From GetKernelTypeForVar
      e82c3a5f
  6. 20 10月, 2021 7 次提交