1. 22 10月, 2021 1 次提交
  2. 21 10月, 2021 7 次提交
    • Z
      [NPU] Add p_norm_grad (#36497) · ed478a3e
      zhulei 提交于
      ed478a3e
    • R
      add swish_op for npu (#36579) · 7eab0fa6
      ronnywang 提交于
      7eab0fa6
    • J
      Added matmul_v2+transpose+reshape fuse pass (#36481) · 856cb9c5
      jakpiase 提交于
      * added base changes for matmul_v2+trans+resh fuse pass
      
      * added full matmul_v2+transpose+reshape pass
      
      * removed a file added by mistake
      
      * added reviewers suggestions
      
      * Changed ops type in checking capatibility version
      
      * Deteled one statement
      856cb9c5
    • F
      [NPU] Add sync_batch_norm and sync_batch_norm_grad NPU Kernel (#36320) · 0ca2807c
      furnace 提交于
      * add sync_batch_norm (support train, infer, and fp32, fp16, and NCHW, NHWC)
      
      * [NPU] Delete debug codes
      
      * [NPU] Remove FP16
      0ca2807c
    • J
      Add viterbi decode (#35778) · 6072aecb
      Jack Zhou 提交于
      * add viterbi decode cpu kernel
      
      * add viterbi decoder api in paddle.text
      
      * add a data buffer once to avoid create many small pieces of data buffer frequently
      
      * fix viterbi max_seq_length bug
      
      * fix seq_len=1 bug
      
      * fix device context
      
      * move split out of for loop
      
      * remove INVERSE_SUB
      
      * remove 2 GET_CAST_MASK
      
      * remove 1 loop
      
      * remove Functor
      
      * add to_static deploy code
      
      * use MAX_FUNC instead of ELE_MAX
      
      * add MaxFunctor
      
      * impl max_func
      
      * remove MaxFunctor
      
      * remove cast op
      
      * use REGISTER_OP_WITHOUT_GRADIENT
      
      * add viterbi cuda kernel
      
      * add FIX_BLOCKDIM_CASE macro
      
      * add MKL add, mul; add get data mask
      
      * add arange mkl impl
      
      * add CPU Argmax
      
      * add cpu gather
      
      * use EXECUTE_MKL_ELEMENT_BINARY_OP instead of some ADD, MUL
      
      * use SameDimsBinaryOP instead of EXECUTE_MKL_ELEMENT_BINARY_OP
      
      * use SAME_DIMS_ELEMENT_BINARY_OP
      
      * add SimpleBroadcastBinaryOP
      
      * use int instead of int64_t to accelerate
      
      * optimize SimpleBroadcastBinaryOP
      
      * optimize SimpleBroadcastBinaryOP
      
      * optimize performance in both single thread and multithread situation
      
      * remove useless line
      
      * remove useless code
      
      * add CREATE_TENSOR_BUFFER macro
      
      * add INIT_REQUIRED_TENSOR macro
      
      * add comment
      
      * fix windows ci
      
      * add viterbi unittest
      
      * remove cuda add functor
      
      * remove cuda equal
      
      * remove a template function
      
      * fix windows ci
      
      * fix windows dtype
      
      * remove some template instance
      
      * remove useless header file
      
      * remove some blockdim
      
      * remove transpose impl
      
      * accelerate cpu performance on single thread situation
      
      * viterbi_decode->crf_decode
      
      * rename crf params name
      
      * add viterbi api test
      
      * remove useless import
      
      * add enable_static
      
      * use viterbi decoder
      
      * fix viterbi len=1
      
      * fix  viterbi unittest
      
      * remove useless comments
      
      * reconstruct viterbi decode
      
      * remove ADD,SUB,MUL structure
      
      * fix coverage
      
      * remove CREATE_TENSOR
      
      * add name args
      
      * crf.py->ops.py; with_start_stop_tag->include_start_end_tag
      
      * update crf_decode en docs
      
      * fix viterbi decode en docs
      
      * fix some review comments
      
      * add FIXED_BLOCK_DIM_CASE in cuda
      
      * push_back->emplace_back
      
      * crf_decode->viterbi_decode; include_start_end_tag->include_bos_eos_tag
      
      * paddle.text.ops.viterbi_decode->paddle.text.viterbi_decode
      
      * fix viterbi_decode en docs
      6072aecb
    • T
      add fill_any_like/flatten ops to train ssd on kunlun (#36550) · 7bf2aa38
      TTerror 提交于
      * add some ops to train ssd on kunlun
      
      * update test_fill_any_like_op_xpu.py
      7bf2aa38
    • N
      Fix a bug in ReadData, ReadDataBc and ReadDataReduce when NX != 1 (#36373) · 921c0917
      niuliling123 提交于
      * Update the implement of reduceAnyKernel according to kernel primitive api
      * Fix a bug in ReadData, ReadDataBc and ReadDataReduce when NX != 1
      921c0917
  3. 20 10月, 2021 5 次提交
  4. 19 10月, 2021 7 次提交
  5. 18 10月, 2021 4 次提交
  6. 17 10月, 2021 1 次提交
  7. 15 10月, 2021 5 次提交
  8. 14 10月, 2021 7 次提交
  9. 13 10月, 2021 3 次提交
    • Y
      [PaddlePaddle hackathon] + ADD CELU (#36088) · d7064f04
      yujun 提交于
      * update
      
      * update
      
      * update
      
      * try make CI pass
      
      * doc typo
      
      * update doc string
      d7064f04
    • L
      Merge lars op (#35476) · 0c31579c
      limingshu 提交于
      * A leap of try for cudaLaunchCooperativeKernel
      
      * fix bugs
      
      * Totally replace the lar cuda kernel
      
      * Fix bugs
      
      * a test for lars merge
      
      * Adding las_op_momentum infer_shape
      
      * Fix codes
      
      * use avg_numel instead of max_numel to acquire grid num
      
      * modify unittest files about lars op
      
      * Finally converge when merged-lars works
      
      * fix ctest files
      
      * add merged_operation kernel when cuda version is older than 11
      
      * Fix code style
      
      * fix ctest failure
      
      * fix error
      
      * fix all ctest error and change lars compute code of cpu
      
      * fix bugs on v100.
      
      * revert python modififation about lars
      
      * revert python modification codes
      0c31579c
    • J
      Implemented LRU based cache clearing (#36290) · bf748f24
      Jacek Czaja 提交于
      - Lint
      
      - Merge with develop
      
      - lint
      bf748f24