提交 · 05d7e2fd4a6e3637047e3d29b30efbc9916a669b · 机器未来 / Paddle

25 10月, 2021 1 次提交

Add fused_dropout wrapper to ease use. (#36185) (#36640) · 05d7e2fd

由 Li Min 提交于 10月 25, 2021

In fused_attention op and fused_ffn op, the fused bias_add+dropout+residual+layernorm kernel or bias_add+dropout+residual kernel is used. To ease the use of this kernel, we provide a wrapper in this PR.
1.To reuse the increment computing code, we exact the corresponding code to "GetSeedDataAndIncrement" routine in dropout_impl_util.h.
2.The fused_dropout_helper.h provides the fused dropout kernel wrapper.

Note: the test of this warper will be provided in the following fused_attention_op and fused_ffn PRs.

05d7e2fd

24 10月, 2021 1 次提交

Add viterbi decode (#35778) (#36615) · 1906c746

由 Jack Zhou 提交于 10月 24, 2021

* add viterbi decode cpu kernel

* add viterbi decoder api in paddle.text

* add a data buffer once to avoid create many small pieces of data buffer frequently

* fix viterbi max_seq_length bug

* fix seq_len=1 bug

* fix device context

* move split out of for loop

* remove INVERSE_SUB

* remove 2 GET_CAST_MASK

* remove 1 loop

* remove Functor

* add to_static deploy code

* use MAX_FUNC instead of ELE_MAX

* add MaxFunctor

* impl max_func

* remove MaxFunctor

* remove cast op

* use REGISTER_OP_WITHOUT_GRADIENT

* add viterbi cuda kernel

* add FIX_BLOCKDIM_CASE macro

* add MKL add, mul; add get data mask

* add arange mkl impl

* add CPU Argmax

* add cpu gather

* use EXECUTE_MKL_ELEMENT_BINARY_OP instead of some ADD, MUL

* use SameDimsBinaryOP instead of EXECUTE_MKL_ELEMENT_BINARY_OP

* use SAME_DIMS_ELEMENT_BINARY_OP

* add SimpleBroadcastBinaryOP

* use int instead of int64_t to accelerate

* optimize SimpleBroadcastBinaryOP

* optimize SimpleBroadcastBinaryOP

* optimize performance in both single thread and multithread situation

* remove useless line

* remove useless code

* add CREATE_TENSOR_BUFFER macro

* add INIT_REQUIRED_TENSOR macro

* add comment

* fix windows ci

* add viterbi unittest

* remove cuda add functor

* remove cuda equal

* remove a template function

* fix windows ci

* fix windows dtype

* remove some template instance

* remove useless header file

* remove some blockdim

* remove transpose impl

* accelerate cpu performance on single thread situation

* viterbi_decode->crf_decode

* rename crf params name

* add viterbi api test

* remove useless import

* add enable_static

* use viterbi decoder

* fix viterbi len=1

* fix  viterbi unittest

* remove useless comments

* reconstruct viterbi decode

* remove ADD,SUB,MUL structure

* fix coverage

* remove CREATE_TENSOR

* add name args

* crf.py->ops.py; with_start_stop_tag->include_start_end_tag

* update crf_decode en docs

* fix viterbi decode en docs

* fix some review comments

* add FIXED_BLOCK_DIM_CASE in cuda

* push_back->emplace_back

* crf_decode->viterbi_decode; include_start_end_tag->include_bos_eos_tag

* paddle.text.ops.viterbi_decode->paddle.text.viterbi_decode

* fix viterbi_decode en docs

1906c746

22 10月, 2021 1 次提交

Fix a bug in ReadData, ReadDataBc and ReadDataReduce when NX != 1 (#36373) (#36616) · 6840cf55

由 niuliling123 提交于 10月 22, 2021

* Fix a bug in ReadData, ReadDataBc and ReadDataReduce when NX != 1
* Update the implement of reduceAnyKernel according to kernel primitive api

6840cf55

21 10月, 2021 2 次提交
- N
  [Cherry-pick] Add functor_primitives.h for kernel primitive api (#36418) · 30909889
  由 niuliling123 提交于 10月 21, 2021
```
* Add functor_primitives.h for kernel primtive api
```
  30909889
- improve replicate pad error information (#36531) · a201a691
  由 littletomatodonkey 提交于 10月 21, 2021
```
* fix replicate pad when input size is 0

* add unit test
```
  a201a691
19 10月, 2021 2 次提交

[cherry-pick]Add sparse attention cherrypick (#36447) · 36edb0e1

由 Liu-xiandong 提交于 10月 19, 2021

The code of this PR can only support CUDA 11.2. Currently, CI does not have GPU with CUDA 11.2 , and all tests will be skipped automatically.

The new OP is paddle._C_ops.sparse_attention. Regarding the work of the python API, it will be resolved in a follow-up PR.

The code of this PR lacks tests on dynamic graphs and static graphs, and will be added in subsequent PRs.

36edb0e1

W

cherry-pick 36424 inference support bert when exists matmul_v2 (#36500) · d974dbd1
由 Wilber 提交于 10月 19, 2021

d974dbd1

13 10月, 2021 1 次提交
- J
  
  fix for matmul_v2 6D x 2D (#36379) · ce6a27d9
  由 jakpiase 提交于 10月 13, 2021
  
  ce6a27d9
12 10月, 2021 1 次提交
- A
  Fix stop_gradient in RunProgramOp (#36339) (#36353) · a6868c91
  由 Aurelius84 提交于 10月 12, 2021
```
* Fix stop_gradient in RunProgramOp

* fix reference
```
  a6868c91
30 9月, 2021 2 次提交
- G
  
  fix bug of reduce_sum when src_dtype != dst_dtype and reduce_num == 1 (#36123) (#36193) · e8efba57
  由 Guoxia Wang 提交于 9月 30, 2021
  
  e8efba57
- G
  
  support fp16 (#35888) (#36191) · 87cc8d48
  由 Guoxia Wang 提交于 9月 30, 2021
  
  87cc8d48
29 9月, 2021 1 次提交

add API paddle.linalg.eig (#35674) (#36188) · 4e2daa9a

由 Lijunhui 提交于 9月 29, 2021

向PaddlePaddle中的线性代数库添加eig算子，该算子计算一般方阵的特征分解。
cherry-pick 自#35674.

4e2daa9a

28 9月, 2021 1 次提交
- R
  [cherry-pick] [ROCM] bugfix for bilinear_interp_v2_grad (#36160) #36161 · c576169b
  由 ronnywang 提交于 9月 28, 2021
```
ATT, cherry-pick #36160
```
  c576169b
27 9月, 2021 5 次提交
- J
  cherry-pick #36021 fix unique/unstack zero tensor (#36163) · 749bc240
  由 Jiawei Wang 提交于 9月 27, 2021
```
* fix unique unstack dim 0

* fix unique_op format
```
  749bc240
- J
  
  bugfix reshape -1 (#36143) · 45b7627b
  由 JZ-LIANG 提交于 9月 27, 2021
  
  45b7627b
- R
  [ROCM] fixbug for arg_min_max (#36113) · 40a29186
  由 ronnywang 提交于 9月 27, 2021
```
ATT, cherry-pick #36098
```
  40a29186
- J
  [Cherry-pick] Add new func/class API psroi_pool and UT (#36111) · 81557da6
  由 JYChen 提交于 9月 27, 2021
```
cherry-pick from #35352

Add new detection api paddle.vision.ops.psroi_pool and paddle.vision.ops.PSRoIPool
```
  81557da6
- Y
  [cherry-pick]Support fixed seed in Python for test (#36065) (#36094) · c3a0eaab
  由 YuanRisheng 提交于 9月 27, 2021
```
When users use gumbel_softmax, they can use paddle.seed() in python for fixed seed.
```
  c3a0eaab
26 9月, 2021 5 次提交
- C
  [cherry-pick]CPU forward calculation replaces Eigen with Lapack (#35916) (#36091) · effb70f4
  由 crystal 提交于 9月 26, 2021
```
cherry-pick #35916，CPU前向计算将Eigen替换为Lapack，修改linalg暴露规则
```
  effb70f4
- H
  [cherry-pick] Add Det and Slogdet API to Release 2.2 (#36083) · ba2a1bb4
  由 Huihuang Zheng 提交于 9月 26, 2021
```
This PR added det and slogdet API to release/2.2
It is cherry-pick from #34992 and #36013
```
  ba2a1bb4
- N
  [cherry-pick] Add function comments and instructions to the Primitive API #36024 · 05621f7f
  由 niuliling123 提交于 9月 26, 2021
```
[cherry-pick] Add function comments and instructions to the Primitive API
```
  05621f7f
- W
  [Cherry-Pick]Add paddle.linalg.solve OP (#35715) (#36056) · 6b4f2fbf
  由 Weilong Wu 提交于 9月 26, 2021
```
This PR supports linalg.solve calculation for linear algorithm module of Paddle. One may call paddle.linalg.solve to use it.
```
  6b4f2fbf
- R
  [NPU] add randperm_op_npu (#35763) (#36026) · df81915a
  由 ronnywang 提交于 9月 26, 2021
```
* add randperm_op_npu

* fix test_set_value_op_npu
```
  df81915a
25 9月, 2021 1 次提交
- B
  
  temporarily fix the performance drop of recurrent op (#36053) · 33fbdafa
  由 baoachun 提交于 9月 25, 2021
  
  33fbdafa
24 9月, 2021 1 次提交
- F
  [cherry-pick] Replace Eigen with Lapack library for eigvals OP kernel (#35909) (#36038) · e9c04149
  由 From00 提交于 9月 24, 2021
```
This PR implements the kernel of "eigvals" OP with the Lapack library, which has a better performance than the previous Eigen library.
```
  e9c04149
23 9月, 2021 3 次提交
- C
  [cherry-pick] FixEighOP; Unified MatrixEighFunctor function (#35812) (#35919) · 4629401e
  由 crystal 提交于 9月 23, 2021
```
cherry-pick #35812，修复Eigh OP
```
  4629401e
- W
  
  add dilation check for conv (#35894) · 91f25ee3
  由 wangguanzhong 提交于 9月 23, 2021
  
  91f25ee3
- T
  op:transpose_op supports bool type (#35886) (#35926) · 95c100c1
  由 TeslaZhao 提交于 9月 23, 2021
```
* Pass compat of conv_transpose_bias_mkldnn_fuse_pass

* Fix a bug of strided_slice op, about the axes parameter access memory out of bounds

* Fix a bug of transpose op, about accessing memory out of bounds of the perm param

* op:transpose_op supports bool type
```
  95c100c1
22 9月, 2021 3 次提交
- Y
  [Cherry-pick 2.2] Correct the return type of elementwise kernel to avoid many... · 0f344838
  由 Yiqun Liu 提交于 9月 22, 2021
```
 [Cherry-pick 2.2] Correct the return type of elementwise kernel to avoid many compiling warnings. (#35839) (#35868)

Cherry-pick #35839
```
  0f344838
- W
  
  [cherry-pick] [Inference] Support NNAdapter and ascend310 (#35882) · 2aaa417e
  由 Wilber 提交于 9月 22, 2021
  
  2aaa417e
- [cherry-pick2.2]support extern third_party lapack API on Linux/Windows/Mac (#35897) · fb8be035
  由 zhouweiwei2014 提交于 9月 22, 2021
```
ATT, cherry-pick #35690
```
  fb8be035
18 9月, 2021 3 次提交

由 Feiyu Chan 提交于 9月 18, 2021

* 1. add interface for fft;
2. add data type predicate;
3. fix paddle.roll.

* add fft c2c cufft kernel

* implement argument checking & op calling parts for fft_c2c and fftn_c2c

* add operator and opmaker definitions

* only register float and double for cpu.

* add common code for implementing FFT, add pocketfft as a dependency

* add fft c2c cufft kernel function

* fix bugs in python interface

* add support for c2r, r2c operators, op makers, kernels and kernel functors.

* test and fix bugs

* 1. fft_c2c function: add support for onesided=False;
2. add complex<float>, complex<double> support for concat and flip.

* 1. fft: fix python api bugs;
2. shape_op: add support for complex data types.

* fft c2c cufft kernel done with complie and link

* fix shape_op, add mkl placeholder

* remove mkl

* complete fft c2c in gpu

* 1. implement mkl-based fft, FFTC2CFunctor and common function exec_fft;
2. change the design, add input and output typename as template parameter for all FFTFunctors, update pocketfft-based implementation.

* complete fft c2c on gpu in ND

* complete fft c2c on gpu in ND

* complete fft c2c backward in ND

* fix MKL-based implementation

* Add frame op and CPU/GPU kernels.

* Add frame op forward unittest.

* Add frame op forward unittest.

* Remove axis parameter in FrameFunctor.

* Add frame op grad CPU/GPU kernels and unittest.

* Add frame op grad CPU/GPU kernels and unittest.

* Update doc string.

* Update after review and remove librosa requirement in unittest.

* Update grad kernel.

* add fft_c2r op

* Remove data allocation in TransCompute function.

* add fft r2c onesided with cpu(pocketfft/mkl) and gpu

* last fft c2r functor

* fix C2R and R2C for cufft, becase the direction is not an option in these cases.

* add fft r2c onesided with cpu(pocketfft/mkl) and gpu

* fix bugs in python APIs

* fix fft_c2r grad kernal

* fix bugs in python APIs

* add cuda fft c2r grad kernal functor

* clean code

* fix fft_c2r python API

* fill fft r2c result with conjugate symmetry (#19)

fill fft r2c result with conjugate symmetry

* add placeholder for unittests (#24)

* simple parameterize test function by auto generate test case from parm list (#25)

* miscellaneous fixes for python APIs (#26)

* add placeholder for unittests

* resize fft inputs before computation is n or s is provided.

* add complex kernels for pad and pad_grad

* simplify argument checking.

* add type promotion

* add int to float or complex promotion

* fix output data type for static mode

* fix fft's input dtype dispatch, import fft to paddle

* fix typos in axes checking (#27)

* fix typos in axes checking

* fix argument checking (#28)

* fix argument checking

* Add C2R Python layer normal and abnormal use cases (#29)

* documents and single case

* test c2r case

* New C2R Python layer normal and exception use cases

* complete rfft,rfft2,rfftn,ihfft,ihfft2,ihfftn unittest and doc string (#30)

* Documentation of the common interfaces of c2r and c2c (#31)

* Documentation of the common interfaces of c2r and c2c

* clean c++ code  (#32)

* clean code

* Add numpy-based implementation of spectral ops (#33)

* add numpy reference implementation of spectral ops

* Add fft_c2r numpy based implementation for unittest. (#34)

* add fft_c2r numpy implementation

* Add deframe op and stft/istft api. (#23)

* Add frame api

* Add deframe op and kernels.

* Add stft and istft apis.

* Add deframe api. Update stft and istft apis.

* Fix bug in frame_from_librosa function when input dims >= 3

* Rename deframe to overlap_add.

* Update istft.

* Update after code review.

* Add overlap_add op and stft/istft api unittest (#35)

* Add overlap_add op unittest.

* Register complex kernels of squeeze/unsquuze op.

* Add stft/istft api unittest.

* Add unittest for fft helper functions (#36)

* add unittests for fft helper functions. add complex kernel for roll op.

* complete static graph unittest for all public api (#37)

* Unittest of op with FFT C2C, C2R and r2c added (#38)

* documents and single case

* test c2r case

* New C2R Python layer normal and exception use cases

* Documentation of the common interfaces of c2r and c2c

* Unittest of op with FFT C2C, C2R and r2c added
Co-authored-by: lijiaqi <lijiaqi0612@163.com>

* add fft related options to CMakeLists.txt

* fix typos and clean code (#39)

* fix invisible character in mkl branch and fix error in error message

* clean code: remove docstring from unittest for signal.py.

* always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype. (#40)

* always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype.

* fix CI Errors: numpy dtype comparison, thrust when cuda is not available (#41)

1. always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype.
2. promote floating point tensor to complex tensor ior fft_c2c and fft_c2r;
3. fix unittest to catch UnImplementedError and RuntimeError;
4. fix compile error by avoid using thrust when cuda is not available.
5.  fix sample code, use paddle.fft instead of paddle.tensor.fft

* remove inclusion of thrust, add __all__ list for fft (#42)

* Add api doc and update unittest. (#43)

* Add doc strings.
* Update overlap_add op unittest

* fix MKL-based FFT implementation (#44)

* fix MKL-based FFT implementation, MKL CDFT's FORWARD DOMAIN is always REAL for R2C and C2R

* remove code for debug (#45)

* use dynload for cufft (#46)

* use std::ptrdiff_t as datatype of stride (instead of int64_t) to avoid argument mismatch on some platforms.

* add complex support for fill_zeros_like

* use dynload for cufft

* Update doc and unittest. (#47)

* Add doc of frame op and overlap_add op.

* Update unittest.

* use dynload for cufft (#48)

1. use dynload for cufft
2. fix unittest;
3. temporarily disable Rocm.

* fix conflicts and merge upstream (#49)

fix conflicts and merge upstream

* fix compile error: only link dyload_cuda when cuda is available (#50)

* fix compile error: only link dyload_cuda when cuda is available

* fix dynload for cufft on windows (#51)

1. fix dynload for cufft on windows;
2. fix unittests.

* add NOMINMAX to compile on windows (#52)

 add NOMINMAX to compile on windows

* explicitly specify capture mode for lambdas (#55)

 explicitly specify capture mode for lambdas

* fix fft sample (#53)

* fix fft sample

* update scipy and numpy version for unittests of fft (#56)

update scipy and numpy version for unittests of fft

* Add static graph unittests of frame and overlap_add api. (#57)

* Remove cache of cuFFT & Disable ONEMKL (#59)

1. replace numpy.fft with scipy.fft as numpy<1.20 not support ortho norm
2. remove cache of cufft plans;
3. enhance error checking.
4. default WITH_ONEMKL to OFF
Co-authored-by: Njeff41404 <jeff41404@gmail.com>
Co-authored-by: Nroot <root@bjyz-sys-gpu-kongming9.bjyz.baidu.com>
Co-authored-by: NKP <109694228@qq.com>
Co-authored-by: lijiaqi <lijiaqi0612@163.com>
Co-authored-by: NXiaoxu Chen <chenxx_id@163.com>
Co-authored-by: Nlijiaqi0612 <33169170+lijiaqi0612@users.noreply.github.com>

11518a43

[oneDNN] Disable caching of Reorder operation (#35664) · e4c2a854

由 Jacek Czaja 提交于 9月 18, 2021

* - REorder disabling caching

* - compilation fix

* - another compilation fix

* - another compilation fix

* - compilation fix

* - Fix

* - yet another compilation fix

* - suppresingly another compilation fix

* - lint

* - fix after review

* - fix

e4c2a854

Add new API "eigvals" in linalg (#35720) · d411a038

由 From00 提交于 9月 18, 2021

* Add linalg.eigvals API

* pre-commit check

* Adjust code style

* Fix conflict

* Improve code style

* Modify the test code to ignore testing CUDA kernel

* Sort ouput data before checking in test code

* Set timeout value for UT

* Improve API example code to pass CI

* Fix bug for None fetch_list in Windows

* Delete grad Op

d411a038

17 9月, 2021 6 次提交
- J
  Disabled oneDNN reshape1/2 and squeeze1/2 kernels (#35781) · 0eaab803
  由 jakpiase 提交于 9月 17, 2021
```
* disabled matmul_v2 grad

* Revert "disabled matmul_v2 grad"

This reverts commit b569bcef162116ca9f7963f3975b4a412f9e8555.

* reverted disabling matmul_v2, disabled reshape and squeeze
```
  0eaab803
- Z
  Make flag adding easier (#35823) · 2c781455
  由 Zeng Jinle 提交于 9月 17, 2021
```
* make flag setter easier

* update

* rename macro name

* fix bug of public/writable

* update to pass CI

* polish

* fix CPU link error
```
  2c781455
- F
  broadcast qkv_op (#35780) · cf9eae4c
  由 feng_shuai 提交于 9月 17, 2021
```
* broadcast qkv_op

* use PADDLE_ENFORCE_GT to replace assert
```
  cf9eae4c
- Z
  add a fusion op: fused_layernorm_residual_dropout_bias (#35151) · 7975dfcf
  由 zhangkaihuo 提交于 9月 17, 2021
```
Fused elementwise_add, dropout, elementwise_add and layer_norm into one operator, only support Forward. 
No Python API changed.
```
  7975dfcf
- W
  fix the memory leak for the static.auc · 0fd09fdf
  由 wawltor 提交于 9月 17, 2021
```
fix the memory leak for the static.auc 
```
  0fd09fdf
- 0
  
  refine matrix_rank op code and doc (#35722) · 28fffef6
  由 0x45f 提交于 9月 17, 2021
  
  28fffef6

机器未来 / Paddle 与 Fork 源项目一致

机器未来 / Paddle
与 Fork 源项目一致