提交 · 799f3861d17d8c1cfa6f1669d7bb47309dcfe77a · Crayon鑫 / Paddle

21 9月, 2021 1 次提交

Reuse OneDNN handler for SGD and SUM for SelectedRows input tensors. (#35510) · 799f3861

由 Adam Osewski 提交于 9月 20, 2021

* Create stateful OneDNNAXPYHandler object.

This makes it possible to call it multiple times without recreating the
oneDNN primitives every time.

* Prepare SGDOpKernel to reuse its implementation from OneDNN kernel.

* OneDNN SGD kernel.

* Update call to use new OneDNNAXPYHandler object api.

* Setup seed in proper place.

* Enable OneDNN kernel only for single case.

* For dense param and sparse grad.

* Small refactor.

* Enable oneDNN by op attr or by cmd line flag.

* Use int64_t type for number of elements.

* Support dense param and grad from OneDNN kernel.

* Enable SGD OneDNN kernel when use MP BF16 optimizer.

* Force non-copyable/movable OneDNNAXPYHandler.

* Reuse OneDNNAXPYHandler for spare tensors in SUM op.

* Fix SFINAE rules.

* Remove recording event inside AXPY.

* Get rid of internal primitive caching.

* Stop use PP cache mechanims to store mem and primitive obj.
* Handler obj store and reuse needed desc & prim

* Do not derive from MKLDNNHandlerT

799f3861

19 9月, 2021 1 次提交

Optimization of pool2d grad (#35389) · 86685190

由 limingshu 提交于 9月 19, 2021

* Optimization of pool2d grad, first commit.

* remove useless print codes

* refine codes

* refine codes

* seal more operation into template specialization

* fix template struct error in MaxPool2dGrad.

* Fix header including error

* refine code with comment

* Seal the param-preparation codes into function for common use.

* Seal the param-preparation codes into function for common use.

* Seal the param-preparation into funciton and make it common for other kernels

* polish code and erase useless template speicalization

* Rerun triger

* rerun trigger

86685190

18 9月, 2021 2 次提交

C

FixEighOP; Unified MatrixEighFunctor function (#35812) · da441363
由 crystal 提交于 9月 18, 2021

da441363

由 Feiyu Chan 提交于 9月 18, 2021

* 1. add interface for fft;
2. add data type predicate;
3. fix paddle.roll.

* add fft c2c cufft kernel

* implement argument checking & op calling parts for fft_c2c and fftn_c2c

* add operator and opmaker definitions

* only register float and double for cpu.

* add common code for implementing FFT, add pocketfft as a dependency

* add fft c2c cufft kernel function

* fix bugs in python interface

* add support for c2r, r2c operators, op makers, kernels and kernel functors.

* test and fix bugs

* 1. fft_c2c function: add support for onesided=False;
2. add complex<float>, complex<double> support for concat and flip.

* 1. fft: fix python api bugs;
2. shape_op: add support for complex data types.

* fft c2c cufft kernel done with complie and link

* fix shape_op, add mkl placeholder

* remove mkl

* complete fft c2c in gpu

* 1. implement mkl-based fft, FFTC2CFunctor and common function exec_fft;
2. change the design, add input and output typename as template parameter for all FFTFunctors, update pocketfft-based implementation.

* complete fft c2c on gpu in ND

* complete fft c2c on gpu in ND

* complete fft c2c backward in ND

* fix MKL-based implementation

* Add frame op and CPU/GPU kernels.

* Add frame op forward unittest.

* Add frame op forward unittest.

* Remove axis parameter in FrameFunctor.

* Add frame op grad CPU/GPU kernels and unittest.

* Add frame op grad CPU/GPU kernels and unittest.

* Update doc string.

* Update after review and remove librosa requirement in unittest.

* Update grad kernel.

* add fft_c2r op

* Remove data allocation in TransCompute function.

* add fft r2c onesided with cpu(pocketfft/mkl) and gpu

* last fft c2r functor

* fix C2R and R2C for cufft, becase the direction is not an option in these cases.

* add fft r2c onesided with cpu(pocketfft/mkl) and gpu

* fix bugs in python APIs

* fix fft_c2r grad kernal

* fix bugs in python APIs

* add cuda fft c2r grad kernal functor

* clean code

* fix fft_c2r python API

* fill fft r2c result with conjugate symmetry (#19)

fill fft r2c result with conjugate symmetry

* add placeholder for unittests (#24)

* simple parameterize test function by auto generate test case from parm list (#25)

* miscellaneous fixes for python APIs (#26)

* add placeholder for unittests

* resize fft inputs before computation is n or s is provided.

* add complex kernels for pad and pad_grad

* simplify argument checking.

* add type promotion

* add int to float or complex promotion

* fix output data type for static mode

* fix fft's input dtype dispatch, import fft to paddle

* fix typos in axes checking (#27)

* fix typos in axes checking

* fix argument checking (#28)

* fix argument checking

* Add C2R Python layer normal and abnormal use cases (#29)

* documents and single case

* test c2r case

* New C2R Python layer normal and exception use cases

* complete rfft,rfft2,rfftn,ihfft,ihfft2,ihfftn unittest and doc string (#30)

* Documentation of the common interfaces of c2r and c2c (#31)

* Documentation of the common interfaces of c2r and c2c

* clean c++ code  (#32)

* clean code

* Add numpy-based implementation of spectral ops (#33)

* add numpy reference implementation of spectral ops

* Add fft_c2r numpy based implementation for unittest. (#34)

* add fft_c2r numpy implementation

* Add deframe op and stft/istft api. (#23)

* Add frame api

* Add deframe op and kernels.

* Add stft and istft apis.

* Add deframe api. Update stft and istft apis.

* Fix bug in frame_from_librosa function when input dims >= 3

* Rename deframe to overlap_add.

* Update istft.

* Update after code review.

* Add overlap_add op and stft/istft api unittest (#35)

* Add overlap_add op unittest.

* Register complex kernels of squeeze/unsquuze op.

* Add stft/istft api unittest.

* Add unittest for fft helper functions (#36)

* add unittests for fft helper functions. add complex kernel for roll op.

* complete static graph unittest for all public api (#37)

* Unittest of op with FFT C2C, C2R and r2c added (#38)

* documents and single case

* test c2r case

* New C2R Python layer normal and exception use cases

* Documentation of the common interfaces of c2r and c2c

* Unittest of op with FFT C2C, C2R and r2c added
Co-authored-by: lijiaqi <lijiaqi0612@163.com>

* add fft related options to CMakeLists.txt

* fix typos and clean code (#39)

* fix invisible character in mkl branch and fix error in error message

* clean code: remove docstring from unittest for signal.py.

* always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype. (#40)

* always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype.

* fix CI Errors: numpy dtype comparison, thrust when cuda is not available (#41)

1. always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype.
2. promote floating point tensor to complex tensor ior fft_c2c and fft_c2r;
3. fix unittest to catch UnImplementedError and RuntimeError;
4. fix compile error by avoid using thrust when cuda is not available.
5.  fix sample code, use paddle.fft instead of paddle.tensor.fft

* remove inclusion of thrust, add __all__ list for fft (#42)

* Add api doc and update unittest. (#43)

* Add doc strings.
* Update overlap_add op unittest

* fix MKL-based FFT implementation (#44)

* fix MKL-based FFT implementation, MKL CDFT's FORWARD DOMAIN is always REAL for R2C and C2R

* remove code for debug (#45)

* use dynload for cufft (#46)

* use std::ptrdiff_t as datatype of stride (instead of int64_t) to avoid argument mismatch on some platforms.

* add complex support for fill_zeros_like

* use dynload for cufft

* Update doc and unittest. (#47)

* Add doc of frame op and overlap_add op.

* Update unittest.

* use dynload for cufft (#48)

1. use dynload for cufft
2. fix unittest;
3. temporarily disable Rocm.

* fix conflicts and merge upstream (#49)

fix conflicts and merge upstream

* fix compile error: only link dyload_cuda when cuda is available (#50)

* fix compile error: only link dyload_cuda when cuda is available

* fix dynload for cufft on windows (#51)

1. fix dynload for cufft on windows;
2. fix unittests.

* add NOMINMAX to compile on windows (#52)

 add NOMINMAX to compile on windows

* explicitly specify capture mode for lambdas (#55)

 explicitly specify capture mode for lambdas

* fix fft sample (#53)

* fix fft sample

* update scipy and numpy version for unittests of fft (#56)

update scipy and numpy version for unittests of fft

* Add static graph unittests of frame and overlap_add api. (#57)

* Remove cache of cuFFT & Disable ONEMKL (#59)

1. replace numpy.fft with scipy.fft as numpy<1.20 not support ortho norm
2. remove cache of cufft plans;
3. enhance error checking.
4. default WITH_ONEMKL to OFF
Co-authored-by: Njeff41404 <jeff41404@gmail.com>
Co-authored-by: Nroot <root@bjyz-sys-gpu-kongming9.bjyz.baidu.com>
Co-authored-by: NKP <109694228@qq.com>
Co-authored-by: lijiaqi <lijiaqi0612@163.com>
Co-authored-by: NXiaoxu Chen <chenxx_id@163.com>
Co-authored-by: Nlijiaqi0612 <33169170+lijiaqi0612@users.noreply.github.com>

11518a43

16 9月, 2021 1 次提交
- C
  
  Add CPU and GPU eigh op implementation (#34990) · 07d0b834
  由 crystal 提交于 9月 16, 2021
  
  07d0b834
15 9月, 2021 1 次提交

[NPU] add beam_search npu op (#34860) · 3760be06

由 pangyoki 提交于 9月 15, 2021

* add beam_search npu op

* fix CMakeList and add unittest

* fix bug of beam search npu op

* fix unittest

* let input ids become int64

* set output ids to int64_t

* delete check_dygraph

* fix beam_width=1

3760be06

13 9月, 2021 2 次提交

Z

Support int16_t in fill_constant_op (#35619) · 4b6f8099
由 Zhang Zheng 提交于 9月 13, 2021

4b6f8099

Add searchsorted op (#35159) · 66223048

由 Yanxing Shi 提交于 9月 13, 2021

* fix github name

* fix CI error

* fix review and CI error

* fix inf,nan error and modify unittest samples

* add unittest samples

* add unittest samples

* fix unittest error

* test=document_fix

* test=document_fix

* modify doc and add unittest samples

* fix error newline in constant

* modify doc after mentor review

* modify __all__ and doc

* modify doc

66223048

10 9月, 2021 1 次提交

add cumprod op (#35185) · 4e509f46

由 hlygit66666 提交于 9月 10, 2021

* add test_cumprod_op

* Revert "add test_cumprod_op"

This reverts commit c96cf6dff5d09ae7d8cc72c1e8ae4369a153aa19.

* recommit

* add error message

* test input(x) initialize

* test use cpu

* update test code

* add test type

* add test case

* solve ci problem

* add complex case test

* add complex case test

* fix review problem

* fix conflict

* fix some docs

* change test case

* change test case

* fix review problems again

* fix docs

* fix inclusivescan bug

4e509f46

08 9月, 2021 2 次提交

C

Add FP16 PRelu (#35532) · 4e62af80
由 cc 提交于 9月 08, 2021

4e62af80

merge CMakeList.txt manual (#35378) · c4a3e8b4

由 feng_shuai 提交于 9月 08, 2021

* merge CMakeList.txt manual

* add platform for changethreadnum

* repair some bugs according to make error

* do nothing just flush CI

* forget change thread num

* add inplace_atol param for check_output_with_place

* Windows

* std:min and std::max should be change because of windows

c4a3e8b4

01 9月, 2021 1 次提交
- W
  Stablize depthwise conv (#35161) · 3c21f26b
  由 wangguanzhong 提交于 9月 01, 2021
```
* stablize depthwise conv

* clean commend
```
  3c21f26b
27 8月, 2021 1 次提交

Add unpool2d op & Expose max_unpool2d API (#35056) · ceee71a0

由 xiaoting 提交于 8月 27, 2021

* add maxunppol2d op, test=develop

* fix typo, test=develop

* fix unpool unitest, test=develop

* fix unpool code-example, test=develop

* fix for unpool_op_unittest,test=develop

* fix example code, test=develop

* add noqa:F401, test=develop

* fix converage, test=develop

* fix unitest for unpool, test=develop

* rename unpool2d to unpool, test=develop

* rename unpool2d to unpool, test=develop

ceee71a0

17 8月, 2021 1 次提交

Align CTC grad scale same with ESPNet (#34729) · 10f9644c

由 Hui Zhang 提交于 8月 16, 2021

* dygraph support more ctc grad scale

* scale for 1.x

* fix unitest

* fix unitest

* format code

* fix unittest

* fix log info

* unittest cov

* fix format;notest,test=cpu,coverage

* skip ctc_loss egs;test=cpu

* warpctc grad cov;test=coverage

* add dygraph test;test=coverage

* format;test=cpu,coverage

* format;test=cpu

* add api compat;test=cpu

* add cpu test

* rename

* rename

* fix

* fix test

* format

* eigen cpu

* eigen gpu grad pass

* cuda gpu pass

* format

* fix ci

10f9644c

12 8月, 2021 1 次提交

Fix safety-bug of functional.linear (#34696) · 0e28c8bb

由 zhulei 提交于 8月 12, 2021

* Fix safety-bug of functional.linear

* Fix safety-bug of functional.linear

* Fix safety-bug of functional.linear

* Fix safety-bug of functional.linear

0e28c8bb

11 8月, 2021 1 次提交
- W
  
  miss format (#34771) · addd5fce
  由 wenbin 提交于 8月 11, 2021
  
  addd5fce
09 8月, 2021 1 次提交
- L
  
  fix split on empty tensor (#34356) · 898acb1a
  由 Leo Chen 提交于 8月 09, 2021
  
  898acb1a
22 7月, 2021 2 次提交
- W
  
  fix concat bug (#34319) · c342651e
  由 wuhuachaocoding 提交于 7月 22, 2021
  
  c342651e
- C
  Add int16 kernel for lookup_talbe and dequantize_abs_max op (#34275) · 85e531a9
  由 cc 提交于 7月 22, 2021
```
* add int16 kernel for lookup_talbe and dequantize_abs_max op
```
  85e531a9
09 7月, 2021 1 次提交

Use CBLAS for SelectedRows elementwise add operation. (#34008) · 1412d3bc

由 arlesniak 提交于 7月 09, 2021

* Use CBLAS for SelectedRows elementwise add operation. It's faster.

* template compilation fix

* reverted template compilation fix

* slimmed template compilation fix
Co-authored-by: NAdam Osewski <adam.osewski@intel.com>

1412d3bc

07 7月, 2021 1 次提交
- X
  
  [HIP] 解决hipMemcpy无法overlap的问题，修改后AMD GPU性能提升大于10% (#33982) · 20da7703
  由 xiayanming 提交于 7月 07, 2021
  
  20da7703
06 7月, 2021 1 次提交
- L
  
  Optimize the forward of log_softmax for the case when axis is not the last dimention. (#32396) · 69ffb386
  由 Lijunhui 提交于 7月 06, 2021
  
  69ffb386
05 7月, 2021 1 次提交
- W
  
  Add fused elemwise gelu and optimize performance (#33480) · eae31856
  由 WangXi 提交于 7月 05, 2021
  
  eae31856
22 6月, 2021 1 次提交
- Z
  
  fix gpt2 train loss Nan problem (#33658) · 687571f2
  由 zhiboniu 提交于 6月 22, 2021
  
  687571f2
21 6月, 2021 1 次提交

Add AXPY oneDNN handler (#33632) · 773aabc7

由 lidanqing 提交于 6月 21, 2021

* Add oneDNN AXPY handler.

* Add fallback for small tensors.

* Fix ifdefs

* Remove unnecessary namespace prefixes and add missing headers.

* Guard handler_axpy with proper ifdefs.

* Compilation of this function is possible only when Paddle is not build
with CUDA nor HIP.

* Move AXPY handler code to separate files.

* Use oneDNN AXPY handler in SGD op.

* Use axpy handler only when Paddle is built with oneDNN.

* Add test for SUM BF16 with big rows.

* Fix SFINAE rules for elementwise_add_to.

* Add test case for SGD with big rows.

* update

* update
Co-authored-by: NAdam Osewski <adam.osewski@intel.com>

773aabc7

05 6月, 2021 1 次提交
- W
  
  [Paddle-TRT] Add gather_nd and reduce_sum trt op. (#33324) · d194bd3a
  由 Wilber 提交于 6月 05, 2021
  
  d194bd3a
02 6月, 2021 1 次提交
- W
  
  fix iScan C++ problems, test=develop (#33274) · 1b10ccdb
  由 wuhuanzhou 提交于 6月 02, 2021
  
  1b10ccdb
01 6月, 2021 1 次提交

replace and remove complex64/128 types in custom OP and other files (#33195) · 06c63ca0

由 chentianyu03 提交于 6月 01, 2021

* replace and remove complex64/128 types in custom OP and other files

* fix custom_tensor_test fail bug

* fix custom_conj_test fail bug

* fix dispatch_test_op build fail bug

06c63ca0

26 5月, 2021 2 次提交

C
modify matmul Op to complex template types (#33130) · 6c07cd7e
由 chentianyu03 提交于 5月 26, 2021
```
* modify matmul Op to complex template types

* remove complex64/128 head file
```
6c07cd7e

optimize OP's compilation time (#32617) · 78ecb668

由 wuhuanzhou 提交于 5月 26, 2021

* optimize OP's compilation time, test=develop

* add more op and run ci test, test=develop

* CUDA Kernel register in cc file, test=develop

* fix macros, test=develop

* fix undefined symbol error, test=develop

* fix compilation error and undefined symbol, test=develop

* fix compilation error on Windows, test=develop

* fix compilation error on Windows, test=develop

78ecb668

25 5月, 2021 1 次提交

modify Ops to complex template (#33041) · 5fa44c34

由 chentianyu03 提交于 5月 25, 2021

* modify conj, real, imag OP to complex template

* replace with complex template to dot Op

* replace with complex template to Abs Op

* add support for complex64 and complex128

5fa44c34

20 5月, 2021 1 次提交

Add complex template type (#32857) · 738bf20e

由 chentianyu03 提交于 5月 20, 2021

* add complex template file

* add numtraits for complex template

* add complex template type register

* modify specify template of complex

* modify specify template of complex

* modify specify template of complex

* modify specify template of complex

* make TensorCheckerVisitor support complex type

* fix operator= error

* add complex template

* add complex template type

* add complex template type to pyarray transform

* add complex template type to pyarray transform

* remove complex type for dlpack register

* set dlpack supprot complex type

* set dlpack supprot complex type

* set dlpack supprot complex type

* remove explict for complex constructor

* add complex unit test file

738bf20e

12 5月, 2021 1 次提交
- L
  
  [NPU] Support npu pinned allocator and manage Tensor on NPUPinnedPlace (#32840) · 6b3bb796
  由 liym27 提交于 5月 12, 2021
  
  6b3bb796
06 5月, 2021 2 次提交

[ROCM] bugfix for unittest (#32392) · 31392627

由 ronnywang 提交于 5月 06, 2021

* fix test_unpool_op

* fix test_inplace_addto_strategy

* fix test_conv2d_fusion_op

* fix test_imperative_lod_tensor_to_selected_rows, test_imperative_selected_rows_to_lod_tensor

* fix test_dot_op

* fix test_correlation_op

* fix tracer

* fix test_memcpy_op

31392627

A

Sum kernel for CPU supporting BF16 and SelectedRows (#32631) · 9599c3b3
由 Adam Osewski 提交于 5月 06, 2021

9599c3b3

27 4月, 2021 1 次提交
- Z
  [OPs] Bug fix, fix the segment mean for illegal syncthreads usage. (#32596) · 1afe1ac9
  由 Zhong Hui 提交于 4月 27, 2021
```
* [OPs] Bug fix, fix the segment mean for illegal syncthreads usage.
```
  1afe1ac9
19 4月, 2021 1 次提交
- J
  
  Add BF16 Constant Initializer and support for other initializer (#31935) · 76cb83e8
  由 joanna.wozna.intel 提交于 4月 19, 2021
  
  76cb83e8
14 4月, 2021 1 次提交

fix matrix_inverse_op with rocm (#32128) · 995b5f2c

由 zhulei 提交于 4月 14, 2021

* fix matrix_inverse_op with rocm

* fix matrix_inverse_op with rocm

* fix matrix_inverse_op with rocm

* fix matrix_inverse_op with rocm

995b5f2c

13 4月, 2021 1 次提交
- Q
  
  [ROCM] fix depth conv2d in rocm, test=develop (#32170) · 693c7629
  由 Qi Li 提交于 4月 13, 2021
  
  693c7629
09 4月, 2021 1 次提交
- N
  make high precision for avg_pool and adaptive_avg_pool when data_type is float16 (#31887) · ec2ffb68
  由 niuliling123 提交于 4月 09, 2021
```
* make high precision for avg_pool
```
  ec2ffb68

Crayon鑫 / Paddle 与 Fork 源项目一致

Crayon鑫 / Paddle
与 Fork 源项目一致