提交 · 616ce203e2a3d55d540cdff1ca929f6db499b88a · PaddlePaddle / Paddle

26 10月, 2021 3 次提交
- Y
  
  [Cherry-pick] Add the forward QR operator (#36627) · 616ce203
  由 Yulong Ao 提交于 10月 26, 2021
  
  616ce203
- Y
  add slot record support for GpuPS (#36723) · 53480c9c
  由 yaoxuefeng 提交于 10月 26, 2021
```
* add slotrecord datafeed (#36099)

* fix multi-node (#36329)
```
  53480c9c
- Y
  
  add slot record dataset (#36200) (#36710) · 3fbb6644
  由 yaoxuefeng 提交于 10月 26, 2021
  
  3fbb6644
19 10月, 2021 1 次提交

[cherry-pick]Add sparse attention cherrypick (#36447) · 36edb0e1

由 Liu-xiandong 提交于 10月 19, 2021

The code of this PR can only support CUDA 11.2. Currently, CI does not have GPU with CUDA 11.2 , and all tests will be skipped automatically.

The new OP is paddle._C_ops.sparse_attention. Regarding the work of the python API, it will be resolved in a follow-up PR.

The code of this PR lacks tests on dynamic graphs and static graphs, and will be added in subsequent PRs.

36edb0e1

27 9月, 2021 1 次提交

Add paddle.device.cuda.get_device_properties (#35875) · cea0bc26

由 Yanxing Shi 提交于 9月 27, 2021

* Initial Commit

* fix py2 error

* fix wrong words and doc

* test=document_fix

* fix _gpuDeviceProperties

cea0bc26

26 9月, 2021 2 次提交
- C
  [cherry-pick]CPU forward calculation replaces Eigen with Lapack (#35916) (#36091) · effb70f4
  由 crystal 提交于 9月 26, 2021
```
cherry-pick #35916，CPU前向计算将Eigen替换为Lapack，修改linalg暴露规则
```
  effb70f4
- W
  [Cherry-Pick]Add paddle.linalg.solve OP (#35715) (#36056) · 6b4f2fbf
  由 Weilong Wu 提交于 9月 26, 2021
```
This PR supports linalg.solve calculation for linear algorithm module of Paddle. One may call paddle.linalg.solve to use it.
```
  6b4f2fbf
24 9月, 2021 3 次提交
- F
  [cherry-pick] Replace Eigen with Lapack library for eigvals OP kernel (#35909) (#36038) · e9c04149
  由 From00 提交于 9月 24, 2021
```
This PR implements the kernel of "eigvals" OP with the Lapack library, which has a better performance than the previous Eigen library.
```
  e9c04149
- H
  Basic PR on Cost Model (#35774) (#35915) · efcd108d
  由 Huihuang Zheng 提交于 9月 24, 2021
```
Add basic Cost Model, it uses executor to run program and profile it to get op time.

This is an early basic version, we will add more functions in the future.
```
  efcd108d
- L
  [cherry-pick] fix cusparse compile bug in windows CUDA11.2, test=release/2.2 (#36015) · 0e19aeb9
  由 Liu-xiandong 提交于 9月 24, 2021
```
解决Windows中CUDA11.2编译出错的问题。
cherry-pick #35941
```
  0e19aeb9
22 9月, 2021 1 次提交
- [cherry-pick2.2]support extern third_party lapack API on Linux/Windows/Mac (#35897) · fb8be035
  由 zhouweiwei2014 提交于 9月 22, 2021
```
ATT, cherry-pick #35690
```
  fb8be035
18 9月, 2021 3 次提交

由 Feiyu Chan 提交于 9月 18, 2021

* 1. add interface for fft;
2. add data type predicate;
3. fix paddle.roll.

* add fft c2c cufft kernel

* implement argument checking & op calling parts for fft_c2c and fftn_c2c

* add operator and opmaker definitions

* only register float and double for cpu.

* add common code for implementing FFT, add pocketfft as a dependency

* add fft c2c cufft kernel function

* fix bugs in python interface

* add support for c2r, r2c operators, op makers, kernels and kernel functors.

* test and fix bugs

* 1. fft_c2c function: add support for onesided=False;
2. add complex<float>, complex<double> support for concat and flip.

* 1. fft: fix python api bugs;
2. shape_op: add support for complex data types.

* fft c2c cufft kernel done with complie and link

* fix shape_op, add mkl placeholder

* remove mkl

* complete fft c2c in gpu

* 1. implement mkl-based fft, FFTC2CFunctor and common function exec_fft;
2. change the design, add input and output typename as template parameter for all FFTFunctors, update pocketfft-based implementation.

* complete fft c2c on gpu in ND

* complete fft c2c on gpu in ND

* complete fft c2c backward in ND

* fix MKL-based implementation

* Add frame op and CPU/GPU kernels.

* Add frame op forward unittest.

* Add frame op forward unittest.

* Remove axis parameter in FrameFunctor.

* Add frame op grad CPU/GPU kernels and unittest.

* Add frame op grad CPU/GPU kernels and unittest.

* Update doc string.

* Update after review and remove librosa requirement in unittest.

* Update grad kernel.

* add fft_c2r op

* Remove data allocation in TransCompute function.

* add fft r2c onesided with cpu(pocketfft/mkl) and gpu

* last fft c2r functor

* fix C2R and R2C for cufft, becase the direction is not an option in these cases.

* add fft r2c onesided with cpu(pocketfft/mkl) and gpu

* fix bugs in python APIs

* fix fft_c2r grad kernal

* fix bugs in python APIs

* add cuda fft c2r grad kernal functor

* clean code

* fix fft_c2r python API

* fill fft r2c result with conjugate symmetry (#19)

fill fft r2c result with conjugate symmetry

* add placeholder for unittests (#24)

* simple parameterize test function by auto generate test case from parm list (#25)

* miscellaneous fixes for python APIs (#26)

* add placeholder for unittests

* resize fft inputs before computation is n or s is provided.

* add complex kernels for pad and pad_grad

* simplify argument checking.

* add type promotion

* add int to float or complex promotion

* fix output data type for static mode

* fix fft's input dtype dispatch, import fft to paddle

* fix typos in axes checking (#27)

* fix typos in axes checking

* fix argument checking (#28)

* fix argument checking

* Add C2R Python layer normal and abnormal use cases (#29)

* documents and single case

* test c2r case

* New C2R Python layer normal and exception use cases

* complete rfft,rfft2,rfftn,ihfft,ihfft2,ihfftn unittest and doc string (#30)

* Documentation of the common interfaces of c2r and c2c (#31)

* Documentation of the common interfaces of c2r and c2c

* clean c++ code  (#32)

* clean code

* Add numpy-based implementation of spectral ops (#33)

* add numpy reference implementation of spectral ops

* Add fft_c2r numpy based implementation for unittest. (#34)

* add fft_c2r numpy implementation

* Add deframe op and stft/istft api. (#23)

* Add frame api

* Add deframe op and kernels.

* Add stft and istft apis.

* Add deframe api. Update stft and istft apis.

* Fix bug in frame_from_librosa function when input dims >= 3

* Rename deframe to overlap_add.

* Update istft.

* Update after code review.

* Add overlap_add op and stft/istft api unittest (#35)

* Add overlap_add op unittest.

* Register complex kernels of squeeze/unsquuze op.

* Add stft/istft api unittest.

* Add unittest for fft helper functions (#36)

* add unittests for fft helper functions. add complex kernel for roll op.

* complete static graph unittest for all public api (#37)

* Unittest of op with FFT C2C, C2R and r2c added (#38)

* documents and single case

* test c2r case

* New C2R Python layer normal and exception use cases

* Documentation of the common interfaces of c2r and c2c

* Unittest of op with FFT C2C, C2R and r2c added
Co-authored-by: lijiaqi <lijiaqi0612@163.com>

* add fft related options to CMakeLists.txt

* fix typos and clean code (#39)

* fix invisible character in mkl branch and fix error in error message

* clean code: remove docstring from unittest for signal.py.

* always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype. (#40)

* always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype.

* fix CI Errors: numpy dtype comparison, thrust when cuda is not available (#41)

1. always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype.
2. promote floating point tensor to complex tensor ior fft_c2c and fft_c2r;
3. fix unittest to catch UnImplementedError and RuntimeError;
4. fix compile error by avoid using thrust when cuda is not available.
5.  fix sample code, use paddle.fft instead of paddle.tensor.fft

* remove inclusion of thrust, add __all__ list for fft (#42)

* Add api doc and update unittest. (#43)

* Add doc strings.
* Update overlap_add op unittest

* fix MKL-based FFT implementation (#44)

* fix MKL-based FFT implementation, MKL CDFT's FORWARD DOMAIN is always REAL for R2C and C2R

* remove code for debug (#45)

* use dynload for cufft (#46)

* use std::ptrdiff_t as datatype of stride (instead of int64_t) to avoid argument mismatch on some platforms.

* add complex support for fill_zeros_like

* use dynload for cufft

* Update doc and unittest. (#47)

* Add doc of frame op and overlap_add op.

* Update unittest.

* use dynload for cufft (#48)

1. use dynload for cufft
2. fix unittest;
3. temporarily disable Rocm.

* fix conflicts and merge upstream (#49)

fix conflicts and merge upstream

* fix compile error: only link dyload_cuda when cuda is available (#50)

* fix compile error: only link dyload_cuda when cuda is available

* fix dynload for cufft on windows (#51)

1. fix dynload for cufft on windows;
2. fix unittests.

* add NOMINMAX to compile on windows (#52)

 add NOMINMAX to compile on windows

* explicitly specify capture mode for lambdas (#55)

 explicitly specify capture mode for lambdas

* fix fft sample (#53)

* fix fft sample

* update scipy and numpy version for unittests of fft (#56)

update scipy and numpy version for unittests of fft

* Add static graph unittests of frame and overlap_add api. (#57)

* Remove cache of cuFFT & Disable ONEMKL (#59)

1. replace numpy.fft with scipy.fft as numpy<1.20 not support ortho norm
2. remove cache of cufft plans;
3. enhance error checking.
4. default WITH_ONEMKL to OFF
Co-authored-by: Njeff41404 <jeff41404@gmail.com>
Co-authored-by: Nroot <root@bjyz-sys-gpu-kongming9.bjyz.baidu.com>
Co-authored-by: NKP <109694228@qq.com>
Co-authored-by: lijiaqi <lijiaqi0612@163.com>
Co-authored-by: NXiaoxu Chen <chenxx_id@163.com>
Co-authored-by: Nlijiaqi0612 <33169170+lijiaqi0612@users.noreply.github.com>

11518a43

A
split cuda_profiler into .h and .cc (#35821) · 01063218
由 Aurelius84 提交于 9月 18, 2021
```
* split cuda_profiler into .h and .cc

* fix cmake

* remove inline
```
01063218

[oneDNN] Disable caching of Reorder operation (#35664) · e4c2a854

由 Jacek Czaja 提交于 9月 18, 2021

* - REorder disabling caching

* - compilation fix

* - another compilation fix

* - another compilation fix

* - compilation fix

* - Fix

* - yet another compilation fix

* - suppresingly another compilation fix

* - lint

* - fix after review

* - fix

e4c2a854

17 9月, 2021 1 次提交

Make flag adding easier (#35823) · 2c781455

由 Zeng Jinle 提交于 9月 17, 2021

* make flag setter easier

* update

* rename macro name

* fix bug of public/writable

* update to pass CI

* polish

* fix CPU link error

2c781455

16 9月, 2021 1 次提交
- C
  
  Add CPU and GPU eigh op implementation (#34990) · 07d0b834
  由 crystal 提交于 9月 16, 2021
  
  07d0b834
15 9月, 2021 3 次提交
- J
  
  added fix for matmul and support for 6 rank tensor (#35740) · e80acff3
  由 jakpiase 提交于 9月 15, 2021
  
  e80acff3
- L
  add nvidia cusparse library, test=develop (#35675) · 7e18106a
  由 Liu-xiandong 提交于 9月 15, 2021
```
Put Nvidia's cusparse library into paddle.
```
  7e18106a
- S
  Add paddle.cuda.device.stream_guard API (#35623) · 3218075d
  由 Siming Dai 提交于 9月 15, 2021
```
Add paddle.cuda.device.stream_guard API 
```
  3218075d
14 9月, 2021 2 次提交

Y
Implement FunctionTraits to support two kinds of elementwise functor and... · 12bf0502
由 Yiqun Liu 提交于 9月 14, 2021
```
Implement FunctionTraits to support two kinds of elementwise functor and remove some old codes for broadcast. (#35688)
```
12bf0502

Add api paddle.device.cuda.empty_cache to release idle gpu memory hold by allocator。 (#35427) · 83932715

由 chenenquan 提交于 9月 14, 2021

* Add empty_cache api to release idle gpu memory hold by allocator,test=develop

* Add empty_cache api to release idle gpu memory hold by allocator,test=develop

* Add empty_cache api to release idle gpu memory hold by allocator,test=develop

* Fix test coverage problem for empty_cache

* delete redundant check for empty_cache

* fix the problem of empty_cache's doc

* delete the nvidia-smi comment in doc of empty_cache, test=document_fix

83932715

13 9月, 2021 4 次提交
- Y
  Revert "Implement FunctionTraits to support two kinds of elementwise functor... · 40d4a295
  由 Yiqun Liu 提交于 9月 13, 2021
```
Revert "Implement FunctionTraits to support two kinds of elementwise functor and remove some old codes for broadcast. (#35487)" (#35686)
```
  40d4a295
- Y
  Implement FunctionTraits to support two kinds of elementwise functor and... · d4f84d46
  由 Yiqun Liu 提交于 9月 13, 2021
```
Implement FunctionTraits to support two kinds of elementwise functor and remove some old codes for broadcast. (#35487)
```
  d4f84d46
- T
  
  add xpu_wait & new implementation replace memcpy in adam, adamw (#35437) · 86a6be1a
  由 taixiurong 提交于 9月 13, 2021
  
  86a6be1a
- J
  Added clip BF16/FP32 FWD/BWD kernels (#35601) · 4e233712
  由 jakpiase 提交于 9月 12, 2021
```
* implemented clip op bf16/fp32

* added skipping if not cpu or bf16

* CI rerun after bf16 package change

* added parentheses to ensure formatting
```
  4e233712
11 9月, 2021 1 次提交
- A
  
  Clear VLOG in DeviceEvent (#35633) · cd5115f7
  由 Aurelius84 提交于 9月 11, 2021
  
  cd5115f7
09 9月, 2021 1 次提交

Add matrix_rank Op and it's GPU and CPU kernel (#34823) · eb1fbf12

由 0x45f 提交于 9月 09, 2021

* init matrix_rank op, add matrix_rank CPU code and test

* add GPU kernel, remove svd_eigen.h

* add CPU kernel when tol is tensor

* add cpu and gpu code when tol is tensor

* fix CI-ROCM error

* add matrix_rank API describe, fix PR-CI-Py3 error

* fix PR-CI-Windows error, add matrix_rank API test

* delete useless comments

* fix review

* add my code in svd_helper.h

* update doc commets

* remove spaces

eb1fbf12

08 9月, 2021 2 次提交

Enable program passes on Fleet APIs (#34955) · 5f369881

由 Zeng Jinle 提交于 9月 08, 2021

* add fleet api for program pass

* turn on apply pass for CI test

* fix disable fuse_all_optimizer bug

* try to test ci

* fix CI

* fill unspecified op role

* fix fuse_allreduce

* add ut to improve coverage

* remove useless change

* improve c++ coverage

* follow some comments

* test ir pass pipeline

* update doc

* reduce ut time again

5f369881

merge CMakeList.txt manual (#35378) · c4a3e8b4

由 feng_shuai 提交于 9月 08, 2021

* merge CMakeList.txt manual

* add platform for changethreadnum

* repair some bugs according to make error

* do nothing just flush CI

* forget change thread num

* add inplace_atol param for check_output_with_place

* Windows

* std:min and std::max should be change because of windows

c4a3e8b4

07 9月, 2021 2 次提交
- Y
  
  support multi-node (#35396) · c6e0cedc
  由 yaoxuefeng 提交于 9月 07, 2021
  
  c6e0cedc
- A
  Fix DryRun unittest failed from test_standalon_executor.py (#35433) · 071e8156
  由 Aurelius84 提交于 9月 07, 2021
```
* fix commit

* Open unittest

* fix unittest on Windows

* fix constructor
```
  071e8156
06 9月, 2021 2 次提交
- A
  Support Reset for DeviceEvent (#35443) · 8c73c1b5
  由 Aurelius84 提交于 9月 06, 2021
```
* Support Reset for DeviceEvent

* fix code

* add more unittest
```
  8c73c1b5
- Y
  
  Revert hccl check nan (#35438) · c3ad7775
  由 Yuang Liu 提交于 9月 06, 2021
  
  c3ad7775
03 9月, 2021 3 次提交
- Y
  
  Unify the implementation of AlignedVector and simplify the codes of dropout and cast. (#35373) · c171eca2
  由 Yiqun Liu 提交于 9月 03, 2021
  
  c171eca2
- T
  
  fix bn_infer and optimize momentum for kunlun (#35250) · 8305ba37
  由 TTerror 提交于 9月 03, 2021
  
  8305ba37
- L
  
  [NPU] add 32 extra bytes for npu memory slot (#35347) · 668bfb35
  由 Leo Chen 提交于 9月 03, 2021
  
  668bfb35
02 9月, 2021 2 次提交

Add SVD Op and it's GPU and CPU kernel (#34953) · 7e5fb462

由 xiongkun 提交于 9月 02, 2021

* Add SVD Op and it's GPU and CPU kernel

* Remove CUDAPlace in test_svd_op, make the test available in CPU package

* modfity the file

* fix windows bug/ fix ROCM / fix test timeout

* for pass the CIs

* improve error report

* for code review

* some modification to test_svd_op

* change python code style

* expose the svd interface for document

7e5fb462

B

[npu] add update_loss_scaling npu min value (#35270) · 280d7421
由 Baibaifan 提交于 9月 02, 2021

280d7421

01 9月, 2021 2 次提交

Added slice BF16/FP32 FWD/BWD kernels (#34332) · 070cab11

由 jakpiase 提交于 9月 01, 2021

* aded slice FWD FP32

* added tests for slice FWD FP32

* added slice bwd

* added bf16 tests

* CI fix

* CI fix

* added reason to skip_if

* minor change

* temporary fix for failing test

* temporary fix

* changes after review

* CI rerun

070cab11

Q
support KL label smooth (#35177) · 7ca28bb6
由 QingshuChen 提交于 9月 01, 2021
```
* support KL label smooth

* update UT for KL label_smooth
```
7ca28bb6

PaddlePaddle / Paddle 1 年多 前同步成功

PaddlePaddle / Paddle
1 年多前同步成功