提交 · db633affe1b04a880935eeb20d405ff3466a0841 · wmsofts / Paddle

25 10月, 2021 1 次提交

add op: fused_feedforward(forward) (#35843) · b18cbfb2

由 zhangkaihuo 提交于 10月 25, 2021

这个PR只包含fused_feedforward前向的代码。

相关kernel实现：fused_dropout_act_bias, fused_residual_dropout_bias, fused_layernorm_residual_dropout_bias

fused_feedforward是一个融合算子，该算子对transformer模型的feed forward层的算子进行融合和封装，使得前端只呈现一个接口，通过融合减少部分访存和kernel launch的时间，以此提升性能。

b18cbfb2

22 10月, 2021 1 次提交

Fused attention op forward (#35905) · d4906214

由 Li Min 提交于 10月 22, 2021

功能：本PR的目标是提高attention模块的计算性能。
为了减少框架层对op的调度开销，本PR通过在C++层手动实现attention模块，对外提供attention 大op；
为了减少防存开销，本PR采取了两种优化方法：
（1）在q,k,v计算时通过共享输入X，将该处的gemm，transpose和bias add从三次调用减少为一次；
（2）使用kernel融合优化技术，在不同cuda kernel之间通过寄存器传输数据；

d4906214

21 10月, 2021 1 次提交
- X
  
  User specified backend (#35745) · b6e7f8e9
  由 xiongkun 提交于 10月 21, 2021
  
  b6e7f8e9
20 10月, 2021 2 次提交

[FIX] Extend time for test_activation_nn_grad to avoid its timeout issue (#36527) · c285c719

由 Jiabin Yang 提交于 10月 20, 2021

* native commit for triple grad of sigmod

* Updated unittests files

* init functional jacobian api

* Updated trible_test func

* Updated gradient_checker & test_script

* finish test with dtype float32

* add float64 test case

* polish code

* use atol=1e-5 with dtype float64

* fix for ci

* set timeout for test_jacobian

* fix dygraph grad to support high differential

* polish API docstring

* Updated gradient checker and some related files

* fix double grad strip error for high differential

* fix double grad strip error for high differential

* Add Sigmoid triple grad tests

* fix dygraph double grad dtype error when calling for high differential senario

* Updated triple grad teses func

* Use np.random to initialize ddx

* Updated triple_grad_check func

* add todo for gradient checker and refine some comments

* remove additional code

* add test for warnging in backward.py

* add tanh triple grad

* format python code

* refine code

* make test_activation_nn_grad test time to 150s
Co-authored-by: Nveyron95 <veyron_wu@163.com>
Co-authored-by: Nlevi131 <limaolin01@baidu.com>

c285c719

[Auto Parallel] Generalization for Partition and Completion (#35735) · 797bd40d

由 JZ-LIANG 提交于 10月 20, 2021

* default dist op

* add dist_attr for dist op

* add unitest

* update inputname

* update function name

* add unitest

* update CMakeLists.txt for CI

* fix dis_matmul

* fix compile error

* update matmul to matmul_v2

* unify api

* unify api

* todo

* update distop forward func

* update distop forward func

* auto parallel backward

* update dist op

* autoparallel backward

* add backward for embedding

* temp1

* temp2

* temp3

* temp4

* backward done1

* backward done2

* backward done3

* dist embedding remove mp mode

* dist matmul remove mp mode

* update dist embedding
『

* dist op init1

* dist op init 2

* update unitest

* context remove parallel mode

* partitioner remove parallel mode

* update unitest

* a more general method to support varying mesh in pipeline parallel

* support varying mesh in pipeline parallel

* embedding support varying mesh in pipeline parallel

* matmul support varying mesh in pipeline parallel

* default dist op support varying mesh in pipeline parallel

* dist attribute for startup program

* default dist op support varying mesh in pipeline parallel 2

* partitoner support varying mesh in pipeline parallel

* revise logic for auto compeletion

* revise framework.py

* revise reshard unitest

* revise unitest for parallelize

* chmod

* fixed bug for dist embedding name mapping
Co-authored-by: Nzhaoyingli <zhaoyingli@baidu.com>

797bd40d

19 10月, 2021 1 次提交

Add auto parallel cost model and unittests (#36363) · a573a7ed

由 YipZLF 提交于 10月 19, 2021

* Add auto parallel cost model and unittests

* Fixed code styles.

* Fixed bugs and codes style

* fixed typo

* Improved code style: object encapsulation.

* Fixed codes.

* Refractored estimate_cost

* Fixed typo

a573a7ed

13 10月, 2021 2 次提交
- G
  
  support auto parallel data shard (#36055) · 85bb1a85
  由 Guoxia Wang 提交于 10月 13, 2021
  
  85bb1a85
- F
  
  Set NIGHTLY tag for 'tensordot' UT (#36354) · 90457d8c
  由 From00 提交于 10月 13, 2021
  
  90457d8c
11 10月, 2021 2 次提交

Add nn.functional.sparse_attention and some test cases, test=develop (#35757) · 85b77232

由 Liu-xiandong 提交于 10月 11, 2021

Add paddle.nn.functional.sparse_attention API

本个PR主要将sparse_attention功能在python层进行了一层封装，OP的主体代码见：#PR35676

此外，对于封装的python 接口，增加了相应的单测。

85b77232

add reshard module (#35779) · c38b0488

由 caozhou 提交于 10月 11, 2021

* add reshard module

* fix conflict

* update reshard module

* update and add unitest

* update reshard module and unitest

* add more unitests

c38b0488

09 10月, 2021 1 次提交

Add new API 'tensordot' (#36273) · 21dc7f40

由 From00 提交于 10月 09, 2021

* Add new API tensordot

* Set timeout value 400 for UT; Fix format for EN docs

* Set timeout value 1000 for UT; Fix format for EN docs

* Remove some input check

* Coding style improve: don't compare boolean values to True or False
using ==

21dc7f40

30 9月, 2021 1 次提交

李

Fix raw optim (#36176) · 5e0f199a

由李季提交于 9月 30, 2021

* fix raw optim

* pre-commit test file
Co-authored-by: Nsneaxiy <sneaxiy@126.com>

5e0f199a

27 9月, 2021 1 次提交

Add functional autograd API: jacobian (#35917) · ec2f68e8

由 levi131 提交于 9月 27, 2021

* init functional jacobian api

* finish test with dtype float32

* add float64 test case

* polish code

* use atol=1e-5 with dtype float64

* fix for ci

* set timeout for test_jacobian

* polish API docstring

* modify docstring

ec2f68e8

24 9月, 2021 1 次提交

Add paddle.linalg.solve OP (#35715) · 8caf951c

由 Weilong Wu 提交于 9月 24, 2021

* Add linalg.solve op, test=develop

* Fix a bug caused by accidental deletion

* updated description and fix a bug: missing a comma

* Add linalg.solve op, test=develop

* updated solve op backward logic

* updated solve op backward logic again

* Add linalg.solve Op, test=develop

* Updated and modified to fit CI requirements

* Fix a bug

* 1)Add more test cases; 2)Fix a wrong usage in reduces operation; 3)Remove redundant code

* Remove redundant comments

* 1)Removed redundant code; 2)Updated to enhance code robustness

* Removed redundant code

* Updated API documents

8caf951c

22 9月, 2021 1 次提交
- F
  
  disable tests for fft on windows with gpu (#35872) · 5af6081a
  由 Feiyu Chan 提交于 9月 22, 2021
  
  5af6081a
18 9月, 2021 3 次提交

Z

increase test_imperative_auto_mixed_precision timePROPERTIES TIMEOUT (#35863) · e7617512
由 zhangbo9674 提交于 9月 18, 2021

e7617512

由 Feiyu Chan 提交于 9月 18, 2021

* 1. add interface for fft;
2. add data type predicate;
3. fix paddle.roll.

* add fft c2c cufft kernel

* implement argument checking & op calling parts for fft_c2c and fftn_c2c

* add operator and opmaker definitions

* only register float and double for cpu.

* add common code for implementing FFT, add pocketfft as a dependency

* add fft c2c cufft kernel function

* fix bugs in python interface

* add support for c2r, r2c operators, op makers, kernels and kernel functors.

* test and fix bugs

* 1. fft_c2c function: add support for onesided=False;
2. add complex<float>, complex<double> support for concat and flip.

* 1. fft: fix python api bugs;
2. shape_op: add support for complex data types.

* fft c2c cufft kernel done with complie and link

* fix shape_op, add mkl placeholder

* remove mkl

* complete fft c2c in gpu

* 1. implement mkl-based fft, FFTC2CFunctor and common function exec_fft;
2. change the design, add input and output typename as template parameter for all FFTFunctors, update pocketfft-based implementation.

* complete fft c2c on gpu in ND

* complete fft c2c on gpu in ND

* complete fft c2c backward in ND

* fix MKL-based implementation

* Add frame op and CPU/GPU kernels.

* Add frame op forward unittest.

* Add frame op forward unittest.

* Remove axis parameter in FrameFunctor.

* Add frame op grad CPU/GPU kernels and unittest.

* Add frame op grad CPU/GPU kernels and unittest.

* Update doc string.

* Update after review and remove librosa requirement in unittest.

* Update grad kernel.

* add fft_c2r op

* Remove data allocation in TransCompute function.

* add fft r2c onesided with cpu(pocketfft/mkl) and gpu

* last fft c2r functor

* fix C2R and R2C for cufft, becase the direction is not an option in these cases.

* add fft r2c onesided with cpu(pocketfft/mkl) and gpu

* fix bugs in python APIs

* fix fft_c2r grad kernal

* fix bugs in python APIs

* add cuda fft c2r grad kernal functor

* clean code

* fix fft_c2r python API

* fill fft r2c result with conjugate symmetry (#19)

fill fft r2c result with conjugate symmetry

* add placeholder for unittests (#24)

* simple parameterize test function by auto generate test case from parm list (#25)

* miscellaneous fixes for python APIs (#26)

* add placeholder for unittests

* resize fft inputs before computation is n or s is provided.

* add complex kernels for pad and pad_grad

* simplify argument checking.

* add type promotion

* add int to float or complex promotion

* fix output data type for static mode

* fix fft's input dtype dispatch, import fft to paddle

* fix typos in axes checking (#27)

* fix typos in axes checking

* fix argument checking (#28)

* fix argument checking

* Add C2R Python layer normal and abnormal use cases (#29)

* documents and single case

* test c2r case

* New C2R Python layer normal and exception use cases

* complete rfft,rfft2,rfftn,ihfft,ihfft2,ihfftn unittest and doc string (#30)

* Documentation of the common interfaces of c2r and c2c (#31)

* Documentation of the common interfaces of c2r and c2c

* clean c++ code  (#32)

* clean code

* Add numpy-based implementation of spectral ops (#33)

* add numpy reference implementation of spectral ops

* Add fft_c2r numpy based implementation for unittest. (#34)

* add fft_c2r numpy implementation

* Add deframe op and stft/istft api. (#23)

* Add frame api

* Add deframe op and kernels.

* Add stft and istft apis.

* Add deframe api. Update stft and istft apis.

* Fix bug in frame_from_librosa function when input dims >= 3

* Rename deframe to overlap_add.

* Update istft.

* Update after code review.

* Add overlap_add op and stft/istft api unittest (#35)

* Add overlap_add op unittest.

* Register complex kernels of squeeze/unsquuze op.

* Add stft/istft api unittest.

* Add unittest for fft helper functions (#36)

* add unittests for fft helper functions. add complex kernel for roll op.

* complete static graph unittest for all public api (#37)

* Unittest of op with FFT C2C, C2R and r2c added (#38)

* documents and single case

* test c2r case

* New C2R Python layer normal and exception use cases

* Documentation of the common interfaces of c2r and c2c

* Unittest of op with FFT C2C, C2R and r2c added
Co-authored-by: lijiaqi <lijiaqi0612@163.com>

* add fft related options to CMakeLists.txt

* fix typos and clean code (#39)

* fix invisible character in mkl branch and fix error in error message

* clean code: remove docstring from unittest for signal.py.

* always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype. (#40)

* always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype.

* fix CI Errors: numpy dtype comparison, thrust when cuda is not available (#41)

1. always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype.
2. promote floating point tensor to complex tensor ior fft_c2c and fft_c2r;
3. fix unittest to catch UnImplementedError and RuntimeError;
4. fix compile error by avoid using thrust when cuda is not available.
5.  fix sample code, use paddle.fft instead of paddle.tensor.fft

* remove inclusion of thrust, add __all__ list for fft (#42)

* Add api doc and update unittest. (#43)

* Add doc strings.
* Update overlap_add op unittest

* fix MKL-based FFT implementation (#44)

* fix MKL-based FFT implementation, MKL CDFT's FORWARD DOMAIN is always REAL for R2C and C2R

* remove code for debug (#45)

* use dynload for cufft (#46)

* use std::ptrdiff_t as datatype of stride (instead of int64_t) to avoid argument mismatch on some platforms.

* add complex support for fill_zeros_like

* use dynload for cufft

* Update doc and unittest. (#47)

* Add doc of frame op and overlap_add op.

* Update unittest.

* use dynload for cufft (#48)

1. use dynload for cufft
2. fix unittest;
3. temporarily disable Rocm.

* fix conflicts and merge upstream (#49)

fix conflicts and merge upstream

* fix compile error: only link dyload_cuda when cuda is available (#50)

* fix compile error: only link dyload_cuda when cuda is available

* fix dynload for cufft on windows (#51)

1. fix dynload for cufft on windows;
2. fix unittests.

* add NOMINMAX to compile on windows (#52)

 add NOMINMAX to compile on windows

* explicitly specify capture mode for lambdas (#55)

 explicitly specify capture mode for lambdas

* fix fft sample (#53)

* fix fft sample

* update scipy and numpy version for unittests of fft (#56)

update scipy and numpy version for unittests of fft

* Add static graph unittests of frame and overlap_add api. (#57)

* Remove cache of cuFFT & Disable ONEMKL (#59)

1. replace numpy.fft with scipy.fft as numpy<1.20 not support ortho norm
2. remove cache of cufft plans;
3. enhance error checking.
4. default WITH_ONEMKL to OFF
Co-authored-by: Njeff41404 <jeff41404@gmail.com>
Co-authored-by: Nroot <root@bjyz-sys-gpu-kongming9.bjyz.baidu.com>
Co-authored-by: NKP <109694228@qq.com>
Co-authored-by: lijiaqi <lijiaqi0612@163.com>
Co-authored-by: NXiaoxu Chen <chenxx_id@163.com>
Co-authored-by: Nlijiaqi0612 <33169170+lijiaqi0612@users.noreply.github.com>

11518a43

Add new API "eigvals" in linalg (#35720) · d411a038

由 From00 提交于 9月 18, 2021

* Add linalg.eigvals API

* pre-commit check

* Adjust code style

* Fix conflict

* Improve code style

* Modify the test code to ignore testing CUDA kernel

* Sort ouput data before checking in test code

* Set timeout value for UT

* Improve API example code to pass CI

* Fix bug for None fetch_list in Windows

* Delete grad Op

d411a038

17 9月, 2021 1 次提交
- L
  temporally disable the warnings (#35560) · 68ae6345
  由 Leo Chen 提交于 9月 17, 2021
```
* temporally disable the warnings

* disable ut
```
  68ae6345
15 9月, 2021 3 次提交
- Z
  add dist_attr for dist op and var (#35585) · fc5fb2a1
  由 zhaoyingli 提交于 9月 15, 2021
```
* add dist_attr for dist op

* add unitest

* update inputname

* update function name

* add unitest

* update CMakeLists.txt for CI

* fix dis_matmul

* fix compile error

* update matmul to matmul_v2
```
  fc5fb2a1
- W
  
  [hybrid] out data parallel as optimizer sharding parallel (#35593) · 78465703
  由 WangXi 提交于 9月 15, 2021
  
  78465703
- Y
  
  add timeout value for uts (#35737) · 5fa9cf7c
  由 YUNSHEN XIE 提交于 9月 15, 2021
  
  5fa9cf7c
14 9月, 2021 1 次提交
- Z
  Fix RawProgramOptimizer bug (#35704) · 0f741880
  由 Zeng Jinle 提交于 9月 14, 2021
```
* fix raw optimizer gm

* update

* update ut
```
  0f741880
13 9月, 2021 5 次提交
- Y
  
  fix bug, test=document_fix (#35697) · 97a73e1d
  由 YUNSHEN XIE 提交于 9月 13, 2021
  
  97a73e1d
- Y
  Change uts to nightly mode (#35541) · 2b0f9b51
  由 YUNSHEN XIE 提交于 9月 13, 2021
```
* Change uts to nightly mode

* remove test_trt_pool_op from parallel_UT_rule.py,test=document_fix
```
  2b0f9b51
- X
  
  refine svd; unexpose tensor.svd; fix english document; set timeout=40 (#35635) · f521a30d
  由 xiongkun 提交于 9月 13, 2021
  
  f521a30d
- 李
  upload global scatter and global gather operators related files (#35546) · ecfe8375
  由李季提交于 9月 13, 2021
```
* upload global scatter and global gather operators related files
```
  ecfe8375
- G
  support hybrid parallel inference helper class (#35576) · dc3c845a
  由 Guoxia Wang 提交于 9月 13, 2021
```
* support hybrid parallel inference helper class
```
  dc3c845a
10 9月, 2021 1 次提交

add cumprod op (#35185) · 4e509f46

由 hlygit66666 提交于 9月 10, 2021

* add test_cumprod_op

* Revert "add test_cumprod_op"

This reverts commit c96cf6dff5d09ae7d8cc72c1e8ae4369a153aa19.

* recommit

* add error message

* test input(x) initialize

* test use cpu

* update test code

* add test type

* add test case

* solve ci problem

* add complex case test

* add complex case test

* fix review problem

* fix conflict

* fix some docs

* change test case

* change test case

* fix review problems again

* fix docs

* fix inclusivescan bug

4e509f46

08 9月, 2021 3 次提交

Intergrate GLOOParallelContext to support Multi-CPU Core for Dygraph DataParallel (#35154) · 51cc73f0

由 xiongkun 提交于 9月 08, 2021

* can pass the fake test

* add files

* modify cmake to pass windows-ci

* for ci pass

* WITH_GLOO=ON

* for pass coverage test

* add cpuonly testcase

* add

* disable nccl when compile with cuda

* change python version in cpuonly

* add backend argument

* add required gpu

* add required:gpu

51cc73f0

Enable program passes on Fleet APIs (#34955) · 5f369881

由 Zeng Jinle 提交于 9月 08, 2021

* add fleet api for program pass

* turn on apply pass for CI test

* fix disable fuse_all_optimizer bug

* try to test ci

* fix CI

* fill unspecified op role

* fix fuse_allreduce

* add ut to improve coverage

* remove useless change

* improve c++ coverage

* follow some comments

* test ir pass pipeline

* update doc

* reduce ut time again

5f369881

merge CMakeList.txt manual (#35378) · c4a3e8b4

由 feng_shuai 提交于 9月 08, 2021

* merge CMakeList.txt manual

* add platform for changethreadnum

* repair some bugs according to make error

* do nothing just flush CI

* forget change thread num

* add inplace_atol param for check_output_with_place

* Windows

* std:min and std::max should be change because of windows

c4a3e8b4

02 9月, 2021 2 次提交

Add SVD Op and it's GPU and CPU kernel (#34953) · 7e5fb462

由 xiongkun 提交于 9月 02, 2021

* Add SVD Op and it's GPU and CPU kernel

* Remove CUDAPlace in test_svd_op, make the test available in CPU package

* modfity the file

* fix windows bug/ fix ROCM / fix test timeout

* for pass the CIs

* improve error report

* for code review

* some modification to test_svd_op

* change python code style

* expose the svd interface for document

7e5fb462

[Auto Parallel] Logical Partition & Dist Op (#35117) · a622b701

由 JZ-LIANG 提交于 9月 02, 2021

* support shard reader

* support shard reader

* add parallel mode

* update process mesh

* add method to compute comm_group

* implement dist_embedding forward func

* implement dist matmul forward func

* implement dist reshape forward func

* add transpiler framework

* add transpiler forward

* implement transpiler forward

* implement transpiler backward & update

* add process

* add unitest

* chmod

* chmod

* chmod

* update unitest

* add unitest for gpt

* remove unused print

* rename transpiler --> partitioner

* rename transpiler --> partitioner

* chmod

* chmod

* bug fixed

* remove amp function

* update case for dp mode

* update case for dp mode

a622b701

31 8月, 2021 1 次提交
- Q
  [NPU] fix cmake for ascend ci, test=develop (#35255) · f6004ab9
  由 Qi Li 提交于 8月 31, 2021
```
* [NPU] fix cmake for ascend ci, test=develop

* update paddle_build.sh scripts, test=allcase
```
  f6004ab9
24 8月, 2021 1 次提交

Add no_sync in data parallel for dynamic graph (#34740) · b09f4d7f

由 Haohongxiang 提交于 8月 24, 2021

* Add no_sync in data parallel for dynamic graph

* modify UT of no_sync

* delete test_parallel_dygraph_dataparallel_no_sync.py

* add test_parallel_dygraph_no_sync.py

* modify run_trainer_with_spawn in UTs

* Add UT of complex control flow in no_sync

* add specific descriptions and notes for no_sync

* check code style

* modify UT's TIMEOUT in CMakeLists.txt

b09f4d7f

23 8月, 2021 1 次提交
- B
  
  [CPU] Enable barrier op upon gloo (#34671) · e8f146a9
  由 Bo Liu 提交于 8月 23, 2021
  
  e8f146a9
18 8月, 2021 2 次提交

Add function to disable paddle signal handler (#34577) · dd533dd3

由 Zhanlue Yang 提交于 8月 18, 2021

* Add function to disable paddle signal handler

Paddle used google::InstallFaultSignalHandler to handle selected system signals,
mainly for debugging and bug report purposes.

However, this can be conflicted with other python packages whoever captures similar signals.
Such python package involves tvm and more

To resolve this issue, we support a function to disable signal handler

* Remove signal test from WIN32 platform

* Remove redundant return from disable_signal_handler() function

* Add detailed messages to en_doc

dd533dd3

G
support class center sample of PartialFC (#34106) · 100db44f
由 Guoxia Wang 提交于 8月 18, 2021
```
* support class center sample of PartialFC
```
100db44f

16 8月, 2021 1 次提交
- G
  support margin loss (arcface, cosface, sphereface) for single GPU and cross GPUs (#34247) · b0cb4148
  由 Guoxia Wang 提交于 8月 16, 2021
```
* support margin loss (arcface, cosface, sphereface)
```
  b0cb4148

wmsofts / Paddle 与 Fork 源项目一致

wmsofts / Paddle
与 Fork 源项目一致