提交 · 11938ae17f9942108ac79fe67b02853768e91f82 · PaddlePaddle / Paddle

10 1月, 2022 1 次提交

Add gpu kernel for new api : linalg.lstsq (#38621) · 405103d8

由 Haohongxiang 提交于 1月 10, 2022

* add lstsq gpu kernel

* update

* add docs_en

* modify ut

* fix bugs

* modify example in docs_en

* remove lstsq_op.cu from ROCM cmake

* modify docs_en

* modify docs_en

* modify docs_en

* remove unneccessary TensorCopy

405103d8

04 1月, 2022 1 次提交
- Z
  
  Modify macro definition to support arm (#38642) · 719f7419
  由 zhangkaihuo 提交于 1月 04, 2022
  
  719f7419
30 12月, 2021 3 次提交

Z
add OP lu forward (#38559) · 4e21457d
由 zhiboniu 提交于 12月 30, 2021
```
LGTM
```
4e21457d

Add cpu kernel of new api : lstsq (#38585) · ccf99b66

由 Haohongxiang 提交于 12月 30, 2021

* add cpu kernel of lstsq

* update

* modify code style

* modify unittest

* remove support for complex

ccf99b66

Add cusparse and unittest (#38431) · 667dc9f0

由 zhangkaihuo 提交于 12月 30, 2021

将cuSparse的handle与DeviceContext进行绑定，避免op中进行创建和销毁
添加对cuSparse中dense和sparse转换的API进行封装
添加对封装的API的单测

667dc9f0

29 12月, 2021 1 次提交
- S
  
  add nccl func of NCCL 2.11 (#38519) · 4853ab0a
  由 sneaxiy 提交于 12月 29, 2021
  
  4853ab0a
24 12月, 2021 1 次提交
- Z
  
  Add new API cholesky_solve (#38167) · 39f7c41f
  由 zhiboniu 提交于 12月 24, 2021
  
  39f7c41f
03 12月, 2021 1 次提交
- R
  refine structure for cuda and rocm (#37202) · a6d2fddb
  由 ronnywang 提交于 12月 03, 2021
```
* refine structure for cuda and rocm

* update

* update

* update

* update
```
  a6d2fddb
27 11月, 2021 1 次提交

[NPU] reorganization for device API abstraction (#37110) · 72241a6a

由 Aganlengzi 提交于 11月 27, 2021

* [NPU] reorganization for device API abstraction

* [NPU] delete old files

* [NPU] fix npu_collective_helper

* [NPU] fix collective_helper

* [NPU] fix ut

* [NPU] mod memory allocation and hccl_helper

* [NPU] fix place_type

* [NPU] split enfoce.h

* move acl* call into npu_info

* merge conflict

* fix merge

* merge conflict

* merge conflict

72241a6a

24 11月, 2021 1 次提交

[Paddle-Inference] Matmul_int8_convert: tensor*tensor (#37285) · 16590799

由 Wangzheee 提交于 11月 24, 2021

* matmul_convert_int8

* matmul_convert_int8

* matmulconvert_int8

* Matmul_int8_convert: tensor*tensor

* Matmul_int8_convert: tensor*tensor

* Matmul_int8_convert: tensor*tensor

16590799

08 11月, 2021 1 次提交

Use cuda virtual memory management and merge blocks (#36189) · a1ec1d5a

由 wanghuancoder 提交于 11月 08, 2021

* Use cuda virtual memory management and merge blocks, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

* window dll, test=develop

* fix cuda error of CUDA_ERROR_NOT_INITIALIZED, test=develop

* use autogrowthv2 for system allocator, test=develop

* remove ~CUDAVirtualMemAllocator(), test=develop

* refine, test=develop

* fix cuda error of CUDA_ERROR_NOT_INITIALIZED, test=develop

* fix cuda error of CUDA_ERROR_NOT_INITIALIZED, test=develop

* fix bug, test=develop

* revert system allocator, test =develop

* revert multiprocessing, test=develop

* fix AutoGrowthBestFitAllocatorV2 mutxt, test=develop

* catch cudaErrorInitializationError when create allocator, test=develop

* fix cuMemSetAccess use, test=develop

* refine cuda api use, test=develop

* refine, test=develop

* for test, test=develop

* for test, test=develop

* switch to v2, test=develop

* refine virtual allocator, test=develop

* Record cuMemCreate and cuMemRelease, test=develop

* refine, test=develop

* avoid out of bounds, test=develop

* rename allocator, test=develop

* refine, test=develop

* use PADDLE_ENFORCE_CUDA_SUCCESS, test=develop

* for test,test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

* refine, test=develop

a1ec1d5a

02 11月, 2021 1 次提交
- L
  
  fix cusparse compile bug in CUDA11.2, test=develop (#36911) · dc08c187
  由 Liu-xiandong 提交于 11月 02, 2021
  
  dc08c187
29 10月, 2021 1 次提交

add new API/OP: paddle.linalg.triangular_solve (#36714) · 92d6a048

由 zhouweiwei2014 提交于 10月 29, 2021

* add new API: paddle.linalg.triangular_solve

* add new API/OP: paddle.linalg.triangular_solve

* add new API/OP: paddle.linalg.triangular_solve

* fix comment

92d6a048

19 10月, 2021 2 次提交

[paddle.linalg.qr] Add the Qr Operator (#35742) · 34d785c2

由 Yulong Ao 提交于 10月 19, 2021

* Add QR decomposition op

* Change codes to adapt to new svd_helper

* Update linalg.py

Restore the deleted comma

* Restore the deleted line

* Update linalg.py

* Update linalg.py

* Improve the qr code by reviews

* Update QR based on CI results

* Update qr doc, test=document_fix

* Change unsafe and ill-formed codes

34d785c2

X

add rocm support for fft api (#36415) · 1d5746bd
由 Xiaoxu Chen 提交于 10月 19, 2021

1d5746bd

15 10月, 2021 1 次提交
- F
  
  dynamic load mkl as a fft backend when it is avaialble and requested (#36414) · f45e6cf6
  由 Feiyu Chan 提交于 10月 15, 2021
  
  f45e6cf6
29 9月, 2021 1 次提交
- L
  fix cusparse compile problem, test=develop (#36199) · 3eb50715
  由 Liu-xiandong 提交于 9月 29, 2021
```
* fix cusparse compile problem, test=develop

* Modify file permissions
```
  3eb50715
26 9月, 2021 1 次提交
- C
  
  CPU forward calculation replaces Eigen with Lapack;Modify linalg exposure rules (#35916) · 7ff226f0
  由 crystal 提交于 9月 26, 2021
  
  7ff226f0
24 9月, 2021 2 次提交

Add paddle.linalg.solve OP (#35715) · 8caf951c

由 Weilong Wu 提交于 9月 24, 2021

* Add linalg.solve op, test=develop

* Fix a bug caused by accidental deletion

* updated description and fix a bug: missing a comma

* Add linalg.solve op, test=develop

* updated solve op backward logic

* updated solve op backward logic again

* Add linalg.solve Op, test=develop

* Updated and modified to fit CI requirements

* Fix a bug

* 1)Add more test cases; 2)Fix a wrong usage in reduces operation; 3)Remove redundant code

* Remove redundant comments

* 1)Removed redundant code; 2)Updated to enhance code robustness

* Removed redundant code

* Updated API documents

8caf951c

L

fix cusparse compile bug in windows CUDA11.2, test=develop (#35941) · 7c3567ea
由 Liu-xiandong 提交于 9月 24, 2021

7c3567ea

23 9月, 2021 1 次提交
- F
  
  Replace Eigen with Lapack library for eigvals OP kernel (#35909) · 9b8aafe5
  由 From00 提交于 9月 23, 2021
  
  9b8aafe5
22 9月, 2021 2 次提交
- Z
  
  ResnetUnitOp implemented by cuDNN fused op(backend code) (#35557) · 736a7388
  由 Zhang Zheng 提交于 9月 22, 2021
  
  736a7388
- [2.2]support extern third_party lapack API on Linux/Windows/Mac (#35690) · ae65257d
  由 zhouweiwei2014 提交于 9月 22, 2021
```
* support extern third_party lapack on Linux/Windows/Mac

* fix ci
```
  ae65257d
18 9月, 2021 1 次提交

由 Feiyu Chan 提交于 9月 18, 2021

* 1. add interface for fft;
2. add data type predicate;
3. fix paddle.roll.

* add fft c2c cufft kernel

* implement argument checking & op calling parts for fft_c2c and fftn_c2c

* add operator and opmaker definitions

* only register float and double for cpu.

* add common code for implementing FFT, add pocketfft as a dependency

* add fft c2c cufft kernel function

* fix bugs in python interface

* add support for c2r, r2c operators, op makers, kernels and kernel functors.

* test and fix bugs

* 1. fft_c2c function: add support for onesided=False;
2. add complex<float>, complex<double> support for concat and flip.

* 1. fft: fix python api bugs;
2. shape_op: add support for complex data types.

* fft c2c cufft kernel done with complie and link

* fix shape_op, add mkl placeholder

* remove mkl

* complete fft c2c in gpu

* 1. implement mkl-based fft, FFTC2CFunctor and common function exec_fft;
2. change the design, add input and output typename as template parameter for all FFTFunctors, update pocketfft-based implementation.

* complete fft c2c on gpu in ND

* complete fft c2c on gpu in ND

* complete fft c2c backward in ND

* fix MKL-based implementation

* Add frame op and CPU/GPU kernels.

* Add frame op forward unittest.

* Add frame op forward unittest.

* Remove axis parameter in FrameFunctor.

* Add frame op grad CPU/GPU kernels and unittest.

* Add frame op grad CPU/GPU kernels and unittest.

* Update doc string.

* Update after review and remove librosa requirement in unittest.

* Update grad kernel.

* add fft_c2r op

* Remove data allocation in TransCompute function.

* add fft r2c onesided with cpu(pocketfft/mkl) and gpu

* last fft c2r functor

* fix C2R and R2C for cufft, becase the direction is not an option in these cases.

* add fft r2c onesided with cpu(pocketfft/mkl) and gpu

* fix bugs in python APIs

* fix fft_c2r grad kernal

* fix bugs in python APIs

* add cuda fft c2r grad kernal functor

* clean code

* fix fft_c2r python API

* fill fft r2c result with conjugate symmetry (#19)

fill fft r2c result with conjugate symmetry

* add placeholder for unittests (#24)

* simple parameterize test function by auto generate test case from parm list (#25)

* miscellaneous fixes for python APIs (#26)

* add placeholder for unittests

* resize fft inputs before computation is n or s is provided.

* add complex kernels for pad and pad_grad

* simplify argument checking.

* add type promotion

* add int to float or complex promotion

* fix output data type for static mode

* fix fft's input dtype dispatch, import fft to paddle

* fix typos in axes checking (#27)

* fix typos in axes checking

* fix argument checking (#28)

* fix argument checking

* Add C2R Python layer normal and abnormal use cases (#29)

* documents and single case

* test c2r case

* New C2R Python layer normal and exception use cases

* complete rfft,rfft2,rfftn,ihfft,ihfft2,ihfftn unittest and doc string (#30)

* Documentation of the common interfaces of c2r and c2c (#31)

* Documentation of the common interfaces of c2r and c2c

* clean c++ code  (#32)

* clean code

* Add numpy-based implementation of spectral ops (#33)

* add numpy reference implementation of spectral ops

* Add fft_c2r numpy based implementation for unittest. (#34)

* add fft_c2r numpy implementation

* Add deframe op and stft/istft api. (#23)

* Add frame api

* Add deframe op and kernels.

* Add stft and istft apis.

* Add deframe api. Update stft and istft apis.

* Fix bug in frame_from_librosa function when input dims >= 3

* Rename deframe to overlap_add.

* Update istft.

* Update after code review.

* Add overlap_add op and stft/istft api unittest (#35)

* Add overlap_add op unittest.

* Register complex kernels of squeeze/unsquuze op.

* Add stft/istft api unittest.

* Add unittest for fft helper functions (#36)

* add unittests for fft helper functions. add complex kernel for roll op.

* complete static graph unittest for all public api (#37)

* Unittest of op with FFT C2C, C2R and r2c added (#38)

* documents and single case

* test c2r case

* New C2R Python layer normal and exception use cases

* Documentation of the common interfaces of c2r and c2c

* Unittest of op with FFT C2C, C2R and r2c added
Co-authored-by: lijiaqi <lijiaqi0612@163.com>

* add fft related options to CMakeLists.txt

* fix typos and clean code (#39)

* fix invisible character in mkl branch and fix error in error message

* clean code: remove docstring from unittest for signal.py.

* always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype. (#40)

* always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype.

* fix CI Errors: numpy dtype comparison, thrust when cuda is not available (#41)

1. always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype.
2. promote floating point tensor to complex tensor ior fft_c2c and fft_c2r;
3. fix unittest to catch UnImplementedError and RuntimeError;
4. fix compile error by avoid using thrust when cuda is not available.
5.  fix sample code, use paddle.fft instead of paddle.tensor.fft

* remove inclusion of thrust, add __all__ list for fft (#42)

* Add api doc and update unittest. (#43)

* Add doc strings.
* Update overlap_add op unittest

* fix MKL-based FFT implementation (#44)

* fix MKL-based FFT implementation, MKL CDFT's FORWARD DOMAIN is always REAL for R2C and C2R

* remove code for debug (#45)

* use dynload for cufft (#46)

* use std::ptrdiff_t as datatype of stride (instead of int64_t) to avoid argument mismatch on some platforms.

* add complex support for fill_zeros_like

* use dynload for cufft

* Update doc and unittest. (#47)

* Add doc of frame op and overlap_add op.

* Update unittest.

* use dynload for cufft (#48)

1. use dynload for cufft
2. fix unittest;
3. temporarily disable Rocm.

* fix conflicts and merge upstream (#49)

fix conflicts and merge upstream

* fix compile error: only link dyload_cuda when cuda is available (#50)

* fix compile error: only link dyload_cuda when cuda is available

* fix dynload for cufft on windows (#51)

1. fix dynload for cufft on windows;
2. fix unittests.

* add NOMINMAX to compile on windows (#52)

 add NOMINMAX to compile on windows

* explicitly specify capture mode for lambdas (#55)

 explicitly specify capture mode for lambdas

* fix fft sample (#53)

* fix fft sample

* update scipy and numpy version for unittests of fft (#56)

update scipy and numpy version for unittests of fft

* Add static graph unittests of frame and overlap_add api. (#57)

* Remove cache of cuFFT & Disable ONEMKL (#59)

1. replace numpy.fft with scipy.fft as numpy<1.20 not support ortho norm
2. remove cache of cufft plans;
3. enhance error checking.
4. default WITH_ONEMKL to OFF
Co-authored-by: Njeff41404 <jeff41404@gmail.com>
Co-authored-by: Nroot <root@bjyz-sys-gpu-kongming9.bjyz.baidu.com>
Co-authored-by: NKP <109694228@qq.com>
Co-authored-by: lijiaqi <lijiaqi0612@163.com>
Co-authored-by: NXiaoxu Chen <chenxx_id@163.com>
Co-authored-by: Nlijiaqi0612 <33169170+lijiaqi0612@users.noreply.github.com>

11518a43

16 9月, 2021 1 次提交
- C
  
  Add CPU and GPU eigh op implementation (#34990) · 07d0b834
  由 crystal 提交于 9月 16, 2021
  
  07d0b834
15 9月, 2021 1 次提交
- L
  add nvidia cusparse library, test=develop (#35675) · 7e18106a
  由 Liu-xiandong 提交于 9月 15, 2021
```
Put Nvidia's cusparse library into paddle.
```
  7e18106a
09 9月, 2021 1 次提交

Add matrix_rank Op and it's GPU and CPU kernel (#34823) · eb1fbf12

由 0x45f 提交于 9月 09, 2021

* init matrix_rank op, add matrix_rank CPU code and test

* add GPU kernel, remove svd_eigen.h

* add CPU kernel when tol is tensor

* add cpu and gpu code when tol is tensor

* fix CI-ROCM error

* add matrix_rank API describe, fix PR-CI-Py3 error

* fix PR-CI-Windows error, add matrix_rank API test

* delete useless comments

* fix review

* add my code in svd_helper.h

* update doc commets

* remove spaces

eb1fbf12

02 9月, 2021 1 次提交

Add SVD Op and it's GPU and CPU kernel (#34953) · 7e5fb462

由 xiongkun 提交于 9月 02, 2021

* Add SVD Op and it's GPU and CPU kernel

* Remove CUDAPlace in test_svd_op, make the test available in CPU package

* modfity the file

* fix windows bug/ fix ROCM / fix test timeout

* for pass the CIs

* improve error report

* for code review

* some modification to test_svd_op

* change python code style

* expose the svd interface for document

7e5fb462

13 8月, 2021 1 次提交

New Einsum API (#33821) · 8c8667f0

由 Tongxin Bai 提交于 8月 13, 2021

* OP dot: refactor CPU kernels and get better loop performance.

* Minor fix on code format.

* Fixed minor errors.

* Add new API: einsum

* Update the Einsum unit test.

One case failed with matmul_v2, where the dtype is int64:

a = np.arange(2 * 3 * 1).reshape(2, 3, 1)
b = np.arange(1)
paddle.einsum("...i, ...i", a, b)

* Test cases in test_einsum test floating point dtypes only.

As of now Paddle only supports float/double dtypes in matmul, which is
one of building blocks of this Einsum implementation. We decide not to
test einsum against other dtypes.

* Polish format.

* More formatting.

* Format...

* Einsum: improve test coverage.

* Einsum: bug fixes and more testcases for testing error messages

* Einsum: fix format..

* Einsum: fixed typo and format.

* Einsum: format again...

* Einsum: applied suggested changes.

* Einsum API: improve API documentation.

* Einsum API: apply suggested changes.

* Einsum API: Add dygraph only note.

* Einsum API: Add dygraph only note.

* Einsum API: fixed unittest.

8c8667f0

29 7月, 2021 1 次提交

add fix op run order pass (#34427) · 79e758c6

由 Zeng Jinle 提交于 7月 29, 2021

* add fix op run order pass

* add ut for fix_op_run_order

* fix ci error

* improve coverage

* improve coverge again and fix cpu test case

* follow some comments

79e758c6

07 7月, 2021 1 次提交
- F
  
  add no tensorrt warning (#33874) · 758dd7bb
  由 feng_shuai 提交于 7月 07, 2021
  
  758dd7bb
24 6月, 2021 1 次提交
- Z
  Modify the search order of dynamic library (#33722) · 6801b6e2
  由 Zhou Wei 提交于 6月 24, 2021
```
* Modify the search order of dynamic library

* Modify the search order of dynamic library
```
  6801b6e2
02 6月, 2021 1 次提交
- Q
  
  [ROCM] update paddle inference cmake, test=develop (#33260) · e7541209
  由 Qi Li 提交于 6月 02, 2021
  
  e7541209
07 5月, 2021 1 次提交
- L
  Fix compile error on jetson platform (#32748) · 8ce6b393
  由 LielinJiang 提交于 5月 07, 2021
```
* fix compile error on jetson platform
```
  8ce6b393
06 5月, 2021 1 次提交

[ROCM] bugfix for unittest (#32392) · 31392627

由 ronnywang 提交于 5月 06, 2021

* fix test_unpool_op

* fix test_inplace_addto_strategy

* fix test_conv2d_fusion_op

* fix test_imperative_lod_tensor_to_selected_rows, test_imperative_selected_rows_to_lod_tensor

* fix test_dot_op

* fix test_correlation_op

* fix tracer

* fix test_memcpy_op

31392627

29 4月, 2021 1 次提交
- L
  Add op read_file and decode_jpeg (#32564) · b22f6d69
  由 LielinJiang 提交于 4月 29, 2021
```
* add op read_file and decode_jpeg
```
  b22f6d69
25 4月, 2021 1 次提交
- P
  [Paddle-TRT] Add trt runtime version check (#32443) · b0556764
  由 Pei Yang 提交于 4月 25, 2021
```
* add trt runtime version check

* use different wrap, and change to major version check
```
  b0556764
21 4月, 2021 1 次提交

【NPU】Merge NPU ccl code (#32381) · c3158527

由 zhang wenhui 提交于 4月 21, 2021

* add allreduce and broadcast without test (#31024)

add allreduce and broadcast without test

* Refactor HCCLCommContext to be compatible with Paddle (#31359)

Refactor HCCLCommContext to be compatible with Paddle (#31359)

* [NPU] add npu kernel for communication op (#31437)

* add allreduce and broadcast without test

* add c_broadcast_test case

* build c_comm_init and c_create_group operators

* make the whole thing compile

* add broadcast and init op test case but run failed

* make unit test compile

* fix broadcast test bug and change into hcom for ccl

* change c_comm_init and c_create_group ops accordingly

* make tests compile

* transfer code to 27

* compiled successfully in 28, but run failed

* test broadcast in 28, but failed

* make hcom primitives work

* change hccl data type for base.h

* fix broadcast bug

* make attributes work

* fix group name bug

* add allreduce but test failed

* allreduce bug for qiuliang

* allreduce finished

* add allgather and reducescatter

* merge all op code

* add allgather test

* finish run all ccl op test exclude send/recv

* all all op and test exclude send/recv

* send_v2_npu.cc recv_v2_npiu.cc compiled

* fix ccl core dump bug and test allgather, reducescatter, broadcast op

* fix allreduce bug just for test

* hcom send&recv test pass, without hcom_destroy

* for qiuliang test

* Ascend Send&Recv Test Pass

* all op (ex send/recv) ok

* fix bug

* merge all ccl op

* style merge to PaddlePaddle

* merge style

* new merge style

* merge style 2

* insert an empty at the end

* disable ctest for hcom to pass ci
Co-authored-by: Nvoid-main <voidmain1313113@gmail.com>
Co-authored-by: Nf2hkop <f2huestc@outlook.com>

* Add auto-increasing tag id for Hcom OPs (#31702)

* add c_reduce_sum op (#31793)

add c_reduce_sum op

* update Ascendrc hccl to 20.3 (#32126)

update Ascendrc hccl to 20.3 (#32126)

* fix merge code

* change cmake.txt1

* [NPU] Support npu kernel for c sync stream op (#31386)

* sync stream npu op

* add with_ascend_acl

* update c++ unittest

* compile all failed

* try to pre commit

* after pre commit

* merge&compile&test hccl successfully!

* fix code style

* fix code style

* fix bugs about hccl

* fix some bugs

* fix code style

* fix style

* fix style

* fix

* fixed

* merge develop
Co-authored-by: Nlw921014 <liuwei921014@yeah.net>
Co-authored-by: NVoid Main <voidmain1313113@gmail.com>
Co-authored-by: Nf2hkop <f2huestc@outlook.com>
Co-authored-by: Nxiayanming <41795079@qq.com>

c3158527

15 4月, 2021 1 次提交

[ROCM] bugfix for unit tests (#32258) · 90133d24

由 furnace 提交于 4月 15, 2021

* [ROCM] bugfix for test_conv_transpose_nn_grad

* [ROCM] bugfix for test_batch_norm_op_v2

* [ROCM] bugfix for test_empty_like_op

* [ROCM] bugfix for test_conv_transpose_nn_grad

90133d24

09 4月, 2021 1 次提交

[NPU] cherry-pick basic NPU components/allocator/operator/executor supports from ascendrc (#32144) · ccf5709d

由 Leo Chen 提交于 4月 09, 2021

* [feature] support npu allocator (#30840)

[feature] support npu allocator

* [feature] support npu operator (#30951)

[feature] support npu operator

* [feature] support npu allocator, part 2 (#30972)

* support npu allocator

* add npu device context

* fix some compile problem

* fix some compile problem

* add npu info

* compile ok

* fix include dir

* support naive_best_fit_allocator

* run ut ok, bug failed to exit

* call aclrtResetDevice before exit

* fix aclFinilize

* add system allocatot test

* add selected_gpus in gtest

* add tensor_test for npu

* support npu op, initial commit

* add npu stream

* add elementwise_add_op

* compile ok

* fix typo

* fix elementwise_add_op_npu_test

* support op run

* test can run but failed

* change aclopExecuteV2 to aclopCompileAndExecute

* support parsing ascend rank table file (#31000)

support parsing ascend rank table file

* Fix reshape on GE graph. (#31084)

Fix reshape on GE graph

* add npu kernel for elementwise_sub and elementwise_sub_grad (#30973)

* add npu sub op

* fix typo

* rename test

* fix bug

* fix bug

* add fp16 kernel

* fix typo

* support sub grad op

* support elementwise_sub_grad op
Co-authored-by: Nfrankwhzhang <frankwhzhang@126.com>

* Fix compilation problem (#31100)

Fix compilation problem (#31100)

* fix compile

* fix code stype

* remove const_cast

* support adding correct npu op in pybind.h (#31143)

* support adding correct npu op in pybind.h

* refine code

* [NPU] Support executor with NPU (#31057)

* [NPU] Support executor with NPU

* Fix code according to reviews

* Fix code

* Add unittest for sub op npu

* refactor npu device manager (#31154)

refactor npu device manager (#31154)

* fix selected npus

* fix compile

* fix reading flags from env

* format
Co-authored-by: Nxiayanming <41795079@qq.com>
Co-authored-by: Ngongweibao <weibao.gong@gmail.com>
Co-authored-by: Nfrankwhzhang <frankwhzhang@126.com>
Co-authored-by: Nliym27 <33742067+liym27@users.noreply.github.com>

ccf5709d

PaddlePaddle / Paddle 1 年多 前同步成功

PaddlePaddle / Paddle
1 年多前同步成功