提交 · 7a724ddb30c677b994b907e967b308a42ac8c7ad · 机器未来 / Paddle

11 10月, 2021 3 次提交

Y

fix multi-node (#36329) · 7a724ddb
由 yaoxuefeng 提交于 10月 11, 2021

7a724ddb

由 wangxinxin08 提交于 10月 11, 2021

* add mish trt plugin, compile & install success, run error. test=develop
* modify code according to review
* add TRT_NOEXCEPT for mish trt plugin
* add unittest for mish trt plugin
* remove unnecessary check of mish in op_teller.cc
* fix some problem of trt8
* add check and modify unittest while converting mish to trt plugin
Co-authored-by: Ndengkaipeng <dengkaipeng@baidu.com>

2b7b752a

Add use_cinn Flag and RunFromCinn in PE (#36107) · 5690666c

由 Huihuang Zheng 提交于 10月 11, 2021

Add use_cinn flag and use it to control whether we run PaddlePaddle using CINN.

Also add:

Replace PaddlePaddle graph with a CINN graph in a pass
PE Method to feed data and run the graph by CINN

5690666c

09 10月, 2021 2 次提交
- Z
  Add const for OpDesc::id() and VarDesc::id() (#36298) · cb620ca6
  由 Zeng Jinle 提交于 10月 09, 2021
```
* add const OpDesc id()

* add const for VarDesc::id()
```
  cb620ca6
- W
  C++ support register pass via PassDesc (#36095) · 2fd8deea
  由 wuhuanzhou 提交于 10月 09, 2021
```
支持C++开发注册GeneratePass，简化针对fusion等子图优化场景开发方式。
```
  2fd8deea
08 10月, 2021 1 次提交

Support CUDA Graph on ParallelExecutor (#36250) · f9591bb1

由 Zeng Jinle 提交于 10月 08, 2021

* support CUDA Graph on PE

* add ut, fix CI compile

* reduce memory consumption

* fix CUDA 10 CI

* improve coverage

* improve python coverage

f9591bb1

30 9月, 2021 1 次提交
- Y
  
  add slotrecord datafeed (#36099) · 0a3dbe8a
  由 yaoxuefeng 提交于 9月 30, 2021
  
  0a3dbe8a
29 9月, 2021 6 次提交
- L
  Spinlock (#36030) · a9ea41c5
  由 liutiexing 提交于 9月 29, 2021
```
* add align for WorkQueue

* add spinlock

* merge spinlock
```
  a9ea41c5
- Y
  
  add slot record dataset (#36200) · 79bd5f90
  由 yaoxuefeng 提交于 9月 29, 2021
  
  79bd5f90
- Y
  
  Implement the grad and enhance the cache of norm_convolution fusion ops. (#36168) · 767050d9
  由 Yiqun Liu 提交于 9月 29, 2021
  
  767050d9
- Z
  
  remove wait if no fetch (#36150) · b3d2dc7b
  由 Zeng Jinle 提交于 9月 29, 2021
  
  b3d2dc7b
- B
  
  fix nullptr block in op_teller (#36197) · 667bf188
  由 baoachun 提交于 9月 29, 2021
  
  667bf188
- Z
  
  refine case when thread_num = 1 (#36201) · 7e60cc63
  由 Zeng Jinle 提交于 9月 29, 2021
  
  7e60cc63
28 9月, 2021 6 次提交
- T
  [HeterPs]ps gpu dump (#36157) · 97d30602
  由 Thunderbrook 提交于 9月 28, 2021
```
* ps gpu dump

* remove log
```
  97d30602
- J
  【Bug fix】Fix dygraph double grad dtype error (#36125) · af4f018a
  由 Jiabin Yang 提交于 9月 28, 2021
```
* fix dygraph double grad dtype error when calling for high differential senario

* reinvoke ci

* add test for partial_engine.cc
```
  af4f018a
- L
  
  reduce calls to SizeOfType (#36110) · c719add7
  由 Leo Chen 提交于 9月 28, 2021
  
  c719add7
- Z
  
  rename scale loss grad (#36162) · ad128144
  由 Zeng Jinle 提交于 9月 28, 2021
  
  ad128144
- H
  Add Basic CINN Runner Class (#35978) · 6f18b041
  由 Huihuang Zheng 提交于 9月 28, 2021
```
* Add Basic CINN Runner Class

* Add CinnCacheKey

* Add Cache logic and improve CinnCacheKey


* Modify as reviewer commented

* Implement hash_combine to fix MAC build.
```
  6f18b041
- S
  
  dlpack fix (#35817) · 74ff59cf
  由 Siming Dai 提交于 9月 28, 2021
  
  74ff59cf
27 9月, 2021 2 次提交

gloo hdfs set check & gloo connect retry (#35750) · ae382d1f

由 xiaoxiao-luomu 提交于 9月 27, 2021

* gloo hdfs set check & gloo connect retry

* add vlog

* print gloo connect addr & add vlog

* .

* modify vlof

* modify vlog

* modify vlog

ae382d1f

A
Polish multi-thread schedule strategy and Keep one task in current thread (#35928) · 0e5d81c7
由 Aurelius84 提交于 9月 27, 2021
```
* Polish multi-thread schedule strategy

* fix atomic_deps

* modify into lambda function

* add and run
```
0e5d81c7

26 9月, 2021 1 次提交
- T
  set file_num in one shard (#35835) · 991dc67d
  由 Thunderbrook 提交于 9月 26, 2021
```
* set file_num in one shard

* format
```
  991dc67d
24 9月, 2021 1 次提交
- B
  add multihead_matmul trt converter test case (#36023) · fcaa64b3
  由 baoachun 提交于 9月 24, 2021
```
* add multihead_matmul trt converter test case

* move attribute check to op_teller
```
  fcaa64b3
23 9月, 2021 1 次提交

Optimize workqueue (#35931) · 4e7bd9c3

由 liutiexing 提交于 9月 23, 2021

* add align for WorkQueue

* WorkQueue update

* Revert "WorkQueue update"

This reverts commit 14ce793dbb204f8ddec63c34b3b72a73c7cdb93a.

* optimize WorkQueue

4e7bd9c3

22 9月, 2021 6 次提交
- T
  Fix copy elision warning (#35885) · 47d6bc86
  由 Tomasz Socha 提交于 9月 22, 2021
```
* Fix copy elision warning

* Remove redundand code
```
  47d6bc86
- W
  
  add no need buffer check, test=develop (#35790) · 7ebbcbbc
  由 wanghuancoder 提交于 9月 22, 2021
  
  7ebbcbbc
- W
  
  fix: delete_quant_dequant_filter_op_pass, delete_quant_dequant_op_pass (#35879) · 5cda6b2b
  由 Wangzheee 提交于 9月 22, 2021
  
  5cda6b2b
- W
  
  add timeline(recordevent) for new executor, test=develop (#35831) · 5574c8cf
  由 wanghuancoder 提交于 9月 21, 2021
  
  5574c8cf
- W
  refine gc for new_executor (#35764) · fab1a029
  由 wanghuancoder 提交于 9月 21, 2021
```
* refine gc for new_executor, test=develop

* refine, test=develop

* refine, test=develop

* merge, test=develop
```
  fab1a029
- A
  Modify H2D and D2H as kQueue::Sync and Polish Schedule logic (#35866) · fe35496b
  由 Aurelius84 提交于 9月 22, 2021
```
* Modify H2D and D2H as kQueue::Sync

* fix interface error
```
  fe35496b
18 9月, 2021 6 次提交

Basic PR on Cost Model (#35774) · 5ba9fe6e

由 Huihuang Zheng 提交于 9月 18, 2021

Add basic Cost Model, it uses executor to run program and profile it to get op time.

This is an early basic version, we will add more functions in the future.

5ba9fe6e

W

trt engine dtor when the last predictor dtor. (#35842) · 8a239ae5
由 Wilber 提交于 9月 18, 2021

8a239ae5

由 Feiyu Chan 提交于 9月 18, 2021

* 1. add interface for fft;
2. add data type predicate;
3. fix paddle.roll.

* add fft c2c cufft kernel

* implement argument checking & op calling parts for fft_c2c and fftn_c2c

* add operator and opmaker definitions

* only register float and double for cpu.

* add common code for implementing FFT, add pocketfft as a dependency

* add fft c2c cufft kernel function

* fix bugs in python interface

* add support for c2r, r2c operators, op makers, kernels and kernel functors.

* test and fix bugs

* 1. fft_c2c function: add support for onesided=False;
2. add complex<float>, complex<double> support for concat and flip.

* 1. fft: fix python api bugs;
2. shape_op: add support for complex data types.

* fft c2c cufft kernel done with complie and link

* fix shape_op, add mkl placeholder

* remove mkl

* complete fft c2c in gpu

* 1. implement mkl-based fft, FFTC2CFunctor and common function exec_fft;
2. change the design, add input and output typename as template parameter for all FFTFunctors, update pocketfft-based implementation.

* complete fft c2c on gpu in ND

* complete fft c2c on gpu in ND

* complete fft c2c backward in ND

* fix MKL-based implementation

* Add frame op and CPU/GPU kernels.

* Add frame op forward unittest.

* Add frame op forward unittest.

* Remove axis parameter in FrameFunctor.

* Add frame op grad CPU/GPU kernels and unittest.

* Add frame op grad CPU/GPU kernels and unittest.

* Update doc string.

* Update after review and remove librosa requirement in unittest.

* Update grad kernel.

* add fft_c2r op

* Remove data allocation in TransCompute function.

* add fft r2c onesided with cpu(pocketfft/mkl) and gpu

* last fft c2r functor

* fix C2R and R2C for cufft, becase the direction is not an option in these cases.

* add fft r2c onesided with cpu(pocketfft/mkl) and gpu

* fix bugs in python APIs

* fix fft_c2r grad kernal

* fix bugs in python APIs

* add cuda fft c2r grad kernal functor

* clean code

* fix fft_c2r python API

* fill fft r2c result with conjugate symmetry (#19)

fill fft r2c result with conjugate symmetry

* add placeholder for unittests (#24)

* simple parameterize test function by auto generate test case from parm list (#25)

* miscellaneous fixes for python APIs (#26)

* add placeholder for unittests

* resize fft inputs before computation is n or s is provided.

* add complex kernels for pad and pad_grad

* simplify argument checking.

* add type promotion

* add int to float or complex promotion

* fix output data type for static mode

* fix fft's input dtype dispatch, import fft to paddle

* fix typos in axes checking (#27)

* fix typos in axes checking

* fix argument checking (#28)

* fix argument checking

* Add C2R Python layer normal and abnormal use cases (#29)

* documents and single case

* test c2r case

* New C2R Python layer normal and exception use cases

* complete rfft,rfft2,rfftn,ihfft,ihfft2,ihfftn unittest and doc string (#30)

* Documentation of the common interfaces of c2r and c2c (#31)

* Documentation of the common interfaces of c2r and c2c

* clean c++ code  (#32)

* clean code

* Add numpy-based implementation of spectral ops (#33)

* add numpy reference implementation of spectral ops

* Add fft_c2r numpy based implementation for unittest. (#34)

* add fft_c2r numpy implementation

* Add deframe op and stft/istft api. (#23)

* Add frame api

* Add deframe op and kernels.

* Add stft and istft apis.

* Add deframe api. Update stft and istft apis.

* Fix bug in frame_from_librosa function when input dims >= 3

* Rename deframe to overlap_add.

* Update istft.

* Update after code review.

* Add overlap_add op and stft/istft api unittest (#35)

* Add overlap_add op unittest.

* Register complex kernels of squeeze/unsquuze op.

* Add stft/istft api unittest.

* Add unittest for fft helper functions (#36)

* add unittests for fft helper functions. add complex kernel for roll op.

* complete static graph unittest for all public api (#37)

* Unittest of op with FFT C2C, C2R and r2c added (#38)

* documents and single case

* test c2r case

* New C2R Python layer normal and exception use cases

* Documentation of the common interfaces of c2r and c2c

* Unittest of op with FFT C2C, C2R and r2c added
Co-authored-by: lijiaqi <lijiaqi0612@163.com>

* add fft related options to CMakeLists.txt

* fix typos and clean code (#39)

* fix invisible character in mkl branch and fix error in error message

* clean code: remove docstring from unittest for signal.py.

* always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype. (#40)

* always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype.

* fix CI Errors: numpy dtype comparison, thrust when cuda is not available (#41)

1. always convert numpy array to paddle.Tensor to avoid comparing numpy dtype with paddle dtype.
2. promote floating point tensor to complex tensor ior fft_c2c and fft_c2r;
3. fix unittest to catch UnImplementedError and RuntimeError;
4. fix compile error by avoid using thrust when cuda is not available.
5.  fix sample code, use paddle.fft instead of paddle.tensor.fft

* remove inclusion of thrust, add __all__ list for fft (#42)

* Add api doc and update unittest. (#43)

* Add doc strings.
* Update overlap_add op unittest

* fix MKL-based FFT implementation (#44)

* fix MKL-based FFT implementation, MKL CDFT's FORWARD DOMAIN is always REAL for R2C and C2R

* remove code for debug (#45)

* use dynload for cufft (#46)

* use std::ptrdiff_t as datatype of stride (instead of int64_t) to avoid argument mismatch on some platforms.

* add complex support for fill_zeros_like

* use dynload for cufft

* Update doc and unittest. (#47)

* Add doc of frame op and overlap_add op.

* Update unittest.

* use dynload for cufft (#48)

1. use dynload for cufft
2. fix unittest;
3. temporarily disable Rocm.

* fix conflicts and merge upstream (#49)

fix conflicts and merge upstream

* fix compile error: only link dyload_cuda when cuda is available (#50)

* fix compile error: only link dyload_cuda when cuda is available

* fix dynload for cufft on windows (#51)

1. fix dynload for cufft on windows;
2. fix unittests.

* add NOMINMAX to compile on windows (#52)

 add NOMINMAX to compile on windows

* explicitly specify capture mode for lambdas (#55)

 explicitly specify capture mode for lambdas

* fix fft sample (#53)

* fix fft sample

* update scipy and numpy version for unittests of fft (#56)

update scipy and numpy version for unittests of fft

* Add static graph unittests of frame and overlap_add api. (#57)

* Remove cache of cuFFT & Disable ONEMKL (#59)

1. replace numpy.fft with scipy.fft as numpy<1.20 not support ortho norm
2. remove cache of cufft plans;
3. enhance error checking.
4. default WITH_ONEMKL to OFF
Co-authored-by: Njeff41404 <jeff41404@gmail.com>
Co-authored-by: Nroot <root@bjyz-sys-gpu-kongming9.bjyz.baidu.com>
Co-authored-by: NKP <109694228@qq.com>
Co-authored-by: lijiaqi <lijiaqi0612@163.com>
Co-authored-by: NXiaoxu Chen <chenxx_id@163.com>
Co-authored-by: Nlijiaqi0612 <33169170+lijiaqi0612@users.noreply.github.com>

11518a43

A
Clean ParseMemInfo and Fix unittest failed under multi-thread (#35840) · 2fff5a58
由 Aurelius84 提交于 9月 18, 2021
```
* Clean ParaseMemInfo and fix unittest with multi-thread

* fix declare
```
2fff5a58

[oneDNN] Disable caching of Reorder operation (#35664) · e4c2a854

由 Jacek Czaja 提交于 9月 18, 2021

* - REorder disabling caching

* - compilation fix

* - another compilation fix

* - another compilation fix

* - compilation fix

* - Fix

* - yet another compilation fix

* - suppresingly another compilation fix

* - lint

* - fix after review

* - fix

e4c2a854

Add new API "eigvals" in linalg (#35720) · d411a038

由 From00 提交于 9月 18, 2021

* Add linalg.eigvals API

* pre-commit check

* Adjust code style

* Fix conflict

* Improve code style

* Modify the test code to ignore testing CUDA kernel

* Sort ouput data before checking in test code

* Set timeout value for UT

* Improve API example code to pass CI

* Fix bug for None fetch_list in Windows

* Delete grad Op

d411a038

17 9月, 2021 4 次提交

Z

change to PADDLE_DEFINE_EXPORTED (#35841) · d22914fd
由 Zeng Jinle 提交于 9月 17, 2021

d22914fd

add inplace op support to prune, scale_op is no longer need in jit.save (#35730) · 21921936

由 Haipeng Wang 提交于 9月 17, 2021

* add scale_op in model save step is not necessary, just fix the prune method to support static graph and inplace op

* fix jit.save, no need to add scale_op to each outputvar anymore.
fix prune_with_input, now it supports inplace op

* temporarily disable test_trt_dynamic_shape.TRTDynamicShapeOutOfBound2Test

21921936

Intergrate MultiThreadedWorkQueue to execute program ops (#35356) · a0871194

由 Aurelius84 提交于 9月 17, 2021

* format code

* format interface

* polish interface

* Remove std::memory_order

* modify into SpinLock

* remove fetch_context_pool_

* fix comment

* modify into WorkQueueGroup

* refine code

* fix pointer

* fix paddle_enforce

* split into AsyncWorkQueue

* polish code

* specify std::memory_relax

* fix atomic fetch_sub

* fix num_thread

a0871194

GeneratePass for Python Pass (#35708) · f6db9806

由 wuhuanzhou 提交于 9月 17, 2021

#### 背景

#35602 提供Python侧开发子图替换类Pass的方式：

- 利用Paddle Python API或者辅助类型定义子图program用来匹配/替换图；
- Python侧注册Pass时，将注册函数最终转换为protobuf定义的PassDesc数据形式，供C++侧进行解析完成Pass实例注册。

本PR即为根据PassDesc规则描述解析生成Pass实例。

#### 方案设计

##### Pass规则验证

在以往的Pass开发中，会存在随着算子迭代引发的匹配失效或者错误匹配的问题，该问题可以通过扫描算子支持的参数设置及参数类型等来判断是否应该使用该Pass或者给出提示需要修改Pass代码。

当前Pass开发中提供了算子兼容性OpCompatSensiblePass用于解决上述问题。但同时还存在不足：由于以往Pass开发在运行时才能获取到pattern信息，所以需要在执行Pass时才可以判断。

使用PassDesc表示的Pass可以在执行Pass前验证上述问题，这个过程在VerifyDesc中完成。

##### 根据匹配子图构造pattern

GeneratePass对于图匹配和替换使用GraphPatternDecetor完成，构造匹配pattern实际上就是将对应对象成员PDPattern中添加PDNode和边关系。该过程在函数`InitGeneratePattern`中完成，该函数没有作为GeneratePass的成员方法，主要出于后续可能开发新的Decetor考虑，GeneratePass与Decetor的操作是没有关联的。

初始化pattern主要通过遍历匹配子图program的全部算子实现：

1. 添加当前算子对应PDNode及限制条件（算子类型、属性限制等）；
2. 遍历当前算子对应输入并从pattern中尝试获取PDNode：
   - 在pattern中获取到PDNode且为输出节点：表示属于匹配子图的中间节点，将该PDNode设置为中间节点；
   - 在pattern中没有获取到PDNode：添加该输入PDNode并设置作为输入节点；
   - 设置输入到算子的边关系；
3. 遍历当前算子对应输出：
   - 在pattern中获取到PDNode且为输入节点：表示属于匹配子图的中间节点，将该PDNode设置为中间节点；
   - 在pattern中没有获取到PDNode：添加该输入PDNode并设置作为输出节点；
   - 设置算子到输出的边关系；

##### 根据替换子图操作graph

替换子图操作的过程在`GetGenerateRewrite`函数中完成，与`InitGeneratePattern`类似没有作为GeneratePass的成员方法。

生成替换子图操作过程如下：

1. 判断冗余替换子图；
2. 遍历替换子图program的全部算子添加替换子图Node：
   1. 添加当前算子的Node及属性设置；
   2. 遍历当前算子对应输入，添加中间variable节点；
   3. 遍历当前算子对应输出，添加中间variable节点；
   4. 添加输入/输出节点与算子节点的边关系；
3. 删除匹配图中属于中间节点的Node；

##### 优化子图验证

对于替换子图或者替换后的计算图是否可以正确运行等，可以在执行Pass时验证，从而防止在后续执行计算图时出现异常。

当前Pass执行直接修改计算图，验证失败时无法很好的完成还原操作，目前子图验证暂时默认成功，留到后续改进。

f6db9806

机器未来 / Paddle 与 Fork 源项目一致

机器未来 / Paddle
与 Fork 源项目一致