提交 · e7842ba6670824efa5484b7ccfe9b364949a6fb7 · BaiXuePrincess / Paddle

28 10月, 2021 2 次提交
- L
  
  Rewrite Softmax in Kernel Primitive API, test=develop (#36706) · ef76f664
  由 Liu-xiandong 提交于 10月 28, 2021
  
  ef76f664
- L
  Fix fused_attention_op and fused_feedforward_op bug when pre_layer_norm is false. (#36793) · ff3018d7
  由 Li Min 提交于 10月 28, 2021
```
* Fix bug when pre_layer_norm is false.
```
  ff3018d7
27 10月, 2021 8 次提交

P

add unittest (#36511) · 51a33962
由 pangyoki 提交于 10月 27, 2021

51a33962

Added fp32 / bf16 forward and backward elementwise_div_mkldnn operator (#36158) · e92e6b06

由 piotrekobiIntel 提交于 10月 27, 2021

* Add WIP version of elementwise_div_mkldnn without working dy grad

* Add dy gradient calculation implementation, disable broadcast tests

* Readd removed tests from static_mode_white_list

* Add bfloat16 gradient tests, remove int8 and uint8 support

* - Change the way dy grad is calculated to improve performance
- Refactor BinaryMKLDNNHandler to use a default parameter

* Change copyright year

* Refactor as suggested

* Attempt to bypass CI Approval
not accepting max_relative_error

* Fix formatting issue

e92e6b06

Add LRUCache for fft plans (#36646) · 737992eb

由 Feiyu Chan 提交于 10月 27, 2021

* WIP: add cache

* delete move constructor and operator= for CuFFTHandle and FFTConfig

* remove log from CuFFTHandle and FFTConfig

* add lrucache for fft rocm backend

* disable LRUCache when CUFFT_VERSION >= 10200

* disbale copy and move for hipFFTHandle; format code

* clean debug code
Co-authored-by: NXiaoxu Chen <chenxx_id@163.com>

737992eb

B
add matmul_v2 to v1 CPU pass and fix matmul dim error (#36731) · d5245a35
由 baoachun 提交于 10月 27, 2021
```
* fix matmul dim error

* fix wrong dim check in matmul
```
d5245a35

fix fftshift/ifftshift on static mode (#36748) · 34b6860e

由 Feiyu Chan 提交于 10月 27, 2021

* fix fftshift/ifftshift on static mode
* update roll_op version
* add more test cases for fftshift/ifftshift

34b6860e

T

add fp16 unittests for kl2 (#36583) · 6838a187
由 taixiurong 提交于 10月 27, 2021

6838a187

add paddle.linalg.eigvalsh API (#35615) · 9f9ed3ae

由 huangjun12 提交于 10月 27, 2021

* add eigvalsh with is_test

* add eigvalsh op

* fix backward bug

* forward and backward, float and complex, unittest

* remove eigvalsh_helper.h

* remove changes of cusolver.h

* fix unittest

* fix unittest bug

* update code following eigh

* fix test

* update lapack

* pull develop

* update funcor

* fix unittest bug

* fix details

* add tensor_method_func

* fix notes

9f9ed3ae

W

Fix inverse in fake quant (#36762) · 542ba214
由 whs 提交于 10月 27, 2021

542ba214

26 10月, 2021 7 次提交

Add fused attention op backward and python layer. (#36498) · 5119428e

由 Li Min 提交于 10月 26, 2021

功能：本PR的目标是提高attention模块的计算性能。
为了减少框架层对op的调度开销，本PR通过在C++层手动实现attention模块，对外提供attention 大op；
为了减少防存开销，本PR采取了两种优化方法：
（1）在q,k,v计算时通过共享输入X，将该处的gemm，transpose和bias add从三次调用减少为一次；
（2）使用kernel融合优化技术，在不同cuda kernel之间通过寄存器传输数据；

5119428e

F

roll_op: support Tensor as input for shifts (#36727) · 7b1e30fc
由 Feiyu Chan 提交于 10月 26, 2021

7b1e30fc
Z

Add roi_align grad (#36724) · 236ed94d
由 zhulei 提交于 10月 26, 2021

236ed94d
L
Move fused_attention and fused_feedforward functional api path to incubate (#36704) · 9aeca2f1
由 Li Min 提交于 10月 26, 2021
```
将 #35905 和 #35843 PR中新增的的python api接口移到incubate目录下。
```
9aeca2f1

[NPU] fix argsort op, test=develop (#36576) · 3523bbe8

由 Qi Li 提交于 10月 26, 2021

* [NPU] fix argsort op, test=develop

* remove debug files, test=develop

* fix typo, test=develop

* address review comments, test=develop

3523bbe8

J
Optimize FasterTokenizer (#36701) · 290ded7a
由 Jack Zhou 提交于 10月 26, 2021
```
* optimize fast tokenizer
```
290ded7a
F

Pool3d 2.0 (#36545) · 229bae81
由 feng_shuai 提交于 10月 26, 2021

229bae81

25 10月, 2021 6 次提交

A
[NPU] modifications for model ernie-1.0 (#36642) · 19b02d95
由 Aganlengzi 提交于 10月 25, 2021
```
* [NPU] modifications for model ernie-1.0

* rollback 503003 and change cast to dtype
```
19b02d95

add op: fused_feedforward(backward) (#35611) · 2dd0a46a

由 zhangkaihuo 提交于 10月 25, 2021

这个PR是fused_feedforward反向的代码

相关kernel实现：fused_dropout_act_bias, fused_residual_dropout_bias, fused_layernorm_residual_dropout_bias

fused_feedforward是一个融合算子，该算子对transformer模型的feed forward层的算子进行融合和封装，使得前端只呈现一个接口，通过融合减少部分访存和kernel launch的时间，以此提升性能。

2dd0a46a

Add bincount op (#36317) · 39f19127

由 smallv0221 提交于 10月 25, 2021

* Add bincount op

* upload cpu version

* fix unitest

* fix unittest

* fix unittest

* fix en doc

* add more test

* fix en doc

* add more test case

* fix test

* fix input vailidation

* fix input check

* fix unittest

* fix test

* fix en doc

39f19127

add some ops to train ssd on kunlun (#36407) · 50778ad6

由 TTerror 提交于 10月 25, 2021

* add some ops to train ssd on kunlun

* add some ops to train ssd on kunlun

* add some ops to train ssd on kunlun

* update cast op unittest

* update cast op unittest

* update cast op unittest

* update xpu cmake

* update cast unittest

50778ad6

W

Fix grid sampler while input size is [1] (#36183) · eff3ee5e
由 whs 提交于 10月 25, 2021

eff3ee5e

add op: fused_feedforward(forward) (#35843) · b18cbfb2

由 zhangkaihuo 提交于 10月 25, 2021

这个PR只包含fused_feedforward前向的代码。

相关kernel实现：fused_dropout_act_bias, fused_residual_dropout_bias, fused_layernorm_residual_dropout_bias

b18cbfb2

23 10月, 2021 1 次提交
- B
  
  fix interpolate mkldnn op error (#36623) · f6d82526
  由 baoachun 提交于 10月 23, 2021
  
  f6d82526
22 10月, 2021 3 次提交

Z

add fp16 kernel for clip_op (#36577) · 1962d3af
由 zhangbo9674 提交于 10月 22, 2021

1962d3af

Fused attention op forward (#35905) · d4906214

由 Li Min 提交于 10月 22, 2021

功能：本PR的目标是提高attention模块的计算性能。
为了减少框架层对op的调度开销，本PR通过在C++层手动实现attention模块，对外提供attention 大op；
为了减少防存开销，本PR采取了两种优化方法：
（1）在q,k,v计算时通过共享输入X，将该处的gemm，transpose和bias add从三次调用减少为一次；
（2）使用kernel融合优化技术，在不同cuda kernel之间通过寄存器传输数据；

d4906214

【Bug Fixes】Elementwise_add triple grad, fixed an input uninitialized problem (#36618) · 6580ad16

由 Weilong Wu 提交于 10月 22, 2021

* Support elementwise_add triple grad Kernel

* Change code-format to follow CI std

* Removed unreasonable code, and fixed an input uninitialized issue

* Support elementwise_add triple grad Kernel

* Change code-format to follow CI std

* Removed unreasonable code, and fixed an input uninitialized issue

6580ad16

21 10月, 2021 7 次提交

Z

[NPU] Add p_norm_grad (#36497) · ed478a3e
由 zhulei 提交于 10月 21, 2021

ed478a3e
R

add swish_op for npu (#36579) · 7eab0fa6
由 ronnywang 提交于 10月 21, 2021

7eab0fa6

Added matmul_v2+transpose+reshape fuse pass (#36481) · 856cb9c5

由 jakpiase 提交于 10月 21, 2021

* added base changes for matmul_v2+trans+resh fuse pass

* added full matmul_v2+transpose+reshape pass

* removed a file added by mistake

* added reviewers suggestions

* Changed ops type in checking capatibility version

* Deteled one statement

856cb9c5

[NPU] Add sync_batch_norm and sync_batch_norm_grad NPU Kernel (#36320) · 0ca2807c

由 furnace 提交于 10月 21, 2021

* add sync_batch_norm (support train, infer, and fp32, fp16, and NCHW, NHWC)

* [NPU] Delete debug codes

* [NPU] Remove FP16

0ca2807c

Add viterbi decode (#35778) · 6072aecb

由 Jack Zhou 提交于 10月 21, 2021

* add viterbi decode cpu kernel

* add viterbi decoder api in paddle.text

* add a data buffer once to avoid create many small pieces of data buffer frequently

* fix viterbi max_seq_length bug

* fix seq_len=1 bug

* fix device context

* move split out of for loop

* remove INVERSE_SUB

* remove 2 GET_CAST_MASK

* remove 1 loop

* remove Functor

* add to_static deploy code

* use MAX_FUNC instead of ELE_MAX

* add MaxFunctor

* impl max_func

* remove MaxFunctor

* remove cast op

* use REGISTER_OP_WITHOUT_GRADIENT

* add viterbi cuda kernel

* add FIX_BLOCKDIM_CASE macro

* add MKL add, mul; add get data mask

* add arange mkl impl

* add CPU Argmax

* add cpu gather

* use EXECUTE_MKL_ELEMENT_BINARY_OP instead of some ADD, MUL

* use SameDimsBinaryOP instead of EXECUTE_MKL_ELEMENT_BINARY_OP

* use SAME_DIMS_ELEMENT_BINARY_OP

* add SimpleBroadcastBinaryOP

* use int instead of int64_t to accelerate

* optimize SimpleBroadcastBinaryOP

* optimize SimpleBroadcastBinaryOP

* optimize performance in both single thread and multithread situation

* remove useless line

* remove useless code

* add CREATE_TENSOR_BUFFER macro

* add INIT_REQUIRED_TENSOR macro

* add comment

* fix windows ci

* add viterbi unittest

* remove cuda add functor

* remove cuda equal

* remove a template function

* fix windows ci

* fix windows dtype

* remove some template instance

* remove useless header file

* remove some blockdim

* remove transpose impl

* accelerate cpu performance on single thread situation

* viterbi_decode->crf_decode

* rename crf params name

* add viterbi api test

* remove useless import

* add enable_static

* use viterbi decoder

* fix viterbi len=1

* fix  viterbi unittest

* remove useless comments

* reconstruct viterbi decode

* remove ADD,SUB,MUL structure

* fix coverage

* remove CREATE_TENSOR

* add name args

* crf.py->ops.py; with_start_stop_tag->include_start_end_tag

* update crf_decode en docs

* fix viterbi decode en docs

* fix some review comments

* add FIXED_BLOCK_DIM_CASE in cuda

* push_back->emplace_back

* crf_decode->viterbi_decode; include_start_end_tag->include_bos_eos_tag

* paddle.text.ops.viterbi_decode->paddle.text.viterbi_decode

* fix viterbi_decode en docs

6072aecb

T
add fill_any_like/flatten ops to train ssd on kunlun (#36550) · 7bf2aa38
由 TTerror 提交于 10月 21, 2021
```
* add some ops to train ssd on kunlun

* update test_fill_any_like_op_xpu.py
```
7bf2aa38

Fix a bug in ReadData, ReadDataBc and ReadDataReduce when NX != 1 (#36373) · 921c0917

由 niuliling123 提交于 10月 21, 2021

* Update the implement of reduceAnyKernel according to kernel primitive api
* Fix a bug in ReadData, ReadDataBc and ReadDataReduce when NX != 1

921c0917

20 10月, 2021 5 次提交

李
Fix global gather and global scatter operators (#36517) · 17b4dd70
由李季提交于 10月 20, 2021
```
* fix global gather and global scatter operators
```
17b4dd70
R

[NPU] Add kldiv_loss_op for npu (#36494) · 6a572a19
由 ronnywang 提交于 10月 20, 2021

6a572a19

Add FasterTokenizer Operator (#34491) · 3f2d6a3f

由 Steffy-zxf 提交于 10月 20, 2021

Add Tokenizer related functionalities for Transformer model in order that the process of training and predicting is consistent.

* support the text string as an input Tensor
* support the "VOCAB"unordered_map<wstring, int> as an input Tensor to lookup tokens
* Tokenizer used for BERT. This tokenizer applies an end-to-end, text string to wordpiece tokenization.
* It first applies basic tokenization, followed by wordpiece tokenization.

3f2d6a3f

W

adapt to cann5.0.3_alpha3. (#36106) · 873ee4e3
由 wuhuachaocoding 提交于 10月 20, 2021

873ee4e3
Z

fix pow2 decay (#36559) · 605e7f08
由 Zeng Jinle 提交于 10月 20, 2021

605e7f08

19 10月, 2021 1 次提交
- W
  Support elementwise_add triple grad Kernel (#36508) · 51c97d9f
  由 Weilong Wu 提交于 10月 19, 2021
```
* Support elementwise_add triple grad Kernel

* Change code-format to follow CI std
```
  51c97d9f

BaiXuePrincess / Paddle 与 Fork 源项目一致

BaiXuePrincess / Paddle
与 Fork 源项目一致