提交 · 3523bbe86376878fcda52b2dcc152db76971db87 · 机器未来 / Paddle

26 10月, 2021 8 次提交

[NPU] fix argsort op, test=develop (#36576) · 3523bbe8

由 Qi Li 提交于 10月 26, 2021

* [NPU] fix argsort op, test=develop

* remove debug files, test=develop

* fix typo, test=develop

* address review comments, test=develop

3523bbe8

fix wrong trt dim when input dim is 2 (#36614) · 43dcf235

由 baoachun 提交于 10月 26, 2021

* fix wrong trt dim when input dim is 2

* update leaky_relu and instance_norm converter unit test

* add instance_norm input dim check

43dcf235

Z
Fix the null ptr bug in build_cinn_pass. (#36698) · 28bab073
由 Zhen Wang 提交于 10月 26, 2021
```
* Fix the null ptr bug in build_cinn_pass.

* Add test for empty&ctrl var.
```
28bab073
L

enable flags_benchmark for dygraph (#36686) · 21bece3f
由 Leo Chen 提交于 10月 26, 2021

21bece3f

[Paddle-Inference]Add MatmulV2ToMatmul convert Pass, fix (matmul_v2, matmul,... · 93c591e2

由 Wangzheee 提交于 10月 26, 2021

[Paddle-Inference]Add MatmulV2ToMatmul convert Pass, fix (matmul_v2, matmul, mul) convert pass, fix (matmul, mul) op_teller (#36652)

* new_Matmul2ToMatmulToMul

* new_Matmul2ToMatmulToMul

* fix paddle_pass_builder

* fix paddle_pass_builder

* fix paddle_pass_builder

* tem

* tem

* Add MatmulV2ToMatmul convert Pass; MatmulV2ToMul convert Pass

* Add MatmulV2ToMatmul convert Pass; MatmulV2ToMul convert Pass

* add matmul_broadcast_unitest

* fix op_teller

93c591e2

J
Optimize FasterTokenizer (#36701) · 290ded7a
由 Jack Zhou 提交于 10月 26, 2021
```
* optimize fast tokenizer
```
290ded7a

Support various length support for SelectedRows in GLOO::AllGather (#36637) · eca78a9f

由 xiongkun 提交于 10月 26, 2021

* In cpu parallel using gloo, add various length support for SelectedRows

* fix bug

* fix bugs

* fix by code review

* remove timeout

eca78a9f

F

Pool3d 2.0 (#36545) · 229bae81
由 feng_shuai 提交于 10月 26, 2021

229bae81

25 10月, 2021 10 次提交

Z

add ctr accessor (#36601) · cea1ba88
由 zhaocaibei123 提交于 10月 25, 2021

cea1ba88
A
[NPU] modifications for model ernie-1.0 (#36642) · 19b02d95
由 Aganlengzi 提交于 10月 25, 2021
```
* [NPU] modifications for model ernie-1.0

* rollback 503003 and change cast to dtype
```
19b02d95

add op: fused_feedforward(backward) (#35611) · 2dd0a46a

由 zhangkaihuo 提交于 10月 25, 2021

这个PR是fused_feedforward反向的代码

相关kernel实现：fused_dropout_act_bias, fused_residual_dropout_bias, fused_layernorm_residual_dropout_bias

fused_feedforward是一个融合算子，该算子对transformer模型的feed forward层的算子进行融合和封装，使得前端只呈现一个接口，通过融合减少部分访存和kernel launch的时间，以此提升性能。

2dd0a46a

Add bincount op (#36317) · 39f19127

由 smallv0221 提交于 10月 25, 2021

* Add bincount op

* upload cpu version

* fix unitest

* fix unittest

* fix unittest

* fix en doc

* add more test

* fix en doc

* add more test case

* fix test

* fix input vailidation

* fix input check

* fix unittest

* fix test

* fix en doc

39f19127

T
CI build PR and dev whl (#36532) · e16fe48d
由 tianshuo78520a 提交于 10月 25, 2021
```
CI build PR and dev whl
```
e16fe48d

Create CinnCompiler class for compiling subgraphs found by build_cinn_pass. (#36562) · 4c460378

由 Zhen Wang 提交于 10月 25, 2021

* Init the functions of CinnCompiler.

* Add the unit test for CinnCompiler.

* Fix some compilation errors.

* Update the UT of cinn_compiler.

* Use Decomposer&OpFusion passes in CinnCompiler::CompileGraph.

* Update some comments.

* Uncomment some includes in build_cinn_pass.cc.

* Use refs instead of ptrs as returned types of FindGraph & Compile in
CinnCompiler.

* Use the merged CinnGraphSymbolization functions in CinnCompiler.

4c460378

add some ops to train ssd on kunlun (#36407) · 50778ad6

由 TTerror 提交于 10月 25, 2021

* add some ops to train ssd on kunlun

* add some ops to train ssd on kunlun

* add some ops to train ssd on kunlun

* update cast op unittest

* update cast op unittest

* update cast op unittest

* update xpu cmake

* update cast unittest

50778ad6

[new-exec] Add events waiter (#36480) · cdb9bfa3

由 liutiexing 提交于 10月 25, 2021

* add align for WorkQueue

* add spinlock

* merge develop

* merge

* Add EventsWaiter

* update

* update

* update Error MSG

* update EventsWaiter

cdb9bfa3

W

Fix grid sampler while input size is [1] (#36183) · eff3ee5e
由 whs 提交于 10月 25, 2021

eff3ee5e

add op: fused_feedforward(forward) (#35843) · b18cbfb2

由 zhangkaihuo 提交于 10月 25, 2021

这个PR只包含fused_feedforward前向的代码。

相关kernel实现：fused_dropout_act_bias, fused_residual_dropout_bias, fused_layernorm_residual_dropout_bias

b18cbfb2

24 10月, 2021 1 次提交
- Z
  
  Add the macro `-DPADDLE_WITH_CINN`. (#36660) · e2173b68
  由 Zhen Wang 提交于 10月 24, 2021
  
  e2173b68
23 10月, 2021 6 次提交

add cinn graph symbolization (#36417) · bbd4bd73

由 jiangcheng 提交于 10月 23, 2021

* add cinn graph symbolization

* fix some bug

* add paddle scope to cinn scope

* add paddle scope to CINN scope in Symbolization, and add feed op when build cinn pass

* fix some bug

* fix some bug by review advices

* optimize code problem

* revert build_cinn_pass and move the change to https://github.com/PaddlePaddle/Paddle/pull/36503

* fix some bug after co-compilation

* perfect single test script

* remove scope and rename feed_target to input_tensor

* using std::unordered_map instead of absl::flat_hash_map

* fix single test bug

* revert to preverion for WITH_CINN has add in later PR

* full error information for CI

* full enfore information for CI pass

bbd4bd73

W
disable padding if dynamic shape (#36648) · 99e396f8
由 wenbin 提交于 10月 23, 2021
```
* disable padding if dynamic shape

* add parentheses

* correct
```
99e396f8
B

fix interpolate mkldnn op error (#36623) · f6d82526
由 baoachun 提交于 10月 23, 2021

f6d82526
W
add file exists check (#36628) · 425db7c8
由 Wilber 提交于 10月 23, 2021
```
* add file check

* add ut
```
425db7c8

Add transformer of paddle desc and cinn desc (#36100) · 3cb6f65e

由 jiangcheng 提交于 10月 23, 2021

* add transformer of paddle desc and cinn desc

* change LOG(FATAL) to PADDLE_THROW for ci

* full error imformation for ci

* fix some problem as review advice

* fix some bug

* move vat type utils to tansform_desc header file

* add if NOT WITH_CINN control whether compile

* build_strategy check whether open WITH_CINN

* add control WITH_CINN in cmake

3cb6f65e

New Paddle-CINN Compile PR (#36584) · ab732884

由 Huihuang Zheng 提交于 10月 23, 2021

This PR added some changes to match the CINN change for compilation. It also tried to fix JiangCheng's Problem in PR: https://github.com/PaddlePaddle/Paddle/pull/36100

These changes include:
1. Set `CINN_GIT_TAG` to a newer tag
2. CINN now just `make cinnapi -j`
3. We have to add `-DPY_VERSION=${PY_VERSION} -DWITH_TESTING=ON` to CINN cmake args
4. For CINN's third party dependencies, we could just include headers without target_link_libraries
5. Moved `cinn.cmake` from `paddle/cmake` to `paddle/cmake/external` to match old style. External folder contains `lite`, which is the same level of `cinn`
6. CINN added `-DNAMESPACE=cinn_gflags` in `gflags.cmake` to have different gflag namespaces between CINN and Paddle. It solved re-define problem.
7. Change namespace of `::google::` in gflags to `::GFLAGS_NAMESPACE`

ab732884

22 10月, 2021 6 次提交

W
correct slice serialize data (#36588) · 5e880840
由 wenbin 提交于 10月 22, 2021
```
* slice

* add UT
```
5e880840
Z

add fp16 kernel for clip_op (#36577) · 1962d3af
由 zhangbo9674 提交于 10月 22, 2021

1962d3af

Fused attention op forward (#35905) · d4906214

由 Li Min 提交于 10月 22, 2021

功能：本PR的目标是提高attention模块的计算性能。
为了减少框架层对op的调度开销，本PR通过在C++层手动实现attention模块，对外提供attention 大op；
为了减少防存开销，本PR采取了两种优化方法：
（1）在q,k,v计算时通过共享输入X，将该处的gemm，transpose和bias add从三次调用减少为一次；
（2）使用kernel融合优化技术，在不同cuda kernel之间通过寄存器传输数据；

d4906214

[hapi] support dygraph amp O2 (#36441) · 08248db0

由 Leo Chen 提交于 10月 22, 2021

* [hapi] support dygrapg amp O2

* fix problem of static pure fp16 in hapi

* fix bug

* fix format

* fix ut

* follow comments

* update ut

* update amp save/load

* fix ut

* refine code format

08248db0

【Bug Fixes】Elementwise_add triple grad, fixed an input uninitialized problem (#36618) · 6580ad16

由 Weilong Wu 提交于 10月 22, 2021

* Support elementwise_add triple grad Kernel

* Change code-format to follow CI std

* Removed unreasonable code, and fixed an input uninitialized issue

* Support elementwise_add triple grad Kernel

* Change code-format to follow CI std

* Removed unreasonable code, and fixed an input uninitialized issue

6580ad16

W

support lite xpu choose device id (#36610) · f46311b0
由 Wilber 提交于 10月 22, 2021

f46311b0

21 10月, 2021 9 次提交

Z

[NPU] Add p_norm_grad (#36497) · ed478a3e
由 zhulei 提交于 10月 21, 2021

ed478a3e
R

add swish_op for npu (#36579) · 7eab0fa6
由 ronnywang 提交于 10月 21, 2021

7eab0fa6

Added matmul_v2+transpose+reshape fuse pass (#36481) · 856cb9c5

由 jakpiase 提交于 10月 21, 2021

* added base changes for matmul_v2+trans+resh fuse pass

* added full matmul_v2+transpose+reshape pass

* removed a file added by mistake

* added reviewers suggestions

* Changed ops type in checking capatibility version

* Deteled one statement

856cb9c5

[NPU] Add sync_batch_norm and sync_batch_norm_grad NPU Kernel (#36320) · 0ca2807c

由 furnace 提交于 10月 21, 2021

* add sync_batch_norm (support train, infer, and fp32, fp16, and NCHW, NHWC)

* [NPU] Delete debug codes

* [NPU] Remove FP16

0ca2807c

Add viterbi decode (#35778) · 6072aecb

由 Jack Zhou 提交于 10月 21, 2021

* add viterbi decode cpu kernel

* add viterbi decoder api in paddle.text

* add a data buffer once to avoid create many small pieces of data buffer frequently

* fix viterbi max_seq_length bug

* fix seq_len=1 bug

* fix device context

* move split out of for loop

* remove INVERSE_SUB

* remove 2 GET_CAST_MASK

* remove 1 loop

* remove Functor

* add to_static deploy code

* use MAX_FUNC instead of ELE_MAX

* add MaxFunctor

* impl max_func

* remove MaxFunctor

* remove cast op

* use REGISTER_OP_WITHOUT_GRADIENT

* add viterbi cuda kernel

* add FIX_BLOCKDIM_CASE macro

* add MKL add, mul; add get data mask

* add arange mkl impl

* add CPU Argmax

* add cpu gather

* use EXECUTE_MKL_ELEMENT_BINARY_OP instead of some ADD, MUL

* use SameDimsBinaryOP instead of EXECUTE_MKL_ELEMENT_BINARY_OP

* use SAME_DIMS_ELEMENT_BINARY_OP

* add SimpleBroadcastBinaryOP

* use int instead of int64_t to accelerate

* optimize SimpleBroadcastBinaryOP

* optimize SimpleBroadcastBinaryOP

* optimize performance in both single thread and multithread situation

* remove useless line

* remove useless code

* add CREATE_TENSOR_BUFFER macro

* add INIT_REQUIRED_TENSOR macro

* add comment

* fix windows ci

* add viterbi unittest

* remove cuda add functor

* remove cuda equal

* remove a template function

* fix windows ci

* fix windows dtype

* remove some template instance

* remove useless header file

* remove some blockdim

* remove transpose impl

* accelerate cpu performance on single thread situation

* viterbi_decode->crf_decode

* rename crf params name

* add viterbi api test

* remove useless import

* add enable_static

* use viterbi decoder

* fix viterbi len=1

* fix  viterbi unittest

* remove useless comments

* reconstruct viterbi decode

* remove ADD,SUB,MUL structure

* fix coverage

* remove CREATE_TENSOR

* add name args

* crf.py->ops.py; with_start_stop_tag->include_start_end_tag

* update crf_decode en docs

* fix viterbi decode en docs

* fix some review comments

* add FIXED_BLOCK_DIM_CASE in cuda

* push_back->emplace_back

* crf_decode->viterbi_decode; include_start_end_tag->include_bos_eos_tag

* paddle.text.ops.viterbi_decode->paddle.text.viterbi_decode

* fix viterbi_decode en docs

6072aecb

T
add fill_any_like/flatten ops to train ssd on kunlun (#36550) · 7bf2aa38
由 TTerror 提交于 10月 21, 2021
```
* add some ops to train ssd on kunlun

* update test_fill_any_like_op_xpu.py
```
7bf2aa38
X

User specified backend (#35745) · b6e7f8e9
由 xiongkun 提交于 10月 21, 2021

b6e7f8e9

Fix a bug in ReadData, ReadDataBc and ReadDataReduce when NX != 1 (#36373) · 921c0917

由 niuliling123 提交于 10月 21, 2021

* Update the implement of reduceAnyKernel according to kernel primitive api
* Fix a bug in ReadData, ReadDataBc and ReadDataReduce when NX != 1

921c0917

S

Graph engine4 (#36587) · 5eb640c6
由 seemingwang 提交于 10月 21, 2021

5eb640c6

机器未来 / Paddle 与 Fork 源项目一致

机器未来 / Paddle
与 Fork 源项目一致