- 25 10月, 2021 2 次提交
-
-
由 Li Min 提交于
功能:本PR的目标是提高attention模块的计算性能。 为了减少框架层对op的调度开销,本PR通过在C++层手动实现attention模块,对外提供attention 大op; 为了减少防存开销,本PR采取了两种优化方法: (1)在q,k,v计算时通过共享输入X,将该处的gemm,transpose和bias add从三次调用减少为一次; (2)使用kernel融合优化技术,在不同cuda kernel之间通过寄存器传输数据;
-
由 Li Min 提交于
In fused_attention op and fused_ffn op, the fused bias_add+dropout+residual+layernorm kernel or bias_add+dropout+residual kernel is used. To ease the use of this kernel, we provide a wrapper in this PR. 1.To reuse the increment computing code, we exact the corresponding code to "GetSeedDataAndIncrement" routine in dropout_impl_util.h. 2.The fused_dropout_helper.h provides the fused dropout kernel wrapper. Note: the test of this warper will be provided in the following fused_attention_op and fused_ffn PRs.
-
- 24 10月, 2021 1 次提交
-
-
由 Jack Zhou 提交于
* add viterbi decode cpu kernel * add viterbi decoder api in paddle.text * add a data buffer once to avoid create many small pieces of data buffer frequently * fix viterbi max_seq_length bug * fix seq_len=1 bug * fix device context * move split out of for loop * remove INVERSE_SUB * remove 2 GET_CAST_MASK * remove 1 loop * remove Functor * add to_static deploy code * use MAX_FUNC instead of ELE_MAX * add MaxFunctor * impl max_func * remove MaxFunctor * remove cast op * use REGISTER_OP_WITHOUT_GRADIENT * add viterbi cuda kernel * add FIX_BLOCKDIM_CASE macro * add MKL add, mul; add get data mask * add arange mkl impl * add CPU Argmax * add cpu gather * use EXECUTE_MKL_ELEMENT_BINARY_OP instead of some ADD, MUL * use SameDimsBinaryOP instead of EXECUTE_MKL_ELEMENT_BINARY_OP * use SAME_DIMS_ELEMENT_BINARY_OP * add SimpleBroadcastBinaryOP * use int instead of int64_t to accelerate * optimize SimpleBroadcastBinaryOP * optimize SimpleBroadcastBinaryOP * optimize performance in both single thread and multithread situation * remove useless line * remove useless code * add CREATE_TENSOR_BUFFER macro * add INIT_REQUIRED_TENSOR macro * add comment * fix windows ci * add viterbi unittest * remove cuda add functor * remove cuda equal * remove a template function * fix windows ci * fix windows dtype * remove some template instance * remove useless header file * remove some blockdim * remove transpose impl * accelerate cpu performance on single thread situation * viterbi_decode->crf_decode * rename crf params name * add viterbi api test * remove useless import * add enable_static * use viterbi decoder * fix viterbi len=1 * fix viterbi unittest * remove useless comments * reconstruct viterbi decode * remove ADD,SUB,MUL structure * fix coverage * remove CREATE_TENSOR * add name args * crf.py->ops.py; with_start_stop_tag->include_start_end_tag * update crf_decode en docs * fix viterbi decode en docs * fix some review comments * add FIXED_BLOCK_DIM_CASE in cuda * push_back->emplace_back * crf_decode->viterbi_decode; include_start_end_tag->include_bos_eos_tag * paddle.text.ops.viterbi_decode->paddle.text.viterbi_decode * fix viterbi_decode en docs
-
- 22 10月, 2021 1 次提交
-
-
由 niuliling123 提交于
* Fix a bug in ReadData, ReadDataBc and ReadDataReduce when NX != 1 * Update the implement of reduceAnyKernel according to kernel primitive api
-
- 21 10月, 2021 2 次提交
-
-
由 niuliling123 提交于
* Add functor_primitives.h for kernel primtive api
-
由 littletomatodonkey 提交于
* fix replicate pad when input size is 0 * add unit test
-
- 20 10月, 2021 1 次提交
-
-
由 Wilber 提交于
-
- 19 10月, 2021 3 次提交
-
-
由 Liu-xiandong 提交于
The code of this PR can only support CUDA 11.2. Currently, CI does not have GPU with CUDA 11.2 , and all tests will be skipped automatically. The new OP is paddle._C_ops.sparse_attention. Regarding the work of the python API, it will be resolved in a follow-up PR. The code of this PR lacks tests on dynamic graphs and static graphs, and will be added in subsequent PRs.
-
由 Wilber 提交于
-
由 Siming Dai 提交于
* fix async_read bug * change index place to cpu * add tensor size judge * add async_read & async_write test * fix bug in async_write * fix mac py3 ci * fix bug for cpu version paddle * fix windows ci bug * change input argument error type * change const_cast to mutable_data * add async_write out-of-bound check and consumate error hint * fix a small bug for dst_tensor * add docs and refine codes * refine docs * notest,test=windows_ci * fix windows ci * fix require * fix code-block * add core.is_compiled_with_cuda()
-
- 15 10月, 2021 1 次提交
-
-
由 wuhuanzhou 提交于
* [WIP]Verify the correctness of graph rewrited by GeneratePass, test=develop * add delete subgraph and unittest, test=develop * check simple pass, test=develop * fix coverage, test=develop * limit with input_spec via Paddle API, test=develop
-
- 13 10月, 2021 1 次提交
-
-
由 jakpiase 提交于
-
- 12 10月, 2021 2 次提交
-
-
由 Aurelius84 提交于
* Fix stop_gradient in RunProgramOp * fix reference
-
由 wenbin 提交于
-
- 11 10月, 2021 2 次提交
-
-
由 Siming Dai 提交于
-
由 wuhuanzhou 提交于
(cherry picked from PR #36095) PR主要功能:支持C++开发注册GeneratePass,简化针对fusion等子图优化场景开发方式。
-
- 30 9月, 2021 2 次提交
-
-
由 Guoxia Wang 提交于
-
由 Guoxia Wang 提交于
-
- 29 9月, 2021 1 次提交
-
-
由 Lijunhui 提交于
向PaddlePaddle中的线性代数库添加eig算子,该算子计算一般方阵的特征分解。 cherry-pick 自#35674.
-
- 28 9月, 2021 1 次提交
-
-
由 ronnywang 提交于
ATT, cherry-pick #36160
-
- 27 9月, 2021 8 次提交
-
-
由 Yanxing Shi 提交于
* Initial Commit * fix py2 error * fix wrong words and doc * test=document_fix * fix _gpuDeviceProperties
-
由 Jiawei Wang 提交于
* fix unique unstack dim 0 * fix unique_op format
-
由 JZ-LIANG 提交于
-
由 Wilber 提交于
-
由 ronnywang 提交于
ATT, cherry-pick #36098
-
由 JYChen 提交于
cherry-pick from #35352 Add new detection api paddle.vision.ops.psroi_pool and paddle.vision.ops.PSRoIPool
-
由 zhangbo9674 提交于
The AdamW optimizer modify the op from adamw to adam in pr35521, this is a inappropriate modify. Modify adam to adamw in AdamW.
-
由 YuanRisheng 提交于
When users use gumbel_softmax, they can use paddle.seed() in python for fixed seed.
-
- 26 9月, 2021 5 次提交
-
-
由 crystal 提交于
cherry-pick #35916,CPU前向计算将Eigen替换为Lapack,修改linalg暴露规则
-
由 Huihuang Zheng 提交于
This PR added det and slogdet API to release/2.2 It is cherry-pick from #34992 and #36013
-
由 niuliling123 提交于
[cherry-pick] Add function comments and instructions to the Primitive API
-
由 Weilong Wu 提交于
This PR supports linalg.solve calculation for linear algorithm module of Paddle. One may call paddle.linalg.solve to use it.
-
由 ronnywang 提交于
* add randperm_op_npu * fix test_set_value_op_npu
-
- 25 9月, 2021 1 次提交
-
-
由 baoachun 提交于
-
- 24 9月, 2021 5 次提交
-
-
由 From00 提交于
This PR implements the kernel of "eigvals" OP with the Lapack library, which has a better performance than the previous Eigen library.
-
由 Huihuang Zheng 提交于
Add basic Cost Model, it uses executor to run program and profile it to get op time. This is an early basic version, we will add more functions in the future.
-
由 Wilber 提交于
* update xpu version
-
由 Liu-xiandong 提交于
解决Windows中CUDA11.2编译出错的问题。 cherry-pick #35941
-
由 JingZhuangzhuang 提交于
-
- 23 9月, 2021 1 次提交
-
-
由 crystal 提交于
cherry-pick #35812,修复Eigh OP
-