- 23 11月, 2021 2 次提交
-
-
由 zhupengyang 提交于
-
由 sneaxiy 提交于
* enhance scatter err msg check * fix ci error
-
- 22 11月, 2021 2 次提交
-
-
由 Siming Dai 提交于
* Add paddle.incubate.graph_send_recv API * fix bug in CudaAtomicMin and CudaAtomicMax * add empty line
-
由 Li Min 提交于
fix bug to support dropout eval grad computing. cherry-pick #37305.
-
- 16 11月, 2021 1 次提交
-
-
由 zhangkaihuo 提交于
修复了fused_transformer_encoder_layer fine-tune过程发现的一些问题: fused_attention_op添加attn_mask=None的支持:PR pre_layer_norm处理问题:PR 参数处理,计算错误的问题:PR add_bias计算错误问题:PR 添加pure fp16的支持:PR
-
- 15 11月, 2021 1 次提交
-
-
由 Zeng Jinle 提交于
* add mlperf optimization PRs * update
-
- 10 11月, 2021 1 次提交
-
-
由 Jack Zhou 提交于
* fix rnn grad bug when num_layers is set 2 and dropout_prob is set 0 * add more test for rnn
-
- 08 11月, 2021 1 次提交
-
-
由 Weilong Wu 提交于
Renamed the variable and function Removed the original template function Removed the tests_properties in CMakeLists.txt
-
- 01 11月, 2021 2 次提交
-
-
由 Liu-xiandong 提交于
* fix cusparse compile bug in CUDA11.2, test=develop * fix bug
-
由 Feng Xing 提交于
-
- 28 10月, 2021 4 次提交
-
-
由 pangyoki 提交于
Cherry-pick PR #36511
-
由 Li Min 提交于
* Fix bug when pre_layer_norm is false.
-
由 Xiaoxu Chen 提交于
* update fft api path (#36219) * update fft api path * add sample code for ihfft2 Co-authored-by: Nchenfeiyu <chenfeiyu@baidu.com> * fix fft axis (#36321) fix: `-1` is used when fft's axis is `0` * use unified external error message for cufft api (#36114) * fft: modify sample code result (#36325) * dynamic load mkl as a fft backend when it is avaialble and requested (#36414) * add rocm support for fft api (#36415) * move signal apis * move fft and signal API path (#2) * move signal apis * move fft.py and signal.py to paddle/, fix typos * fix relative imports from fft.py and signal.py * fix typos in signal.py (#3) * move signal apis * move fft.py and signal.py to paddle/, fix typos * fix relative imports from fft.py and signal.py * fix typos * disable Cache when CUFFT_VERSION >= 10200 (#4) * move signal apis * move fft.py and signal.py to paddle/, fix typos * fix relative imports from fft.py and signal.py * fix typos * Add LRUCache for fft plans * add LRUCache for cuff and hipfft (#5) * move signal apis * move fft.py and signal.py to paddle/, fix typos * fix relative imports from fft.py and signal.py * fix typos * WIP: add cache * delete move constructor and operator= for CuFFTHandle and FFTConfig * remove log from CuFFTHandle and FFTConfig * add lrucache for fft rocm backend * disable LRUCache when CUFFT_VERSION >= 10200 * disbale copy and move for hipFFTHandle; format code Co-authored-by: NXiaoxu Chen <chenxx_id@163.com> * remove debug message of cufftHandler * roll_op: support Tensor as input for shifts (#36727) * fix fftshift/ifftshift on static mode * update roll_op version * add more test cases for fftshift/ifftshift Co-authored-by: Nzhiboniu <31800336+zhiboniu@users.noreply.github.com> Co-authored-by: Nchenfeiyu <chenfeiyu@baidu.com> Co-authored-by: LJQ
❤ ️ <33169170+lijiaqi0612@users.noreply.github.com>
-
- 27 10月, 2021 4 次提交
-
-
由 baoachun 提交于
-
由 huangjun12 提交于
-
由 whs 提交于
-
由 Li Min 提交于
功能:本PR的目标是提高attention模块的计算性能。 为了减少框架层对op的调度开销,本PR通过在C++层手动实现attention模块,对外提供attention 大op; 为了减少防存开销,本PR采取了两种优化方法: (1)在q,k,v计算时通过共享输入X,将该处的gemm,transpose和bias add从三次调用减少为一次; (2)使用kernel融合优化技术,在不同cuda kernel之间通过寄存器传输数据;
-
- 26 10月, 2021 7 次提交
-
-
由 Steffy-zxf 提交于
* Add FasterTokenizer Operator (#34491) Add Tokenizer related functionalities for Transformer model in order that the process of training and predicting is consistent. * support the text string as an input Tensor * support the "VOCAB"unordered_map<wstring, int> as an input Tensor to lookup tokens * Tokenizer used for BERT. This tokenizer applies an end-to-end, text string to wordpiece tokenization. * It first applies basic tokenization, followed by wordpiece tokenization. * optimize fast tokenizer * remove const_cast Co-authored-by: Nzhoushunjie <zhoushunjie@baidu.com> Co-authored-by: Nwawltor <fangzeyang0904@hotmail.com>
-
由 zhangkaihuo 提交于
* add op: fused_feedforward(backward) (#35611) 这个PR是fused_feedforward反向的代码 相关kernel实现:fused_dropout_act_bias, fused_residual_dropout_bias, fused_layernorm_residual_dropout_bias fused_feedforward是一个融合算子,该算子对transformer模型的feed forward层的算子进行融合和封装,使得前端只呈现一个接口,通过融合减少部分访存和kernel launch的时间,以此提升性能。 * Move fused_attention and fused_feedforward functional api path to incubate (#36704) 将 #35905 和 #35843 PR中新增的的python api接口移到incubate目录下。
-
由 zhangkaihuo 提交于
This is a fusion operator to compute feed forward layer in transformer model architecture.
-
由 feng_shuai 提交于
-
由 smallv0221 提交于
* Add bincount op * upload cpu version * fix unitest * fix unittest * fix unittest * fix en doc * add more test * fix en doc * add more test case * fix test * fix input vailidation * fix input check * fix unittest * fix test * fix en doc cherry-pick
-
由 Yulong Ao 提交于
-
由 Li Min 提交于
功能:本PR的目标是提高attention模块的计算性能。 为了减少框架层对op的调度开销,本PR通过在C++层手动实现attention模块,对外提供attention 大op; 为了减少防存开销,本PR采取了两种优化方法: (1)在q,k,v计算时通过共享输入X,将该处的gemm,transpose和bias add从三次调用减少为一次; (2)使用kernel融合优化技术,在不同cuda kernel之间通过寄存器传输数据;
-
- 25 10月, 2021 8 次提交
-
-
由 WangXi 提交于
* Revert "Add fused_dropout wrapper to ease use. (#36185) (#36640)" This reverts commit 05d7e2fd. * [hybrid] seed and dropout op support force-cpu (#35820) * [HIP] fix op not support AMD GPU bug, the flag PADDLE_WITH_ROCM is invalid * [HIP] fix op not support AMD GPU bug, the flag PADDLE_WITH_ROCM is invalid * [HIP] fix op not support AMD GPU bug * [hybrid] seed and dropout op support force-cpu * [hybrid] seed and dropout op support force-cpu * [hybrid] seed and dropout op support force-cpu * [hybrid] seed and dropout op support force-cpu * [hybrid] seed and dropout op support force-cpu * [hybrid] fix seed ci failed issue * add AsExtra for force_cpu of seed op * Add fused_dropout wrapper to ease use. (#36185) * [hybrid] static model parallel dropout support deterministic RandomSeedGenerator (#36228) Co-authored-by: Nxiayanming <41795079@qq.com> Co-authored-by: NLi Min <11663212+limin2021@users.noreply.github.com>
-
由 whs 提交于
* Fix grid sampler * Fix code format
-
由 Zeng Jinle 提交于
-
由 baoachun 提交于
-
由 Liu-xiandong 提交于
Add paddle.nn.functional.sparse_attention API 本个PR主要将sparse_attention功能在python层进行了一层封装,OP的主体代码见:#PR35676 此外,对于封装的python 接口,增加了相应的单测。
-
由 zhangbo9674 提交于
Add fp16 kernel for clip_op.
-
由 Li Min 提交于
功能:本PR的目标是提高attention模块的计算性能。 为了减少框架层对op的调度开销,本PR通过在C++层手动实现attention模块,对外提供attention 大op; 为了减少防存开销,本PR采取了两种优化方法: (1)在q,k,v计算时通过共享输入X,将该处的gemm,transpose和bias add从三次调用减少为一次; (2)使用kernel融合优化技术,在不同cuda kernel之间通过寄存器传输数据;
-
由 Li Min 提交于
In fused_attention op and fused_ffn op, the fused bias_add+dropout+residual+layernorm kernel or bias_add+dropout+residual kernel is used. To ease the use of this kernel, we provide a wrapper in this PR. 1.To reuse the increment computing code, we exact the corresponding code to "GetSeedDataAndIncrement" routine in dropout_impl_util.h. 2.The fused_dropout_helper.h provides the fused dropout kernel wrapper. Note: the test of this warper will be provided in the following fused_attention_op and fused_ffn PRs.
-
- 24 10月, 2021 1 次提交
-
-
由 Jack Zhou 提交于
* add viterbi decode cpu kernel * add viterbi decoder api in paddle.text * add a data buffer once to avoid create many small pieces of data buffer frequently * fix viterbi max_seq_length bug * fix seq_len=1 bug * fix device context * move split out of for loop * remove INVERSE_SUB * remove 2 GET_CAST_MASK * remove 1 loop * remove Functor * add to_static deploy code * use MAX_FUNC instead of ELE_MAX * add MaxFunctor * impl max_func * remove MaxFunctor * remove cast op * use REGISTER_OP_WITHOUT_GRADIENT * add viterbi cuda kernel * add FIX_BLOCKDIM_CASE macro * add MKL add, mul; add get data mask * add arange mkl impl * add CPU Argmax * add cpu gather * use EXECUTE_MKL_ELEMENT_BINARY_OP instead of some ADD, MUL * use SameDimsBinaryOP instead of EXECUTE_MKL_ELEMENT_BINARY_OP * use SAME_DIMS_ELEMENT_BINARY_OP * add SimpleBroadcastBinaryOP * use int instead of int64_t to accelerate * optimize SimpleBroadcastBinaryOP * optimize SimpleBroadcastBinaryOP * optimize performance in both single thread and multithread situation * remove useless line * remove useless code * add CREATE_TENSOR_BUFFER macro * add INIT_REQUIRED_TENSOR macro * add comment * fix windows ci * add viterbi unittest * remove cuda add functor * remove cuda equal * remove a template function * fix windows ci * fix windows dtype * remove some template instance * remove useless header file * remove some blockdim * remove transpose impl * accelerate cpu performance on single thread situation * viterbi_decode->crf_decode * rename crf params name * add viterbi api test * remove useless import * add enable_static * use viterbi decoder * fix viterbi len=1 * fix viterbi unittest * remove useless comments * reconstruct viterbi decode * remove ADD,SUB,MUL structure * fix coverage * remove CREATE_TENSOR * add name args * crf.py->ops.py; with_start_stop_tag->include_start_end_tag * update crf_decode en docs * fix viterbi decode en docs * fix some review comments * add FIXED_BLOCK_DIM_CASE in cuda * push_back->emplace_back * crf_decode->viterbi_decode; include_start_end_tag->include_bos_eos_tag * paddle.text.ops.viterbi_decode->paddle.text.viterbi_decode * fix viterbi_decode en docs
-
- 22 10月, 2021 1 次提交
-
-
由 niuliling123 提交于
* Fix a bug in ReadData, ReadDataBc and ReadDataReduce when NX != 1 * Update the implement of reduceAnyKernel according to kernel primitive api
-
- 21 10月, 2021 2 次提交
-
-
由 niuliling123 提交于
* Add functor_primitives.h for kernel primtive api
-
由 littletomatodonkey 提交于
* fix replicate pad when input size is 0 * add unit test
-
- 19 10月, 2021 2 次提交
-
-
由 Liu-xiandong 提交于
The code of this PR can only support CUDA 11.2. Currently, CI does not have GPU with CUDA 11.2 , and all tests will be skipped automatically. The new OP is paddle._C_ops.sparse_attention. Regarding the work of the python API, it will be resolved in a follow-up PR. The code of this PR lacks tests on dynamic graphs and static graphs, and will be added in subsequent PRs.
-
由 Wilber 提交于
-
- 13 10月, 2021 1 次提交
-
-
由 jakpiase 提交于
-