- 21 12月, 2021 4 次提交
-
-
由 Chen Weihang 提交于
* rename cuda to gpu * revert CMake change * resolve conflit * rename other cuda to gpu * poish details
-
由 crystal 提交于
* relu forward opt * add gelu functor * optimize code
-
由 arlesniak 提交于
-
由 sneaxiy 提交于
* mean first version * fix scalar mean * add fp16 dtype for api
-
- 20 12月, 2021 9 次提交
-
-
由 chentianyu03 提交于
* add pten conj kernel * modify conj_kernel file path * add defined cuda macro to cuda/conj_kernel.h
-
由 baoachun 提交于
-
由 fwenguang 提交于
-
由 sneaxiy 提交于
* support FP16 for more ops * add amp list tests * refine reduce_mean_grad * fix OP benchmark ci * fix fp16 reduce_mean * updat ut, but still have some problems * remove mean/reduce_mean fp16 kernel
-
由 Feng Xing 提交于
softmax_with_cross_entropy optimization with soft label. This PR includes optimization of "SoftmaxWithCrossEntropySoftLabel" : compute log_softmax and then compute loss. "CrossEntropySoftLabel" : compute loss with softmax as input. These optimization includes following technics: read data to buffer with vectorization compute max and sum in warp fixed loop size with macro Performance (computation time): softmax_with_cross_entropy_0 (forward) : -40.1% softmax_with_cross_entropy_0 (backward): -41%
-
由 石晓伟 提交于
-
由 Feiyu Chan 提交于
-
由 Sylwester Fraczek 提交于
-
由 YuanRisheng 提交于
* fix bugs when run reshape * fix ci bug
-
- 18 12月, 2021 3 次提交
-
-
由 Noel 提交于
-
由 Guoxia Wang 提交于
-
由 Feiyu Chan 提交于
* add complex op and `paddle.complex`.
-
- 17 12月, 2021 6 次提交
-
-
由 sneaxiy 提交于
* support multi precision update for LAMB * hide some api * fix ci uts * fix lamb output of dygraph * remove some changes to some PR * try to fix Py3 CI compile error * fix test_imperative_optimizer, add lars ut, add layer_norm ut * fix ut, fix format * fix ut * fix windows ci
-
由 chentianyu03 提交于
* modify sum mean args * add GetExpectedPtenKernelArgs for redcue_op * modify kernel args number * modify kernel args number
-
由 kuizhiqing 提交于
-
由 zlsh80826 提交于
From --ptxas-options=-v, SegmentOpsKernel uses 66 registers in a block. There are two ways to resolve this problem: Reduce the threads per block launch configuration add __launch_bound__ to give information to nvcc compiler for reducing registers usage this PR chooses __launch_bound__ solution because changing gpu_launch_config may affect other ops.
-
由 niuliling123 提交于
-
由 limingshu 提交于
* fix_bugs_for_elementwise_branch_selection * fix merge_dims bugs * fix all influenced file
-
- 16 12月, 2021 11 次提交
-
-
由 Tomasz Socha 提交于
* Faster implementation of CPU kernel for ROI_ALIGN Operator * Add missing variable to CUDA roi_align_op * Style * Fix boundaries * Rename variables for indexes calculation * Remove unnecessary emplace * Revert "Remove unnecessary emplace" This reverts commit c10e87f7fb812f1a672fde32f2690a97d47e2f5a. * Style
-
由 chentianyu03 提交于
-
由 xiaoting 提交于
* add activation * update activation_op * add unitest for activation * fix acosh for init, test=develop
-
由 Chen Weihang 提交于
* add register_ctx_kernel and move scale kernel * polish details by reviewer comment * fix xpu compile failed * fix cmake error
-
由 danleifeng 提交于
* trainer_device fix and checknan tool for psgpu;test=develop * disable show_one_table;test=develop
-
由 LJQ❤️ 提交于
Add elementwise_fmax and elementwise_fmin operators
-
由 Liu-xiandong 提交于
Add key_padding_mask and attn_mask in sparse_attention Api 1.Key padding mask is a tensor with dimensions [batch_size, seq_len], and attention mask is a tensor with dimensions [seq_len, seq_len]. The data types of the two masks are consistent with Q, K, and V, which are float32 or float64. If the value in Mask is 0, it means that the position needs to be masked. 2.The changed files are mainly paddle/fluid/operators/sparse_attention_op.cu and python/paddle/fluid/tests/unittests/test_sparse_attention_op.py. sparse_attention has three parts: sddmm, softmax, and dsd. Adding the mask operation only needs to modify the softmax. It has no effect on the other two parts. In addition, in order to test the mask function, related tests has been added.
-
由 niuliling123 提交于
* Add the transformop parameter in TensorReduceFunctorImpl
-
由 YuanRisheng 提交于
* Reduce reshape kernel functions in pten * delete notes * fix bugs when compile * modify register name * fix compile bugs
-
由 Li Min 提交于
* Add float16 type for scatter op. * Add fp16 test for scatter op. * Add int and int64 support for scatter_grad on gpu. * Add int and int64 for check_variable_and_dtype routine. * Minors. * Code format.
-
- 15 12月, 2021 3 次提交
-
-
由 Yiqun Liu 提交于
test=document_fix
-
由 Huihuang Zheng 提交于
As the title.
-
由 chentianyu03 提交于
* replace with pten kernel in cast cuda compute and remove unused codes * rm unused header file * replace CastCUDAOpKernel with CastOpKernel
-
- 14 12月, 2021 4 次提交
-
-
由 Sylwester Fraczek 提交于
* add map_matmul passes to quant2_int8_mkldnn_pass * fix fc+act fuse (activation scale) * ci fix, c++17 structured bindings not available * fix ci static check
-
由 baoachun 提交于
* add conv_gelu_mkldnn_fuse_pass * add post ops
-
由 weishengying 提交于
-
由 YuanRisheng 提交于
* Reduce reshape kernel functions in pten * delete notes * fix bugs when compile
-