- 20 12月, 2021 10 次提交
-
-
由 sneaxiy 提交于
* support FP16 for more ops * add amp list tests * refine reduce_mean_grad * fix OP benchmark ci * fix fp16 reduce_mean * updat ut, but still have some problems * remove mean/reduce_mean fp16 kernel
-
由 Feng Xing 提交于
softmax_with_cross_entropy optimization with soft label. This PR includes optimization of "SoftmaxWithCrossEntropySoftLabel" : compute log_softmax and then compute loss. "CrossEntropySoftLabel" : compute loss with softmax as input. These optimization includes following technics: read data to buffer with vectorization compute max and sum in warp fixed loop size with macro Performance (computation time): softmax_with_cross_entropy_0 (forward) : -40.1% softmax_with_cross_entropy_0 (backward): -41%
-
由 石晓伟 提交于
-
由 Feiyu Chan 提交于
-
由 heliqi 提交于
* add matmul_scale matmul_v2_scale fuse pass * add scaletensor judge * modify var name * add timeout notest;test=coverag * fix error commit * fix use_mkldnn attr * fix use_mkldnn attr
-
由 Sylwester Fraczek 提交于
-
由 zhangbo9674 提交于
* add multi_tensor for momentum and clear_grads for optimizer * fix bug for dygraph * add unittest * refine comment * add param_group * refine regularizaiton logic * del clear_grads * add clear_grads * add dispensable check of None * refine clear_grad * fix build bug * refine code by comment * refine code * add multi tensor check * refine param_group update * add multi tensor for static mode * refine comments * delete useless comma for momentum * refine comment for momentum * refine code by commment
-
由 Yuang Liu 提交于
-
由 YuanRisheng 提交于
* fix bugs when run reshape * fix ci bug
-
由 zyfncg 提交于
-
- 18 12月, 2021 3 次提交
-
-
由 Noel 提交于
-
由 Guoxia Wang 提交于
-
由 Feiyu Chan 提交于
* add complex op and `paddle.complex`.
-
- 17 12月, 2021 17 次提交
-
-
由 Jiabin Yang 提交于
* support more eager tensor api * support multiple constructor for eager tensor * add place related code * polish code * specific randint with dtype of int64 * Support pure cpu test * refine test in pure cpu * refine test in pure cpu
-
由 Leo Chen 提交于
-
由 sneaxiy 提交于
* support multi precision update for LAMB * hide some api * fix ci uts * fix lamb output of dygraph * remove some changes to some PR * try to fix Py3 CI compile error * fix test_imperative_optimizer, add lars ut, add layer_norm ut * fix ut, fix format * fix ut * fix windows ci
-
由 feng_shuai 提交于
-
由 Sing_chan 提交于
-
由 chentianyu03 提交于
* modify sum mean args * add GetExpectedPtenKernelArgs for redcue_op * modify kernel args number * modify kernel args number
-
由 LiYuRio 提交于
-
由 Zhanlue Yang 提交于
* Rearranged Eager AutoCodeGen directory structure * Removed USE_OP in Eager AutoCodeGen * Enabled generation for Operators without Grad/Inputs/Outputs * Resolved operators without input * Fixed merge conflicts * Enabled Eager AutoCodeGen for 10+ more operators * Refactored Eager AutoCodeGen with more organized helper objects * Enabled Eager AutoCodeGen for operators with multiple OpBases * Adjusted Eager AutoCodeGen to Enable Passing Output Tensor as Input Argument * Handled Dispensable Inputs/Outputs in Eager AutoCodeGen * Adjusted function generation/call between Python-C API & Dygraph API * Synchronized auto-generated Python-C API with Dygraph Forward Functions * Generated CoreOpsInfos for potential use in append_op API * Fixed CI problem
-
由 kuizhiqing 提交于
-
由 Leo Chen 提交于
* Inspect the information inside a TRT engine. * Follow up the google code style. * Fix code error.
-
由 zlsh80826 提交于
From --ptxas-options=-v, SegmentOpsKernel uses 66 registers in a block. There are two ways to resolve this problem: Reduce the threads per block launch configuration add __launch_bound__ to give information to nvcc compiler for reducing registers usage this PR chooses __launch_bound__ solution because changing gpu_launch_config may affect other ops.
-
由 niuliling123 提交于
-
由 From00 提交于
* Get GPU BasePtr from CUDA allocation * Fix compile error for ROCm * Add BasePtr function for IPUPlace in naive_best_fit_allocator.cc * Add alignment for BuddyAllocator * Set address alignment of BuddyAllocator to 32 bytes * Fix CI error * Remove code for naive_best_fit strategy
-
由 From00 提交于
-
由 Yuang Liu 提交于
-
由 limingshu 提交于
* fix_bugs_for_elementwise_branch_selection * fix merge_dims bugs * fix all influenced file
-
由 houj04 提交于
-
- 16 12月, 2021 10 次提交
-
-
由 Leo Chen 提交于
* fix cmake * not check execution time
-
由 chentianyu03 提交于
-
由 Tomasz Socha 提交于
* Faster implementation of CPU kernel for ROI_ALIGN Operator * Add missing variable to CUDA roi_align_op * Style * Fix boundaries * Rename variables for indexes calculation * Remove unnecessary emplace * Revert "Remove unnecessary emplace" This reverts commit c10e87f7fb812f1a672fde32f2690a97d47e2f5a. * Style
-
由 chentianyu03 提交于
-
由 Zhanlue Yang 提交于
* Rearranged Eager AutoCodeGen directory structure * Removed USE_OP in Eager AutoCodeGen * Enabled generation for Operators without Grad/Inputs/Outputs * Resolved operators without input * Fixed merge conflicts * Enabled Eager AutoCodeGen for 10+ more operators * Refactored Eager AutoCodeGen with more organized helper objects * Enabled Eager AutoCodeGen for operators with multiple OpBases * Adjusted Eager AutoCodeGen to Enable Passing Output Tensor as Input Argument * Handled Dispensable Inputs/Outputs in Eager AutoCodeGen * Enabled Eager AutoCodeGen for All Existing Operators & Possible Future Operators * Fixed CI issues * Fixed LD_LIBRARY_PATH for eager_code_generator
-
由 xiaoting 提交于
* add activation * update activation_op * add unitest for activation * fix acosh for init, test=develop
-
由 feng_shuai 提交于
* conv_transpose_eltwiseadd_bn_fuse_pass * change timeout * add TIMEOUT * add random num for group and dilation * change PassCompat
-
由 yeliang2258 提交于
* add test for conv_elementwise_add2_act_fuse_pass and conv_elementwise_add_act_fuse_pass * Add conv_eltwiseadd_bn_fuse_pass test and fix test_conv_elementwise_addX_act_fuse_pass * add tests for conv_act_mkldnn_fuse_pass * add test for conv_bias_mkldnn_fuse_pass * update code * add conv_act_mkldnn_fuse_pass for relu, relu6, swish, leaky_relu * update test * update * update bug * update * update pattern_detector * fix test_conv_eltwiseadd_bn_fuse_pass * add diff display notest;test=windows_ci_inference * fix * remove test_conv_act_mkldnn_fuse_pass.py * ifix
-
由 Chen Weihang 提交于
* add register_ctx_kernel and move scale kernel * polish details by reviewer comment * fix xpu compile failed * fix cmake error
-