- 20 12月, 2022 1 次提交
-
-
由 ShenLiang 提交于
Co-authored-by: NMing-Xu Huang <mingh@nvidia.com>
-
- 28 11月, 2022 1 次提交
-
-
由 zlsh80826 提交于
* Reduce squeeze2_matmul_fuse_pass, flattent tests time (#47098) * Add missing fp32 config and reduce the testing combination * Reduce trt matmul pass test max examples * Loose TRT fp16 tests tolerance (#47100) * Loose TRT half test tolerance to 1e-3 (#47101) * Loose TRT half test tolerance to 1e-3 (#47106) * Update distributed_strategy.proto (#46531) * Close popen pipe after used (#47053) * Add launch_bounds (#47285) * Fix TRT UT failures (#47488) * Format cherry-picked commits * CudnnNormConvolution is no longer supported on NVIDIA Hopper GPUs (#48203) * Skip tests that use fused_ops on H100 * Add error message to FusedOps on H100 Co-authored-by: NShijie <505749828@qq.com> Co-authored-by: NLeo Chen <39020268+leo0519@users.noreply.github.com> Co-authored-by: NTian Zheng <tizheng@nvidia.com>
-
- 26 10月, 2022 1 次提交
-
-
由 sneaxiy 提交于
[Cherry-pick][Release/2.4]Refine the memory usage of fused_attention and fused_feedforward ops (#47235) * fix fused_attention fused_feedforward * fix ci * fix ci * fix ci PADDLE_GET_CONST * fix ci ut
-
- 20 10月, 2022 1 次提交
-
-
由 sneaxiy 提交于
support pure bfloat16 for more ops
-
- 18 10月, 2022 1 次提交
-
-
由 Haohongxiang 提交于
* [Dygraph] Fix performance of pp+mp by using send/recv_calc_stream instead of send/recv (#46116) * [Dygraph] Fix Perf of FusedFeedForward and FusedAttention with AllReduce (#46780) * update
-
- 19 9月, 2022 2 次提交
-
-
由 minghaoBD 提交于
Co-authored-by: NRichardWooSJTU <37864677+RichardWooSJTU@users.noreply.github.com>
-
由 xiaoxiaohehe001 提交于
-
- 09 9月, 2022 1 次提交
-
-
由 sneaxiy 提交于
-
- 08 9月, 2022 2 次提交
-
-
由 taixiurong 提交于
* add gemm_epilogue * xpu-paddlepaddle-40 [任务] fused_gemm_epilogue 支持 test=kunlun
-
由 sneaxiy 提交于
-
- 07 9月, 2022 1 次提交
-
-
由 Wilber 提交于
-
- 01 9月, 2022 1 次提交
-
-
由 Leo Chen 提交于
-
- 31 8月, 2022 1 次提交
-
-
由 Wilber 提交于
-
- 23 8月, 2022 1 次提交
-
-
由 niuliling123 提交于
-
- 17 8月, 2022 1 次提交
-
-
由 Wilber 提交于
* fix multi stream error.
-
- 16 8月, 2022 1 次提交
-
-
由 feng_shuai 提交于
* convert multihead to oss * fix:bug * fix:delete const cast * fix:don't support bias_qk * add vit pass * fix:convert bug and add preln_residual_bias * support length=-1 * add UT for convert * add no_bias_qk support for gpu_multihead_op * delete infer_shape depends on bias_qk * oss just can be used in T4 and A* * fix:change api for ROCM CI
-
- 15 8月, 2022 2 次提交
-
-
由 Yuanle Liu 提交于
-
由 Wilber 提交于
* convert_fp16 support multi block * update * update
-
- 09 8月, 2022 1 次提交
-
-
由 Allen Guo 提交于
-
- 05 8月, 2022 1 次提交
-
-
由 carryyu 提交于
* add fused_multi_transformer post_layer_norm * add test post_layer_norm
-
- 02 8月, 2022 1 次提交
-
-
由 Wilber 提交于
* multihead matmul add fp16 * fix windows error * fix rocm error * fix rocm error
-
- 01 8月, 2022 1 次提交
-
-
由 Leo Chen 提交于
* remove cudaDeviceContext * remove more template * fix rocm compile * remove alias name CUDADeviceContext * fix compile * fix tests * revert changes
-
- 29 7月, 2022 3 次提交
-
-
由 Leo Chen 提交于
* remove cudaDeviceContext * remove more template * fix rocm compile
-
由 QingshuChen 提交于
* add some fp16 op for kunlun resnet50 model *test=kunlun * tmp *test=kunlun
-
由 ming1753 提交于
* fused_fc_elementwise_layernorm support fp16 * fused_fc_elementwise_layernorm support double
-
- 26 7月, 2022 1 次提交
-
-
由 Wilber 提交于
* multi stream support handle lazy init. * support eigen lazy init * update * fix ci problem
-
- 19 7月, 2022 2 次提交
-
-
由 Ruibiao Chen 提交于
* Rename BOOST_GET macros * Fix conflicts
-
由 zhangyikun02 提交于
-
- 18 7月, 2022 1 次提交
-
-
由 QingshuChen 提交于
* add xpu resnet_unit *test=kunlun * tmp *test=kunlun
-
- 13 7月, 2022 1 次提交
-
-
由 zhangyikun02 提交于
-
- 12 7月, 2022 1 次提交
-
-
由 Yuang Liu 提交于
-
- 08 7月, 2022 2 次提交
-
-
由 xiaoxiaohehe001 提交于
-
由 zhangyikun02 提交于
-
- 07 7月, 2022 2 次提交
-
-
由 Zhang Zheng 提交于
* Fix nan in fast_ln_fwd_kernel when cols > 1024 * delete blas
-
由 zhangyikun02 提交于
-
- 06 7月, 2022 2 次提交
-
-
由 LiYuRio 提交于
-
由 xiaoxiaohehe001 提交于
* conv_fusion
-
- 02 7月, 2022 1 次提交
-
-
由 Leo Chen 提交于
* fix init() * delete test_device_context * replace CPUDeviceContext with CPUContext * fix test_scalar * remove dot_op.cc * fix compile
-
- 01 7月, 2022 1 次提交
-
-
由 limingshu 提交于
* 2nd part of transpose update * add switch_auto_tune option. * add some changes according to Ci * refine the structure of auto_tune_base. * merge develop changes * reset the switch_set_range and change unittest of transpose auto-tune * change the kernel auto-tune logits
-
- 30 6月, 2022 1 次提交
-
-
由 wanghuancoder 提交于
* fused_gate_attention manual code in eager * refine * refine * refine * refine * refine * refine
-