add launch bound to limit the registers usage for volta architecture (#38113)

From --ptxas-options=-v, SegmentOpsKernel uses 66 registers in a block. There are two ways to resolve this problem: Reduce the threads per block launch configuration add __launch_bound__ to give information to nvcc compiler for reducing registers usage this PR chooses __launch_bound__ solution because changing gpu_launch_config may affect other ops.

add launch bound to limit the registers usage for volta architecture (#38113)
From --ptxas-options=-v, SegmentOpsKernel uses 66 registers in a block. There are two ways to resolve this problem: Reduce the threads per block launch configuration add __launch_bound__ to give information to nvcc compiler for reducing registers usage this PR chooses __launch_bound__ solution because changing gpu_launch_config may affect other ops.
18a59822 · zlsh80826 · GitHub · 76eb371e · 18a59822
显示空白变更内容
内联并排

Showing with 3 addition and 2 deletion

paddle/fluid/operators/math/segment_pooling.cu paddle/fluid/operators/math/segment_pooling.cu +3 -2

未找到文件。
--- a/paddle/fluid/operators/math/segment_pooling.cu
+++ b/paddle/fluid/operators/math/segment_pooling.cu
@@ -120,8 +120,9 @@ __global__ void SegmentMeanKernel(const Index* segment_ids, const T* input,
 }
 template <typename T, typename Index, typename Helper, typename Pool>
-__global__ void SegmentOpsKernel(const Index* segment_ids, const T* input,
+__global__ void __launch_bounds__(1024, 1)
-                                 T* output, Helper h, Pool pool) {
+    SegmentOpsKernel(const Index* segment_ids, const T* input, T* output,
+                     Helper h, Pool pool) {
  CUDA_KERNEL_LOOP(stripe_index, h.total_stripe_count) {
    Index segment_offset, dim_index_base, actual_height;
    Index inner_dim_size = h.inner_dim_size;