未验证 提交 18a59822 编写于 作者: Z zlsh80826 提交者: GitHub

add launch bound to limit the registers usage for volta architecture (#38113)

From --ptxas-options=-v, SegmentOpsKernel uses 66 registers in a block.
There are two ways to resolve this problem:
    Reduce the threads per block launch configuration
    add __launch_bound__ to give information to nvcc compiler for reducing registers usage
this PR chooses __launch_bound__ solution because changing gpu_launch_config may affect other ops.
上级 76eb371e
......@@ -120,8 +120,9 @@ __global__ void SegmentMeanKernel(const Index* segment_ids, const T* input,
}
template <typename T, typename Index, typename Helper, typename Pool>
__global__ void SegmentOpsKernel(const Index* segment_ids, const T* input,
T* output, Helper h, Pool pool) {
__global__ void __launch_bounds__(1024, 1)
SegmentOpsKernel(const Index* segment_ids, const T* input, T* output,
Helper h, Pool pool) {
CUDA_KERNEL_LOOP(stripe_index, h.total_stripe_count) {
Index segment_offset, dim_index_base, actual_height;
Index inner_dim_size = h.inner_dim_size;
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册