Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
慢慢CG
Mace
提交
fa36bf43
Mace
项目概览
慢慢CG
/
Mace
与 Fork 源项目一致
Fork自
Xiaomi / Mace
通知
1
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
DevOps
流水线
流水线任务
计划
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
Mace
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
DevOps
DevOps
流水线
流水线任务
计划
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
流水线任务
提交
Issue看板
体验新版 GitCode,发现更多精彩内容 >>
提交
fa36bf43
编写于
10月 13, 2020
作者:
L
luxuhui
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
perf: opt `reduce` op's performance on GPU
N/A Signed-off-by:
N
Luxuhui
<
luxuhui@xiaomi.com
>
上级
62d8ba37
变更
2
隐藏空白更改
内联
并排
Showing
2 changed file
with
14 addition
and
15 deletion
+14
-15
mace/ops/opencl/cl/reduce.cl
mace/ops/opencl/cl/reduce.cl
+0
-2
mace/ops/opencl/image/reduce.cc
mace/ops/opencl/image/reduce.cc
+14
-13
未找到文件。
mace/ops/opencl/cl/reduce.cl
浏览文件 @
fa36bf43
...
...
@@ -73,9 +73,7 @@ __kernel void reduce(OUT_OF_RANGE_PARAMS
#
endif
local_buffer[index]
=
part_result
;
#
ifdef
NON_QUALCOMM_ADRENO
barrier
(
CLK_LOCAL_MEM_FENCE
)
;
#
endif
if
(
w
==
0
&&
h
==
0
)
{
#
if
REDUCE_TYPE
==
1
...
...
mace/ops/opencl/image/reduce.cc
浏览文件 @
fa36bf43
...
...
@@ -14,6 +14,8 @@
#include "mace/ops/opencl/image/reduce.h"
#include <algorithm>
namespace
mace
{
namespace
ops
{
namespace
opencl
{
...
...
@@ -58,24 +60,23 @@ MaceStatus ReduceKernel::Compute(
kernel_name
,
built_options
,
&
kernel_
));
kwg_size_
=
static_cast
<
uint32_t
>
(
runtime
->
GetKernelMaxWorkGroupSize
(
kernel_
));
}
if
(
runtime
->
gpu_type
()
==
GPUType
::
QUALCOMM_ADRENO
)
{
const
uint32_t
wave_size
=
static_cast
<
uint32_t
>
(
runtime
->
GetKernelWaveSize
(
kernel_
));
gws
=
{
4
,
(
wave_size
/
4
),
static_cast
<
uint32_t
>
(
batch
*
channel_blocks
)};
}
else
{
// Ensure each kernel has at least 4 input elements.
gws
=
{
4
,
image_size
/
16
,
static_cast
<
uint32_t
>
(
batch
*
channel_blocks
)};
if
(
gws
[
1
]
==
0
)
{
gws
[
1
]
=
1
;
}
else
if
(
gws
[
1
]
>
16
)
{
gws
[
1
]
=
16
;
}
// In the reduce.cl file, the computation is divided into two steps.
// The first step computes `compute_size` times parallelly, and the second
// step computes `group_num` times. In order to speed up the computation, we
// make the computation times of these two steps as uniform as possible.
uint32_t
local_wg_size
=
static_cast
<
uint32_t
>
(
sqrt
(
in_height
*
in_width
));
// Increase the times of the second step for it's not parallel
local_wg_size
*=
2
;
local_wg_size
=
std
::
min
(
local_wg_size
,
kwg_size_
);
gws
=
{
4
,
local_wg_size
/
4
,
static_cast
<
uint32_t
>
(
batch
*
channel_blocks
)};
if
(
gws
[
1
]
==
0
)
{
gws
[
1
]
=
1
;
}
lws
=
{
gws
[
0
],
gws
[
1
],
1
};
const
int
group_num
=
lws
[
0
]
*
lws
[
1
]
*
lws
[
2
];
// Each kernel intends to compute compute_size elements.
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录