Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
BaiXuePrincess
Paddle
提交
34605d26
P
Paddle
项目概览
BaiXuePrincess
/
Paddle
与 Fork 源项目一致
Fork自
PaddlePaddle / Paddle
通知
1
Star
1
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
34605d26
编写于
2月 26, 2018
作者:
D
dzhwinter
提交者:
GitHub
2月 26, 2018
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
accelerate the cuda concat op, avoid many times copy (#8585)
* "try enhance concat op" * "enhance the concat operator"
上级
087d8e7f
变更
1
隐藏空白更改
内联
并排
Showing
1 changed file
with
41 addition
and
6 deletion
+41
-6
paddle/fluid/operators/concat_op.h
paddle/fluid/operators/concat_op.h
+41
-6
未找到文件。
paddle/fluid/operators/concat_op.h
浏览文件 @
34605d26
...
@@ -14,6 +14,7 @@ limitations under the License. */
...
@@ -14,6 +14,7 @@ limitations under the License. */
#pragma once
#pragma once
#include <utility>
#include <vector>
#include <vector>
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/operators/strided_memcpy.h"
#include "paddle/fluid/operators/strided_memcpy.h"
...
@@ -34,12 +35,46 @@ class ConcatKernel : public framework::OpKernel<T> {
...
@@ -34,12 +35,46 @@ class ConcatKernel : public framework::OpKernel<T> {
auto
out_stride
=
framework
::
stride_numel
(
out
->
dims
());
auto
out_stride
=
framework
::
stride_numel
(
out
->
dims
());
size_t
output_offset
=
0
;
size_t
output_offset
=
0
;
for
(
auto
*
in
:
ins
)
{
auto
in_stride
=
framework
::
stride_numel
(
in
->
dims
());
// If axis >=1, copy to out immediately need to call many times
StridedNumelCopyWithAxis
<
T
>
(
ctx
.
device_context
(),
axis
,
// of cuda memcpy. Copy the input to cpu and do the stride copy,
out
->
data
<
T
>
()
+
output_offset
,
out_stride
,
// then copy to gpu output.
in
->
data
<
T
>
(),
in_stride
,
in_stride
[
axis
]);
output_offset
+=
in_stride
[
axis
];
if
(
platform
::
is_gpu_place
(
place
)
&&
axis
>=
1
)
{
platform
::
CPUPlace
copy_place
;
auto
&
cpu_ctx
=
*
platform
::
DeviceContextPool
::
Instance
().
Get
(
copy_place
);
framework
::
Tensor
cpu_out
;
cpu_out
.
Resize
(
out
->
dims
());
cpu_out
.
mutable_data
<
T
>
(
copy_place
);
auto
&
dev_ctx
=
ctx
.
device_context
();
std
::
vector
<
std
::
unique_ptr
<
framework
::
Tensor
>>
cpu_ins
;
for
(
auto
*
in
:
ins
)
{
std
::
unique_ptr
<
framework
::
Tensor
>
cpu_in
(
new
framework
::
Tensor
);
framework
::
TensorCopy
(
*
in
,
copy_place
,
dev_ctx
,
cpu_in
.
get
());
cpu_ins
.
emplace_back
(
std
::
move
(
cpu_in
));
}
// TODO(dzhwinter): overlap copy and compute stream
// https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/
dev_ctx
.
Wait
();
for
(
auto
&
in
:
cpu_ins
)
{
auto
&
cpu_in
=
*
in
.
get
();
auto
in_stride
=
framework
::
stride_numel
(
cpu_in
.
dims
());
StridedNumelCopyWithAxis
<
T
>
(
cpu_ctx
,
axis
,
cpu_out
.
data
<
T
>
()
+
output_offset
,
out_stride
,
cpu_in
.
data
<
T
>
(),
in_stride
,
in_stride
[
axis
]);
output_offset
+=
in_stride
[
axis
];
}
framework
::
TensorCopy
(
cpu_out
,
place
,
dev_ctx
,
out
);
}
else
{
for
(
auto
*
in
:
ins
)
{
auto
in_stride
=
framework
::
stride_numel
(
in
->
dims
());
StridedNumelCopyWithAxis
<
T
>
(
ctx
.
device_context
(),
axis
,
out
->
data
<
T
>
()
+
output_offset
,
out_stride
,
in
->
data
<
T
>
(),
in_stride
,
in_stride
[
axis
]);
output_offset
+=
in_stride
[
axis
];
}
}
}
}
}
};
};
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录