Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
Crayon鑫
Paddle
提交
f52c4f8b
P
Paddle
项目概览
Crayon鑫
/
Paddle
与 Fork 源项目一致
Fork自
PaddlePaddle / Paddle
通知
1
Star
1
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
1
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
1
Issue
1
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
f52c4f8b
编写于
9月 21, 2020
作者:
Y
yaoxuefeng6
浏览文件
操作
浏览文件
下载
差异文件
fix conflict
上级
cb602fce
37f7414f
变更
125
展开全部
隐藏空白更改
内联
并排
Showing
125 changed file
with
4740 addition
and
912 deletion
+4740
-912
cmake/cuda.cmake
cmake/cuda.cmake
+3
-0
paddle/fluid/framework/data_feed.cc
paddle/fluid/framework/data_feed.cc
+58
-12
paddle/fluid/framework/details/CMakeLists.txt
paddle/fluid/framework/details/CMakeLists.txt
+1
-0
paddle/fluid/framework/details/all_reduce_op_handle.cc
paddle/fluid/framework/details/all_reduce_op_handle.cc
+63
-19
paddle/fluid/framework/details/async_ssa_graph_executor.cc
paddle/fluid/framework/details/async_ssa_graph_executor.cc
+13
-2
paddle/fluid/framework/details/build_strategy.h
paddle/fluid/framework/details/build_strategy.h
+4
-0
paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.cc
...uid/framework/details/fast_threaded_ssa_graph_executor.cc
+5
-1
paddle/fluid/framework/details/fetch_op_handle.cc
paddle/fluid/framework/details/fetch_op_handle.cc
+6
-2
paddle/fluid/framework/details/op_handle_base.cc
paddle/fluid/framework/details/op_handle_base.cc
+7
-0
paddle/fluid/framework/details/op_handle_base.h
paddle/fluid/framework/details/op_handle_base.h
+6
-0
paddle/fluid/framework/details/parallel_ssa_graph_executor.cc
...le/fluid/framework/details/parallel_ssa_graph_executor.cc
+8
-1
paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.cc
...id/framework/details/scope_buffered_ssa_graph_executor.cc
+9
-1
paddle/fluid/framework/details/share_tensor_buffer_functor.cc
...le/fluid/framework/details/share_tensor_buffer_functor.cc
+57
-13
paddle/fluid/framework/details/share_tensor_buffer_functor.h
paddle/fluid/framework/details/share_tensor_buffer_functor.h
+9
-1
paddle/fluid/framework/details/share_tensor_buffer_op_handle.cc
.../fluid/framework/details/share_tensor_buffer_op_handle.cc
+20
-5
paddle/fluid/framework/details/share_tensor_buffer_op_handle.h
...e/fluid/framework/details/share_tensor_buffer_op_handle.h
+4
-1
paddle/fluid/framework/details/ssa_graph_executor.cc
paddle/fluid/framework/details/ssa_graph_executor.cc
+4
-2
paddle/fluid/framework/details/threaded_ssa_graph_executor.cc
...le/fluid/framework/details/threaded_ssa_graph_executor.cc
+15
-6
paddle/fluid/framework/details/threaded_ssa_graph_executor.h
paddle/fluid/framework/details/threaded_ssa_graph_executor.h
+2
-2
paddle/fluid/framework/details/var_handle.h
paddle/fluid/framework/details/var_handle.h
+8
-3
paddle/fluid/framework/details/variable_visitor.cc
paddle/fluid/framework/details/variable_visitor.cc
+44
-27
paddle/fluid/framework/fleet/gloo_wrapper.cc
paddle/fluid/framework/fleet/gloo_wrapper.cc
+22
-3
paddle/fluid/framework/ir/conv_affine_channel_fuse_pass.cc
paddle/fluid/framework/ir/conv_affine_channel_fuse_pass.cc
+12
-0
paddle/fluid/framework/ir/conv_bn_fuse_pass.cc
paddle/fluid/framework/ir/conv_bn_fuse_pass.cc
+12
-0
paddle/fluid/framework/ir/conv_elementwise_add2_act_fuse_pass.cc
...fluid/framework/ir/conv_elementwise_add2_act_fuse_pass.cc
+8
-1
paddle/fluid/framework/ir/conv_elementwise_add_act_fuse_pass.cc
.../fluid/framework/ir/conv_elementwise_add_act_fuse_pass.cc
+8
-0
paddle/fluid/framework/ir/conv_elementwise_add_fuse_pass.cc
paddle/fluid/framework/ir/conv_elementwise_add_fuse_pass.cc
+7
-2
paddle/fluid/framework/ir/embedding_fc_lstm_fuse_pass.cc
paddle/fluid/framework/ir/embedding_fc_lstm_fuse_pass.cc
+11
-1
paddle/fluid/framework/ir/fc_fuse_pass.cc
paddle/fluid/framework/ir/fc_fuse_pass.cc
+8
-0
paddle/fluid/framework/ir/fc_gru_fuse_pass.cc
paddle/fluid/framework/ir/fc_gru_fuse_pass.cc
+21
-1
paddle/fluid/framework/ir/fc_lstm_fuse_pass.cc
paddle/fluid/framework/ir/fc_lstm_fuse_pass.cc
+15
-0
paddle/fluid/framework/ir/memory_optimize_pass/CMakeLists.txt
...le/fluid/framework/ir/memory_optimize_pass/CMakeLists.txt
+2
-0
paddle/fluid/framework/ir/memory_optimize_pass/buffer_shared_inplace_op_pass.cc
.../ir/memory_optimize_pass/buffer_shared_inplace_op_pass.cc
+4
-2
paddle/fluid/framework/ir/memory_optimize_pass/inplace_addto_op_pass.cc
...ramework/ir/memory_optimize_pass/inplace_addto_op_pass.cc
+221
-0
paddle/fluid/framework/ir/memory_optimize_pass/memory_reuse_pass.cc
...id/framework/ir/memory_optimize_pass/memory_reuse_pass.cc
+8
-3
paddle/fluid/framework/ir/memory_optimize_pass/memory_reuse_pass.h
...uid/framework/ir/memory_optimize_pass/memory_reuse_pass.h
+7
-7
paddle/fluid/framework/ir/repeated_fc_relu_fuse_pass.cc
paddle/fluid/framework/ir/repeated_fc_relu_fuse_pass.cc
+10
-0
paddle/fluid/framework/ir/shuffle_channel_detect_pass.cc
paddle/fluid/framework/ir/shuffle_channel_detect_pass.cc
+8
-0
paddle/fluid/framework/ir/squared_mat_sub_fuse_pass.cc
paddle/fluid/framework/ir/squared_mat_sub_fuse_pass.cc
+24
-6
paddle/fluid/framework/ir/squared_mat_sub_fuse_pass.h
paddle/fluid/framework/ir/squared_mat_sub_fuse_pass.h
+1
-1
paddle/fluid/framework/operator.h
paddle/fluid/framework/operator.h
+8
-0
paddle/fluid/framework/parallel_executor.cc
paddle/fluid/framework/parallel_executor.cc
+19
-0
paddle/fluid/inference/api/paddle_pass_builder.cc
paddle/fluid/inference/api/paddle_pass_builder.cc
+2
-1
paddle/fluid/inference/tensorrt/convert/emb_eltwise_layernorm.cc
...fluid/inference/tensorrt/convert/emb_eltwise_layernorm.cc
+3
-3
paddle/fluid/inference/tensorrt/plugin/emb_eltwise_layernorm_plugin.cu
...inference/tensorrt/plugin/emb_eltwise_layernorm_plugin.cu
+133
-81
paddle/fluid/inference/tensorrt/plugin/emb_eltwise_layernorm_plugin.h
.../inference/tensorrt/plugin/emb_eltwise_layernorm_plugin.h
+144
-34
paddle/fluid/inference/tests/api/trt_dynamic_shape_ernie_deserialize_test.cc
...nce/tests/api/trt_dynamic_shape_ernie_deserialize_test.cc
+4
-6
paddle/fluid/operators/conv_cudnn_op.cu
paddle/fluid/operators/conv_cudnn_op.cu
+22
-5
paddle/fluid/operators/conv_op.cc
paddle/fluid/operators/conv_op.cc
+10
-0
paddle/fluid/operators/cudnn_lstm_cache.h
paddle/fluid/operators/cudnn_lstm_cache.h
+10
-0
paddle/fluid/operators/elementwise/elementwise_add_op.cc
paddle/fluid/operators/elementwise/elementwise_add_op.cc
+18
-0
paddle/fluid/operators/elementwise/elementwise_add_op.cu
paddle/fluid/operators/elementwise/elementwise_add_op.cu
+7
-0
paddle/fluid/operators/fake_quantize_op.cc
paddle/fluid/operators/fake_quantize_op.cc
+135
-0
paddle/fluid/operators/fake_quantize_op.cu
paddle/fluid/operators/fake_quantize_op.cu
+87
-2
paddle/fluid/operators/fake_quantize_op.h
paddle/fluid/operators/fake_quantize_op.h
+31
-0
paddle/fluid/operators/fused/fusion_gru_op.cc
paddle/fluid/operators/fused/fusion_gru_op.cc
+1
-0
paddle/fluid/operators/optimizers/rmsprop_op.cc
paddle/fluid/operators/optimizers/rmsprop_op.cc
+2
-1
paddle/fluid/operators/optimizers/rmsprop_op.cu
paddle/fluid/operators/optimizers/rmsprop_op.cu
+2
-1
paddle/fluid/operators/top_k_v2_op.cc
paddle/fluid/operators/top_k_v2_op.cc
+12
-3
paddle/fluid/platform/cudnn_helper.h
paddle/fluid/platform/cudnn_helper.h
+2
-0
paddle/fluid/platform/dynload/cudnn.cc
paddle/fluid/platform/dynload/cudnn.cc
+4
-0
paddle/fluid/platform/dynload/cudnn.h
paddle/fluid/platform/dynload/cudnn.h
+13
-8
paddle/fluid/platform/flags.cc
paddle/fluid/platform/flags.cc
+15
-0
paddle/fluid/pybind/global_value_getter_setter.cc
paddle/fluid/pybind/global_value_getter_setter.cc
+2
-1
paddle/fluid/pybind/op_function_generator.cc
paddle/fluid/pybind/op_function_generator.cc
+1
-0
paddle/fluid/pybind/pybind.cc
paddle/fluid/pybind/pybind.cc
+6
-0
paddle/scripts/paddle_build.sh
paddle/scripts/paddle_build.sh
+44
-1
python/paddle/distributed/fleet/__init__.py
python/paddle/distributed/fleet/__init__.py
+1
-0
python/paddle/distributed/fleet/base/fleet_base.py
python/paddle/distributed/fleet/base/fleet_base.py
+17
-27
python/paddle/distributed/fleet/base/role_maker.py
python/paddle/distributed/fleet/base/role_maker.py
+441
-253
python/paddle/distributed/fleet/base/util_factory.py
python/paddle/distributed/fleet/base/util_factory.py
+8
-37
python/paddle/distributed/fleet/launch.py
python/paddle/distributed/fleet/launch.py
+25
-1
python/paddle/distributed/fleet/launch_utils.py
python/paddle/distributed/fleet/launch_utils.py
+8
-2
python/paddle/distributed/fleet/meta_optimizers/common.py
python/paddle/distributed/fleet/meta_optimizers/common.py
+3
-3
python/paddle/distributed/fleet/meta_optimizers/dgc_optimizer.py
...paddle/distributed/fleet/meta_optimizers/dgc_optimizer.py
+2
-2
python/paddle/distributed/fleet/meta_optimizers/graph_execution_optimizer.py
...ibuted/fleet/meta_optimizers/graph_execution_optimizer.py
+9
-9
python/paddle/distributed/fleet/meta_optimizers/localsgd_optimizer.py
...e/distributed/fleet/meta_optimizers/localsgd_optimizer.py
+5
-5
python/paddle/distributed/fleet/meta_optimizers/parameter_server_graph_optimizer.py
...fleet/meta_optimizers/parameter_server_graph_optimizer.py
+1
-1
python/paddle/distributed/fleet/meta_optimizers/parameter_server_optimizer.py
...buted/fleet/meta_optimizers/parameter_server_optimizer.py
+2
-2
python/paddle/distributed/fleet/meta_optimizers/pipeline_optimizer.py
...e/distributed/fleet/meta_optimizers/pipeline_optimizer.py
+4
-4
python/paddle/distributed/fleet/runtime/parameter_server_runtime.py
...dle/distributed/fleet/runtime/parameter_server_runtime.py
+11
-10
python/paddle/fluid/__init__.py
python/paddle/fluid/__init__.py
+1
-0
python/paddle/fluid/backward.py
python/paddle/fluid/backward.py
+77
-17
python/paddle/fluid/contrib/slim/quantization/imperative/qat.py
.../paddle/fluid/contrib/slim/quantization/imperative/qat.py
+8
-3
python/paddle/fluid/contrib/slim/quantization/imperative/quant_nn.py
...le/fluid/contrib/slim/quantization/imperative/quant_nn.py
+105
-7
python/paddle/fluid/contrib/slim/tests/test_imperative_qat.py
...on/paddle/fluid/contrib/slim/tests/test_imperative_qat.py
+0
-1
python/paddle/fluid/contrib/slim/tests/test_imperative_qat_channelwise.py
...uid/contrib/slim/tests/test_imperative_qat_channelwise.py
+428
-0
python/paddle/fluid/incubate/fleet/parameter_server/ir/public.py
...paddle/fluid/incubate/fleet/parameter_server/ir/public.py
+24
-6
python/paddle/fluid/layers/tensor.py
python/paddle/fluid/layers/tensor.py
+2
-0
python/paddle/fluid/tests/unittests/ir/inference/test_conv_affine_channel_fuse_pass.py
...ttests/ir/inference/test_conv_affine_channel_fuse_pass.py
+228
-0
python/paddle/fluid/tests/unittests/ir/inference/test_conv_bn_fuse_pass.py
...id/tests/unittests/ir/inference/test_conv_bn_fuse_pass.py
+177
-0
python/paddle/fluid/tests/unittests/ir/inference/test_conv_elementwise_add2_act_fuse_pass.py
.../ir/inference/test_conv_elementwise_add2_act_fuse_pass.py
+4
-0
python/paddle/fluid/tests/unittests/ir/inference/test_conv_elementwise_add_act_fuse_pass.py
...s/ir/inference/test_conv_elementwise_add_act_fuse_pass.py
+4
-0
python/paddle/fluid/tests/unittests/ir/inference/test_conv_elementwise_add_fuse_pass.py
...tests/ir/inference/test_conv_elementwise_add_fuse_pass.py
+3
-0
python/paddle/fluid/tests/unittests/ir/inference/test_fc_fuse_pass.py
...e/fluid/tests/unittests/ir/inference/test_fc_fuse_pass.py
+54
-0
python/paddle/fluid/tests/unittests/ir/inference/test_fc_gru_fuse_pass.py
...uid/tests/unittests/ir/inference/test_fc_gru_fuse_pass.py
+86
-0
python/paddle/fluid/tests/unittests/ir/inference/test_fc_lstm_fuse_pass.py
...id/tests/unittests/ir/inference/test_fc_lstm_fuse_pass.py
+52
-0
python/paddle/fluid/tests/unittests/ir/inference/test_repeated_fc_relu_fuse_pass.py
...unittests/ir/inference/test_repeated_fc_relu_fuse_pass.py
+94
-0
python/paddle/fluid/tests/unittests/ir/inference/test_squared_mat_sub_fuse_pass.py
.../unittests/ir/inference/test_squared_mat_sub_fuse_pass.py
+63
-0
python/paddle/fluid/tests/unittests/ir/inference/test_transpose_flatten_concat_fuse_pass.py
...s/ir/inference/test_transpose_flatten_concat_fuse_pass.py
+3
-1
python/paddle/fluid/tests/unittests/ir/inference/test_trt_shuffle_channel_detect_pass.py
...ests/ir/inference/test_trt_shuffle_channel_detect_pass.py
+51
-0
python/paddle/fluid/tests/unittests/test_fake_quantize_op.py
python/paddle/fluid/tests/unittests/test_fake_quantize_op.py
+65
-0
python/paddle/fluid/tests/unittests/test_fleet_base.py
python/paddle/fluid/tests/unittests/test_fleet_base.py
+41
-25
python/paddle/fluid/tests/unittests/test_fleet_rolemaker_2.py
...on/paddle/fluid/tests/unittests/test_fleet_rolemaker_2.py
+1
-1
python/paddle/fluid/tests/unittests/test_fleet_rolemaker_new.py
.../paddle/fluid/tests/unittests/test_fleet_rolemaker_new.py
+636
-43
python/paddle/fluid/tests/unittests/test_fleet_util.py
python/paddle/fluid/tests/unittests/test_fleet_util.py
+3
-94
python/paddle/fluid/tests/unittests/test_inplace_addto_strategy.py
...ddle/fluid/tests/unittests/test_inplace_addto_strategy.py
+114
-0
python/paddle/fluid/tests/unittests/test_top_k_v2_op.py
python/paddle/fluid/tests/unittests/test_top_k_v2_op.py
+15
-4
python/paddle/fluid/tests/unittests/test_transformer_api.py
python/paddle/fluid/tests/unittests/test_transformer_api.py
+135
-0
python/paddle/nn/layer/transformer.py
python/paddle/nn/layer/transformer.py
+82
-20
python/paddle/optimizer/adam.py
python/paddle/optimizer/adam.py
+5
-6
python/paddle/reader/decorator.py
python/paddle/reader/decorator.py
+1
-1
python/paddle/tests/test_dataset_cifar.py
python/paddle/tests/test_dataset_cifar.py
+16
-8
python/paddle/tests/test_datasets.py
python/paddle/tests/test_datasets.py
+4
-2
python/paddle/text/datasets/uci_housing.py
python/paddle/text/datasets/uci_housing.py
+5
-1
python/paddle/utils/__init__.py
python/paddle/utils/__init__.py
+1
-0
python/paddle/utils/lazy_import.py
python/paddle/utils/lazy_import.py
+34
-0
python/paddle/vision/datasets/cifar.py
python/paddle/vision/datasets/cifar.py
+1
-0
python/paddle/vision/datasets/folder.py
python/paddle/vision/datasets/folder.py
+2
-1
python/paddle/vision/datasets/mnist.py
python/paddle/vision/datasets/mnist.py
+1
-10
python/paddle/vision/transforms/functional.py
python/paddle/vision/transforms/functional.py
+35
-16
python/paddle/vision/transforms/transforms.py
python/paddle/vision/transforms/transforms.py
+49
-12
python/requirements.txt
python/requirements.txt
+0
-1
python/setup.py.in
python/setup.py.in
+0
-3
tools/check_api_approvals.sh
tools/check_api_approvals.sh
+1
-1
未找到文件。
cmake/cuda.cmake
浏览文件 @
f52c4f8b
...
...
@@ -107,6 +107,9 @@ function(select_nvcc_arch_flags out_variable)
elseif
(
${
CUDA_ARCH_NAME
}
STREQUAL
"Maxwell"
)
set
(
cuda_arch_bin
"50"
)
elseif
(
${
CUDA_ARCH_NAME
}
STREQUAL
"Pascal"
)
if
(
NOT
${
CMAKE_CUDA_COMPILER_VERSION
}
LESS 10.0
)
add_definitions
(
"-DSUPPORTS_CUDA_FP16"
)
endif
()
set
(
cuda_arch_bin
"60 61"
)
elseif
(
${
CUDA_ARCH_NAME
}
STREQUAL
"Volta"
)
if
(
NOT
${
CMAKE_CUDA_COMPILER_VERSION
}
LESS 10.0
)
...
...
paddle/fluid/framework/data_feed.cc
浏览文件 @
f52c4f8b
...
...
@@ -527,6 +527,8 @@ bool MultiSlotDataFeed::CheckFile(const char* filename) {
VLOG
(
0
)
<<
"error: the number of ids is a negative number: "
<<
num
;
VLOG
(
0
)
<<
"please check line<"
<<
instance_cout
<<
"> in file<"
<<
filename
<<
">"
;
VLOG
(
0
)
<<
"Error occured when parsing "
<<
i
<<
" th slot with total slots number: "
<<
all_slots_
.
size
();
return
false
;
}
else
if
(
num
==
0
)
{
VLOG
(
0
)
...
...
@@ -536,42 +538,66 @@ bool MultiSlotDataFeed::CheckFile(const char* filename) {
"characters."
;
VLOG
(
0
)
<<
"please check line<"
<<
instance_cout
<<
"> in file<"
<<
filename
<<
">"
;
VLOG
(
0
)
<<
"Error occured when parsing "
<<
i
<<
" th slot with total slots number: "
<<
all_slots_
.
size
();
return
false
;
}
else
if
(
errno
==
ERANGE
||
num
>
INT_MAX
)
{
VLOG
(
0
)
<<
"error: the number of ids greater than INT_MAX"
;
VLOG
(
0
)
<<
"please check line<"
<<
instance_cout
<<
"> in file<"
<<
filename
<<
">"
;
VLOG
(
0
)
<<
"Error occured when parsing "
<<
i
<<
" th slot with total slots number: "
<<
all_slots_
.
size
();
return
false
;
}
if
(
all_slots_type_
[
i
]
==
"float"
)
{
for
(
int
i
=
0
;
i
<
num
;
++
i
)
{
for
(
int
j
=
0
;
j
<
num
;
++
j
)
{
strtof
(
endptr
,
&
endptr
);
if
(
errno
==
ERANGE
)
{
VLOG
(
0
)
<<
"error: the value is out of the range of "
"representable values for float"
;
VLOG
(
0
)
<<
"please check line<"
<<
instance_cout
<<
"> in file<"
<<
filename
<<
">"
;
VLOG
(
0
)
<<
"Error occured when parsing "
<<
i
<<
" th slot with total slots number: "
<<
all_slots_
.
size
();
VLOG
(
0
)
<<
"and in this slot: "
<<
j
<<
" th id with total id number: "
<<
num
;
return
false
;
}
if
(
i
+
1
!=
num
&&
endptr
-
str
==
len
)
{
if
(
j
+
1
!=
num
&&
endptr
-
str
==
len
)
{
VLOG
(
0
)
<<
"error: there is a wrong with the number of ids."
;
VLOG
(
0
)
<<
"Error occured when parsing "
<<
i
<<
" th slot with total slots number: "
<<
all_slots_
.
size
();
VLOG
(
0
)
<<
"and in this slot: "
<<
j
<<
" th id with total id number: "
<<
num
;
VLOG
(
0
)
<<
"please check line<"
<<
instance_cout
<<
"> in file<"
<<
filename
<<
">"
;
return
false
;
}
}
}
else
if
(
all_slots_type_
[
i
]
==
"uint64"
)
{
for
(
int
i
=
0
;
i
<
num
;
++
i
)
{
for
(
int
j
=
0
;
j
<
num
;
++
j
)
{
strtoull
(
endptr
,
&
endptr
,
10
);
if
(
errno
==
ERANGE
)
{
VLOG
(
0
)
<<
"error: the value is out of the range of "
"representable values for uint64_t"
;
VLOG
(
0
)
<<
"Error occured when parsing "
<<
i
<<
" th slot with total slots number: "
<<
all_slots_
.
size
();
VLOG
(
0
)
<<
"and in this slot: "
<<
j
<<
" th id with total id number: "
<<
num
;
VLOG
(
0
)
<<
"please check line<"
<<
instance_cout
<<
"> in file<"
<<
filename
<<
">"
;
return
false
;
}
if
(
i
+
1
!=
num
&&
endptr
-
str
==
len
)
{
if
(
j
+
1
!=
num
&&
endptr
-
str
==
len
)
{
VLOG
(
0
)
<<
"error: there is a wrong with the number of ids."
;
VLOG
(
0
)
<<
"Error occured when parsing "
<<
i
<<
" th slot with total slots number: "
<<
all_slots_
.
size
();
VLOG
(
0
)
<<
"and in this slot: "
<<
j
<<
" th id with total id number: "
<<
num
;
VLOG
(
0
)
<<
"please check line<"
<<
instance_cout
<<
"> in file<"
<<
filename
<<
">"
;
return
false
;
...
...
@@ -632,8 +658,13 @@ bool MultiSlotDataFeed::ParseOneInstanceFromPipe(
"The number of ids can not be zero, you need padding "
"it in data generator; or if there is something wrong with "
"the data, please check if the data contains unresolvable "
"characters.
\n
please check this error line: %s"
,
str
));
"characters.
\n
please check this error line: %s,
\n
Specifically, "
"something wrong happened(the length of this slot's feasign is 0)"
"when we parse the %d th slots."
"Maybe something wrong around this slot"
,
"
\n
We detect the feasign number of this slot is %d, "
"which is illegal."
,
str
,
i
,
num
));
if
(
idx
!=
-
1
)
{
(
*
instance
)[
idx
].
Init
(
all_slots_type_
[
i
]);
if
((
*
instance
)[
idx
].
GetType
()[
0
]
==
'f'
)
{
// float
...
...
@@ -683,8 +714,13 @@ bool MultiSlotDataFeed::ParseOneInstance(std::vector<MultiSlotType>* instance) {
"The number of ids can not be zero, you need padding "
"it in data generator; or if there is something wrong with "
"the data, please check if the data contains unresolvable "
"characters.
\n
please check this error line: %s."
,
str
));
"characters.
\n
please check this error line: %s,
\n
Specifically, "
"something wrong happened(the length of this slot's feasign is 0)"
"when we parse the %d th slots."
"Maybe something wrong around this slot"
,
"
\n
We detect the feasign number of this slot is %d, "
"which is illegal."
,
str
,
i
,
num
));
if
(
idx
!=
-
1
)
{
(
*
instance
)[
idx
].
Init
(
all_slots_type_
[
i
]);
...
...
@@ -916,8 +952,13 @@ bool MultiSlotInMemoryDataFeed::ParseOneInstanceFromPipe(Record* instance) {
"The number of ids can not be zero, you need padding "
"it in data generator; or if there is something wrong with "
"the data, please check if the data contains unresolvable "
"characters.
\n
please check this error line: %s."
,
str
));
"characters.
\n
please check this error line: %s,
\n
Specifically, "
"something wrong happened(the length of this slot's feasign is 0)"
"when we parse the %d th slots."
"Maybe something wrong around this slot"
,
"
\n
We detect the feasign number of this slot is %d, "
"which is illegal."
,
str
,
i
,
num
));
if
(
idx
!=
-
1
)
{
if
(
all_slots_type_
[
i
][
0
]
==
'f'
)
{
// float
for
(
int
j
=
0
;
j
<
num
;
++
j
)
{
...
...
@@ -982,8 +1023,13 @@ bool MultiSlotInMemoryDataFeed::ParseOneInstance(Record* instance) {
"The number of ids can not be zero, you need padding "
"it in data generator; or if there is something wrong with "
"the data, please check if the data contains unresolvable "
"characters.
\n
please check this error line: %s."
,
str
));
"characters.
\n
please check this error line: %s,
\n
Specifically, "
"something wrong happened(the length of this slot's feasign is 0)"
"when we parse the %d th slots."
"Maybe something wrong around this slot"
,
"
\n
We detect the feasign number of this slot is %d, "
"which is illegal."
,
str
,
i
,
num
));
if
(
idx
!=
-
1
)
{
if
(
all_slots_type_
[
i
][
0
]
==
'f'
)
{
// float
...
...
paddle/fluid/framework/details/CMakeLists.txt
浏览文件 @
f52c4f8b
...
...
@@ -74,6 +74,7 @@ set(SSA_GRAPH_EXECUTOR_DEPS graph framework_proto
eager_deletion_pass
buffer_shared_inplace_op_pass
buffer_shared_cross_op_memory_reuse_pass
inplace_addto_op_pass
set_reader_device_info_utils
add_reader_dependency_pass
)
cc_library
(
ssa_graph_executor SRCS ssa_graph_executor.cc DEPS
${
SSA_GRAPH_EXECUTOR_DEPS
}
)
...
...
paddle/fluid/framework/details/all_reduce_op_handle.cc
浏览文件 @
f52c4f8b
...
...
@@ -12,7 +12,9 @@
// See the License for the specific language governing permissions and
// limitations under the License.
#include "paddle/fluid/framework/details/all_reduce_op_handle.h"
#include <algorithm>
#include "paddle/fluid/framework/details/container_cast.h"
#include "paddle/fluid/framework/details/reduce_and_gather.h"
#include "paddle/fluid/framework/details/variable_visitor.h"
...
...
@@ -34,14 +36,24 @@ AllReduceOpHandle::AllReduceOpHandle(ir::Node *node,
const
std
::
vector
<
platform
::
Place
>
&
places
,
const
platform
::
NCCLCommunicator
*
ctxs
)
:
NCCLOpHandleBase
(
node
,
places
,
ctxs
),
local_scopes_
(
local_scopes
)
{
PADDLE_ENFORCE_EQ
(
places_
.
size
(),
local_scopes_
.
size
());
PADDLE_ENFORCE_EQ
(
places_
.
size
(),
local_scopes_
.
size
(),
platform
::
errors
::
InvalidArgument
(
"The number of places and the number of local scopes "
"should be equal, but got number of places is %d and "
"number of local scopes is %d."
,
places_
.
size
(),
local_scopes_
.
size
()));
}
#else
AllReduceOpHandle
::
AllReduceOpHandle
(
ir
::
Node
*
node
,
const
std
::
vector
<
Scope
*>
&
local_scopes
,
const
std
::
vector
<
platform
::
Place
>
&
places
)
:
OpHandleBase
(
node
),
local_scopes_
(
local_scopes
),
places_
(
places
)
{
PADDLE_ENFORCE_EQ
(
places_
.
size
(),
local_scopes_
.
size
());
PADDLE_ENFORCE_EQ
(
places_
.
size
(),
local_scopes_
.
size
(),
platform
::
errors
::
InvalidArgument
(
"The number of places and the number of local scopes "
"should be equal, but got number of places is %d and "
"number of local scopes is %d."
,
places_
.
size
(),
local_scopes_
.
size
()));
}
#endif
...
...
@@ -60,13 +72,25 @@ void AllReduceOpHandle::AllReduceImpl(
const
std
::
vector
<
VarHandle
*>
&
in_var_handles
,
const
std
::
vector
<
VarHandle
*>
&
out_var_handles
)
{
size_t
num_places
=
places_
.
size
();
PADDLE_ENFORCE_EQ
(
in_var_handles
.
size
(),
num_places
,
"The NoDummyInputSize should be equal to the number of places."
);
PADDLE_ENFORCE_EQ
(
in_var_handles
.
size
(),
num_places
,
platform
::
errors
::
InvalidArgument
(
"The NoDummyInputSize should be equal "
"to the number of places, but got NoDummyInputSize is "
"%d and the number of place is %d."
,
in_var_handles
.
size
(),
num_places
));
PADDLE_ENFORCE_EQ
(
in_var_handles
.
size
(),
out_var_handles
.
size
(),
"The NoDummyInputSize and NoDummyOutputSize should be equal."
);
PADDLE_ENFORCE_EQ
(
local_exec_scopes_
.
size
(),
num_places
);
platform
::
errors
::
InvalidArgument
(
"The NoDummyInputSize and NoDummyOutputSize should be "
"equal, but got NoDummyInputSize is %d and NoDummyOutputSize is %d."
,
in_var_handles
.
size
(),
out_var_handles
.
size
()));
PADDLE_ENFORCE_EQ
(
local_exec_scopes_
.
size
(),
num_places
,
platform
::
errors
::
InvalidArgument
(
"The number of local scopes should be equal "
"to the number of places, but got the number of local scopes is "
"%d and the number of place is %d."
,
in_var_handles
.
size
(),
num_places
));
std
::
vector
<
const
void
*>
lod_tensor_data
;
std
::
vector
<
platform
::
Place
>
places
;
...
...
@@ -78,23 +102,36 @@ void AllReduceOpHandle::AllReduceImpl(
for
(
size_t
i
=
0
;
i
<
local_exec_scopes_
.
size
();
++
i
)
{
auto
&
local_scope
=
local_exec_scopes_
[
i
];
auto
var
=
local_scope
->
FindVar
(
in_var_handles
[
i
]
->
name
());
PADDLE_ENFORCE_NOT_NULL
(
var
,
"%s is not found int scope."
,
in_var_handles
[
i
]
->
name
());
PADDLE_ENFORCE_NOT_NULL
(
var
,
platform
::
errors
::
NotFound
(
"Variable %s is not found in local scope."
,
in_var_handles
[
i
]
->
name
()));
auto
&
lod_tensor
=
var
->
Get
<
LoDTensor
>
();
if
(
i
==
0
)
{
numel
=
static_cast
<
int64_t
>
(
lod_tensor
.
numel
());
// only enforce place0, we will enforce other palce numel == place0 numel
PADDLE_ENFORCE_GT
(
numel
,
0
,
platform
::
errors
::
InvalidArgument
(
"The numel of tensos=[%s] must > 0. But now numel=[%d]"
,
in_var_handles
[
i
]
->
name
(),
numel
));
numel
,
0
,
platform
::
errors
::
PreconditionNotMet
(
"The numel of tensor %s should be > 0, but got numel is %d."
,
in_var_handles
[
i
]
->
name
(),
numel
));
dtype
=
lod_tensor
.
type
();
is_gpu_place
=
platform
::
is_gpu_place
(
lod_tensor
.
place
());
}
PADDLE_ENFORCE_EQ
(
numel
,
static_cast
<
int64_t
>
(
lod_tensor
.
numel
()));
PADDLE_ENFORCE_EQ
(
dtype
,
lod_tensor
.
type
());
PADDLE_ENFORCE_EQ
(
is_gpu_place
,
platform
::
is_gpu_place
(
lod_tensor
.
place
()));
PADDLE_ENFORCE_EQ
(
numel
,
static_cast
<
int64_t
>
(
lod_tensor
.
numel
()),
platform
::
errors
::
PreconditionNotMet
(
"The size of tensors of the same variable in different local "
"scopes should be equal."
));
PADDLE_ENFORCE_EQ
(
dtype
,
lod_tensor
.
type
(),
platform
::
errors
::
PreconditionNotMet
(
"The dtype of tensors of the same variable in different local "
"scopes should be equal."
));
PADDLE_ENFORCE_EQ
(
is_gpu_place
,
platform
::
is_gpu_place
(
lod_tensor
.
place
()),
platform
::
errors
::
PreconditionNotMet
(
"The place type of tensors of the same variable "
"in different local scopes should be equal."
));
lod_tensor_data
.
emplace_back
(
lod_tensor
.
data
<
void
>
());
places
.
emplace_back
(
lod_tensor
.
place
());
...
...
@@ -102,8 +139,12 @@ void AllReduceOpHandle::AllReduceImpl(
VLOG
(
10
)
<<
"place:"
<<
i
<<
", input_name:"
<<
in_var_handles
[
i
]
->
name
()
<<
", out_name:"
<<
out_var_handles
[
i
]
->
name
();
PADDLE_ENFORCE_EQ
(
in_var_handles
[
i
]
->
name
(),
out_var_handles
[
i
]
->
name
(),
"The name of input and output should be equal."
);
PADDLE_ENFORCE_EQ
(
in_var_handles
[
i
]
->
name
(),
out_var_handles
[
i
]
->
name
(),
platform
::
errors
::
InvalidArgument
(
"The name of input and output of all_reduce op should be equal, "
"but got input is %s and output is %s."
,
in_var_handles
[
i
]
->
name
(),
out_var_handles
[
i
]
->
name
()));
}
std
::
vector
<
std
::
string
>
grad_var_names
;
...
...
@@ -122,7 +163,9 @@ void AllReduceOpHandle::AllReduceFunc(
const
std
::
vector
<
std
::
string
>
&
out_var_names
)
{
if
(
is_gpu_place
(
places
[
0
]))
{
#if defined(PADDLE_WITH_NCCL)
PADDLE_ENFORCE_NOT_NULL
(
nccl_ctxs_
,
"nccl_ctxs should not be nullptr."
);
PADDLE_ENFORCE_NOT_NULL
(
nccl_ctxs_
,
platform
::
errors
::
InvalidArgument
(
"The nccl context should not be NULL."
));
ncclDataType_t
nccl_dtype
=
platform
::
ToNCCLDataType
(
dtype
);
std
::
vector
<
std
::
function
<
void
()
>>
all_reduce_calls
;
for
(
size_t
i
=
0
;
i
<
local_exec_scopes_
.
size
();
++
i
)
{
...
...
@@ -134,7 +177,8 @@ void AllReduceOpHandle::AllReduceFunc(
}
NCCLAllReduceFunc
(
all_reduce_calls
);
#else
PADDLE_THROW
(
"Not compiled with CUDA."
);
PADDLE_THROW
(
platform
::
errors
::
PreconditionNotMet
(
"Not compiled with CUDA."
));
#endif
}
else
{
// Special handle CPU only Operator's gradient. Like CRF
auto
&
trg
=
*
local_exec_scopes_
[
0
]
...
...
paddle/fluid/framework/details/async_ssa_graph_executor.cc
浏览文件 @
f52c4f8b
...
...
@@ -89,8 +89,19 @@ AsyncSSAGraphExecutor::AsyncSSAGraphExecutor(
places_
(
std
::
move
(
places
)),
graphs_
(
std
::
move
(
graphs
))
{
VLOG
(
3
)
<<
"build AsyncSSAGraphExecutor"
;
PADDLE_ENFORCE_EQ
(
places_
.
size
(),
local_scopes_
.
size
());
PADDLE_ENFORCE_EQ
(
local_scopes_
.
size
(),
local_exec_scopes_
.
size
());
PADDLE_ENFORCE_EQ
(
places_
.
size
(),
local_scopes_
.
size
(),
platform
::
errors
::
InvalidArgument
(
"The number of places and the number of local scopes "
"should be equal, but got number of places is %d and "
"number of local scopes is %d."
,
places_
.
size
(),
local_scopes_
.
size
()));
PADDLE_ENFORCE_EQ
(
local_scopes_
.
size
(),
local_exec_scopes_
.
size
(),
platform
::
errors
::
InvalidArgument
(
"The number of local scopes and the number of local execution scopes "
"should be equal, but got number of local scopes is %d and "
"number of local execution scopes is %d."
,
local_scopes_
.
size
(),
local_exec_scopes_
.
size
()));
// set the correct size of thread pool to each device.
strategy_
.
num_threads_
=
strategy_
.
num_threads_
<
places_
.
size
()
...
...
paddle/fluid/framework/details/build_strategy.h
浏览文件 @
f52c4f8b
...
...
@@ -19,6 +19,7 @@
#include <unordered_set>
#include <utility>
#include <vector>
#include "boost/optional.hpp"
#include "paddle/fluid/framework/ir/pass_builder.h"
#include "paddle/fluid/framework/program_desc.h"
...
...
@@ -119,6 +120,9 @@ struct BuildStrategy {
// Turn on inplace by default.
bool
enable_inplace_
{
true
};
// Turn off inplace addto by default.
bool
enable_addto_
{
false
};
// FIXME(zcd): is_distribution_ is a temporary field, because in pserver mode,
// num_trainers is 1, so the current fields of build_strategy doesn't tell if
// it's distributed model.
...
...
paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.cc
浏览文件 @
f52c4f8b
...
...
@@ -12,12 +12,14 @@
// See the License for the specific language governing permissions and
// limitations under the License.
#include "paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.h"
#include <deque>
#include <memory>
#include <string>
#include <unordered_map>
#include <unordered_set>
#include <vector>
#include "paddle/fluid/framework/details/computation_op_handle.h"
#include "paddle/fluid/framework/details/fetch_async_op_handle.h"
#include "paddle/fluid/framework/details/multi_devices_helper.h"
...
...
@@ -48,7 +50,9 @@ FastThreadedSSAGraphExecutor::FastThreadedSSAGraphExecutor(
bootstrap_ops_
.
emplace_back
(
op
);
}
}
PADDLE_ENFORCE_GT
(
op_deps_
.
size
(),
0
,
"The graph doesn't have operators."
);
PADDLE_ENFORCE_GT
(
op_deps_
.
size
(),
0
,
platform
::
errors
::
PreconditionNotMet
(
"The graph doesn't have operators."
));
PrepareAtomicOpDeps
();
}
...
...
paddle/fluid/framework/details/fetch_op_handle.cc
浏览文件 @
f52c4f8b
...
...
@@ -13,9 +13,11 @@
// limitations under the License.
#include "paddle/fluid/framework/details/fetch_op_handle.h"
#include <string>
#include <utility>
#include <vector>
#include "paddle/fluid/platform/profiler.h"
namespace
paddle
{
...
...
@@ -138,8 +140,10 @@ void FetchOpHandle::RunImpl() {
auto
*
var_handle
=
static_cast
<
VarHandle
*>
(
inputs_
[
i
]);
auto
&
scope
=
scopes
.
at
(
var_handle
->
scope_idx
());
auto
*
var
=
scope
->
FindVar
(
var_handle
->
name
());
PADDLE_ENFORCE_NOT_NULL
(
var
,
"Cannot find variable %s in execution scope"
,
var_handle
->
name
());
PADDLE_ENFORCE_NOT_NULL
(
var
,
platform
::
errors
::
NotFound
(
"Cannot find variable %s in execution scope."
,
var_handle
->
name
()));
if
(
var
->
IsType
<
LoDTensor
>
())
{
auto
&
t
=
var
->
Get
<
framework
::
LoDTensor
>
();
...
...
paddle/fluid/framework/details/op_handle_base.cc
浏览文件 @
f52c4f8b
...
...
@@ -12,6 +12,7 @@
// See the License for the specific language governing permissions and
// limitations under the License.
#include "paddle/fluid/framework/details/op_handle_base.h"
#include <map>
#include <unordered_set>
...
...
@@ -88,6 +89,12 @@ void OpHandleBase::Run(bool use_cuda) {
PADDLE_ENFORCE
(
!
use_cuda
);
#endif
// skip running current op, used with inplace_addto_op_pass
if
(
skip_running_
)
{
VLOG
(
4
)
<<
"skip running: "
<<
Name
();
return
;
}
RunImpl
();
}
...
...
paddle/fluid/framework/details/op_handle_base.h
浏览文件 @
f52c4f8b
...
...
@@ -18,6 +18,7 @@
#include <unordered_map>
#include <unordered_set>
#include <vector>
#include "paddle/fluid/framework/details/var_handle.h"
#include "paddle/fluid/framework/ir/node.h"
#include "paddle/fluid/platform/device_context.h"
...
...
@@ -52,6 +53,10 @@ class OpHandleBase {
virtual
Priority
GetPriority
()
const
{
return
kNormal
;
}
virtual
bool
GetSkipRunning
()
const
{
return
skip_running_
;
}
virtual
void
SetSkipRunning
(
bool
skip_runing
)
{
skip_running_
=
skip_runing
;
}
virtual
std
::
string
Name
()
const
=
0
;
void
Run
(
bool
use_cuda
);
...
...
@@ -131,6 +136,7 @@ class OpHandleBase {
std
::
map
<
platform
::
Place
,
platform
::
DeviceContext
*>
dev_ctxes_
;
std
::
vector
<
Scope
*>
local_exec_scopes_
;
bool
skip_running_
=
false
;
#ifdef PADDLE_WITH_CUDA
std
::
unordered_map
<
int
,
cudaEvent_t
>
events_
;
...
...
paddle/fluid/framework/details/parallel_ssa_graph_executor.cc
浏览文件 @
f52c4f8b
...
...
@@ -13,9 +13,11 @@
// limitations under the License.
#include "paddle/fluid/framework/details/parallel_ssa_graph_executor.h"
#include <algorithm>
#include <memory>
#include <utility>
#include "paddle/fluid/framework/ir/graph_helper.h"
namespace
paddle
{
...
...
@@ -104,7 +106,12 @@ ParallelSSAGraphExecutor::ParallelSSAGraphExecutor(
places_
(
places
),
graphs_
(
std
::
move
(
graphs
)),
feed_status_
(
places
.
size
(),
FeedStatus
::
kNone
)
{
PADDLE_ENFORCE_EQ
(
places_
.
size
(),
local_scopes_
.
size
());
PADDLE_ENFORCE_EQ
(
places_
.
size
(),
local_scopes_
.
size
(),
platform
::
errors
::
InvalidArgument
(
"The number of places and the number of local scopes "
"should be equal, but got number of places is %d and "
"number of local scopes is %d."
,
places_
.
size
(),
local_scopes_
.
size
()));
PADDLE_ENFORCE_EQ
(
places_
.
size
(),
graphs_
.
size
(),
platform
::
errors
::
InvalidArgument
(
...
...
paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.cc
浏览文件 @
f52c4f8b
...
...
@@ -13,10 +13,12 @@
// limitations under the License.
#include "paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.h"
#include <stdexcept>
#include <string>
#include <utility>
#include <vector>
#include "paddle/fluid/framework/details/multi_devices_helper.h"
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/framework/variable_helper.h"
...
...
@@ -37,7 +39,13 @@ ScopeBufferedSSAGraphExecutor::ScopeBufferedSSAGraphExecutor(
var_infos_
(
std
::
move
(
var_infos
)),
places_
(
std
::
move
(
places
)),
scope_monitor_
(
places_
,
local_exec_scopes_
)
{
PADDLE_ENFORCE_EQ
(
local_scopes_
.
size
(),
local_exec_scopes_
.
size
());
PADDLE_ENFORCE_EQ
(
local_scopes_
.
size
(),
local_exec_scopes_
.
size
(),
platform
::
errors
::
InvalidArgument
(
"The number of local scopes and the number of local execution scopes "
"should be equal, but got number of local scopes is %d and "
"number of local execution scopes is %d."
,
local_scopes_
.
size
(),
local_exec_scopes_
.
size
()));
PrepareLocalExeScopes
();
}
...
...
paddle/fluid/framework/details/share_tensor_buffer_functor.cc
浏览文件 @
f52c4f8b
...
...
@@ -13,9 +13,11 @@
// limitations under the License.
#include "paddle/fluid/framework/details/share_tensor_buffer_functor.h"
#include <string>
#include <unordered_map>
#include <unordered_set>
#include "paddle/fluid/framework/lod_tensor.h"
#include "paddle/fluid/framework/selected_rows.h"
#include "paddle/fluid/platform/enforce.h"
...
...
@@ -29,7 +31,8 @@ static inline const Tensor &GetTensorFromVar(const Variable *var) {
if
(
var
->
IsType
<
LoDTensor
>
())
{
return
var
->
Get
<
LoDTensor
>
();
}
else
{
PADDLE_THROW
(
"Variable must be type of LoDTensor"
);
PADDLE_THROW
(
platform
::
errors
::
InvalidArgument
(
"Variable must be type of LoDTensor."
));
}
}
...
...
@@ -37,20 +40,27 @@ static inline Tensor *GetMutableTensorFromVar(Variable *var) {
if
(
var
->
IsType
<
LoDTensor
>
())
{
return
var
->
GetMutable
<
LoDTensor
>
();
}
else
{
PADDLE_THROW
(
"Variable must be type of LoDTensor"
);
PADDLE_THROW
(
platform
::
errors
::
InvalidArgument
(
"Variable must be type of LoDTensor."
));
}
}
ShareTensorBufferFunctor
::
ShareTensorBufferFunctor
(
Scope
*
scope
,
size_t
scope_idx
,
const
std
::
string
&
op_type
,
const
std
::
vector
<
const
ir
::
MemOptVarInfo
*>
&
in_var_infos
,
const
std
::
vector
<
std
::
string
>
&
out_var_names
)
const
std
::
vector
<
std
::
string
>
&
out_var_names
,
bool
share_dims
)
:
scope_
(
scope
),
scope_idx_
(
scope_idx
),
op_type_
(
op_type
),
in_var_infos_
(
in_var_infos
),
out_var_names_
(
out_var_names
)
{
PADDLE_ENFORCE_EQ
(
in_var_infos_
.
size
(),
out_var_names_
.
size
());
out_var_names_
(
out_var_names
),
share_dims_
(
share_dims
)
{
PADDLE_ENFORCE_EQ
(
in_var_infos_
.
size
(),
out_var_names_
.
size
(),
platform
::
errors
::
PreconditionNotMet
(
"The number of input variables and output variables "
"should be equal, but got number of input variables is "
"%d and number of output variables is %d."
,
in_var_infos_
.
size
(),
out_var_names_
.
size
()));
for
(
size_t
i
=
0
;
i
<
in_var_infos_
.
size
();
++
i
)
{
AddReuseVarPair
(
in_var_infos_
[
i
],
out_var_names_
[
i
]);
}
...
...
@@ -67,32 +77,59 @@ ShareTensorBufferFunctor::ReusedVars() const {
void
ShareTensorBufferFunctor
::
AddReuseVarPair
(
const
ir
::
MemOptVarInfo
*
in_var_info
,
const
std
::
string
&
out_var_name
)
{
PADDLE_ENFORCE_NOT_NULL
(
in_var_info
,
"in_var_info cannot be nullptr"
);
PADDLE_ENFORCE_NOT_NULL
(
in_var_info
,
platform
::
errors
::
InvalidArgument
(
"The input variables to be inplaced should not be NULL."
));
PADDLE_ENFORCE_NE
(
in_var_info
->
Name
(),
out_var_name
,
"in/out cannot have same name: %s"
,
out_var_name
);
platform
::
errors
::
InvalidArgument
(
"The input variable and output variable to be inplaced "
"cannot have the same name: %s."
,
out_var_name
));
in_var_infos_
.
emplace_back
(
in_var_info
);
out_var_names_
.
emplace_back
(
out_var_name
);
}
void
ShareTensorBufferFunctor
::
CallOnce
()
{
PADDLE_ENFORCE
(
in_out_vars_
.
empty
(),
"in_out_vars_ must be initialized here"
);
PADDLE_ENFORCE
(
in_out_vars_
.
empty
(),
platform
::
errors
::
InvalidArgument
(
"The input-output variable pairs to be "
"inplaced should be initialized here."
));
for
(
size_t
i
=
0
;
i
<
in_var_infos_
.
size
();
++
i
)
{
auto
*
in_var
=
exec_scope_
->
FindVar
(
in_var_infos_
[
i
]
->
Name
());
auto
*
out_var
=
exec_scope_
->
FindVar
(
out_var_names_
[
i
]);
PADDLE_ENFORCE_NOT_NULL
(
in_var
);
PADDLE_ENFORCE_NOT_NULL
(
out_var
);
PADDLE_ENFORCE_NE
(
in_var
,
out_var
);
PADDLE_ENFORCE_NOT_NULL
(
in_var
,
platform
::
errors
::
NotFound
(
"The input variable(%s)to be inplaced should not be NULL."
,
in_var_infos_
[
i
]
->
Name
()));
PADDLE_ENFORCE_NOT_NULL
(
out_var
,
platform
::
errors
::
NotFound
(
"The output variable(%s) to be inplaced should not be NULL."
,
out_var_names_
[
i
]));
PADDLE_ENFORCE_NE
(
in_var
,
out_var
,
platform
::
errors
::
PreconditionNotMet
(
"The input variable and output variable to be inplaced "
"cannot be the same variable(%s)."
,
out_var_names_
[
i
]));
in_out_vars_
.
emplace_back
(
in_var
,
out_var
);
}
}
void
ShareTensorBufferFunctor
::
operator
()(
Scope
*
exec_scope
)
{
if
(
!
exec_scope_
)
{
PADDLE_ENFORCE_NOT_NULL
(
exec_scope
);
PADDLE_ENFORCE_NOT_NULL
(
exec_scope
,
platform
::
errors
::
InvalidArgument
(
"The given execution scope should not be NULL "
"if the cached scope is NULL."
));
exec_scope_
=
exec_scope
;
CallOnce
();
}
else
{
PADDLE_ENFORCE
(
exec_scope_
==
exec_scope
,
"Scope must be the same"
);
PADDLE_ENFORCE_EQ
(
exec_scope_
,
exec_scope
,
platform
::
errors
::
InvalidArgument
(
"The given execution scope and the cached execution "
"scope should be the same."
));
}
for
(
size_t
i
=
0
;
i
<
in_var_infos_
.
size
();
++
i
)
{
...
...
@@ -115,6 +152,13 @@ void ShareTensorBufferFunctor::operator()(Scope *exec_scope) {
}
else
{
out_tensor
->
ShareBufferWith
(
in_tensor
);
// NOTE(zhiqiu): In the case of inplace addto, if the operator of
// the in_out_vars is skipped during running, we should set the dims of
// output as the same as input.
if
(
share_dims_
)
{
out_tensor
->
Resize
(
in_tensor
.
dims
());
}
VLOG
(
2
)
<<
"Share tensor buffer when running "
<<
op_type_
<<
" : "
<<
in_var_info
->
Name
()
<<
" -> "
<<
out_var_names_
[
i
];
}
...
...
paddle/fluid/framework/details/share_tensor_buffer_functor.h
浏览文件 @
f52c4f8b
...
...
@@ -19,6 +19,7 @@
#include <unordered_set>
#include <utility>
#include <vector>
#include "paddle/fluid/framework/details/op_handle_base.h"
#include "paddle/fluid/framework/ir/memory_optimize_pass/memory_optimization_var_info.h"
#include "paddle/fluid/framework/scope.h"
...
...
@@ -40,11 +41,13 @@ class ShareTensorBufferFunctor {
ShareTensorBufferFunctor
(
Scope
*
scope
,
size_t
scope_idx
,
const
std
::
string
&
op_type
,
const
std
::
vector
<
const
ir
::
MemOptVarInfo
*>
&
in_var_infos
,
const
std
::
vector
<
std
::
string
>
&
out_var_names
);
const
std
::
vector
<
std
::
string
>
&
out_var_names
,
bool
share_dims
=
false
);
void
AddReuseVarPair
(
const
ir
::
MemOptVarInfo
*
in_var_info
,
const
std
::
string
&
out_var_name
);
void
SetShareDims
(
bool
share_dims
)
{
share_dims_
=
share_dims
;
}
void
operator
()(
Scope
*
exec_scope
);
std
::
unordered_map
<
std
::
string
,
std
::
string
>
ReusedVars
()
const
;
...
...
@@ -66,6 +69,11 @@ class ShareTensorBufferFunctor {
std
::
vector
<
std
::
string
>
out_var_names_
;
std
::
vector
<
std
::
pair
<
const
Variable
*
,
Variable
*>>
in_out_vars_
;
// NOTE(zhiqiu): In the case of inplace addto, if the operator of
// the in_out_vars is skipped during running, we should set the dims of output
// as the same as input.
bool
share_dims_
{
false
};
};
}
// namespace details
...
...
paddle/fluid/framework/details/share_tensor_buffer_op_handle.cc
浏览文件 @
f52c4f8b
...
...
@@ -13,8 +13,10 @@
// limitations under the License.
#include "paddle/fluid/framework/details/share_tensor_buffer_op_handle.h"
#include <string>
#include <unordered_set>
#include "paddle/fluid/framework/ir/memory_optimize_pass/memory_optimization_var_info.h"
#include "paddle/fluid/framework/lod_tensor.h"
#include "paddle/fluid/framework/scope.h"
...
...
@@ -32,26 +34,35 @@ ComputationOpHandle *GetUniquePendingComputationOpHandle(
for
(
ir
::
Node
*
pending_op
:
out_var
->
outputs
)
{
auto
&
op
=
pending_op
->
Wrapper
<
OpHandleBase
>
();
auto
*
compute_op
=
dynamic_cast
<
ComputationOpHandle
*>
(
&
op
);
PADDLE_ENFORCE_NOT_NULL
(
compute_op
);
PADDLE_ENFORCE_NOT_NULL
(
compute_op
,
platform
::
errors
::
PreconditionNotMet
(
"The pending OpHandle should be ComputationOpHandle."
));
if
(
result_op
==
nullptr
)
{
result_op
=
compute_op
;
}
else
{
PADDLE_ENFORCE_EQ
(
result_op
,
compute_op
);
PADDLE_ENFORCE_EQ
(
result_op
,
compute_op
,
platform
::
errors
::
PreconditionNotMet
(
"The pending OpHandle should be the unique one."
));
}
}
}
PADDLE_ENFORCE_NOT_NULL
(
result_op
);
PADDLE_ENFORCE_NOT_NULL
(
result_op
,
platform
::
errors
::
PreconditionNotMet
(
"The pending OpHandle should not be NULL."
));
return
result_op
;
}
ShareTensorBufferOpHandle
::
ShareTensorBufferOpHandle
(
ir
::
Node
*
node
,
Scope
*
scope
,
size_t
scope_idx
,
const
std
::
string
&
op_type
,
const
std
::
vector
<
const
ir
::
MemOptVarInfo
*>
&
in_var_infos
,
const
std
::
vector
<
std
::
string
>
&
out_var_names
)
const
std
::
vector
<
std
::
string
>
&
out_var_names
,
bool
share_dims
)
:
OpHandleBase
(
node
),
functor_
(
scope
,
scope_idx
,
op_type
,
in_var_infos
,
out_var_names
)
{}
functor_
(
scope
,
scope_idx
,
op_type
,
in_var_infos
,
out_var_names
,
share_dims
)
{}
std
::
unordered_map
<
std
::
string
,
std
::
string
>
ShareTensorBufferOpHandle
::
ReusedVars
()
const
{
...
...
@@ -63,6 +74,10 @@ void ShareTensorBufferOpHandle::AddReuseVarPair(
functor_
.
AddReuseVarPair
(
in_var_info
,
out_var_name
);
}
void
ShareTensorBufferOpHandle
::
SetShareDims
(
bool
share_dims
)
{
functor_
.
SetShareDims
(
share_dims
);
}
void
ShareTensorBufferOpHandle
::
InitCUDA
()
{
#ifdef PADDLE_WITH_CUDA
int
dev_id
=
...
...
paddle/fluid/framework/details/share_tensor_buffer_op_handle.h
浏览文件 @
f52c4f8b
...
...
@@ -17,6 +17,7 @@
#include <unordered_map>
#include <utility>
#include <vector>
#include "paddle/fluid/framework/details/computation_op_handle.h"
#include "paddle/fluid/framework/details/op_handle_base.h"
#include "paddle/fluid/framework/details/share_tensor_buffer_functor.h"
...
...
@@ -31,7 +32,7 @@ class ShareTensorBufferOpHandle : public OpHandleBase {
ir
::
Node
*
node
,
Scope
*
scope
,
size_t
scope_idx
,
const
std
::
string
&
op_type
,
const
std
::
vector
<
const
ir
::
MemOptVarInfo
*>
&
in_vars_infos
,
const
std
::
vector
<
std
::
string
>
&
out_var_names
);
const
std
::
vector
<
std
::
string
>
&
out_var_names
,
bool
share_dims
=
false
);
std
::
unordered_map
<
std
::
string
,
std
::
string
>
ReusedVars
()
const
;
...
...
@@ -42,6 +43,8 @@ class ShareTensorBufferOpHandle : public OpHandleBase {
void
AddReuseVarPair
(
const
ir
::
MemOptVarInfo
*
in_var_info
,
const
std
::
string
&
out_var_name
);
void
SetShareDims
(
bool
share_dims
);
const
ShareTensorBufferFunctor
&
Functor
()
const
{
return
functor_
;
}
protected:
...
...
paddle/fluid/framework/details/ssa_graph_executor.cc
浏览文件 @
f52c4f8b
...
...
@@ -13,6 +13,7 @@
// limitations under the License.
#include "paddle/fluid/framework/details/ssa_graph_executor.h"
#include "paddle/fluid/framework/details/fetch_async_op_handle.h"
namespace
paddle
{
...
...
@@ -27,8 +28,9 @@ void ClearFetchOp(ir::Graph* graph, std::vector<OpHandleBase*>* fetch_ops) {
PADDLE_ENFORCE_EQ
(
dynamic_cast
<
FetchOpHandle
*>
(
op
)
!=
nullptr
||
dynamic_cast
<
FetchAsyncOpHandle
*>
(
op
)
!=
nullptr
,
true
,
"The input ops of ClearFetchOp function should be "
"FetchOpHandle or FetchAsyncOpHandle."
);
platform
::
errors
::
PreconditionNotMet
(
"The input ops of ClearFetchOp function should be "
"FetchOpHandle or FetchAsyncOpHandle."
));
for
(
auto
&
out_var
:
op
->
Node
()
->
outputs
)
{
graph
->
RemoveNode
(
out_var
);
}
...
...
paddle/fluid/framework/details/threaded_ssa_graph_executor.cc
浏览文件 @
f52c4f8b
...
...
@@ -13,6 +13,7 @@
// limitations under the License.
#include "paddle/fluid/framework/details/threaded_ssa_graph_executor.h"
#include "paddle/fluid/framework/ir/graph_helper.h"
#include "paddle/fluid/platform/profiler.h"
...
...
@@ -138,7 +139,10 @@ inline FetchResultType ThreadedSSAGraphExecutor::RunImpl(
}
}
}
PADDLE_ENFORCE
(
ready_ops
.
empty
());
PADDLE_ENFORCE_EQ
(
ready_ops
.
empty
(),
true
,
platform
::
errors
::
Fatal
(
"After the execution of computation graph, "
"there are unexecuted operators left."
));
}
// Wait FetchOps.
...
...
@@ -165,9 +169,8 @@ void ThreadedSSAGraphExecutor::InsertFetchOps(
FetchResultType
*
fetch_data
,
bool
return_merged
)
{
std
::
unordered_map
<
std
::
string
,
std
::
vector
<
VarHandleBase
*>>
fetched_vars
;
std
::
unordered_set
<
VarHandleBase
*>
local_ready_vars
;
std
::
unordered_set
<
std
::
string
>
fetch_tensor_set
(
fetch_tensors
.
begin
(),
fetch_tensors
.
end
());
for
(
auto
&
fetch_var_name
:
fetch_tensor_set
)
{
for
(
auto
&
fetch_var_name
:
fetch_tensors
)
{
for
(
auto
&
var_map
:
graph_
->
Get
<
details
::
GraphVars
>
(
details
::
kGraphVars
))
{
auto
it
=
var_map
.
find
(
fetch_var_name
);
if
(
it
!=
var_map
.
end
())
{
...
...
@@ -231,7 +234,11 @@ void ThreadedSSAGraphExecutor::InsertFetchOps(
ready_ops
->
insert
(
static_cast
<
OpHandleBase
*>
(
op
));
}
}
PADDLE_ENFORCE_EQ
(
local_ready_vars
.
size
(),
0
);
PADDLE_ENFORCE_EQ
(
local_ready_vars
.
size
(),
0
,
platform
::
errors
::
Fatal
(
"The number of ready variables should be 0, but got %d."
,
local_ready_vars
.
size
()));
}
void
ThreadedSSAGraphExecutor
::
InsertPendingOp
(
...
...
@@ -277,7 +284,9 @@ void ThreadedSSAGraphExecutor::PrepareOpDeps() {
}
}
op_deps_
->
num_ops_
=
ready_ops
.
size
()
+
pending_ops
.
size
();
PADDLE_ENFORCE_GT
(
op_deps_
->
num_ops_
,
0
,
"The graph doesn't have operators."
);
PADDLE_ENFORCE_GT
(
op_deps_
->
num_ops_
,
0
,
platform
::
errors
::
InvalidArgument
(
"The graph doesn't have operators."
));
for
(
auto
ready_var
:
ready_vars
)
{
pending_vars
.
erase
(
ready_var
);
...
...
paddle/fluid/framework/details/threaded_ssa_graph_executor.h
浏览文件 @
f52c4f8b
...
...
@@ -14,6 +14,8 @@
#pragma once
#include <ThreadPool.h> // ThreadPool in thrird party
#include <deque>
#include <functional>
#include <list>
...
...
@@ -24,8 +26,6 @@
#include <utility>
#include <vector>
#include <ThreadPool.h> // ThreadPool in thrird party
#include "paddle/fluid/framework/blocking_queue.h"
#include "paddle/fluid/framework/details/exception_holder.h"
#include "paddle/fluid/framework/details/execution_strategy.h"
...
...
paddle/fluid/framework/details/var_handle.h
浏览文件 @
f52c4f8b
...
...
@@ -54,8 +54,10 @@ struct VarHandleBase {
void
AddOutput
(
OpHandleBase
*
out
,
ir
::
Node
*
node
)
{
if
(
pending_ops_
.
find
(
out
)
==
pending_ops_
.
end
())
{
PADDLE_ENFORCE
(
out
!=
nullptr
,
"The output of %s should not be nullptr"
,
this
->
Node
()
->
Name
());
PADDLE_ENFORCE_NOT_NULL
(
out
,
platform
::
errors
::
InvalidArgument
(
"The output added to VarHandle %s is NULL."
,
this
->
Node
()
->
Name
()));
pending_ops_
.
insert
(
out
);
node_
->
outputs
.
push_back
(
node
);
}
...
...
@@ -120,7 +122,10 @@ struct VarHandle : public VarHandleBase {
bool
HasEvent
()
{
return
has_event_
;
}
const
cudaEvent_t
&
GetEvent
()
{
PADDLE_ENFORCE
(
HasEvent
(),
"The event is not set."
);
PADDLE_ENFORCE_EQ
(
HasEvent
(),
true
,
platform
::
errors
::
PreconditionNotMet
(
"The cuda event is not set, maybe InitCUDA() is not called."
));
return
event_
;
}
...
...
paddle/fluid/framework/details/variable_visitor.cc
浏览文件 @
f52c4f8b
...
...
@@ -13,6 +13,7 @@
// limitations under the License.
#include "paddle/fluid/framework/details/variable_visitor.h"
#include "paddle/fluid/framework/selected_rows.h"
namespace
paddle
{
namespace
framework
{
...
...
@@ -24,7 +25,9 @@ static void VisitVariable(Variable* var, Func* func) {
}
else
if
(
var
->
IsType
<
SelectedRows
>
())
{
(
*
func
)(
var
->
GetMutable
<
SelectedRows
>
());
}
else
{
PADDLE_THROW
(
"Not supported type %s"
,
ToTypeName
(
var
->
Type
()));
PADDLE_THROW
(
platform
::
errors
::
Unimplemented
(
"VisitVariable is not supported for type %s."
,
ToTypeName
(
var
->
Type
())));
}
}
...
...
@@ -35,7 +38,8 @@ static void VisitVariable(const Variable& var, Func* func) {
}
else
if
(
var
.
IsType
<
SelectedRows
>
())
{
(
*
func
)(
var
.
Get
<
SelectedRows
>
());
}
else
{
PADDLE_THROW
(
"Not supported type %s"
,
ToTypeName
(
var
.
Type
()));
PADDLE_THROW
(
platform
::
errors
::
Unimplemented
(
"VisitVariable is not supported for type %s."
,
ToTypeName
(
var
.
Type
())));
}
}
...
...
@@ -50,7 +54,8 @@ struct TensorVisitor {
template
<
typename
T
>
void
operator
()()
{
PADDLE_THROW
(
"Not Support to get LoDTensor from %s"
,
typeid
(
T
).
name
());
PADDLE_THROW
(
platform
::
errors
::
Unimplemented
(
"Getting tensor from type %s is not supported."
,
typeid
(
T
).
name
()));
}
};
...
...
@@ -78,8 +83,8 @@ struct ShareDimsAndLoDVisitor {
template
<
typename
T
>
void
operator
()(
const
T
&
)
{
PADDLE_
ENFORCE
(
"ShareDimsAndLoD is not supported by type %s"
,
typeid
(
T
).
name
(
));
PADDLE_
THROW
(
platform
::
errors
::
Unimplemented
(
"ShareDimsAndLoD is not supported for type %s."
,
typeid
(
T
).
name
()
));
}
};
...
...
@@ -89,42 +94,54 @@ void VariableVisitor::ShareDimsAndLoD(const Variable& src, Variable* trg) {
}
struct
EnforceShapeAndDTypeEQVisitor
{
const
Variable
*
trg
_
;
const
Variable
*
dst
_
;
void
operator
()(
const
LoDTensor
&
src
)
{
auto
&
tensor
=
trg
_
->
Get
<
LoDTensor
>
();
PADDLE_ENFORCE_EQ
(
src
.
place
().
which
(),
tensor
.
place
().
which
(),
"The Places of the two Variable must be all on CPU or all on GPU."
);
auto
&
tensor
=
dst
_
->
Get
<
LoDTensor
>
();
PADDLE_ENFORCE_EQ
(
src
.
place
().
which
(),
tensor
.
place
().
which
(),
platform
::
errors
::
PreconditionNotMet
(
"The place type of the two variables is not equal."
)
);
PADDLE_ENFORCE_EQ
(
src
.
type
(),
tensor
.
type
(),
"The dtype of the two Variable is not equal."
);
PADDLE_ENFORCE_EQ
(
src
.
dims
(),
tensor
.
dims
(),
"The dims of the two Variable is not equal."
);
platform
::
errors
::
PreconditionNotMet
(
"The dtype of the two variables is not equal."
));
PADDLE_ENFORCE_EQ
(
src
.
dims
(),
tensor
.
dims
(),
platform
::
errors
::
PreconditionNotMet
(
"The layout of the two variables' tensors is not equal."
));
PADDLE_ENFORCE_EQ
(
src
.
lod
(),
tensor
.
lod
(),
"The lod of the two Variable is not equal."
);
PADDLE_ENFORCE_EQ
(
src
.
layout
(),
tensor
.
layout
(),
"The layout of the two Variable's tensor is not equal."
);
platform
::
errors
::
PreconditionNotMet
(
"The lod of the two variable is not equal."
));
PADDLE_ENFORCE_EQ
(
src
.
layout
(),
tensor
.
layout
(),
platform
::
errors
::
PreconditionNotMet
(
"The layout of the two variables' tensors tensor is not equal."
));
}
void
operator
()(
const
SelectedRows
&
src
)
{
auto
&
selected_rows
=
trg
_
->
Get
<
SelectedRows
>
();
PADDLE_ENFORCE_EQ
(
src
.
place
().
which
(),
selected_rows
.
place
().
which
(),
"The Places of the two Variable must be all on CPU or all on GPU."
);
auto
&
selected_rows
=
dst
_
->
Get
<
SelectedRows
>
();
PADDLE_ENFORCE_EQ
(
src
.
place
().
which
(),
selected_rows
.
place
().
which
(),
platform
::
errors
::
PreconditionNotMet
(
"The place type of the two variables is not equal."
)
);
PADDLE_ENFORCE_EQ
(
src
.
value
().
type
(),
selected_rows
.
value
().
type
(),
"The dtype of the two Variable is not equal."
);
PADDLE_ENFORCE_EQ
(
src
.
value
().
layout
(),
selected_rows
.
value
().
layout
(),
"The layout of the two Variable's tensor is not equal."
);
platform
::
errors
::
PreconditionNotMet
(
"The dtype of the two variables is not equal."
));
PADDLE_ENFORCE_EQ
(
src
.
value
().
layout
(),
selected_rows
.
value
().
layout
(),
platform
::
errors
::
PreconditionNotMet
(
"The layout of the two variables' tensors is not equal."
));
PADDLE_ENFORCE_EQ
(
src
.
height
(),
selected_rows
.
height
(),
"The height of the two Variable is not equal."
);
platform
::
errors
::
PreconditionNotMet
(
"The height of the two variables is not equal."
));
PADDLE_ENFORCE_EQ
(
src
.
GetCompleteDims
(),
selected_rows
.
GetCompleteDims
(),
"The dims of the two Variable is not equal."
);
platform
::
errors
::
PreconditionNotMet
(
"The dims of the two variables is not equal."
));
}
template
<
typename
T
>
void
operator
()(
const
T
&
)
{
PADDLE_ENFORCE
(
"EnforceShapeAndDTypeEQ is not supported by type %s"
,
typeid
(
T
).
name
());
PADDLE_THROW
(
platform
::
errors
::
Unimplemented
(
"EnforceShapeAndDTypeEQ is not supported for type %s."
,
typeid
(
T
).
name
()));
}
};
...
...
paddle/fluid/framework/fleet/gloo_wrapper.cc
浏览文件 @
f52c4f8b
...
...
@@ -19,6 +19,8 @@ limitations under the License. */
namespace
gloo
{
namespace
rendezvous
{
constexpr
int
kNodeSize
=
136
;
HdfsStore
::
HdfsStore
(
const
std
::
string
&
path
)
{
path_
=
path
;
wait_sleep_ms_
=
10000
;
...
...
@@ -213,12 +215,14 @@ void ParallelConnectContext::connectFullMesh(
storeKey
<<
rank
;
store
.
set
(
storeKey
.
str
(),
allBytes
);
auto
total_add_size
=
kNodeSize
*
(
size
-
1
);
std
::
vector
<
std
::
shared_ptr
<
std
::
thread
>>
connect_threads
(
thread_num_
);
// Connect every pair
for
(
uint32_t
i
=
0
;
i
<
connect_threads
.
size
();
++
i
)
{
connect_threads
[
i
].
reset
(
new
std
::
thread
(
[
&
store
,
&
transportContext
,
t
his
](
size_t
thread_idx
,
size_t
thread_num
)
->
void
{
[
&
store
,
&
transportContext
,
t
otal_add_size
,
this
](
size_t
thread_idx
,
size_t
thread_num
)
->
void
{
for
(
int
i
=
thread_idx
;
i
<
size
;
i
+=
thread_num
)
{
if
(
i
==
rank
)
{
continue
;
...
...
@@ -226,8 +230,23 @@ void ParallelConnectContext::connectFullMesh(
// Wait for address of other side of this pair to become available
std
::
string
key
=
std
::
to_string
(
i
);
store
.
wait
({
key
},
getTimeout
());
std
::
vector
<
char
>
allAddrs
;
auto
max_retry_times
=
5
;
// Connect to other side of this pair
auto
allAddrs
=
store
.
get
(
key
);
while
(
max_retry_times
>
0
)
{
allAddrs
=
store
.
get
(
key
);
VLOG
(
3
)
<<
"store get all address size: "
<<
allAddrs
.
size
()
<<
" except: "
<<
total_add_size
;
if
(
allAddrs
.
size
()
==
static_cast
<
size_t
>
(
total_add_size
))
{
break
;
}
--
max_retry_times
;
}
auto
addr
=
extractAddress
(
allAddrs
,
i
);
transportContext
->
getPair
(
i
)
->
connect
(
addr
);
}
...
...
paddle/fluid/framework/ir/conv_affine_channel_fuse_pass.cc
浏览文件 @
f52c4f8b
...
...
@@ -18,6 +18,7 @@
#include <string>
#include <vector>
#include "paddle/fluid/framework/lod_tensor.h"
#include "paddle/fluid/framework/op_version_registry.h"
#include "paddle/fluid/operators/math/cpu_vec.h"
#include "paddle/fluid/platform/enforce.h"
...
...
@@ -225,3 +226,14 @@ REGISTER_PASS(conv_affine_channel_fuse_pass,
paddle
::
framework
::
ir
::
ConvAffineChannelFusePass
);
REGISTER_PASS
(
conv_eltwiseadd_affine_channel_fuse_pass
,
paddle
::
framework
::
ir
::
ConvEltwiseAddAffineChannelFusePass
);
REGISTER_PASS_CAPABILITY
(
conv_affine_channel_fuse_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"conv2d"
,
0
)
.
EQ
(
"affine_channel"
,
0
));
REGISTER_PASS_CAPABILITY
(
conv_eltwiseadd_affine_channel_fuse_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"conv2d"
,
0
)
.
EQ
(
"elementwise_add"
,
0
)
.
EQ
(
"affine_channel"
,
0
));
paddle/fluid/framework/ir/conv_bn_fuse_pass.cc
浏览文件 @
f52c4f8b
...
...
@@ -18,6 +18,7 @@
#include <string>
#include <vector>
#include "paddle/fluid/framework/lod_tensor.h"
#include "paddle/fluid/framework/op_version_registry.h"
#include "paddle/fluid/operators/math/cpu_vec.h"
#include "paddle/fluid/platform/enforce.h"
...
...
@@ -372,3 +373,14 @@ REGISTER_PASS(depthwise_conv_bn_fuse_pass,
paddle
::
framework
::
ir
::
DepthwiseConvBNFusePass
);
REGISTER_PASS
(
depthwise_conv_eltwiseadd_bn_fuse_pass
,
paddle
::
framework
::
ir
::
DepthwiseConvEltwiseAddBNFusePass
);
REGISTER_PASS_CAPABILITY
(
conv_bn_fuse_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"conv2d"
,
0
)
.
EQ
(
"batch_norm"
,
0
));
REGISTER_PASS_CAPABILITY
(
conv_eltwiseadd_bn_fuse_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"conv2d"
,
0
)
.
EQ
(
"elementwise_add"
,
0
)
.
EQ
(
"batch_norm"
,
0
));
paddle/fluid/framework/ir/conv_elementwise_add2_act_fuse_pass.cc
浏览文件 @
f52c4f8b
...
...
@@ -11,9 +11,9 @@
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include "paddle/fluid/framework/ir/conv_elementwise_add2_act_fuse_pass.h"
#include <string>
#include "paddle/fluid/framework/op_version_registry.h"
namespace
paddle
{
namespace
framework
{
...
...
@@ -116,3 +116,10 @@ void ConvElementwiseAdd2ActFusePass::ApplyImpl(ir::Graph* graph) const {
REGISTER_PASS
(
conv_elementwise_add2_act_fuse_pass
,
paddle
::
framework
::
ir
::
ConvElementwiseAdd2ActFusePass
);
REGISTER_PASS_CAPABILITY
(
conv_elementwise_add2_act_fuse_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"conv2d"
,
0
)
.
EQ
(
"elementwise_add"
,
0
)
.
EQ
(
"relu"
,
0
)
.
EQ
(
"identity"
,
0
));
paddle/fluid/framework/ir/conv_elementwise_add_act_fuse_pass.cc
浏览文件 @
f52c4f8b
...
...
@@ -15,6 +15,7 @@
#include "paddle/fluid/framework/ir/conv_elementwise_add_act_fuse_pass.h"
#include <string>
#include "paddle/fluid/framework/ir/graph_viz_pass.h"
#include "paddle/fluid/framework/op_version_registry.h"
namespace
paddle
{
namespace
framework
{
...
...
@@ -102,3 +103,10 @@ void ConvElementwiseAddActFusePass::ApplyImpl(ir::Graph* graph) const {
REGISTER_PASS
(
conv_elementwise_add_act_fuse_pass
,
paddle
::
framework
::
ir
::
ConvElementwiseAddActFusePass
);
REGISTER_PASS_CAPABILITY
(
conv_elementwise_add_act_fuse_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"conv2d"
,
0
)
.
EQ
(
"elementwise_add"
,
0
)
.
EQ
(
"relu"
,
0
)
.
EQ
(
"identity"
,
0
));
paddle/fluid/framework/ir/conv_elementwise_add_fuse_pass.cc
浏览文件 @
f52c4f8b
...
...
@@ -12,10 +12,10 @@
// See the License for the specific language governing permissions and
// limitations under the License.
#include <string>
#include "paddle/fluid/framework/ir/conv_elementwise_add_fuse_pass.h"
#include <string>
#include "paddle/fluid/framework/ir/graph_viz_pass.h"
#include "paddle/fluid/framework/op_version_registry.h"
namespace
paddle
{
namespace
framework
{
...
...
@@ -89,3 +89,8 @@ void ConvElementwiseAddFusePass::ApplyImpl(ir::Graph* graph) const {
REGISTER_PASS
(
conv_elementwise_add_fuse_pass
,
paddle
::
framework
::
ir
::
ConvElementwiseAddFusePass
);
REGISTER_PASS_CAPABILITY
(
conv_elementwise_add_fuse_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"conv2d"
,
0
)
.
EQ
(
"elementwise_add"
,
0
));
paddle/fluid/framework/ir/embedding_fc_lstm_fuse_pass.cc
浏览文件 @
f52c4f8b
...
...
@@ -23,6 +23,8 @@
#include "paddle/fluid/operators/math/cpu_vec.h"
#include "paddle/fluid/platform/cpu_info.h"
#include "paddle/fluid/framework/op_version_registry.h"
namespace
paddle
{
namespace
framework
{
namespace
ir
{
...
...
@@ -34,7 +36,7 @@ static int BuildFusion(Graph* graph, const std::string& name_scope,
// Build pattern
PDNode
*
x
=
pattern
->
NewNode
(
patterns
::
PDNodeName
(
name_scope
,
"x"
))
->
assert_is_op_input
(
"lookup_table"
)
->
assert_is_op_input
(
"lookup_table
_v2
"
)
->
assert_var_not_persistable
();
patterns
::
Embedding
embedding_pattern
(
pattern
,
name_scope
);
// TODO(jczaja): Intermediate can only be for val that are not used anywhere
...
...
@@ -256,3 +258,11 @@ void EmbeddingFCLSTMFusePass::ApplyImpl(ir::Graph* graph) const {
REGISTER_PASS
(
embedding_fc_lstm_fuse_pass
,
paddle
::
framework
::
ir
::
EmbeddingFCLSTMFusePass
);
REGISTER_PASS_CAPABILITY
(
embedding_fc_lstm_fuse_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"lookup_table_v2"
,
0
)
.
EQ
(
"mul"
,
0
)
.
EQ
(
"elementwise_add"
,
0
)
.
EQ
(
"lstm"
,
0
)
.
EQ
(
"fused_embedding_fc_lstm"
,
0
));
paddle/fluid/framework/ir/fc_fuse_pass.cc
浏览文件 @
f52c4f8b
...
...
@@ -18,6 +18,7 @@
#include <unordered_set>
#include <vector>
#include "paddle/fluid/framework/ir/graph_helper.h"
#include "paddle/fluid/framework/op_version_registry.h"
#include "paddle/fluid/platform/enforce.h"
namespace
paddle
{
...
...
@@ -182,3 +183,10 @@ int FCFusePass::ApplyFCPattern(Graph* graph, bool with_relu) const {
REGISTER_PASS
(
fc_fuse_pass
,
paddle
::
framework
::
ir
::
FCFusePass
)
.
RequirePassAttr
(
"use_gpu"
);
REGISTER_PASS_CAPABILITY
(
fc_fuse_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"mul"
,
0
)
.
EQ
(
"elementwise_add"
,
0
)
.
EQ
(
"relu"
,
0
)
.
EQ
(
"fc"
,
0
));
paddle/fluid/framework/ir/fc_gru_fuse_pass.cc
浏览文件 @
f52c4f8b
...
...
@@ -16,6 +16,7 @@
#include <string>
#include <unordered_set>
#include "paddle/fluid/framework/lod_tensor.h"
#include "paddle/fluid/framework/op_version_registry.h"
namespace
paddle
{
namespace
framework
{
...
...
@@ -125,7 +126,6 @@ static int BuildFusion(Graph* graph, const std::string& name_scope,
auto
*
x_n
=
subgraph
.
at
(
x
);
GET_IR_NODE_FROM_SUBGRAPH
(
w
,
w
,
fc_pattern
);
GET_IR_NODE_FROM_SUBGRAPH
(
mul
,
mul
,
fc_pattern
);
GET_IR_NODE_FROM_SUBGRAPH
(
fc_out
,
elementwise_add_out
,
fc_pattern
);
GET_IR_NODE_FROM_SUBGRAPH
(
Weight
,
Weight
,
gru_pattern
);
GET_IR_NODE_FROM_SUBGRAPH
(
gru
,
gru
,
gru_pattern
);
GET_IR_NODE_FROM_SUBGRAPH
(
Bias
,
Bias
,
gru_pattern
);
...
...
@@ -136,10 +136,17 @@ static int BuildFusion(Graph* graph, const std::string& name_scope,
gru_pattern
);
GET_IR_NODE_FROM_SUBGRAPH
(
BatchHidden
,
BatchHidden
,
gru_pattern
);
// TODO(wilber): Support origin_mode=True.
if
(
gru
->
Op
()
->
GetAttrIfExists
<
bool
>
(
"origin_mode"
)
==
true
)
{
LOG
(
INFO
)
<<
"fc_gru_fuse_pass not supported when origin_mode=True."
;
return
;
}
if
(
with_fc_bias
)
{
GET_IR_NODE_FROM_SUBGRAPH
(
mul_out
,
mul_out
,
fc_pattern
);
GET_IR_NODE_FROM_SUBGRAPH
(
fc_bias
,
bias
,
fc_pattern
);
GET_IR_NODE_FROM_SUBGRAPH
(
elementwise_add
,
elementwise_add
,
fc_pattern
);
GET_IR_NODE_FROM_SUBGRAPH
(
fc_out
,
elementwise_add_out
,
fc_pattern
);
gru_creater
(
gru
,
x_n
,
w
,
Weight
,
Bias
,
Hidden
,
fc_bias
);
// Remove unneeded nodes.
...
...
@@ -188,3 +195,16 @@ void FCGRUFusePass::ApplyImpl(ir::Graph* graph) const {
REGISTER_PASS
(
mul_gru_fuse_pass
,
paddle
::
framework
::
ir
::
MulGRUFusePass
);
REGISTER_PASS
(
fc_gru_fuse_pass
,
paddle
::
framework
::
ir
::
FCGRUFusePass
);
REGISTER_PASS_CAPABILITY
(
mul_gru_fuse_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"mul"
,
0
)
.
EQ
(
"gru"
,
0
)
.
EQ
(
"fusion_gru"
,
0
));
REGISTER_PASS_CAPABILITY
(
fc_gru_fuse_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"mul"
,
0
)
.
EQ
(
"elementwise_add"
,
0
)
.
EQ
(
"gru"
,
0
)
.
EQ
(
"fusion_gru"
,
0
));
paddle/fluid/framework/ir/fc_lstm_fuse_pass.cc
浏览文件 @
f52c4f8b
...
...
@@ -16,6 +16,7 @@
#include <string>
#include <unordered_set>
#include "paddle/fluid/framework/lod_tensor.h"
#include "paddle/fluid/framework/op_version_registry.h"
namespace
paddle
{
namespace
framework
{
...
...
@@ -196,3 +197,17 @@ void FCLstmFusePass::ApplyImpl(ir::Graph* graph) const {
REGISTER_PASS
(
mul_lstm_fuse_pass
,
paddle
::
framework
::
ir
::
MulLstmFusePass
);
REGISTER_PASS
(
fc_lstm_fuse_pass
,
paddle
::
framework
::
ir
::
FCLstmFusePass
);
REGISTER_PASS_CAPABILITY
(
fc_lstm_fuse_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"mul"
,
0
)
.
EQ
(
"elementwise_add"
,
0
)
.
EQ
(
"lstm"
,
0
)
.
EQ
(
"fusion_lstm"
,
0
));
REGISTER_PASS_CAPABILITY
(
mul_lstm_fuse_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"mul"
,
0
)
.
EQ
(
"lstm"
,
0
)
.
EQ
(
"fusion_lstm"
,
0
));
paddle/fluid/framework/ir/memory_optimize_pass/CMakeLists.txt
浏览文件 @
f52c4f8b
...
...
@@ -13,4 +13,6 @@ cc_library(memory_reuse_pass SRCS memory_reuse_pass.cc DEPS computation_op_handl
cc_library
(
buffer_shared_inplace_op_pass SRCS buffer_shared_inplace_op_pass.cc DEPS memory_reuse_pass
)
cc_library
(
buffer_shared_cross_op_memory_reuse_pass SRCS buffer_shared_cross_op_memory_reuse_pass.cc DEPS memory_reuse_pass
)
cc_library
(
inplace_addto_op_pass SRCS inplace_addto_op_pass.cc DEPS memory_reuse_pass
)
cc_test
(
test_reference_count_pass_last_lived_ops SRCS test_reference_count_pass_last_lived_ops.cc DEPS parallel_executor elementwise_mul_op elementwise_add_op scale_op
)
paddle/fluid/framework/ir/memory_optimize_pass/buffer_shared_inplace_op_pass.cc
浏览文件 @
f52c4f8b
...
...
@@ -16,6 +16,7 @@
#include <unordered_map>
#include <unordered_set>
#include <vector>
#include "paddle/fluid/framework/details/computation_op_handle.h"
#include "paddle/fluid/framework/details/multi_devices_helper.h"
#include "paddle/fluid/framework/details/share_tensor_buffer_op_handle.h"
...
...
@@ -141,11 +142,12 @@ void BufferSharedInplaceOpPass::Run(Graph *graph) const {
VLOG
(
4
)
<<
"Inplace performed in op "
<<
op_type
<<
": "
<<
in_var_handle_ptr
->
Name
()
<<
" -> "
<<
out_var_handle_ptr
->
Name
()
<<
". Debug String is: "
<<
op
->
GetOp
()
->
DebugString
();
<<
". Debug String is: "
<<
op
->
GetOp
()
->
DebugString
()
<<
". ReuseType: "
<<
ReuseType
();
}
else
{
VLOG
(
3
)
<<
"Inplace failed in op "
<<
op_type
<<
": "
<<
in_var_handle_ptr
->
Name
()
<<
" -> "
<<
out_var_handle_ptr
->
Name
();
<<
out_var_handle_ptr
->
Name
()
<<
". ReuseType: "
<<
ReuseType
()
;
}
}
}
...
...
paddle/fluid/framework/ir/memory_optimize_pass/inplace_addto_op_pass.cc
0 → 100644
浏览文件 @
f52c4f8b
// Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <string>
#include <unordered_map>
#include <unordered_set>
#include <vector>
#include "paddle/fluid/framework/details/computation_op_handle.h"
#include "paddle/fluid/framework/details/multi_devices_helper.h"
#include "paddle/fluid/framework/details/share_tensor_buffer_op_handle.h"
#include "paddle/fluid/framework/ir/memory_optimize_pass/memory_optimization_var_info.h"
#include "paddle/fluid/framework/ir/memory_optimize_pass/memory_reuse_pass.h"
#include "paddle/fluid/framework/ir/memory_optimize_pass/reference_count_pass_helper.h"
#include "paddle/fluid/framework/ir/pass.h"
namespace
paddle
{
namespace
framework
{
namespace
ir
{
class
InplaceAddToOpPass
:
public
MemoryReusePass
{
protected:
std
::
string
ReuseType
()
const
override
{
return
"inplace_addto"
;
}
void
Run
(
Graph
*
graph
)
const
override
;
private:
// 1. Add last living op of in_var, add any last living op of out_var
// 2. Set reference count of in_var to be 2
void
UpdateLastLiveOpOfVar
(
details
::
ComputationOpHandle
*
op
,
details
::
VarHandle
*
in_var
,
details
::
VarHandle
*
out_var
)
const
override
{
size_t
scope_idx
=
op
->
GetScopeIdx
();
auto
*
last_live_ops_of_vars_
=
&
Get
<
std
::
vector
<
LastLiveOpsOfVars
>>
(
kLastLiveOpsOfVars
);
auto
*
var_infos_
=
&
(
Get
<
MemOptVarInfoMapList
>
(
kMemOptVarInfoMapList
));
auto
out_var_op_iter
=
(
*
last_live_ops_of_vars_
)[
scope_idx
].
find
(
out_var
->
Name
());
// In Reduce mode, some output variable(gradient of parameter) does not have
// last live ops
details
::
ComputationOpHandle
*
last_live_op_of_in_var
=
nullptr
;
if
(
out_var_op_iter
==
(
*
last_live_ops_of_vars_
)[
scope_idx
].
end
())
{
last_live_op_of_in_var
=
op
;
}
else
{
PADDLE_ENFORCE_EQ
(
out_var_op_iter
->
second
.
ops
().
empty
(),
false
,
platform
::
errors
::
InvalidArgument
(
"Var(%s)'s last live op should not empty."
,
out_var
->
Name
()));
last_live_op_of_in_var
=
*
(
out_var_op_iter
->
second
.
ops
().
begin
());
}
auto
*
last_live_ops_of_in_var
=
(
*
last_live_ops_of_vars_
)[
scope_idx
][
in_var
->
Name
()].
mutable_ops
();
// last_live_ops_of_in_var->clear();
last_live_ops_of_in_var
->
insert
(
last_live_op_of_in_var
);
auto
in_var_info_iter
=
(
*
var_infos_
)[
scope_idx
].
find
(
in_var
->
Name
());
PADDLE_ENFORCE_NE
(
in_var_info_iter
,
(
*
var_infos_
)[
scope_idx
].
end
(),
platform
::
errors
::
NotFound
(
"Cannot find variable %s."
,
in_var
->
Name
()));
in_var_info_iter
->
second
->
SetRefCnt
(
2
);
// before inplace, it is 1
}
};
void
InplaceAddToOpPass
::
Run
(
Graph
*
graph
)
const
{
const
auto
&
last_live_ops
=
Get
<
std
::
vector
<
LastLiveOpsOfVars
>>
(
kLastLiveOpsOfVars
);
bool
use_cuda
=
Get
<
bool
>
(
kUseCuda
);
// Currently, only perform InplaceAddToOpPass on cuda place
if
(
!
use_cuda
)
{
return
;
}
// Step 1: Build a reverse map of last_live_ops
// i.e.: op -> vars
std
::
unordered_map
<
details
::
ComputationOpHandle
*
,
std
::
unordered_map
<
std
::
string
,
ir
::
Node
*>>
candidate_ops
;
for
(
auto
&
each_scope_ops
:
last_live_ops
)
{
for
(
auto
&
pair
:
each_scope_ops
)
{
// If variable has more than 1 last lived ops, this variable cannot
// be inplaced.
if
(
pair
.
second
.
ops
().
size
()
!=
1
)
{
continue
;
}
auto
*
op
=
*
(
pair
.
second
.
ops
().
begin
());
const
std
::
string
&
op_type
=
op
->
GetOp
()
->
Type
();
const
framework
::
OpDesc
*
op_desc
=
op
->
Node
()
->
Op
();
PADDLE_ENFORCE_NOT_NULL
(
op_desc
,
platform
::
errors
::
NotFound
(
"Op(%s) can not find opdesc."
,
op
->
Name
()));
// only grad op should be processed.
if
(
op_type
!=
"grad_add"
)
{
continue
;
}
const
std
::
string
&
var_name
=
pair
.
first
;
auto
in_nodes
=
this
->
FindNodesByName
(
var_name
,
op
->
Node
()
->
inputs
);
if
(
in_nodes
.
size
()
==
1
)
{
candidate_ops
[
op
][
var_name
]
=
*
in_nodes
.
begin
();
}
VLOG
(
4
)
<<
"Find op "
<<
op_type
<<
" with input("
<<
var_name
<<
") that can do inplace add to"
;
}
}
// Step 2: Check which vars can be inplaced indeed
for
(
auto
&
op_vars_pair
:
candidate_ops
)
{
auto
*
op
=
op_vars_pair
.
first
;
// The original gradient accumulation is g = sum(g_0, g_1,..., g_n), and it
// could be changed as follws if inplace addto is enabled:
// g_sum_0 = g_0
// g_sum_1 = grad_add(g_sum_0, g_1)
// g_sum_2 = grad_add(g_sum_1, g_2)
// ...
// g_sum_n = grad_add(g_sum_n-1, g_n)
// here we will add inplace for each grad_add, for example, for the first
// grad_add, g_sum_0 -> g1, g_sum_1 -> g1, and set grad_add as skipped.
const
std
::
string
&
op_type
=
op
->
GetOp
()
->
Type
();
PADDLE_ENFORCE_EQ
(
op
->
Node
()
->
inputs
.
size
(),
2
,
platform
::
errors
::
InvalidArgument
(
"The size of inputs of %s should be 2, but got %d"
,
op_type
,
op
->
Node
()
->
inputs
.
size
()));
PADDLE_ENFORCE_EQ
(
op
->
Node
()
->
outputs
.
size
(),
1
,
platform
::
errors
::
InvalidArgument
(
"The size of outputs of %s should be 1, but got %d"
,
op_type
,
op
->
Node
()
->
outputs
.
size
()));
auto
*
left_var_ptr
=
dynamic_cast
<
details
::
VarHandle
*>
(
&
(
op
->
Node
()
->
inputs
[
0
]
->
Wrapper
<
details
::
VarHandleBase
>
()));
auto
*
right_var_ptr
=
dynamic_cast
<
details
::
VarHandle
*>
(
&
(
op
->
Node
()
->
inputs
[
1
]
->
Wrapper
<
details
::
VarHandleBase
>
()));
auto
*
out_var_ptr
=
dynamic_cast
<
details
::
VarHandle
*>
(
&
(
op
->
Node
()
->
outputs
[
0
]
->
Wrapper
<
details
::
VarHandleBase
>
()));
if
(
left_var_ptr
==
nullptr
||
right_var_ptr
==
nullptr
||
out_var_ptr
==
nullptr
)
{
continue
;
}
// auto *left_generated_op = dynamic_cast<details::ComputationOpHandle *>(
// left_var_ptr->GeneratedOp());
auto
*
right_generated_op
=
dynamic_cast
<
details
::
ComputationOpHandle
*>
(
right_var_ptr
->
GeneratedOp
());
auto
*
out_generated_op
=
dynamic_cast
<
details
::
ComputationOpHandle
*>
(
out_var_ptr
->
GeneratedOp
());
// NOTE(zhiqiu): currently, only conv2d_grad supports addto strategy
if
(
right_generated_op
->
Name
()
!=
"conv2d_grad"
)
{
continue
;
}
// NOTE(zhiqiu): Normally, if we inplace a->b, we should let a generated
// before b. However, in the situation of inplace addto, we do not care
// the order, since a+b is equal to b+a. Is there any exception for that?
// AddDependencyVar(right_generated_op, left_generated_op);
// no need, as discussed above.
// step (a): inplace right_var->left_var of grad_add
this
->
AddReuseVar
(
right_generated_op
,
left_var_ptr
,
right_var_ptr
);
UpdateLastLiveOpOfVar
(
right_generated_op
,
left_var_ptr
,
right_var_ptr
);
VLOG
(
4
)
<<
"Inplace performed in op "
<<
right_generated_op
->
GetOp
()
->
Type
()
<<
": "
<<
left_var_ptr
->
Name
()
<<
" -> "
<<
right_var_ptr
->
Name
()
<<
". Debug String is: "
<<
right_generated_op
->
GetOp
()
->
DebugString
()
<<
". ReuseType: "
<<
ReuseType
();
// step (b): inplace out -> right_var of grad_add
this
->
AddReuseVar
(
out_generated_op
,
right_var_ptr
,
out_var_ptr
,
true
);
VLOG
(
4
)
<<
"Inplace performed in op "
<<
op_type
<<
": "
<<
left_var_ptr
->
Name
()
<<
" -> "
<<
out_var_ptr
->
Name
()
<<
". Debug String is: "
<<
op
->
GetOp
()
->
DebugString
()
<<
". ReuseType: "
<<
ReuseType
();
// step (c): make right_var cannot inplace afterwards. canbe done
// aotomatically since CollectReusedVars is called before any reuse.
// step (d): make right_var's generated op use addto
right_generated_op
->
GetOp
()
->
SetAttr
(
"use_addto"
,
true
);
// step (e): make grad_add skip running
op
->
SetSkipRunning
(
true
);
}
}
}
// namespace ir
}
// namespace framework
}
// namespace paddle
REGISTER_PASS
(
inplace_addto_op_pass
,
paddle
::
framework
::
ir
::
InplaceAddToOpPass
)
.
RequirePassAttr
(
paddle
::
framework
::
ir
::
kMemOptVarInfoMapList
)
.
RequirePassAttr
(
paddle
::
framework
::
ir
::
kLastLiveOpsOfVars
)
.
RequirePassAttr
(
paddle
::
framework
::
ir
::
kUseCuda
);
paddle/fluid/framework/ir/memory_optimize_pass/memory_reuse_pass.cc
浏览文件 @
f52c4f8b
...
...
@@ -13,6 +13,7 @@
// limitations under the License.
#include "paddle/fluid/framework/ir/memory_optimize_pass/memory_reuse_pass.h"
#include <functional>
#include <map>
#include <string>
...
...
@@ -73,6 +74,7 @@ bool MemoryReusePass::TryReuseVar(details::VarHandle *in_var,
out_var
->
Name
()));
if
(
IsVarPairReusable
(
*
in_var
,
*
out_var
))
{
AddReuseVar
(
op
,
in_var
,
out_var
);
UpdateLastLiveOpOfVar
(
op
,
in_var
,
out_var
);
return
true
;
}
else
{
return
false
;
...
...
@@ -324,7 +326,8 @@ bool MemoryReusePass::IsVarPairReusable(
void
MemoryReusePass
::
AddReuseVar
(
details
::
ComputationOpHandle
*
op
,
details
::
VarHandle
*
in_var
,
details
::
VarHandle
*
out_var
)
const
{
details
::
VarHandle
*
out_var
,
bool
share_dims
)
const
{
PADDLE_ENFORCE_GT
(
(
*
var_infos_
)[
op
->
GetScopeIdx
()].
count
(
in_var
->
Name
()),
0
,
platform
::
errors
::
NotFound
(
"Var(%s) does not in mem opt var infos."
,
...
...
@@ -344,13 +347,15 @@ void MemoryReusePass::AddReuseVar(details::ComputationOpHandle *op,
share_buffer_op
->
AddInput
(
in_var
);
}
if
(
share_dims
)
{
share_buffer_op
->
SetShareDims
(
true
);
}
share_buffer_op
->
AddReuseVarPair
(
(
*
var_infos_
)[
op
->
GetScopeIdx
()].
at
(
in_var
->
Name
()).
get
(),
out_var
->
Name
());
reused_in_var_names_
[
op
->
GetScopeIdx
()].
insert
(
in_var
->
Name
());
reused_out_var_names_
[
op
->
GetScopeIdx
()].
insert
(
out_var
->
Name
());
UpdateLastLiveOpOfVar
(
op
,
in_var
,
out_var
);
}
// 1. Set last living op of in_var to be any last living op of out_var
...
...
paddle/fluid/framework/ir/memory_optimize_pass/memory_reuse_pass.h
浏览文件 @
f52c4f8b
...
...
@@ -18,6 +18,7 @@
#include <unordered_map>
#include <unordered_set>
#include <vector>
#include "paddle/fluid/framework/details/computation_op_handle.h"
#include "paddle/fluid/framework/details/multi_devices_helper.h"
#include "paddle/fluid/framework/details/share_tensor_buffer_op_handle.h"
...
...
@@ -92,6 +93,12 @@ class MemoryReusePass : public Pass {
int64_t
GetMemorySize
(
const
details
::
VarHandle
&
var
)
const
;
void
AddReuseVar
(
details
::
ComputationOpHandle
*
op
,
details
::
VarHandle
*
in_var
,
details
::
VarHandle
*
out_var
,
bool
share_dims
=
false
)
const
;
virtual
void
UpdateLastLiveOpOfVar
(
details
::
ComputationOpHandle
*
op
,
details
::
VarHandle
*
in_var
,
details
::
VarHandle
*
out_var
)
const
;
private:
VarDesc
*
GetVarDesc
(
const
details
::
VarHandle
&
var
)
const
;
...
...
@@ -109,13 +116,6 @@ class MemoryReusePass : public Pass {
void
CollectReusedVars
()
const
;
void
AddReuseVar
(
details
::
ComputationOpHandle
*
op
,
details
::
VarHandle
*
in_var
,
details
::
VarHandle
*
out_var
)
const
;
void
UpdateLastLiveOpOfVar
(
details
::
ComputationOpHandle
*
op
,
details
::
VarHandle
*
in_var
,
details
::
VarHandle
*
out_var
)
const
;
private:
mutable
Graph
*
graph_
;
mutable
bool
use_cuda_
;
...
...
paddle/fluid/framework/ir/repeated_fc_relu_fuse_pass.cc
浏览文件 @
f52c4f8b
...
...
@@ -18,6 +18,7 @@ limitations under the License. */
#include <unordered_set>
#include <vector>
#include "paddle/fluid/framework/lod_tensor.h"
#include "paddle/fluid/framework/op_version_registry.h"
#define MAX_NUM_FC 10
...
...
@@ -174,6 +175,10 @@ void BuildRepeatedFCReluPattern(PDPattern* pattern,
if
(
x
->
outputs
.
size
()
<=
0
||
x
->
inputs
.
size
()
<=
0U
)
{
return
false
;
}
if
(
x
->
IsVar
()
&&
x
->
Var
()
&&
x
->
Var
()
->
GetShape
().
size
()
>
2
)
{
LOG
(
WARNING
)
<<
"repeated fc relu only supports input dims = 2"
;
return
false
;
}
int
fc_idx
=
FindFCIdx
(
x
);
if
(
fc_idx
<
0
)
{
return
false
;
...
...
@@ -384,3 +389,8 @@ void RepeatedFCReluFusePass::ApplyImpl(ir::Graph* graph) const {
REGISTER_PASS
(
repeated_fc_relu_fuse_pass
,
paddle
::
framework
::
ir
::
RepeatedFCReluFusePass
);
REGISTER_PASS_CAPABILITY
(
repeated_fc_relu_fuse_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"fc"
,
0
)
.
EQ
(
"relu"
,
0
));
paddle/fluid/framework/ir/shuffle_channel_detect_pass.cc
浏览文件 @
f52c4f8b
...
...
@@ -16,6 +16,7 @@
#include "paddle/fluid/framework/ir/graph_viz_pass.h"
#include "paddle/fluid/framework/ir/shuffle_channel_detect_pass.h"
#include "paddle/fluid/framework/op_version_registry.h"
namespace
paddle
{
namespace
framework
{
...
...
@@ -34,6 +35,8 @@ void ShuffleChannelDetectPass::ApplyImpl(ir::Graph* graph) const {
const
std
::
string
pattern_name
=
"shufflechannel_pattern"
;
FusePassBase
::
Init
(
pattern_name
,
graph
);
LOG
(
WARNING
)
<<
"There is fluid.layers.shuffle_channel API already, you can "
"use it instead of (reshape + transpose +reshape)"
;
GraphPatternDetector
gpd
;
auto
*
x
=
gpd
.
mutable_pattern
()
->
NewNode
(
"x"
)
...
...
@@ -93,3 +96,8 @@ void ShuffleChannelDetectPass::ApplyImpl(ir::Graph* graph) const {
REGISTER_PASS
(
shuffle_channel_detect_pass
,
paddle
::
framework
::
ir
::
ShuffleChannelDetectPass
);
REGISTER_PASS_CAPABILITY
(
shuffle_channel_detect_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"reshape2"
,
0
)
.
EQ
(
"transpose2"
,
0
));
paddle/fluid/framework/ir/squared_mat_sub_fuse_pass.cc
浏览文件 @
f52c4f8b
...
...
@@ -17,6 +17,7 @@
#include <unordered_set>
#include <vector>
#include "paddle/fluid/framework/lod_tensor.h"
#include "paddle/fluid/framework/op_version_registry.h"
namespace
paddle
{
namespace
framework
{
...
...
@@ -77,7 +78,8 @@ PDNode* BuildSquaredMatSubPattern(PDPattern* pattern,
};
auto
is_fusion_input_var
=
[
=
](
Node
*
x
,
const
std
::
string
&
arg_name
)
{
bool
basic
=
var_is_op_input
(
x
,
"matmul"
,
arg_name
)
&&
bool
basic
=
(
var_is_op_input
(
x
,
"matmul_v2"
,
arg_name
)
||
var_is_op_input
(
x
,
"matmul"
,
arg_name
))
&&
var_is_op_input
(
x
,
"square"
,
"X"
);
if
(
!
basic
)
{
return
false
;
...
...
@@ -88,7 +90,8 @@ PDNode* BuildSquaredMatSubPattern(PDPattern* pattern,
}
auto
*
squared_x
=
squared_x_op
->
outputs
[
0
];
bool
next_is_matmul_from_arg
=
var_is_op_input
(
squared_x
,
"matmul"
,
arg_name
)
&&
(
var_is_op_input
(
squared_x
,
"matmul_v2"
,
arg_name
)
||
var_is_op_input
(
squared_x
,
"matmul"
,
arg_name
))
&&
squared_x
->
outputs
.
size
()
==
1
&&
squared_x
->
outputs
[
0
]
->
outputs
.
size
()
==
1
;
if
(
!
next_is_matmul_from_arg
)
{
...
...
@@ -103,7 +106,8 @@ PDNode* BuildSquaredMatSubPattern(PDPattern* pattern,
auto
is_fusion_first_mul_out
=
[
=
](
Node
*
x
)
->
bool
{
bool
input_is_matmul_op
=
x
&&
x
->
inputs
.
size
()
==
1
&&
x
->
inputs
[
0
]
->
IsOp
()
&&
x
->
inputs
[
0
]
->
Op
()
->
Type
()
==
"matmul"
;
(
x
->
inputs
[
0
]
->
Op
()
->
Type
()
==
"matmul_v2"
||
x
->
inputs
[
0
]
->
Op
()
->
Type
()
==
"matmul"
);
if
(
!
input_is_matmul_op
)
{
return
false
;
}
...
...
@@ -167,7 +171,8 @@ PDNode* BuildSquaredMatSubPattern(PDPattern* pattern,
auto
*
matmul_xy_op
=
pattern
->
NewNode
(
[
=
](
Node
*
x
)
{
return
x
&&
x
->
IsOp
()
&&
x
->
Op
()
->
Type
()
==
"matmul"
&&
return
x
&&
x
->
IsOp
()
&&
(
x
->
Op
()
->
Type
()
==
"matmul_v2"
||
x
->
Op
()
->
Type
()
==
"matmul"
)
&&
is_fusion_first_mul_out
(
x
->
outputs
[
0
]);
},
name_scope
+
"/matmul_xy_op"
);
...
...
@@ -189,7 +194,9 @@ PDNode* BuildSquaredMatSubPattern(PDPattern* pattern,
auto
is_fusion_mat_squared_x_y_op_out
=
[
=
](
Node
*
x
)
->
bool
{
bool
basic
=
x
&&
x
->
IsVar
()
&&
x
->
inputs
.
size
()
==
1
&&
x
->
inputs
[
0
]
->
IsOp
()
&&
x
->
inputs
[
0
]
->
Op
()
->
Type
()
==
"matmul"
;
x
->
inputs
[
0
]
->
IsOp
()
&&
(
x
->
inputs
[
0
]
->
Op
()
->
Type
()
==
"matmul_v2"
||
x
->
inputs
[
0
]
->
Op
()
->
Type
()
==
"matmul"
);
if
(
!
basic
)
{
return
false
;
}
...
...
@@ -206,7 +213,8 @@ PDNode* BuildSquaredMatSubPattern(PDPattern* pattern,
auto
*
matmul_squared_x_y_op
=
pattern
->
NewNode
(
[
=
](
Node
*
x
)
{
return
x
&&
x
->
IsOp
()
&&
x
->
Op
()
->
Type
()
==
"matmul"
&&
return
x
&&
x
->
IsOp
()
&&
(
x
->
Op
()
->
Type
()
==
"matmul_v2"
||
x
->
Op
()
->
Type
()
==
"matmul"
)
&&
is_fusion_mat_squared_x_y_op_out
(
x
->
outputs
[
0
]);
},
name_scope
+
"/matmul_squared_x_y_op"
);
...
...
@@ -378,3 +386,13 @@ void SquaredMatSubFusePass::ApplyImpl(ir::Graph* graph) const {
REGISTER_PASS
(
squared_mat_sub_fuse_pass
,
paddle
::
framework
::
ir
::
SquaredMatSubFusePass
);
REGISTER_PASS_CAPABILITY
(
squared_mat_sub_fuse_pass
)
.
AddCombination
(
paddle
::
framework
::
compatible
::
OpVersionComparatorCombination
()
.
EQ
(
"matmul"
,
0
)
.
EQ
(
"matmul_v2"
,
0
)
.
EQ
(
"square"
,
0
)
.
EQ
(
"elementwise_mul"
,
0
)
.
EQ
(
"elementwise_sub"
,
0
)
.
EQ
(
"fill_constant"
,
0
)
.
EQ
(
"fusion_squared_mat_sub"
,
0
));
paddle/fluid/framework/ir/squared_mat_sub_fuse_pass.h
浏览文件 @
f52c4f8b
...
...
@@ -24,7 +24,7 @@ namespace framework {
namespace
ir
{
/**
* Fuse ( (A
.^2 * B.^2) - (A * B).^2
) .* scalar
* Fuse ( (A
* B).^2 - (A.^2 * B.^2)
) .* scalar
*/
class
SquaredMatSubFusePass
:
public
FusePassBase
{
public:
...
...
paddle/fluid/framework/operator.h
浏览文件 @
f52c4f8b
...
...
@@ -157,6 +157,14 @@ class OperatorBase {
platform
::
errors
::
NotFound
(
"(%s) is not found in AttributeMap."
,
name
));
return
BOOST_GET_CONST
(
T
,
attrs_
.
at
(
name
));
}
void
SetAttr
(
const
std
::
string
&
name
,
const
Attribute
&
v
)
{
PADDLE_ENFORCE_EQ
(
HasAttr
(
name
),
true
,
platform
::
errors
::
NotFound
(
"The attribute %s is not found in operator %s"
,
name
,
Type
()));
attrs_
[
name
]
=
v
;
}
const
AttributeMap
&
Attrs
()
const
{
return
attrs_
;
}
const
VariableNameMap
&
Inputs
()
const
{
return
inputs_
;
}
...
...
paddle/fluid/framework/parallel_executor.cc
浏览文件 @
f52c4f8b
...
...
@@ -13,12 +13,14 @@ See the License for the specific language governing permissions and
limitations under the License. */
#include "paddle/fluid/framework/parallel_executor.h"
#include <algorithm>
#include <memory>
#include <string>
#include <tuple>
#include <utility>
#include <vector>
#include "paddle/fluid/framework/details/async_ssa_graph_executor.h"
#include "paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.h"
#include "paddle/fluid/framework/details/multi_devices_helper.h"
...
...
@@ -108,6 +110,11 @@ class ParallelExecutorPrivate {
* them.
*/
inline
void
SetSkipMemoryReuse
(
size_t
scope_idx
,
const
std
::
string
&
name
)
{
if
(
mem_opt_var_infos_
.
size
()
==
0
)
{
VLOG
(
4
)
<<
"The mem_opt_var_infos_ is empty, maybe no memory "
"optimization strategy is enabled"
;
return
;
}
auto
iter
=
mem_opt_var_infos_
[
scope_idx
].
find
(
name
);
if
(
iter
!=
mem_opt_var_infos_
[
scope_idx
].
end
())
{
iter
->
second
->
SetSkipMemoryReuse
(
true
);
...
...
@@ -308,6 +315,7 @@ ir::Graph *ParallelExecutorPrivate::ApplyMemoryOptimizePass(ir::Graph *graph) {
}
bool
need_mem_opt
=
build_strategy_
.
enable_inplace_
||
build_strategy_
.
enable_addto_
||
build_strategy_
.
memory_optimize_
.
get
()
||
is_gc_enabled
;
if
(
!
need_mem_opt
)
return
graph
;
...
...
@@ -320,6 +328,16 @@ ir::Graph *ParallelExecutorPrivate::ApplyMemoryOptimizePass(ir::Graph *graph) {
graph
=
ref_cnt_pass
->
Apply
(
graph
);
VLOG
(
10
)
<<
"ReferenceCountPass Applied"
;
if
(
build_strategy_
.
enable_addto_
)
{
auto
addto_pass
=
ir
::
PassRegistry
::
Instance
().
Get
(
"inplace_addto_op_pass"
);
addto_pass
->
SetNotOwned
(
ir
::
kMemOptVarInfoMapList
,
&
mem_opt_var_infos_
);
addto_pass
->
SetNotOwned
(
ir
::
kLastLiveOpsOfVars
,
&
last_live_ops_of_vars
);
addto_pass
->
SetNotOwned
(
ir
::
kUseCuda
,
&
use_cuda_
);
VLOG
(
10
)
<<
"Start to apply inplace_addto_op_pass"
;
graph
=
addto_pass
->
Apply
(
graph
);
VLOG
(
10
)
<<
"inplace_addto_op_pass Applied"
;
}
if
(
build_strategy_
.
enable_inplace_
)
{
auto
inplace_pass
=
ir
::
PassRegistry
::
Instance
().
Get
(
"buffer_shared_inplace_pass"
);
...
...
@@ -1068,3 +1086,4 @@ USE_PASS(reference_count_pass);
USE_PASS
(
eager_deletion_pass
);
USE_PASS
(
buffer_shared_inplace_pass
);
USE_PASS
(
buffer_shared_cross_op_memory_reuse_pass
);
USE_PASS
(
inplace_addto_op_pass
);
paddle/fluid/inference/api/paddle_pass_builder.cc
浏览文件 @
f52c4f8b
...
...
@@ -156,7 +156,8 @@ CpuPassStrategy::CpuPassStrategy() : PassStrategy({}) {
// "seqpool_concat_fuse_pass", //
"seqpool_cvm_concat_fuse_pass"
,
//
// "embedding_fc_lstm_fuse_pass", //
"fc_lstm_fuse_pass"
,
//
// TODO(wilber): fix correctness problem.
// "fc_lstm_fuse_pass", //
"mul_lstm_fuse_pass"
,
//
"fc_gru_fuse_pass"
,
//
"mul_gru_fuse_pass"
,
//
...
...
paddle/fluid/inference/tensorrt/convert/emb_eltwise_layernorm.cc
浏览文件 @
f52c4f8b
...
...
@@ -80,10 +80,10 @@ class EmbEltwiseLayerNormOpConverter : public OpConverter {
nvinfer1
::
ILayer
*
layer
=
nullptr
;
if
(
engine_
->
with_dynamic_shape
())
{
plugin
::
DynamicPluginTensorRT
*
plugin
=
nullptr
;
plugin
=
new
plugin
::
EmbEltwiseLayernormPluginDynamic
<
float
>
(
auto
use_fp16
=
engine_
->
WithFp16
()
;
auto
plugin
=
new
plugin
::
EmbEltwiseLayernormPluginDynamic
(
input_embs
,
bias
,
scale
,
emb_sizes
,
bias_size
,
scale_size
,
hidden
,
eps
);
eps
,
use_fp16
);
layer
=
engine_
->
AddPluginV2
(
input_ids
.
data
(),
input_num
,
plugin
);
}
else
{
PADDLE_THROW
(
platform
::
errors
::
Fatal
(
...
...
paddle/fluid/inference/tensorrt/plugin/emb_eltwise_layernorm_plugin.cu
浏览文件 @
f52c4f8b
...
...
@@ -32,13 +32,34 @@ namespace plugin {
#if IS_TRT_VERSION_GE(6000)
template
<
typename
T
>
int
EmbEltwiseLayernormPluginDynamic
<
T
>::
initialize
()
{
EmbEltwiseLayernormPluginDynamicImpl
<
T
>::~
EmbEltwiseLayernormPluginDynamicImpl
()
{
this
->
terminate
();
}
inline
half
fp32tofp16
(
float
x
)
{
return
static_cast
<
half
>
(
x
);
}
template
<
typename
T
>
int
EmbEltwiseLayernormPluginDynamicImpl
<
T
>::
initialize
()
{
embs_gpu_
.
resize
(
embs_
.
size
());
for
(
int
i
=
0
;
i
<
embs_
.
size
();
i
++
)
{
if
(
embs_
[
i
])
{
cudaMalloc
(
&
embs_gpu_
[
i
],
sizeof
(
float
)
*
emb_sizes_
[
i
]);
cudaMemcpy
(
embs_gpu_
[
i
],
embs_
[
i
],
emb_sizes_
[
i
]
*
sizeof
(
float
),
T
*
host_ptr
;
auto
size
=
emb_sizes_
[
i
];
if
(
std
::
is_same
<
T
,
half
>::
value
)
{
host_ptr
=
new
T
[
size
];
std
::
transform
(
embs_
[
i
],
(
embs_
[
i
]
+
size
),
host_ptr
,
fp32tofp16
);
}
else
{
host_ptr
=
reinterpret_cast
<
T
*>
(
embs_
[
i
]);
}
cudaMalloc
(
&
embs_gpu_
[
i
],
sizeof
(
T
)
*
size
);
cudaMemcpy
(
embs_gpu_
[
i
],
host_ptr
,
size
*
sizeof
(
T
),
cudaMemcpyHostToDevice
);
if
(
std
::
is_same
<
T
,
half
>::
value
)
{
delete
[]
host_ptr
;
}
}
}
...
...
@@ -53,11 +74,105 @@ int EmbEltwiseLayernormPluginDynamic<T>::initialize() {
cudaMemcpyHostToDevice
);
}
int
input_num
=
embs_
.
size
();
in_ptr_tensor_
.
Resize
({
input_num
});
emb_ptr_tensor_
.
Resize
({
input_num
});
cudaGetDevice
(
&
device_id_
);
auto
emb_ptr_gpu_d
=
emb_ptr_tensor_
.
mutable_data
<
int64_t
>
(
platform
::
CUDAPlace
(
device_id_
));
cudaMemcpy
(
emb_ptr_gpu_d
,
embs_gpu_
.
data
(),
sizeof
(
uintptr_t
)
*
input_num
,
cudaMemcpyHostToDevice
);
return
0
;
}
template
<
typename
T
>
nvinfer1
::
DimsExprs
EmbEltwiseLayernormPluginDynamic
<
T
>::
getOutputDimensions
(
void
EmbEltwiseLayernormPluginDynamicImpl
<
T
>::
terminate
()
{
for
(
int
i
=
0
;
i
<
embs_gpu_
.
size
();
++
i
)
{
if
(
embs_gpu_
[
i
])
{
cudaFree
(
embs_gpu_
[
i
]);
embs_gpu_
[
i
]
=
nullptr
;
}
}
if
(
bias_gpu_
)
{
cudaFree
(
bias_gpu_
);
bias_gpu_
=
nullptr
;
}
if
(
scale_gpu_
)
{
cudaFree
(
scale_gpu_
);
scale_gpu_
=
nullptr
;
}
}
template
<
typename
T
>
int
EmbEltwiseLayernormPluginDynamicImpl
<
T
>::
enqueue
(
const
nvinfer1
::
PluginTensorDesc
*
input_desc
,
const
nvinfer1
::
PluginTensorDesc
*
output_desc
,
const
void
*
const
*
inputs
,
void
*
const
*
outputs
,
void
*
workspace
,
cudaStream_t
stream
)
{
auto
id_dims
=
input_desc
[
0
].
dims
;
int
batch
=
id_dims
.
d
[
0
];
int
seq_len
=
id_dims
.
d
[
1
];
int
input_num
=
embs_
.
size
();
auto
in_ptr_gpu_d
=
in_ptr_tensor_
.
mutable_data
<
int64_t
>
(
platform
::
CUDAPlace
(
device_id_
));
auto
emb_ptr_gpu_d
=
emb_ptr_tensor_
.
mutable_data
<
int64_t
>
(
platform
::
CUDAPlace
(
device_id_
));
auto
new_input_ptr
=
reinterpret_cast
<
uintptr_t
>
(
inputs
[
0
]);
if
(
old_input_ptr_
!=
new_input_ptr
)
{
old_input_ptr_
=
new_input_ptr
;
cudaMemcpyAsync
(
in_ptr_gpu_d
,
reinterpret_cast
<
const
void
*>
(
inputs
),
sizeof
(
uintptr_t
)
*
input_num
,
cudaMemcpyHostToDevice
,
stream
);
}
auto
out_type
=
output_desc
[
0
].
type
;
if
(
std
::
is_same
<
T
,
float
>::
value
)
{
PADDLE_ENFORCE_EQ
(
out_type
==
nvinfer1
::
DataType
::
kFLOAT
,
true
,
platform
::
errors
::
InvalidArgument
(
"The EmbEltwiseLayernorm Plugin only support fp32 input."
));
}
else
if
(
std
::
is_same
<
T
,
half
>::
value
)
{
PADDLE_ENFORCE_EQ
(
out_type
==
nvinfer1
::
DataType
::
kHALF
,
true
,
platform
::
errors
::
InvalidArgument
(
"The EmbEltwiseLayernorm Plugin only support fp16 input."
));
}
else
{
PADDLE_THROW
(
platform
::
errors
::
Fatal
(
"Unsupport data type, the out type of EmbEltwiseLayernorm should be "
"float or half."
));
}
auto
*
output_d
=
reinterpret_cast
<
T
*>
(
outputs
[
0
]);
operators
::
math
::
EmbEltwiseLayerNormFunctor
<
T
>
emb_eltwise_layernorm_func
;
emb_eltwise_layernorm_func
(
batch
,
seq_len
,
hidden_size_
,
in_ptr_gpu_d
,
scale_gpu_
,
bias_gpu_
,
emb_ptr_gpu_d
,
output_d
,
eps_
,
input_num
,
stream
);
return
cudaGetLastError
()
!=
cudaSuccess
;
}
template
class
EmbEltwiseLayernormPluginDynamicImpl
<
float
>;
#ifdef SUPPORTS_CUDA_FP16
template
class
EmbEltwiseLayernormPluginDynamicImpl
<
half
>;
#endif // SUPPORTS_CUDA_FP16
int
EmbEltwiseLayernormPluginDynamic
::
initialize
()
{
impl_
->
initialize
();
return
0
;
}
void
EmbEltwiseLayernormPluginDynamic
::
terminate
()
{
impl_
->
terminate
();
}
nvinfer1
::
DimsExprs
EmbEltwiseLayernormPluginDynamic
::
getOutputDimensions
(
int
output_index
,
const
nvinfer1
::
DimsExprs
*
inputs
,
int
nb_inputs
,
nvinfer1
::
IExprBuilder
&
expr_builder
)
{
// NOLINT
PADDLE_ENFORCE_EQ
(
output_index
,
0
,
...
...
@@ -76,18 +191,7 @@ nvinfer1::DimsExprs EmbEltwiseLayernormPluginDynamic<T>::getOutputDimensions(
return
ret
;
}
template
<
typename
T
>
void
EmbEltwiseLayernormPluginDynamic
<
T
>::
terminate
()
{
for
(
auto
ptr
:
embs_gpu_
)
{
if
(
ptr
)
cudaFree
(
ptr
);
}
if
(
bias_gpu_
)
cudaFree
(
bias_gpu_
);
if
(
scale_gpu_
)
cudaFree
(
scale_gpu_
);
}
template
<
typename
T
>
bool
EmbEltwiseLayernormPluginDynamic
<
T
>::
supportsFormatCombination
(
bool
EmbEltwiseLayernormPluginDynamic
::
supportsFormatCombination
(
int
pos
,
const
nvinfer1
::
PluginTensorDesc
*
in_out
,
int
nb_inputs
,
int
nb_outputs
)
{
PADDLE_ENFORCE_NOT_NULL
(
...
...
@@ -98,6 +202,11 @@ bool EmbEltwiseLayernormPluginDynamic<T>::supportsFormatCombination(
"The EmbEltwiseLayerNorm's output should be one"
"but it's (%d) outputs."
,
nb_outputs
));
PADDLE_ENFORCE_EQ
(
nb_outputs
,
1
,
platform
::
errors
::
InvalidArgument
(
"The EmbEltwiseLayerNorm's output should be one"
"but it's (%d) outputs."
,
nb_outputs
));
PADDLE_ENFORCE_LT
(
pos
,
nb_inputs
+
nb_outputs
,
platform
::
errors
::
InvalidArgument
(
"The pos(%d) should be less than the "
...
...
@@ -122,7 +231,7 @@ bool EmbEltwiseLayernormPluginDynamic<T>::supportsFormatCombination(
}
if
(
pos
==
all_nums
-
1
)
{
if
(
sizeof
(
T
)
==
sizeof
(
float
)
)
{
if
(
with_fp16_
==
false
)
{
return
desc
.
type
==
nvinfer1
::
DataType
::
kFLOAT
;
}
else
{
return
desc
.
type
==
nvinfer1
::
DataType
::
kHALF
;
...
...
@@ -131,84 +240,27 @@ bool EmbEltwiseLayernormPluginDynamic<T>::supportsFormatCombination(
return
false
;
}
template
<
typename
T
>
nvinfer1
::
DataType
EmbEltwiseLayernormPluginDynamic
<
T
>::
getOutputDataType
(
nvinfer1
::
DataType
EmbEltwiseLayernormPluginDynamic
::
getOutputDataType
(
int
index
,
const
nvinfer1
::
DataType
*
input_types
,
int
nb_inputs
)
const
{
PADDLE_ENFORCE_EQ
(
index
,
0
,
platform
::
errors
::
InvalidArgument
(
"The EmbEltwiseLayernorm Plugin only has one input, so the "
"index value should be 0, but get %d."
,
index
));
return
nvinfer1
::
DataType
::
kFLOAT
;
if
(
with_fp16_
)
return
nvinfer1
::
DataType
::
kHALF
;
else
return
nvinfer1
::
DataType
::
kFLOAT
;
}
template
<
typename
T
>
int
EmbEltwiseLayernormPluginDynamic
<
T
>::
enqueue
(
int
EmbEltwiseLayernormPluginDynamic
::
enqueue
(
const
nvinfer1
::
PluginTensorDesc
*
input_desc
,
const
nvinfer1
::
PluginTensorDesc
*
output_desc
,
const
void
*
const
*
inputs
,
void
*
const
*
outputs
,
void
*
workspace
,
cudaStream_t
stream
)
{
auto
id_dims
=
input_desc
[
0
].
dims
;
int
batch
=
id_dims
.
d
[
0
];
int
seq_len
=
id_dims
.
d
[
1
];
int
input_num
=
embs_
.
size
();
framework
::
Tensor
in_ptr_tensor
,
emb_ptr_tensor
;
int
device_id
;
cudaGetDevice
(
&
device_id
);
in_ptr_tensor
.
Resize
({
input_num
});
emb_ptr_tensor
.
Resize
({
input_num
});
int64_t
*
in_ptr_gpu_d
=
in_ptr_tensor
.
mutable_data
<
int64_t
>
(
platform
::
CUDAPlace
(
device_id
));
int64_t
*
emb_ptr_gpu_d
=
emb_ptr_tensor
.
mutable_data
<
int64_t
>
(
platform
::
CUDAPlace
(
device_id
));
std
::
vector
<
uintptr_t
>
in_ptr
,
emb_ptr
;
for
(
int
i
=
0
;
i
<
input_num
;
i
++
)
{
in_ptr
.
push_back
(
reinterpret_cast
<
uintptr_t
>
(
inputs
[
i
]));
emb_ptr
.
push_back
(
reinterpret_cast
<
uintptr_t
>
(
embs_gpu_
[
i
]));
}
cudaMemcpyAsync
(
in_ptr_gpu_d
,
in_ptr
.
data
(),
sizeof
(
int64_t
)
*
input_num
,
cudaMemcpyHostToDevice
,
stream
);
cudaMemcpyAsync
(
emb_ptr_gpu_d
,
emb_ptr
.
data
(),
sizeof
(
int64_t
)
*
input_num
,
cudaMemcpyHostToDevice
,
stream
);
auto
out_type
=
output_desc
[
0
].
type
;
const
unsigned
tpb
=
256
;
const
dim3
grid
(
seq_len
,
batch
,
1
);
const
dim3
block
(
tpb
,
1
,
1
);
if
(
sizeof
(
T
)
==
sizeof
(
float
))
{
PADDLE_ENFORCE_EQ
(
out_type
==
nvinfer1
::
DataType
::
kFLOAT
,
true
,
platform
::
errors
::
InvalidArgument
(
"The EmbEltwiseLayernorm Plugin only support fp32 input."
));
}
else
if
(
sizeof
(
T
)
==
sizeof
(
int16_t
))
{
PADDLE_ENFORCE_EQ
(
out_type
==
nvinfer1
::
DataType
::
kHALF
,
true
,
platform
::
errors
::
InvalidArgument
(
"The EmbEltwiseLayernorm Plugin only support fp16 input."
));
}
else
{
PADDLE_THROW
(
platform
::
errors
::
Fatal
(
"Unsupport data type, the out type of EmbEltwiseLayernorm should be "
"float or half."
));
}
T
*
output_d
=
static_cast
<
T
*>
(
outputs
[
0
]);
operators
::
math
::
EmbEltwiseLayerNormFunctor
<
T
>
emb_eltwise_layernorm_func
;
emb_eltwise_layernorm_func
(
batch
,
seq_len
,
hidden_size_
,
in_ptr_gpu_d
,
scale_gpu_
,
bias_gpu_
,
emb_ptr_gpu_d
,
output_d
,
eps_
,
input_num
,
stream
);
impl_
->
enqueue
(
input_desc
,
output_desc
,
inputs
,
outputs
,
workspace
,
stream
);
return
cudaGetLastError
()
!=
cudaSuccess
;
}
template
class
EmbEltwiseLayernormPluginDynamic
<
float
>;
#ifdef SUPPORTS_CUDA_FP16
template
class
EmbEltwiseLayernormPluginDynamic
<
half
>;
#endif // SUPPORTS_CUDA_FP16
#endif
}
// namespace plugin
...
...
paddle/fluid/inference/tensorrt/plugin/emb_eltwise_layernorm_plugin.h
浏览文件 @
f52c4f8b
...
...
@@ -27,14 +27,76 @@ namespace tensorrt {
namespace
plugin
{
#if IS_TRT_VERSION_GE(6000)
class
EmbEltwiseLayernormPluginDynamicImplBase
{
public:
EmbEltwiseLayernormPluginDynamicImplBase
()
{}
virtual
~
EmbEltwiseLayernormPluginDynamicImplBase
()
{}
virtual
int
initialize
()
=
0
;
virtual
void
terminate
()
=
0
;
virtual
int
enqueue
(
const
nvinfer1
::
PluginTensorDesc
*
inputDesc
,
const
nvinfer1
::
PluginTensorDesc
*
outputDesc
,
const
void
*
const
*
inputs
,
void
*
const
*
outputs
,
void
*
workspace
,
cudaStream_t
stream
)
=
0
;
};
template
<
typename
T
>
class
EmbEltwiseLayernormPluginDynamicImpl
:
public
EmbEltwiseLayernormPluginDynamicImplBase
{
public:
explicit
EmbEltwiseLayernormPluginDynamicImpl
(
std
::
vector
<
float
*>
input_embs
,
float
*
bias
,
float
*
scale
,
std
::
vector
<
int
>
emb_sizes
,
int
bias_size
,
int
scale_size
,
int
hidden_size
,
float
eps
)
:
embs_
(
input_embs
),
bias_
(
bias
),
scale_
(
scale
),
emb_sizes_
(
emb_sizes
),
bias_size_
(
bias_size
),
scale_size_
(
scale_size
),
hidden_size_
(
hidden_size
),
eps_
(
eps
)
{}
~
EmbEltwiseLayernormPluginDynamicImpl
();
int
initialize
();
void
terminate
();
int
enqueue
(
const
nvinfer1
::
PluginTensorDesc
*
inputDesc
,
const
nvinfer1
::
PluginTensorDesc
*
outputDesc
,
const
void
*
const
*
inputs
,
void
*
const
*
outputs
,
void
*
workspace
,
cudaStream_t
stream
);
private:
std
::
vector
<
float
*>
embs_
;
float
*
bias_
{
nullptr
};
float
*
scale_
{
nullptr
};
// data on devices
float
*
bias_gpu_
{
nullptr
};
float
*
scale_gpu_
{
nullptr
};
std
::
vector
<
T
*>
embs_gpu_
;
std
::
vector
<
int
>
emb_sizes_
;
int
bias_size_
;
int
scale_size_
;
int
hidden_size_
;
float
eps_
;
framework
::
Tensor
in_ptr_tensor_
,
emb_ptr_tensor_
;
int
device_id_
{
0
};
uintptr_t
old_input_ptr_
{
0
};
};
class
EmbEltwiseLayernormPluginDynamic
:
public
DynamicPluginTensorRT
{
public:
explicit
EmbEltwiseLayernormPluginDynamic
(
std
::
vector
<
float
*>
input_embs
,
float
*
bias
,
float
*
scale
,
std
::
vector
<
int
>
emb_sizes
,
int
bias_size
,
int
scale_size
,
int
hidden_size
,
float
eps
)
int
hidden_size
,
float
eps
,
bool
with_fp16
)
:
embs_
(
input_embs
),
bias_
(
bias
),
scale_
(
scale
),
...
...
@@ -42,51 +104,81 @@ class EmbEltwiseLayernormPluginDynamic : public DynamicPluginTensorRT {
bias_size_
(
bias_size
),
scale_size_
(
scale_size
),
hidden_size_
(
hidden_size
),
eps_
(
eps
)
{}
eps_
(
eps
),
with_fp16_
(
with_fp16
),
own_host_buff_
(
false
)
{
if
(
with_fp16
)
{
#ifdef SUPPORTS_CUDA_FP16
impl_
=
new
EmbEltwiseLayernormPluginDynamicImpl
<
half
>
(
embs_
,
bias_
,
scale_
,
emb_sizes_
,
bias_size_
,
scale_size_
,
hidden_size_
,
eps_
);
#else
PADDLE_THROW
(
platform
::
errors
::
Fatal
(
"Unsupported data type, current GPU doesn't support half."
));
#endif // SUPPORTS_CUDA_FP16
}
else
{
impl_
=
new
EmbEltwiseLayernormPluginDynamicImpl
<
float
>
(
embs_
,
bias_
,
scale_
,
emb_sizes_
,
bias_size_
,
scale_size_
,
hidden_size_
,
eps_
);
}
}
EmbEltwiseLayernormPluginDynamic
(
void
const
*
serial_data
,
size_t
serial_length
)
{
size_t
serial_length
)
:
own_host_buff_
(
true
)
{
DeserializeValue
(
&
serial_data
,
&
serial_length
,
&
emb_sizes_
);
embs_gpu_
.
resize
(
emb_sizes_
.
size
());
embs_
.
resize
(
emb_sizes_
.
size
());
for
(
size_t
i
=
0
;
i
<
emb_sizes_
.
size
();
i
++
)
{
cudaMalloc
(
&
embs_gpu_
[
i
],
sizeof
(
float
)
*
emb_sizes_
[
i
]);
cudaMemcpy
(
embs_gpu_
[
i
],
serial_data
,
emb_sizes_
[
i
]
*
sizeof
(
float
),
cudaMemcpyHostToDevice
);
auto
size
=
emb_sizes_
[
i
];
auto
ptr
=
new
float
[
size
];
memcpy
(
ptr
,
serial_data
,
sizeof
(
float
)
*
size
);
embs_
[
i
]
=
ptr
;
reinterpret_cast
<
char
const
*&>
(
serial_data
)
+=
emb_sizes_
[
i
]
*
sizeof
(
float
);
serial_length
-=
emb_sizes_
[
i
]
*
sizeof
(
float
);
embs_
[
i
]
=
nullptr
;
}
DeserializeValue
(
&
serial_data
,
&
serial_length
,
&
bias_size_
);
DeserializeValue
(
&
serial_data
,
&
serial_length
,
&
scale_size_
);
cudaMalloc
(
&
bias_gpu_
,
sizeof
(
float
)
*
bias_size_
);
cudaMemcpy
(
bias_gpu_
,
serial_data
,
bias_size_
*
sizeof
(
float
),
cudaMemcpyHostToDevice
);
bias_
=
nullptr
;
if
(
bias_size_
)
{
bias_
=
new
float
[
bias_size_
];
memcpy
(
bias_
,
serial_data
,
sizeof
(
float
)
*
bias_size_
);
}
reinterpret_cast
<
char
const
*&>
(
serial_data
)
+=
bias_size_
*
sizeof
(
float
);
serial_length
-=
bias_size_
*
sizeof
(
float
);
cudaMalloc
(
&
scale_gpu_
,
sizeof
(
float
)
*
scale_size_
);
cudaMemcpy
(
scale_gpu_
,
serial_data
,
scale_size_
*
sizeof
(
float
),
cudaMemcpyHostToDevice
);
scale_
=
nullptr
;
if
(
scale_size_
)
{
scale_
=
new
float
[
scale_size_
];
memcpy
(
scale_
,
serial_data
,
sizeof
(
float
)
*
scale_size_
);
}
reinterpret_cast
<
char
const
*&>
(
serial_data
)
+=
scale_size_
*
sizeof
(
float
);
serial_length
-=
scale_size_
*
sizeof
(
float
);
DeserializeValue
(
&
serial_data
,
&
serial_length
,
&
hidden_size_
);
DeserializeValue
(
&
serial_data
,
&
serial_length
,
&
eps_
);
DeserializeValue
(
&
serial_data
,
&
serial_length
,
&
with_fp16_
);
if
(
with_fp16_
)
{
#ifdef SUPPORTS_CUDA_FP16
impl_
=
new
EmbEltwiseLayernormPluginDynamicImpl
<
half
>
(
embs_
,
bias_
,
scale_
,
emb_sizes_
,
bias_size_
,
scale_size_
,
hidden_size_
,
eps_
);
#else
PADDLE_THROW
(
platform
::
errors
::
Fatal
(
"Unsupported data type, current GPU doesn't support half."
));
#endif // SUPPORTS_CUDA_FP16
}
else
{
impl_
=
new
EmbEltwiseLayernormPluginDynamicImpl
<
float
>
(
embs_
,
bias_
,
scale_
,
emb_sizes_
,
bias_size_
,
scale_size_
,
hidden_size_
,
eps_
);
}
}
nvinfer1
::
IPluginV2DynamicExt
*
clone
()
const
override
{
auto
ptr
=
new
EmbEltwiseLayernormPluginDynamic
(
embs_
,
bias_
,
scale_
,
emb_sizes_
,
bias_size_
,
scale_size_
,
hidden_size_
,
eps_
);
ptr
->
embs_gpu_
=
embs_gpu_
;
ptr
->
bias_gpu_
=
bias_gpu_
;
ptr
->
scale_gpu_
=
scale_gpu_
;
eps_
,
with_fp16_
);
return
ptr
;
}
...
...
@@ -95,6 +187,7 @@ class EmbEltwiseLayernormPluginDynamic : public DynamicPluginTensorRT {
}
int
getNbOutputs
()
const
override
{
return
1
;
}
int
initialize
()
override
;
void
terminate
()
override
;
size_t
getSerializationSize
()
const
override
{
int
sum_num
=
0
;
...
...
@@ -110,24 +203,32 @@ class EmbEltwiseLayernormPluginDynamic : public DynamicPluginTensorRT {
sum_num
+=
(
bias_size_
+
scale_size_
)
*
sizeof
(
float
);
sum_num
+=
SerializedSize
(
hidden_size_
);
sum_num
+=
SerializedSize
(
eps_
);
//
sum_num += SerializedSize(with_fp16_);
sum_num
+=
SerializedSize
(
with_fp16_
);
return
sum_num
;
}
void
terminate
()
override
;
void
serialize
(
void
*
buffer
)
const
override
{
// SerializeValue(&buffer, with_fp16_);
SerializeValue
(
&
buffer
,
emb_sizes_
);
for
(
size_t
i
=
0
;
i
<
emb_sizes_
.
size
();
i
++
)
{
SerializeCudaPointer
(
&
buffer
,
embs_gpu_
[
i
],
emb_sizes_
[
i
]);
auto
size
=
emb_sizes_
[
i
];
for
(
int
j
=
0
;
j
<
size
;
++
j
)
{
SerializeValue
(
&
buffer
,
embs_
[
i
][
j
]);
}
}
SerializeValue
(
&
buffer
,
bias_size_
);
SerializeValue
(
&
buffer
,
scale_size_
);
SerializeCudaPointer
(
&
buffer
,
bias_gpu_
,
bias_size_
);
SerializeCudaPointer
(
&
buffer
,
scale_gpu_
,
scale_size_
);
for
(
int
i
=
0
;
i
<
bias_size_
;
++
i
)
{
SerializeValue
(
&
buffer
,
bias_
[
i
]);
}
for
(
int
i
=
0
;
i
<
scale_size_
;
++
i
)
{
SerializeValue
(
&
buffer
,
scale_
[
i
]);
}
SerializeValue
(
&
buffer
,
hidden_size_
);
SerializeValue
(
&
buffer
,
eps_
);
SerializeValue
(
&
buffer
,
with_fp16_
);
}
nvinfer1
::
DimsExprs
getOutputDimensions
(
...
...
@@ -158,23 +259,33 @@ class EmbEltwiseLayernormPluginDynamic : public DynamicPluginTensorRT {
const
nvinfer1
::
DataType
*
input_types
,
int
nb_inputs
)
const
override
;
void
destroy
()
override
{
delete
this
;
}
void
destroy
()
override
{
if
(
own_host_buff_
)
{
for
(
auto
ptr
:
embs_
)
{
delete
[]
ptr
;
}
delete
[]
bias_
;
delete
[]
scale_
;
}
delete
impl_
;
delete
this
;
}
private:
std
::
vector
<
float
*>
embs_
;
float
*
bias_
;
float
*
scale_
;
// data on devices
float
*
bias_gpu_
;
float
*
scale_gpu_
;
std
::
vector
<
float
*>
embs_gpu_
;
std
::
vector
<
int
>
emb_sizes_
;
int
bias_size_
;
int
scale_size_
;
int
hidden_size_
;
float
eps_
;
bool
with_fp16_
;
bool
own_host_buff_
{
false
};
EmbEltwiseLayernormPluginDynamicImplBase
*
impl_
{
nullptr
};
};
class
EmbEltwiseLayernormPluginV2Creator
:
public
nvinfer1
::
IPluginCreator
{
...
...
@@ -198,8 +309,7 @@ class EmbEltwiseLayernormPluginV2Creator : public nvinfer1::IPluginCreator {
nvinfer1
::
IPluginV2
*
deserializePlugin
(
const
char
*
name
,
const
void
*
serial_data
,
size_t
serial_length
)
override
{
return
new
EmbEltwiseLayernormPluginDynamic
<
float
>
(
serial_data
,
serial_length
);
return
new
EmbEltwiseLayernormPluginDynamic
(
serial_data
,
serial_length
);
}
void
setPluginNamespace
(
const
char
*
lib_namespace
)
override
{
...
...
paddle/fluid/inference/tests/api/trt_dynamic_shape_ernie_deserialize_test.cc
浏览文件 @
f52c4f8b
...
...
@@ -151,7 +151,7 @@ void trt_ernie(bool with_fp16, std::vector<float> result) {
run
(
config
,
&
out_data
);
// serialize
run
(
*
config_deser
,
&
out_data
);
// deserialize
for
(
size_t
i
=
0
;
i
<
out_data
.
size
();
i
++
)
{
EXPECT_NEAR
(
result
[
i
],
out_data
[
i
],
1e-
6
);
EXPECT_NEAR
(
result
[
i
],
out_data
[
i
],
1e-
2
);
}
}
...
...
@@ -159,13 +159,11 @@ TEST(AnalysisPredictor, no_fp16) {
std
::
vector
<
float
>
result
=
{
0.597841
,
0.219972
,
0.182187
};
trt_ernie
(
false
,
result
);
}
TEST
(
AnalysisPredictor
,
fp16
)
{
#ifdef SUPPORTS_CUDA_FP16
std
::
vector
<
float
>
result
=
{
0.598336
,
0.219558
,
0.182106
};
TEST
(
AnalysisPredictor
,
fp16
)
{
std
::
vector
<
float
>
result
=
{
0.59923654
,
0.21923761
,
0.18152587
};
trt_ernie
(
true
,
result
);
#endif
}
#endif // SUPPORTS_CUDA_FP16
}
// namespace inference
}
// namespace paddle
paddle/fluid/operators/conv_cudnn_op.cu
浏览文件 @
f52c4f8b
...
...
@@ -14,6 +14,7 @@ limitations under the License. */
#include <utility>
#include <vector>
#include "paddle/fluid/framework/eigen.h"
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/framework/tensor.h"
...
...
@@ -287,7 +288,9 @@ class CUDNNConvOpKernel : public framework::OpKernel<T> {
#endif
// ------------------- cudnn conv forward ---------------------
ScalingParamType
<
T
>
alpha
=
1.0
f
,
beta
=
0.0
f
;
ScalingParamType
<
T
>
alpha
=
1.0
f
;
ScalingParamType
<
T
>
beta
=
ctx
.
Attr
<
bool
>
(
"use_addto"
)
?
1.0
f
:
0.0
f
;
VLOG
(
4
)
<<
"Conv: use_addto = "
<<
ctx
.
Attr
<
bool
>
(
"use_addto"
);
for
(
int
i
=
0
;
i
<
groups
;
i
++
)
{
workspace_handle
.
RunFunc
(
[
&
](
void
*
workspace_ptr
)
{
...
...
@@ -609,9 +612,13 @@ class CUDNNConvGradOpKernel : public framework::OpKernel<T> {
}
// ------------------- cudnn conv backward data ---------------------
ScalingParamType
<
T
>
alpha
=
1.0
f
,
beta
=
0.0
f
;
ScalingParamType
<
T
>
alpha
=
1.0
f
;
ScalingParamType
<
T
>
beta
=
ctx
.
Attr
<
bool
>
(
"use_addto"
)
?
1.0
f
:
0.0
f
;
VLOG
(
4
)
<<
"Conv_grad: use_addto = "
<<
ctx
.
Attr
<
bool
>
(
"use_addto"
);
if
(
input_grad
)
{
// Because beta is zero, it is unnecessary to reset input_grad.
// When beta is 0, it is unnecessary to reset input_grad.
// When beta is 1, the output cannot be reset since addt strategy used.
for
(
int
i
=
0
;
i
<
groups
;
i
++
)
{
workspace_handle
.
RunFunc
(
[
&
](
void
*
cudnn_workspace_ptr
)
{
...
...
@@ -653,6 +660,9 @@ class CUDNNConvGradOpKernel : public framework::OpKernel<T> {
ctx
,
&
transformed_input_grad_channel
,
input_grad
);
}
}
// filter_grad do not use inplace addto.
ScalingParamType
<
T
>
beta_filter
=
0.0
f
;
// ------------------- cudnn conv backward filter ---------------------
if
(
filter_grad
)
{
// Because beta is zero, it is unnecessary to reset filter_grad.
...
...
@@ -665,7 +675,7 @@ class CUDNNConvGradOpKernel : public framework::OpKernel<T> {
input_data
+
i
*
group_offset_in
,
args2
.
odesc
.
desc
(),
output_grad_data
+
i
*
group_offset_out
,
args2
.
cdesc
.
desc
(),
filter_algo
,
cudnn_workspace_ptr
,
workspace_size
,
&
beta
,
args2
.
wdesc
.
desc
(),
workspace_size
,
&
beta
_filter
,
args2
.
wdesc
.
desc
(),
filter_grad_data
+
i
*
group_offset_filter
));
},
workspace_size
);
...
...
@@ -1017,7 +1027,14 @@ class CUDNNConvDoubleGradOpKernel : public framework::OpKernel<T> {
int
group_offset_out
=
o_c
/
groups
*
o_h
*
o_w
*
o_d
;
int
group_offset_filter
=
W
->
numel
()
/
groups
;
ScalingParamType
<
T
>
alpha
=
1.0
f
,
beta
=
0.0
f
;
ScalingParamType
<
T
>
alpha
=
1.0
f
;
ScalingParamType
<
T
>
beta
=
0.0
f
;
// NOTE(zhiqiu): inplace addto is not supportted in double grad yet.
// ScalingParamType<T> beta = ctx.Attr<bool>("use_addto") ? 1.0f :
// 0.0f;
// VLOG(4) << "Conv_grad_grad: use_addto = " << ctx.Attr<bool>("use_addto");
auto
wkspace_handle
=
dev_ctx
.
cudnn_workspace_handle
();
if
(
ddO
)
{
...
...
paddle/fluid/operators/conv_op.cc
浏览文件 @
f52c4f8b
...
...
@@ -305,6 +305,11 @@ void Conv2DOpMaker::Make() {
.
SetDefault
(
0.0
f
);
AddAttr
<
float
>
(
"fuse_beta"
,
"(float, default 0.0) Only used in mkldnn kernel"
)
.
SetDefault
(
0.0
f
);
AddAttr
<
bool
>
(
"use_addto"
,
"(bool, default false) If use addto strategy or not, only used in "
"cudnn kernel"
)
.
SetDefault
(
false
);
AddAttr
<
bool
>
(
"fuse_residual_connection"
,
"(bool, default false) Only used in mkldnn kernel. Used "
"whenever convolution output is as an input to residual "
...
...
@@ -460,6 +465,11 @@ void Conv3DOpMaker::Make() {
.
SetDefault
(
0.0
f
);
AddAttr
<
float
>
(
"fuse_beta"
,
"(float, default 0.0) Only used in mkldnn kernel"
)
.
SetDefault
(
0.0
f
);
AddAttr
<
bool
>
(
"use_addto"
,
"(bool, default false) If use addto strategy or not, only used in "
"cudnn kernel"
)
.
SetDefault
(
false
);
AddAttr
<
bool
>
(
"fuse_residual_connection"
,
"(bool, default false) Only used in mkldnn kernel. Used "
"whenever convolution output is as an input to residual "
...
...
paddle/fluid/operators/cudnn_lstm_cache.h
浏览文件 @
f52c4f8b
...
...
@@ -54,6 +54,8 @@ class ScopedRNNBase {
x_descs_
.
emplace_back
(
x_desc_
.
descriptor
<
T
>
(
dims_x
,
strides_x
));
y_descs_
.
emplace_back
(
y_desc_
.
descriptor
<
T
>
(
dims_y
,
strides_y
));
}
#if CUDNN_VERSION >= 7201
if
(
!
sequence_length
.
empty
())
{
x_seq_desc_
.
descriptor
<
T
>
(
seq_length_
,
batch_size_
,
input_size_
,
true
,
sequence_length
);
...
...
@@ -61,6 +63,7 @@ class ScopedRNNBase {
hidden_size_
*
numDirections
,
true
,
sequence_length
);
}
#endif
// ------------------- cudnn hx, hy, cx, cy descriptors----------
std
::
vector
<
int
>
dims_hx
=
{
num_layers_
*
numDirections
,
batch_size_
,
...
...
@@ -96,10 +99,13 @@ class ScopedRNNBase {
is_bidirec_
?
CUDNN_BIDIRECTIONAL
:
CUDNN_UNIDIRECTIONAL
,
CUDNN_LSTM
,
cudnn_type
));
#endif
#if CUDNN_VERSION >= 7201
if
(
!
sequence_length
.
empty
())
{
PADDLE_ENFORCE_CUDA_SUCCESS
(
platform
::
dynload
::
cudnnSetRNNPaddingMode
(
rnn_desc_
.
desc
(),
CUDNN_RNN_PADDED_IO_ENABLED
));
}
#endif
// ------------------- cudnn weights_size ---------------------
size_t
weights_size_
;
...
...
@@ -125,8 +131,10 @@ class ScopedRNNBase {
}
cudnnTensorDescriptor_t
*
x_descs
()
{
return
x_descs_
.
data
();
}
cudnnTensorDescriptor_t
*
y_descs
()
{
return
y_descs_
.
data
();
}
#if CUDNN_VERSION >= 7201
cudnnRNNDataDescriptor_t
x_seq_desc
()
{
return
x_seq_desc_
.
desc
();
}
cudnnRNNDataDescriptor_t
y_seq_desc
()
{
return
y_seq_desc_
.
desc
();
}
#endif
cudnnTensorDescriptor_t
init_h_desc
()
{
return
init_h_desc_
.
desc
();
}
cudnnTensorDescriptor_t
init_c_desc
()
{
return
init_c_desc_
.
desc
();
}
cudnnTensorDescriptor_t
last_h_desc
()
{
return
last_h_desc_
.
desc
();
}
...
...
@@ -151,8 +159,10 @@ class ScopedRNNBase {
platform
::
ScopedTensorDescriptor
x_desc_
;
platform
::
ScopedTensorDescriptor
y_desc_
;
#if CUDNN_VERSION >= 7201
platform
::
ScopedRNNTensorDescriptor
x_seq_desc_
;
platform
::
ScopedRNNTensorDescriptor
y_seq_desc_
;
#endif
platform
::
ScopedTensorDescriptor
init_h_desc_
;
platform
::
ScopedTensorDescriptor
init_c_desc_
;
platform
::
ScopedTensorDescriptor
last_h_desc_
;
...
...
paddle/fluid/operators/elementwise/elementwise_add_op.cc
浏览文件 @
f52c4f8b
...
...
@@ -13,8 +13,11 @@ See the License for the specific language governing permissions and
limitations under the License. */
#include "paddle/fluid/operators/elementwise/elementwise_add_op.h"
#include <memory>
#include <string>
#include "paddle/fluid/framework/op_version_registry.h"
#include "paddle/fluid/operators/elementwise/elementwise_op.h"
namespace
paddle
{
...
...
@@ -129,3 +132,18 @@ REGISTER_OP_CPU_KERNEL(
int
>
,
ops
::
ElementwiseAddDoubleGradKernel
<
paddle
::
platform
::
CPUDeviceContext
,
int64_t
>
);
// A specialization elementwise_add operator, used in gradient accumulation with
// inplace addto.
REGISTER_OPERATOR
(
grad_add
,
paddle
::
operators
::
ElementwiseOp
,
paddle
::
operators
::
ElementwiseAddOpMaker
,
paddle
::
framework
::
EmptyGradOpMaker
<
paddle
::
framework
::
OpDesc
>
,
paddle
::
framework
::
EmptyGradOpMaker
<
paddle
::
imperative
::
OpBase
>
);
REGISTER_OP_CPU_KERNEL
(
grad_add
,
ops
::
ElementwiseAddKernel
<
paddle
::
platform
::
CPUDeviceContext
,
float
>
,
ops
::
ElementwiseAddKernel
<
paddle
::
platform
::
CPUDeviceContext
,
double
>
,
ops
::
ElementwiseAddKernel
<
paddle
::
platform
::
CPUDeviceContext
,
int
>
,
ops
::
ElementwiseAddKernel
<
paddle
::
platform
::
CPUDeviceContext
,
int64_t
>
);
paddle/fluid/operators/elementwise/elementwise_add_op.cu
浏览文件 @
f52c4f8b
...
...
@@ -111,3 +111,10 @@ REGISTER_OP_CUDA_KERNEL(
ops
::
ElementwiseAddDoubleGradKernel
<
plat
::
CUDADeviceContext
,
int64_t
>
,
ops
::
ElementwiseAddDoubleGradKernel
<
plat
::
CUDADeviceContext
,
plat
::
float16
>
);
REGISTER_OP_CUDA_KERNEL
(
grad_add
,
ops
::
ElementwiseAddKernel
<
plat
::
CUDADeviceContext
,
float
>
,
ops
::
ElementwiseAddKernel
<
plat
::
CUDADeviceContext
,
double
>
,
ops
::
ElementwiseAddKernel
<
plat
::
CUDADeviceContext
,
int
>
,
ops
::
ElementwiseAddKernel
<
plat
::
CUDADeviceContext
,
int64_t
>
,
ops
::
ElementwiseAddKernel
<
plat
::
CUDADeviceContext
,
plat
::
float16
>
);
paddle/fluid/operators/fake_quantize_op.cc
浏览文件 @
f52c4f8b
...
...
@@ -174,7 +174,64 @@ struct ChannelClipAndFakeQuantFunctor<platform::CPUDeviceContext, T> {
template
struct
ChannelClipAndFakeQuantFunctor
<
platform
::
CPUDeviceContext
,
float
>;
template
<
typename
T
>
struct
ChannelClipFakeQuantDequantFunctor
<
platform
::
CPUDeviceContext
,
T
>
{
void
operator
()(
const
platform
::
CPUDeviceContext
&
ctx
,
const
framework
::
Tensor
&
in
,
const
framework
::
Tensor
&
scale
,
const
int
bin_cnt
,
const
int
quant_axis
,
framework
::
Tensor
*
out
)
{
PADDLE_ENFORCE_EQ
(
quant_axis
==
0
||
quant_axis
==
1
,
true
,
platform
::
errors
::
InvalidArgument
(
"'quant_axis' should be 0 or 1, but "
"the received is %d"
,
quant_axis
));
auto
*
scale_data
=
scale
.
data
<
T
>
();
auto
*
in_data
=
in
.
data
<
T
>
();
auto
*
out_data
=
out
->
mutable_data
<
T
>
(
ctx
.
GetPlace
());
auto
in_dims
=
in
.
dims
();
const
int64_t
channel
=
in_dims
[
quant_axis
];
platform
::
Transform
<
platform
::
CPUDeviceContext
>
trans
;
if
(
quant_axis
==
0
)
{
const
int64_t
channel_size
=
in
.
numel
()
/
channel
;
for
(
int
i
=
0
;
i
<
channel
;
i
++
)
{
T
s
=
scale_data
[
i
];
auto
*
start
=
in_data
+
i
*
channel_size
;
auto
*
end
=
in_data
+
(
i
+
1
)
*
channel_size
;
trans
(
ctx
,
start
,
end
,
out_data
+
i
*
channel_size
,
ClipFunctor
<
T
>
(
-
s
,
s
));
}
for
(
int
i
=
0
;
i
<
channel
;
i
++
)
{
T
s
=
scale_data
[
i
];
T
inv_s
=
inverse
(
s
);
framework
::
Tensor
one_channel_out
=
out
->
Slice
(
i
,
i
+
1
);
auto
out_e
=
framework
::
EigenVector
<
T
>::
Flatten
(
one_channel_out
);
out_e
.
device
(
*
ctx
.
eigen_device
())
=
(
bin_cnt
*
inv_s
*
out_e
).
round
()
*
s
/
static_cast
<
T
>
(
bin_cnt
);
}
}
else
if
(
quant_axis
==
1
)
{
const
int64_t
step_i
=
in
.
numel
()
/
in_dims
[
0
];
const
int64_t
step_j
=
in
.
numel
()
/
(
in_dims
[
0
]
*
in_dims
[
1
]);
for
(
int
i
=
0
;
i
<
in_dims
[
0
];
i
++
)
{
for
(
int
j
=
0
;
j
<
in_dims
[
1
];
j
++
)
{
T
s
=
scale_data
[
j
];
T
inv_s
=
inverse
(
s
);
auto
*
start
=
in_data
+
i
*
step_i
+
j
*
step_j
;
auto
*
end
=
in_data
+
i
*
step_i
+
(
j
+
1
)
*
step_j
;
auto
*
cur_out_data
=
out_data
+
i
*
step_i
+
j
*
step_j
;
trans
(
ctx
,
start
,
end
,
cur_out_data
,
ClipFunctor
<
T
>
(
-
s
,
s
));
for
(
int
k
=
0
;
k
<
step_j
;
k
++
)
{
cur_out_data
[
k
]
=
std
::
round
(
bin_cnt
*
inv_s
*
cur_out_data
[
k
])
*
s
/
static_cast
<
T
>
(
bin_cnt
);
}
}
}
}
}
};
template
struct
ChannelClipFakeQuantDequantFunctor
<
platform
::
CPUDeviceContext
,
float
>;
template
<
typename
T
>
struct
FindRangeAbsMaxFunctor
<
platform
::
CPUDeviceContext
,
T
>
{
void
operator
()(
const
platform
::
CPUDeviceContext
&
ctx
,
...
...
@@ -360,6 +417,75 @@ $$0 \leq c \lt \ the\ channel\ number\ of\ X$$
}
};
class
FakeChannelWiseQuantizeDequantizeAbsMaxOp
:
public
framework
::
OperatorWithKernel
{
public:
using
framework
::
OperatorWithKernel
::
OperatorWithKernel
;
void
InferShape
(
framework
::
InferShapeContext
*
ctx
)
const
override
{
OP_INOUT_CHECK
(
ctx
->
HasInput
(
"X"
),
"Input"
,
"X"
,
"FakeChannelWiseQuantizeDequantizeAbsMax"
);
OP_INOUT_CHECK
(
ctx
->
HasOutput
(
"Out"
),
"Output"
,
"Out"
,
"FakeChannelWiseQuantizeDequantizeAbsMax"
);
OP_INOUT_CHECK
(
ctx
->
HasOutput
(
"OutScale"
),
"Output"
,
"OutScale"
,
"FakeChannelWiseQuantizeDequantizeAbsMax"
);
int
quant_axis
=
ctx
->
Attrs
().
Get
<
int
>
(
"quant_axis"
);
ctx
->
SetOutputDim
(
"Out"
,
ctx
->
GetInputDim
(
"X"
));
ctx
->
SetOutputDim
(
"OutScale"
,
{
ctx
->
GetInputDim
(
"X"
)[
quant_axis
]});
ctx
->
ShareLoD
(
"X"
,
/*->*/
"Out"
);
}
protected:
framework
::
OpKernelType
GetExpectedKernelType
(
const
framework
::
ExecutionContext
&
ctx
)
const
override
{
return
framework
::
OpKernelType
(
OperatorWithKernel
::
IndicateVarDataType
(
ctx
,
"X"
),
ctx
.
GetPlace
());
}
};
class
FakeChannelWiseQuantizeDequantizeAbsMaxOpMaker
:
public
framework
::
OpProtoAndCheckerMaker
{
public:
void
Make
()
override
{
AddInput
(
"X"
,
"(Tensor) Input is float data type."
);
AddOutput
(
"Out"
,
"(Tensor) Output of quantized and dequantized low level tensor, "
"saved as float data type."
);
AddOutput
(
"OutScale"
,
"(Tensor) Current channel wise scale"
);
AddAttr
<
int
>
(
"quant_axis"
,
"(int, default 0) The axis for quantization. "
"For conv2d, depthwise_conv2d, conv2d_transpose "
"and mul, the quant_axis is equal to the cout axis."
)
.
SetDefault
(
0
)
.
AddCustomChecker
([](
const
int
&
quant_axis
)
{
PADDLE_ENFORCE_EQ
(
quant_axis
==
0
||
quant_axis
==
1
,
true
,
platform
::
errors
::
InvalidArgument
(
"'quant_axis' should be 0 or 1, but "
"the received is %d"
,
quant_axis
));
});
AddAttr
<
int
>
(
"bit_length"
,
"(int, default 8)"
)
.
SetDefault
(
8
)
.
AddCustomChecker
([](
const
int
&
bit_length
)
{
PADDLE_ENFORCE_EQ
(
bit_length
>=
1
&&
bit_length
<=
16
,
true
,
platform
::
errors
::
InvalidArgument
(
"'bit_length' should be between 1 and 16, but "
"the received is %d"
,
bit_length
));
});
AddComment
(
R"DOC(
The scale of FakeChannelWiseQuantize operator is a vector.
In detail, each channel of the input X has a scale value.
$$scale_c = max(abs(X_c))$$
$$range = 2^{bit\_length - 1} - 1$$
$$Out_c = round(\frac{X_c * range} {scale_c}) * \frac{scale_c} {range}$$
In above three formulas, the range value of c is as follow:
$$0 \leq c \lt \ the\ channel\ number\ of\ X$$
)DOC"
);
}
};
class
FakeQuantizeRangeAbsMaxOp
:
public
framework
::
OperatorWithKernel
{
public:
FakeQuantizeRangeAbsMaxOp
(
const
std
::
string
&
type
,
...
...
@@ -666,3 +792,12 @@ REGISTER_OP_CPU_KERNEL(moving_average_abs_max_scale,
REGISTER_OPERATOR
(
fake_quantize_dequantize_grad
,
ops
::
FakeQuantDequantGradOp
);
REGISTER_OP_CPU_KERNEL
(
fake_quantize_dequantize_grad
,
ops
::
FakeQuantDequantGradKernel
<
CPU
,
float
>
);
REGISTER_OPERATOR
(
fake_channel_wise_quantize_dequantize_abs_max
,
ops
::
FakeChannelWiseQuantizeDequantizeAbsMaxOp
,
ops
::
FakeChannelWiseQuantizeDequantizeAbsMaxOpMaker
,
ops
::
FakeQuantDequantGradMaker
<
paddle
::
framework
::
OpDesc
>
,
ops
::
FakeQuantDequantGradMaker
<
paddle
::
imperative
::
OpBase
>
);
REGISTER_OP_CPU_KERNEL
(
fake_channel_wise_quantize_dequantize_abs_max
,
ops
::
FakeChannelWiseQuantizeDequantizeAbsMaxKernel
<
CPU
,
float
>
);
paddle/fluid/operators/fake_quantize_op.cu
浏览文件 @
f52c4f8b
...
...
@@ -417,8 +417,90 @@ struct FindMovingAverageAbsMaxFunctor<platform::CUDADeviceContext, T> {
}
};
template
struct
FindMovingAverageAbsMaxFunctor
<
platform
::
CUDADeviceContext
,
float
>;
// ChannelClipAndQuantDequantKernel for quant_axis is 0
template
<
typename
T
>
__global__
void
ChannelClipAndQuantDequantKernelQuantAxis0
(
const
T
*
in
,
const
T
*
scale
,
const
int
bin_cnt
,
const
int
n
,
const
int
c
,
T
*
out
)
{
int
tid
=
threadIdx
.
x
;
int
channel_size
=
n
/
c
;
const
T
*
in_c
=
in
+
blockIdx
.
x
*
channel_size
;
T
*
out_c
=
out
+
blockIdx
.
x
*
channel_size
;
T
s
=
scale
[
blockIdx
.
x
];
T
inv_s
=
inverse
(
s
);
for
(
int
i
=
tid
;
i
<
channel_size
;
i
+=
blockDim
.
x
)
{
T
x
=
in_c
[
i
];
T
v
=
x
>
s
?
s
:
x
;
v
=
v
<
-
s
?
-
s
:
v
;
v
=
bin_cnt
*
inv_s
*
v
;
out_c
[
i
]
=
round
(
v
)
*
s
/
bin_cnt
;
}
}
// ChannelClipAndQuantDequantKernel for quant_axis is 1
template
<
typename
T
>
__global__
void
ChannelClipAndQuantDequantKernelQuantAxis1
(
const
T
*
in
,
const
T
*
scale
,
const
int
bin_cnt
,
const
int
n
,
const
int
cin
,
const
int
cout
,
T
*
out
)
{
T
s
=
scale
[
blockIdx
.
x
%
cout
];
T
inv_s
=
inverse
(
s
);
int
wh_size
=
n
/
(
cin
*
cout
);
const
T
*
in_c
=
in
+
blockIdx
.
x
*
wh_size
;
T
*
out_c
=
out
+
blockIdx
.
x
*
wh_size
;
for
(
int
i
=
threadIdx
.
x
;
i
<
wh_size
;
i
+=
blockDim
.
x
)
{
T
x
=
in_c
[
i
];
T
v
=
x
>
s
?
s
:
x
;
v
=
v
<
-
s
?
-
s
:
v
;
v
=
bin_cnt
*
inv_s
*
v
;
out_c
[
i
]
=
round
(
v
)
*
s
/
bin_cnt
;
}
}
template
<
typename
T
>
struct
ChannelClipFakeQuantDequantFunctor
<
platform
::
CUDADeviceContext
,
T
>
{
void
operator
()(
const
platform
::
CUDADeviceContext
&
ctx
,
const
framework
::
Tensor
&
in
,
const
framework
::
Tensor
&
scale
,
const
int
bin_cnt
,
const
int
quant_axis
,
framework
::
Tensor
*
out
)
{
// At present, channelwise quantization supports conv2d, depthwise_conv2d
// conv2d_transpose and mul
PADDLE_ENFORCE_EQ
(
quant_axis
==
0
||
quant_axis
==
1
,
true
,
platform
::
errors
::
InvalidArgument
(
"'quant_axis' should be 0 or 1, but "
"the received is %d"
,
quant_axis
));
int
num
=
in
.
numel
();
auto
in_dims
=
in
.
dims
();
const
T
*
in_data
=
in
.
data
<
T
>
();
const
T
*
scale_data
=
scale
.
data
<
T
>
();
T
*
out_data
=
out
->
mutable_data
<
T
>
(
ctx
.
GetPlace
());
if
(
quant_axis
==
0
)
{
int
grid
=
in_dims
[
0
];
int
block
=
1024
;
ChannelClipAndQuantDequantKernelQuantAxis0
<
T
><<<
grid
,
block
,
0
,
ctx
.
stream
()
>>>
(
in_data
,
scale_data
,
bin_cnt
,
num
,
in_dims
[
0
],
out_data
);
}
else
if
(
quant_axis
==
1
)
{
int
grid
=
in_dims
[
0
]
*
in_dims
[
1
];
int
block
=
1024
;
ChannelClipAndQuantDequantKernelQuantAxis1
<
T
><<<
grid
,
block
,
0
,
ctx
.
stream
()
>>>
(
in_data
,
scale_data
,
bin_cnt
,
num
,
in_dims
[
0
],
in_dims
[
1
],
out_data
);
}
}
};
template
struct
ChannelClipFakeQuantDequantFunctor
<
platform
::
CUDADeviceContext
,
float
>;
}
// namespace operators
}
// namespace paddle
...
...
@@ -443,3 +525,6 @@ REGISTER_OP_CUDA_KERNEL(
ops
::
FakeQuantizeDequantizeMovingAverageAbsMaxKernel
<
CUDA
,
float
>
);
REGISTER_OP_CUDA_KERNEL
(
fake_quantize_dequantize_grad
,
ops
::
FakeQuantDequantGradKernel
<
CUDA
,
float
>
);
REGISTER_OP_CUDA_KERNEL
(
fake_channel_wise_quantize_dequantize_abs_max
,
ops
::
FakeChannelWiseQuantizeDequantizeAbsMaxKernel
<
CUDA
,
float
>
);
paddle/fluid/operators/fake_quantize_op.h
浏览文件 @
f52c4f8b
...
...
@@ -72,6 +72,13 @@ struct ChannelClipAndFakeQuantFunctor {
const
int
quant_axis
,
framework
::
Tensor
*
out
);
};
template
<
typename
DeviceContext
,
typename
T
>
struct
ChannelClipFakeQuantDequantFunctor
{
void
operator
()(
const
DeviceContext
&
ctx
,
const
framework
::
Tensor
&
in
,
const
framework
::
Tensor
&
scale
,
const
int
bin_cnt
,
const
int
quant_axis
,
framework
::
Tensor
*
out
);
};
template
<
typename
DeviceContext
,
typename
T
>
struct
FindMovingAverageAbsMaxFunctor
{
void
operator
()(
const
DeviceContext
&
ctx
,
const
framework
::
Tensor
&
in_accum
,
...
...
@@ -154,6 +161,30 @@ class FakeChannelWiseQuantizeAbsMaxKernel : public framework::OpKernel<T> {
}
};
template
<
typename
DeviceContext
,
typename
T
>
class
FakeChannelWiseQuantizeDequantizeAbsMaxKernel
:
public
framework
::
OpKernel
<
T
>
{
public:
void
Compute
(
const
framework
::
ExecutionContext
&
context
)
const
override
{
auto
*
in
=
context
.
Input
<
framework
::
Tensor
>
(
"X"
);
auto
*
out
=
context
.
Output
<
framework
::
Tensor
>
(
"Out"
);
auto
*
out_scale
=
context
.
Output
<
framework
::
Tensor
>
(
"OutScale"
);
T
*
out_scale_data
=
out_scale
->
mutable_data
<
T
>
(
context
.
GetPlace
());
auto
&
dev_ctx
=
context
.
template
device_context
<
DeviceContext
>();
out
->
mutable_data
<
T
>
(
dev_ctx
.
GetPlace
());
int
bit_length
=
context
.
Attr
<
int
>
(
"bit_length"
);
int
bin_cnt
=
std
::
pow
(
2
,
bit_length
-
1
)
-
1
;
int
quant_axis
=
context
.
Attr
<
int
>
(
"quant_axis"
);
FindChannelAbsMaxFunctor
<
DeviceContext
,
T
>
()(
dev_ctx
,
*
in
,
quant_axis
,
out_scale_data
);
ChannelClipFakeQuantDequantFunctor
<
DeviceContext
,
T
>
()(
dev_ctx
,
*
in
,
*
out_scale
,
bin_cnt
,
quant_axis
,
out
);
}
};
template
<
typename
DeviceContext
,
typename
T
>
class
FakeQuantizeRangeAbsMaxKernel
:
public
framework
::
OpKernel
<
T
>
{
public:
...
...
paddle/fluid/operators/fused/fusion_gru_op.cc
浏览文件 @
f52c4f8b
...
...
@@ -15,6 +15,7 @@ limitations under the License. */
#include "paddle/fluid/operators/fused/fusion_gru_op.h"
#include <cstring> // for memcpy
#include <string>
#include <vector>
#include "paddle/fluid/operators/jit/kernels.h"
#include "paddle/fluid/operators/math/blas.h"
#include "paddle/fluid/operators/math/fc.h"
...
...
paddle/fluid/operators/optimizers/rmsprop_op.cc
浏览文件 @
f52c4f8b
...
...
@@ -143,4 +143,5 @@ http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)
namespace
ops
=
paddle
::
operators
;
REGISTER_OP_WITHOUT_GRADIENT
(
rmsprop
,
ops
::
RmspropOp
,
ops
::
RmspropOpMaker
);
REGISTER_OP_CPU_KERNEL
(
rmsprop
,
ops
::
RmspropOpKernel
<
paddle
::
platform
::
CPUDeviceContext
,
float
>
);
rmsprop
,
ops
::
RmspropOpKernel
<
paddle
::
platform
::
CPUDeviceContext
,
float
>
,
ops
::
RmspropOpKernel
<
paddle
::
platform
::
CPUDeviceContext
,
double
>
);
paddle/fluid/operators/optimizers/rmsprop_op.cu
浏览文件 @
f52c4f8b
...
...
@@ -15,4 +15,5 @@ limitations under the License. */
namespace
ops
=
paddle
::
operators
;
REGISTER_OP_CUDA_KERNEL
(
rmsprop
,
ops
::
RmspropOpKernel
<
paddle
::
platform
::
CUDADeviceContext
,
float
>
);
rmsprop
,
ops
::
RmspropOpKernel
<
paddle
::
platform
::
CUDADeviceContext
,
float
>
,
ops
::
RmspropOpKernel
<
paddle
::
platform
::
CUDADeviceContext
,
double
>
);
paddle/fluid/operators/top_k_v2_op.cc
浏览文件 @
f52c4f8b
...
...
@@ -32,7 +32,6 @@ class TopkV2Op : public framework::OperatorWithKernel {
auto
input_dims
=
ctx
->
GetInputDim
(
"X"
);
const
int
&
dim_size
=
input_dims
.
size
();
const
int
k
=
static_cast
<
int
>
(
ctx
->
Attrs
().
Get
<
int
>
(
"k"
));
int
axis
=
static_cast
<
int
>
(
ctx
->
Attrs
().
Get
<
int
>
(
"axis"
));
PADDLE_ENFORCE_EQ
((
axis
<
dim_size
)
&&
(
axis
>=
(
-
1
*
dim_size
)),
true
,
"the axis of topk"
...
...
@@ -41,8 +40,18 @@ class TopkV2Op : public framework::OperatorWithKernel {
if
(
axis
<
0
)
axis
+=
dim_size
;
PADDLE_ENFORCE_GE
(
k
,
1
,
"the attribute of k in the topk must >= 1, but received %d ."
,
k
);
int
k
;
auto
k_is_tensor
=
ctx
->
HasInput
(
"K"
);
if
(
k_is_tensor
)
{
k
=
-
1
;
}
else
{
k
=
static_cast
<
int
>
(
ctx
->
Attrs
().
Get
<
int
>
(
"k"
));
PADDLE_ENFORCE_EQ
(
k
>=
1
,
true
,
"the attribute of k in the topk must >= 1 or be a "
"Tensor, but received %d ."
,
k
);
}
PADDLE_ENFORCE_GE
(
input_dims
.
size
(),
1
,
"input of topk must have >= 1d shape"
);
...
...
paddle/fluid/platform/cudnn_helper.h
浏览文件 @
f52c4f8b
...
...
@@ -294,6 +294,7 @@ class ScopedTensorDescriptor {
DISABLE_COPY_AND_ASSIGN
(
ScopedTensorDescriptor
);
};
#if CUDNN_VERSION >= 7201
class
ScopedRNNTensorDescriptor
{
public:
ScopedRNNTensorDescriptor
()
{
...
...
@@ -337,6 +338,7 @@ class ScopedRNNTensorDescriptor {
cudnnRNNDataDescriptor_t
desc_
;
DISABLE_COPY_AND_ASSIGN
(
ScopedRNNTensorDescriptor
);
};
#endif
class
ScopedDropoutDescriptor
{
public:
...
...
paddle/fluid/platform/dynload/cudnn.cc
浏览文件 @
f52c4f8b
...
...
@@ -46,6 +46,10 @@ CUDNN_DNN_ROUTINE_EACH_R6(DEFINE_WRAP);
CUDNN_DNN_ROUTINE_EACH_R7
(
DEFINE_WRAP
);
#endif
#ifdef CUDNN_DNN_ROUTINE_EACH_AFTER_TWO_R7
CUDNN_DNN_ROUTINE_EACH_AFTER_TWO_R7
(
DEFINE_WRAP
);
#endif
#ifdef CUDNN_DNN_ROUTINE_EACH_AFTER_R7
CUDNN_DNN_ROUTINE_EACH_AFTER_R7
(
DEFINE_WRAP
);
#endif
...
...
paddle/fluid/platform/dynload/cudnn.h
浏览文件 @
f52c4f8b
...
...
@@ -101,9 +101,6 @@ extern void EnforceCUDNNLoaded(const char* fn_name);
__macro(cudnnDropoutGetStatesSize); \
__macro(cudnnSetDropoutDescriptor); \
__macro(cudnnRestoreDropoutDescriptor); \
__macro(cudnnCreateRNNDataDescriptor); \
__macro(cudnnDestroyRNNDataDescriptor); \
__macro(cudnnSetRNNDataDescriptor); \
__macro(cudnnCreateRNNDescriptor); \
__macro(cudnnGetRNNParamsSize); \
__macro(cudnnGetRNNWorkspaceSize); \
...
...
@@ -112,11 +109,6 @@ extern void EnforceCUDNNLoaded(const char* fn_name);
__macro(cudnnRNNBackwardData); \
__macro(cudnnRNNBackwardWeights); \
__macro(cudnnRNNForwardInference); \
__macro(cudnnRNNForwardTrainingEx); \
__macro(cudnnSetRNNPaddingMode); \
__macro(cudnnRNNBackwardDataEx); \
__macro(cudnnRNNBackwardWeightsEx); \
__macro(cudnnRNNForwardInferenceEx); \
__macro(cudnnDestroyDropoutDescriptor); \
__macro(cudnnDestroyRNNDescriptor); \
__macro(cudnnSetTensorNdDescriptorEx);
...
...
@@ -188,6 +180,19 @@ CUDNN_DNN_ROUTINE_EACH_R6(DECLARE_DYNAMIC_LOAD_CUDNN_WRAP)
CUDNN_DNN_ROUTINE_EACH_R7
(
DECLARE_DYNAMIC_LOAD_CUDNN_WRAP
)
#endif
#if CUDNN_VERSION >= 7201
#define CUDNN_DNN_ROUTINE_EACH_AFTER_TWO_R7(__macro) \
__macro(cudnnCreateRNNDataDescriptor); \
__macro(cudnnDestroyRNNDataDescriptor); \
__macro(cudnnSetRNNDataDescriptor); \
__macro(cudnnSetRNNPaddingMode); \
__macro(cudnnRNNForwardTrainingEx); \
__macro(cudnnRNNBackwardDataEx); \
__macro(cudnnRNNBackwardWeightsEx); \
__macro(cudnnRNNForwardInferenceEx);
CUDNN_DNN_ROUTINE_EACH_AFTER_TWO_R7
(
DECLARE_DYNAMIC_LOAD_CUDNN_WRAP
)
#endif
#if CUDNN_VERSION >= 7401
#define CUDNN_DNN_ROUTINE_EACH_AFTER_R7(__macro) \
__macro(cudnnGetBatchNormalizationForwardTrainingExWorkspaceSize); \
...
...
paddle/fluid/platform/flags.cc
浏览文件 @
f52c4f8b
...
...
@@ -521,3 +521,18 @@ DEFINE_int32(
DEFINE_bool
(
sort_sum_gradient
,
false
,
"Sum gradients by the reverse order of "
"the forward execution sequence."
);
/**
* Performance related FLAG
* Name: max_inplace_grad_add
* Since Version: 2.0.0
* Value Range: int32, default=0
* Example:
* Note: The maximum number of inplace grad_add.
*/
DEFINE_int32
(
max_inplace_grad_add
,
0
,
"The maximum number of inplace grad_add. When doing "
"gradient accumulation, if the number of gradients need to that "
"less FLAGS_max_inplace_grad_add, than it will be use several grad_add"
"instead of sum. Default is 0."
);
paddle/fluid/pybind/global_value_getter_setter.cc
浏览文件 @
f52c4f8b
...
...
@@ -62,6 +62,7 @@ DECLARE_bool(use_system_allocator);
// others
DECLARE_bool
(
benchmark
);
DECLARE_int32
(
inner_op_parallelism
);
DECLARE_int32
(
max_inplace_grad_add
);
DECLARE_string
(
tracer_profile_fname
);
#ifdef PADDLE_WITH_CUDA
// cudnn
...
...
@@ -348,7 +349,7 @@ static void RegisterGlobalVarGetterSetter() {
FLAGS_init_allocated_mem
,
FLAGS_initial_cpu_memory_in_mb
,
FLAGS_memory_fraction_of_eager_deletion
,
FLAGS_use_pinned_memory
,
FLAGS_benchmark
,
FLAGS_inner_op_parallelism
,
FLAGS_tracer_profile_fname
,
FLAGS_paddle_num_threads
,
FLAGS_use_mkldnn
);
FLAGS_paddle_num_threads
,
FLAGS_use_mkldnn
,
FLAGS_max_inplace_grad_add
);
#ifdef PADDLE_WITH_CUDA
REGISTER_PUBLIC_GLOBAL_VAR
(
...
...
paddle/fluid/pybind/op_function_generator.cc
浏览文件 @
f52c4f8b
...
...
@@ -111,6 +111,7 @@ std::map<std::string, std::set<std::string>> op_passing_outs_map = {
{
"fake_quantize_dequantize_moving_average_abs_max"
,
{
"Out"
,
"OutScale"
,
"OutAccum"
,
"OutState"
}},
{
"fake_quantize_dequantize_abs_max"
,
{
"Out"
,
"OutScale"
}},
{
"fake_channel_wise_quantize_dequantize_abs_max"
,
{
"Out"
,
"OutScale"
}},
{
"check_finite_and_unscale"
,
{
"Out"
,
"FoundInfinite"
}},
{
"update_loss_scaling"
,
{
"Out"
,
"LossScaling"
,
"OutGoodSteps"
,
"OutBadSteps"
}},
...
...
paddle/fluid/pybind/pybind.cc
浏览文件 @
f52c4f8b
...
...
@@ -12,6 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <Python.h>
#include <algorithm>
#include <cstdlib>
#include <map>
...
...
@@ -22,6 +23,7 @@ limitations under the License. */
#include <unordered_set>
#include <utility>
#include <vector>
#include "paddle/fluid/framework/executor.h"
#include "paddle/fluid/framework/feed_fetch_method.h"
#include "paddle/fluid/framework/feed_fetch_type.h"
...
...
@@ -2528,6 +2530,10 @@ All parameter, weight, gradient are variables in Paddle.
"enable_inplace"
,
[](
const
BuildStrategy
&
self
)
{
return
self
.
enable_inplace_
;
},
[](
BuildStrategy
&
self
,
bool
b
)
{
self
.
enable_inplace_
=
b
;
})
.
def_property
(
"enable_addto"
,
[](
const
BuildStrategy
&
self
)
{
return
self
.
enable_addto_
;
},
[](
BuildStrategy
&
self
,
bool
b
)
{
self
.
enable_addto_
=
b
;
})
.
def_property
(
"fuse_all_reduce_ops"
,
[](
const
BuildStrategy
&
self
)
{
...
...
paddle/scripts/paddle_build.sh
浏览文件 @
f52c4f8b
...
...
@@ -121,6 +121,18 @@ function cmake_base() {
else
exit
1
fi
elif
[
"
$1
"
==
"cp38-cp38"
]
;
then
if
[
-d
"/Library/Frameworks/Python.framework/Versions/3.8"
]
;
then
export
LD_LIBRARY_PATH
=
/Library/Frameworks/Python.framework/Versions/3.8/lib/
export
DYLD_LIBRARY_PATH
=
/Library/Frameworks/Python.framework/Versions/3.8/lib/
export
PATH
=
/Library/Frameworks/Python.framework/Versions/3.8/bin/:
${
PATH
}
PYTHON_FLAGS
=
"-DPYTHON_EXECUTABLE:FILEPATH=/Library/Frameworks/Python.framework/Versions/3.8/bin/python3
-DPYTHON_INCLUDE_DIR:PATH=/Library/Frameworks/Python.framework/Versions/3.8/include/python3.8/
-DPYTHON_LIBRARY:FILEPATH=/Library/Frameworks/Python.framework/Versions/3.8/lib/libpython3.8.dylib"
pip3.8
install
--user
-r
${
PADDLE_ROOT
}
/python/requirements.txt
else
exit
1
fi
fi
# delete `gym` to avoid modifying requirements.txt in *.whl
sed
-i
.bak
"/^gym
$/
d"
${
PADDLE_ROOT
}
/python/requirements.txt
...
...
@@ -176,6 +188,13 @@ function cmake_base() {
-DPYTHON_INCLUDE_DIR:PATH=/opt/_internal/cpython-3.7.0/include/python3.7m
-DPYTHON_LIBRARIES:FILEPATH=/opt/_internal/cpython-3.7.0/lib/libpython3.so"
pip3.7
install
-r
${
PADDLE_ROOT
}
/python/requirements.txt
elif
[
"
$1
"
==
"cp38-cp38"
]
;
then
export
LD_LIBRARY_PATH
=
/opt/_internal/cpython-3.8.0/lib/:
${
LD_LIBRARY_PATH
}
export
PATH
=
/opt/_internal/cpython-3.8.0/bin/:
${
PATH
}
export
PYTHON_FLAGS
=
"-DPYTHON_EXECUTABLE:FILEPATH=/opt/_internal/cpython-3.8.0/bin/python3.8
-DPYTHON_INCLUDE_DIR:PATH=/opt/_internal/cpython-3.8.0/include/python3.8
-DPYTHON_LIBRARIES:FILEPATH=/opt/_internal/cpython-3.8.0/lib/libpython3.so"
pip3.8
install
-r
${
PADDLE_ROOT
}
/python/requirements.txt
fi
else
pip
install
-r
${
PADDLE_ROOT
}
/python/requirements.txt
...
...
@@ -514,6 +533,8 @@ EOF
pip3.6 uninstall
-y
paddlepaddle
elif
[
"
$1
"
==
"cp37-cp37m"
]
;
then
pip3.7 uninstall
-y
paddlepaddle
elif
[
"
$1
"
==
"cp38-cp38"
]
;
then
pip3.8 uninstall
-y
paddlepaddle
fi
set
-ex
...
...
@@ -527,6 +548,8 @@ EOF
pip3.6
install
--user
${
INSTALL_PREFIX
:-
/paddle/build
}
/opt/paddle/share/wheels/
*
.whl
elif
[
"
$1
"
==
"cp37-cp37m"
]
;
then
pip3.7
install
--user
${
INSTALL_PREFIX
:-
/paddle/build
}
/opt/paddle/share/wheels/
*
.whl
elif
[
"
$1
"
==
"cp38-cp38"
]
;
then
pip3.8
install
--user
${
INSTALL_PREFIX
:-
/paddle/build
}
/opt/paddle/share/wheels/
*
.whl
fi
tmpfile_rand
=
`
date
+%s%N
`
tmpfile
=
$tmp_dir
/
$tmpfile_rand
...
...
@@ -666,7 +689,7 @@ function generate_api_spec() {
awk
-F
'('
'{print $NF}'
$spec_path
>
${
spec_path
}
.doc
awk
-F
'('
'{$NF="";print $0}'
$spec_path
>
${
spec_path
}
.api
if
[
"
$1
"
==
"cp35-cp35m"
]
||
[
"
$1
"
==
"cp36-cp36m"
]
||
[
"
$1
"
==
"cp37-cp37m"
]
;
then
if
[
"
$1
"
==
"cp35-cp35m"
]
||
[
"
$1
"
==
"cp36-cp36m"
]
||
[
"
$1
"
==
"cp37-cp37m"
]
||
[
"
$1
"
==
"cp38-cp38"
]
;
then
# Use sed to make python2 and python3 sepc keeps the same
sed
-i
's/arg0: str/arg0: unicode/g'
$spec_path
sed
-i
"s/
\(
.*Transpiler.*
\)
.__init__ (ArgSpec(args=
\[
'self'].*/
\1
.__init__ /g"
$spec_path
...
...
@@ -1244,21 +1267,25 @@ EOF
ref_paddle35
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
-cp35-cp35m-linux_x86_64
.whl
ref_paddle36
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
-cp36-cp36m-linux_x86_64
.whl
ref_paddle37
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
-cp37-cp37m-linux_x86_64
.whl
ref_paddle38
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
-cp38-cp38-linux_x86_64
.whl
ref_paddle2_whl
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
-cp27-cp27mu-linux_x86_64
.whl
ref_paddle35_whl
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
-cp35-cp35m-linux_x86_64
.whl
ref_paddle36_whl
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
-cp36-cp36m-linux_x86_64
.whl
ref_paddle37_whl
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
-cp37-cp37m-linux_x86_64
.whl
ref_paddle38_whl
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
-cp38-cp38-linux_x86_64
.whl
if
[[
${
PADDLE_BRANCH
}
!=
"0.0.0"
&&
${
WITH_MKL
}
==
"ON"
&&
${
WITH_GPU
}
==
"ON"
]]
;
then
ref_paddle2
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
.post
${
ref_CUDA_MAJOR
}${
CUDNN_MAJOR
}
-cp27-cp27mu-linux_x86_64
.whl
ref_paddle35
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
.post
${
ref_CUDA_MAJOR
}${
CUDNN_MAJOR
}
-cp35-cp35m-linux_x86_64
.whl
ref_paddle36
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
.post
${
ref_CUDA_MAJOR
}${
CUDNN_MAJOR
}
-cp36-cp36m-linux_x86_64
.whl
ref_paddle37
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
.post
${
ref_CUDA_MAJOR
}${
CUDNN_MAJOR
}
-cp37-cp37m-linux_x86_64
.whl
ref_paddle38
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
.post
${
ref_CUDA_MAJOR
}${
CUDNN_MAJOR
}
-cp38-cp38-linux_x86_64
.whl
ref_paddle2_whl
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
.post
${
ref_CUDA_MAJOR
}${
CUDNN_MAJOR
}
-cp27-cp27mu-linux_x86_64
.whl
ref_paddle35_whl
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
.post
${
ref_CUDA_MAJOR
}${
CUDNN_MAJOR
}
-cp35-cp35m-linux_x86_64
.whl
ref_paddle36_whl
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
.post
${
ref_CUDA_MAJOR
}${
CUDNN_MAJOR
}
-cp36-cp36m-linux_x86_64
.whl
ref_paddle37_whl
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
.post
${
ref_CUDA_MAJOR
}${
CUDNN_MAJOR
}
-cp37-cp37m-linux_x86_64
.whl
ref_paddle38_whl
=
paddlepaddle
${
install_gpu
}
-
${
PADDLE_BRANCH
}
.post
${
ref_CUDA_MAJOR
}${
CUDNN_MAJOR
}
-cp38-cp38-linux_x86_64
.whl
fi
#ref_paddle2_mv1=""
...
...
@@ -1363,6 +1390,22 @@ EOF
apt-get clean -y &&
\
rm -f
${
ref_paddle37
}
&&
\
ldconfig
EOF
cat
>>
${
PADDLE_ROOT
}
/build/Dockerfile
<<
EOF
# run paddle version to install python packages first
RUN apt-get update &&
${
NCCL_DEPS
}
RUN apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev
\
libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev
\
xz-utils tk-dev libffi-dev liblzma-dev
RUN wget -q https://www.python.org/ftp/python/3.8.0/Python-3.8.0.tgz &&
\
tar -xzf Python-3.8.0.tgz && cd Python-3.8.0 &&
\
CFLAGS="-Wformat" ./configure --prefix=/usr/local/ --enable-shared > /dev/null &&
\
make -j8 > /dev/null && make altinstall > /dev/null && cd ../ && rm Python-3.8.0.tgz
RUN apt-get install -y libgtk2.0-dev dmidecode python3-tk && ldconfig &&
\
pip3.8 install opencv-python && wget
${
ref_web
}
/
${
ref_paddle38
}
&& pip3.8 install
${
ref_paddle38_whl
}
; apt-get install -f -y &&
\
apt-get clean -y &&
\
rm -f
${
ref_paddle38
}
&&
\
ldconfig
EOF
cat
>>
${
PADDLE_ROOT
}
/build/Dockerfile
<<
EOF
# run paddle version to install python packages first
...
...
python/paddle/distributed/fleet/__init__.py
浏览文件 @
f52c4f8b
...
...
@@ -42,6 +42,7 @@ server_num = fleet.server_num
server_index
=
fleet
.
server_index
server_endpoints
=
fleet
.
server_endpoints
is_server
=
fleet
.
is_server
set_util
=
fleet
.
set_util
util
=
fleet
.
util
barrier_worker
=
fleet
.
barrier_worker
init_worker
=
fleet
.
init_worker
...
...
python/paddle/distributed/fleet/base/fleet_base.py
浏览文件 @
f52c4f8b
...
...
@@ -180,6 +180,8 @@ class Fleet(object):
raise
ValueError
(
"`role_maker` should be subclass of `RoleMakerBase`, but got {}"
.
format
(
type
(
role_maker
)))
self
.
_role_maker
.
_generate_role
()
self
.
strategy_compiler
=
StrategyCompiler
()
if
paddle
.
fluid
.
framework
.
in_dygraph_mode
():
if
parallel_helper
.
_is_parallel_ctx_initialized
():
...
...
@@ -187,7 +189,6 @@ class Fleet(object):
"The dygraph parallel environment has been initialized."
)
else
:
paddle
.
distributed
.
init_parallel_env
()
return
None
def
is_first_worker
(
self
):
"""
...
...
@@ -206,7 +207,7 @@ class Fleet(object):
fleet.is_first_worker()
"""
return
self
.
_role_maker
.
is_first_worker
()
return
self
.
_role_maker
.
_
is_first_worker
()
def
worker_index
(
self
):
"""
...
...
@@ -223,7 +224,7 @@ class Fleet(object):
fleet.worker_index()
"""
return
self
.
_role_maker
.
worker_index
()
return
self
.
_role_maker
.
_
worker_index
()
def
worker_num
(
self
):
"""
...
...
@@ -240,7 +241,7 @@ class Fleet(object):
fleet.worker_num()
"""
return
self
.
_role_maker
.
worker_num
()
return
self
.
_role_maker
.
_
worker_num
()
def
is_worker
(
self
):
"""
...
...
@@ -258,7 +259,7 @@ class Fleet(object):
fleet.is_worker()
"""
return
self
.
_role_maker
.
is_worker
()
return
self
.
_role_maker
.
_
is_worker
()
def
worker_endpoints
(
self
,
to_string
=
False
):
"""
...
...
@@ -275,13 +276,10 @@ class Fleet(object):
fleet.worker_endpoints()
"""
'''
if
to_string
:
return ",".join(self._role_maker.get_trainer_endpoints())
return
","
.
join
(
self
.
_role_maker
.
_
get_trainer_endpoints
())
else
:
return self._role_maker.get_trainer_endpoints()
'''
return
[
"127.0.0.1:1001"
,
"127.0.0.1:1002"
]
return
self
.
_role_maker
.
_get_trainer_endpoints
()
def
server_num
(
self
):
"""
...
...
@@ -296,7 +294,7 @@ class Fleet(object):
fleet.init()
fleet.server_num()
"""
return
len
(
self
.
_role_maker
.
get_pserver_endpoints
())
return
len
(
self
.
_role_maker
.
_
get_pserver_endpoints
())
def
server_index
(
self
):
"""
...
...
@@ -313,7 +311,7 @@ class Fleet(object):
fleet.server_index()
"""
return
self
.
_role_maker
.
server_index
()
return
self
.
_role_maker
.
_
server_index
()
def
server_endpoints
(
self
,
to_string
=
False
):
"""
...
...
@@ -332,9 +330,9 @@ class Fleet(object):
"""
if
to_string
:
return
","
.
join
(
self
.
_role_maker
.
get_pserver_endpoints
())
return
","
.
join
(
self
.
_role_maker
.
_
get_pserver_endpoints
())
else
:
return
self
.
_role_maker
.
get_pserver_endpoints
()
return
self
.
_role_maker
.
_
get_pserver_endpoints
()
def
is_server
(
self
):
"""
...
...
@@ -352,10 +350,12 @@ class Fleet(object):
fleet.is_server()
"""
return
self
.
_role_maker
.
is_server
(
return
self
.
_role_maker
.
_
is_server
(
)
or
self
.
_role_maker
.
_is_heter_worker
()
@
property
def
set_util
(
self
,
util
):
self
.
_util
=
util
def
util
(
self
):
"""
Utility functions that can be used under certain runtime
...
...
@@ -376,16 +376,6 @@ class Fleet(object):
"""
return
self
.
_util
@
util
.
setter
def
util
(
self
,
util
):
"""
Set Utility functions for userd-defined runtime
Returns:
None
"""
self
.
_util
=
util
def
barrier_worker
(
self
):
"""
barrier all workers
...
...
@@ -393,7 +383,7 @@ class Fleet(object):
Returns:
None
"""
self
.
_role_maker
.
barrier_worker
(
)
self
.
_role_maker
.
_barrier
(
"worker"
)
@
is_non_distributed_check
@
inited_runtime_handler
...
...
python/paddle/distributed/fleet/base/role_maker.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/distributed/fleet/base/util_factory.py
浏览文件 @
f52c4f8b
...
...
@@ -57,34 +57,7 @@ class UtilBase(object):
),
"fs_client must be the instance of paddle.distributed.fleet.utils.FS"
self
.
fs_client
=
fs_client
def
__check_comm_world
(
self
,
comm_world
=
"worker"
):
if
not
self
.
role_maker
.
_role_is_generated
:
self
.
role_maker
.
generate_role
()
_comm_world
=
None
comm_world_upper
=
comm_world
.
upper
()
if
comm_world_upper
==
"WORKER"
:
if
not
self
.
role_maker
.
is_worker
():
print
(
"warning: current role is not worker in collective_func(comm_world=
\"
worker
\"
)"
)
_comm_world
=
self
.
role_maker
.
_node_type_comm
elif
comm_world_upper
==
"SERVER"
:
if
not
self
.
role_maker
.
is_server
():
print
(
"warning: current role is not server in collective_func(comm_world=
\"
server
\"
)"
)
_comm_world
=
self
.
role_maker
.
_node_type_comm
elif
comm_world_upper
==
"ALL"
:
_comm_world
=
self
.
role_maker
.
_all_comm
else
:
raise
ValueError
(
"not support comm_world, please choose one from [worker, server, all]"
)
return
_comm_world
def
all_reduce
(
self
,
input
,
mode
,
comm_world
=
"worker"
):
def
all_reduce
(
self
,
input
,
mode
=
"sum"
,
comm_world
=
"worker"
):
"""
All reduce `input` between specified collection. This is a distributed API.
...
...
@@ -130,8 +103,7 @@ class UtilBase(object):
if __name__ == "__main__":
train()
"""
_comm_world
=
self
.
__check_comm_world
(
comm_world
)
return
self
.
role_maker
.
_all_reduce
(
_comm_world
,
input
,
mode
)
return
self
.
role_maker
.
_all_reduce
(
input
,
mode
,
comm_world
)
def
barrier
(
self
,
comm_world
=
"worker"
):
"""
...
...
@@ -170,8 +142,7 @@ class UtilBase(object):
if __name__ == "__main__":
train()
"""
_comm_world
=
self
.
__check_comm_world
(
comm_world
)
self
.
role_maker
.
_barrier
(
_comm_world
)
self
.
role_maker
.
_barrier
(
comm_world
)
def
all_gather
(
self
,
input
,
comm_world
=
"worker"
):
"""
...
...
@@ -219,8 +190,8 @@ class UtilBase(object):
if __name__ == "__main__":
train()
"""
_comm_world
=
self
.
__check_comm_world
(
comm_world
)
return
self
.
role_maker
.
_all_gather
(
_comm_world
,
input
)
return
self
.
role_maker
.
_all_gather
(
input
,
comm_world
)
def
_broadcast
(
self
):
pass
...
...
@@ -266,8 +237,8 @@ class UtilBase(object):
if
not
isinstance
(
files
,
list
):
raise
TypeError
(
"files should be a list of file need to be read."
)
trainer_id
=
self
.
role_maker
.
worker_index
()
trainers
=
self
.
role_maker
.
worker_num
()
trainer_id
=
self
.
role_maker
.
_
worker_index
()
trainers
=
self
.
role_maker
.
_
worker_num
()
remainder
=
len
(
files
)
%
trainers
blocksize
=
int
(
len
(
files
)
/
trainers
)
...
...
@@ -309,7 +280,7 @@ class UtilBase(object):
fleet_util._set_role_maker(role)
fleet_util.print_on_rank("I'm worker 0", 0)
"""
if
self
.
role_maker
.
worker_index
()
!=
rank_id
:
if
self
.
role_maker
.
_
worker_index
()
!=
rank_id
:
return
print
(
message
)
...
...
python/paddle/distributed/fleet/launch.py
浏览文件 @
f52c4f8b
...
...
@@ -55,7 +55,10 @@ launch a process on each of the given gpu card or cpu machine.
"""
from
__future__
import
print_function
import
shutil
import
sys
import
tempfile
from
sys
import
version
import
subprocess
import
os
...
...
@@ -213,12 +216,20 @@ def launch_collective(args):
cluster
,
pod
=
get_cluster_from_args
(
args
,
gpus
)
logger
.
debug
(
"get cluster from args:{}"
.
format
(
cluster
))
global_envs
=
copy
.
copy
(
os
.
environ
.
copy
())
gloo_rendezvous_dir
=
tempfile
.
mkdtemp
()
# add gloo env
global_envs
[
"PADDLE_WITH_GLOO"
]
=
"1"
global_envs
[
"PADDLE_GLOO_RENDEZVOUS"
]
=
"2"
global_envs
[
"PADDLE_GLOO_FS_PATH"
]
=
gloo_rendezvous_dir
procs
=
start_local_trainers
(
cluster
,
pod
,
training_script
=
args
.
training_script
,
training_script_args
=
args
.
training_script_args
,
log_dir
=
args
.
log_dir
)
log_dir
=
args
.
log_dir
,
envs
=
global_envs
)
while
True
:
alive
=
watch_local_trainers
(
procs
,
cluster
.
trainers_nranks
())
...
...
@@ -230,6 +241,9 @@ def launch_collective(args):
time
.
sleep
(
3
)
if
os
.
path
.
exists
(
gloo_rendezvous_dir
):
shutil
.
rmtree
(
gloo_rendezvous_dir
)
def
launch_ps
(
args
):
ports
=
None
...
...
@@ -315,6 +329,13 @@ def launch_ps(args):
default_env
=
os
.
environ
.
copy
()
current_env
=
copy
.
copy
(
default_env
)
gloo_rendezvous_dir
=
tempfile
.
mkdtemp
()
# add gloo env
current_env
[
"PADDLE_WITH_GLOO"
]
=
"1"
current_env
[
"PADDLE_GLOO_RENDEZVOUS"
]
=
"2"
current_env
[
"PADDLE_GLOO_FS_PATH"
]
=
gloo_rendezvous_dir
current_env
.
pop
(
"http_proxy"
,
None
)
current_env
.
pop
(
"https_proxy"
,
None
)
procs
=
[]
...
...
@@ -419,6 +440,9 @@ def launch_ps(args):
procs
[
i
].
proc
.
terminate
()
print
(
"all parameter server are killed"
,
file
=
sys
.
stderr
)
if
os
.
path
.
exists
(
gloo_rendezvous_dir
):
shutil
.
rmtree
(
gloo_rendezvous_dir
)
def
launch
():
args
=
_parse_args
()
...
...
python/paddle/distributed/fleet/launch_utils.py
浏览文件 @
f52c4f8b
...
...
@@ -398,8 +398,14 @@ def start_local_trainers(cluster,
pod
,
training_script
,
training_script_args
,
log_dir
=
None
):
current_env
=
copy
.
copy
(
os
.
environ
.
copy
())
log_dir
=
None
,
envs
=
None
):
if
envs
is
None
:
current_env
=
copy
.
copy
(
os
.
environ
.
copy
())
else
:
current_env
=
copy
.
copy
(
envs
)
#paddle broadcast ncclUniqueId use socket, and
#proxy maybe make trainers unreachable, so delete them.
#if we set them to "", grpc will log error message "bad uri"
...
...
python/paddle/distributed/fleet/meta_optimizers/common.py
浏览文件 @
f52c4f8b
...
...
@@ -57,12 +57,12 @@ class CollectiveHelper(object):
if
startup_program
is
None
:
self
.
startup_program
=
fluid
.
default_startup_program
()
endpoints
=
self
.
role_maker
.
get_trainer_endpoints
()
current_endpoint
=
endpoints
[
self
.
role_maker
.
worker_index
()]
endpoints
=
self
.
role_maker
.
_
get_trainer_endpoints
()
current_endpoint
=
endpoints
[
self
.
role_maker
.
_
worker_index
()]
for
ring_id
in
range
(
self
.
nrings
):
self
.
_init_communicator
(
self
.
startup_program
,
current_endpoint
,
endpoints
,
self
.
role_maker
.
worker_index
(),
ring_id
,
self
.
wait_port
)
self
.
role_maker
.
_
worker_index
(),
ring_id
,
self
.
wait_port
)
self
.
_broadcast_params
()
def
_init_communicator
(
self
,
program
,
current_endpoint
,
endpoints
,
rank
,
...
...
python/paddle/distributed/fleet/meta_optimizers/dgc_optimizer.py
浏览文件 @
f52c4f8b
...
...
@@ -47,7 +47,7 @@ class DGCOptimizer(MetaOptimizerBase):
sparsity
=
configs
[
'sparsity'
],
parameter_list
=
opt
.
_parameter_list
,
use_nesterov
=
opt
.
_use_nesterov
,
num_trainers
=
self
.
role_maker
.
worker_num
(),
num_trainers
=
self
.
role_maker
.
_
worker_num
(),
regularization
=
opt
.
regularization
,
grad_clip
=
opt
.
_grad_clip
,
name
=
opt
.
_name
)
...
...
@@ -60,7 +60,7 @@ class DGCOptimizer(MetaOptimizerBase):
if
not
isinstance
(
self
.
inner_opt
,
Momentum
):
logging
.
warn
(
"dgc only works on Momentum optimizer"
)
return
False
if
self
.
role_maker
.
worker_num
()
<=
1
:
if
self
.
role_maker
.
_
worker_num
()
<=
1
:
logging
.
warn
(
"dgc only works on multi cards"
)
return
False
...
...
python/paddle/distributed/fleet/meta_optimizers/graph_execution_optimizer.py
浏览文件 @
f52c4f8b
...
...
@@ -50,12 +50,12 @@ class GraphExecutionOptimizer(MetaOptimizerBase):
# should fix the variable
def
_setup_nccl_op
(
self
,
startup_program
,
main_program
,
build_strategy
):
trainer_endpoints
=
self
.
role_maker
.
get_trainer_endpoints
()
trainer_endpoints
=
self
.
role_maker
.
_
get_trainer_endpoints
()
trainers
=
trainer_endpoints
trainer_id
=
self
.
role_maker
.
worker_index
()
current_endpoint
=
self
.
role_maker
.
get_trainer_endpoints
()[
trainer_id
]
trainer_id
=
self
.
role_maker
.
_
worker_index
()
current_endpoint
=
self
.
role_maker
.
_
get_trainer_endpoints
()[
trainer_id
]
trainer_endpoints_env
=
","
.
join
(
trainer_endpoints
)
trainers_num
=
self
.
role_maker
.
worker_num
()
trainers_num
=
self
.
role_maker
.
_
worker_num
()
nccl_id_var
=
startup_program
.
global_block
().
create_var
(
name
=
"NCCLID"
,
persistable
=
True
,
type
=
core
.
VarDesc
.
VarType
.
RAW
)
for
i
in
range
(
1
,
build_strategy
.
nccl_comm_num
):
...
...
@@ -127,8 +127,8 @@ class GraphExecutionOptimizer(MetaOptimizerBase):
local_build_strategy
.
enable_sequential_execution
=
True
exe_strategy
=
self
.
user_defined_strategy
.
execution_strategy
worker_num
=
self
.
role_maker
.
worker_num
()
node_num
=
self
.
role_maker
.
node_num
()
worker_num
=
self
.
role_maker
.
_
worker_num
()
node_num
=
self
.
role_maker
.
_
node_num
()
if
self
.
role_maker
.
_is_collective
:
assert
worker_num
>=
1
,
"nccl2 worker_num must >= 1, now:{}"
%
worker_num
...
...
@@ -170,9 +170,9 @@ class GraphExecutionOptimizer(MetaOptimizerBase):
# TODO(guru4elephant): should be an independent optimizer
self
.
_setup_nccl_op
(
startup_program
,
main_program
,
local_build_strategy
)
local_build_strategy
.
num_trainers
=
self
.
role_maker
.
worker_num
()
local_build_strategy
.
trainer_id
=
self
.
role_maker
.
worker_index
()
local_build_strategy
.
trainers_endpoints
=
self
.
role_maker
.
get_trainer_endpoints
(
local_build_strategy
.
num_trainers
=
self
.
role_maker
.
_
worker_num
()
local_build_strategy
.
trainer_id
=
self
.
role_maker
.
_
worker_index
()
local_build_strategy
.
trainers_endpoints
=
self
.
role_maker
.
_
get_trainer_endpoints
(
)
local_build_strategy
.
enable_backward_optimizer_op_deps
=
True
...
...
python/paddle/distributed/fleet/meta_optimizers/localsgd_optimizer.py
浏览文件 @
f52c4f8b
...
...
@@ -38,7 +38,7 @@ class LocalSGDOptimizer(MetaOptimizerBase):
if
not
self
.
user_defined_strategy
.
localsgd
:
return
False
if
self
.
role_maker
.
worker_num
()
<=
1
:
if
self
.
role_maker
.
_
worker_num
()
<=
1
:
return
False
return
isinstance
(
self
.
inner_opt
,
paddle
.
optimizer
.
momentum
.
Momentum
)
\
...
...
@@ -168,7 +168,7 @@ class LocalSGDOptimizer(MetaOptimizerBase):
inputs
=
{
'X'
:
[
param
]},
outputs
=
{
'Out'
:
[
param
]},
attrs
=
{
'scale'
:
1.0
/
self
.
role_maker
.
worker_num
(),
'scale'
:
1.0
/
self
.
role_maker
.
_
worker_num
(),
OP_ROLE_KEY
:
OpRole
.
Optimize
})
sub_block
.
append_op
(
...
...
@@ -208,7 +208,7 @@ class AdaptiveLocalSGDOptimizer(MetaOptimizerBase):
if
not
self
.
user_defined_strategy
.
adaptive_localsgd
:
return
False
if
self
.
role_maker
.
worker_num
()
<=
1
:
if
self
.
role_maker
.
_
worker_num
()
<=
1
:
return
False
return
isinstance
(
self
.
inner_opt
,
paddle
.
optimizer
.
momentum
.
Momentum
)
\
...
...
@@ -275,7 +275,7 @@ class AdaptiveLocalSGDOptimizer(MetaOptimizerBase):
inputs
=
{
'X'
:
[
avg_loss
]},
outputs
=
{
'Out'
:
[
avg_loss
]},
attrs
=
{
'scale'
:
1.0
/
self
.
role_maker
.
worker_num
(),
'scale'
:
1.0
/
self
.
role_maker
.
_
worker_num
(),
OP_ROLE_KEY
:
OpRole
.
Optimize
})
...
...
@@ -398,7 +398,7 @@ class AdaptiveLocalSGDOptimizer(MetaOptimizerBase):
inputs
=
{
'X'
:
[
param
]},
outputs
=
{
'Out'
:
[
param
]},
attrs
=
{
'scale'
:
1.0
/
self
.
role_maker
.
worker_num
(),
'scale'
:
1.0
/
self
.
role_maker
.
_
worker_num
(),
OP_ROLE_KEY
:
OpRole
.
Optimize
})
sub_block
.
append_op
(
...
...
python/paddle/distributed/fleet/meta_optimizers/parameter_server_graph_optimizer.py
浏览文件 @
f52c4f8b
...
...
@@ -31,7 +31,7 @@ class ParameterServerGraphOptimizer(ParameterServerOptimizer):
if
k_steps
<
0
:
return
False
if
self
.
role_maker
.
is_server
():
if
self
.
role_maker
.
_
is_server
():
return
False
if
self
.
role_maker
.
_is_heter_parameter_server_mode
:
...
...
python/paddle/distributed/fleet/meta_optimizers/parameter_server_optimizer.py
浏览文件 @
f52c4f8b
...
...
@@ -239,10 +239,10 @@ class ParameterServerOptimizer(MetaOptimizerBase):
strategy
,
self
.
role_maker
)
compiled_config
.
strategy
=
strategy
if
self
.
role_maker
.
is_worker
()
or
self
.
role_maker
.
_is_heter_worker
():
if
self
.
role_maker
.
_
is_worker
()
or
self
.
role_maker
.
_is_heter_worker
():
main_program
,
startup_program
=
self
.
_build_trainer_programs
(
compiled_config
)
elif
self
.
role_maker
.
is_server
():
elif
self
.
role_maker
.
_
is_server
():
main_program
,
startup_program
=
self
.
_build_pserver_programs
(
compiled_config
)
...
...
python/paddle/distributed/fleet/meta_optimizers/pipeline_optimizer.py
浏览文件 @
f52c4f8b
...
...
@@ -126,11 +126,11 @@ class PipelineOptimizer(MetaOptimizerBase):
optimize_ops
,
params_grads
,
prog_list
=
\
self
.
wrapped_opt
.
minimize
(
loss
,
startup_program
,
parameter_list
,
no_grad_set
)
if
self
.
role_maker
.
worker_num
()
==
1
:
if
self
.
role_maker
.
_
worker_num
()
==
1
:
return
optimize_ops
,
params_grads
endpoints
=
self
.
role_maker
.
get_trainer_endpoints
()
current_endpoint
=
endpoints
[
self
.
role_maker
.
worker_index
()]
endpoints
=
self
.
role_maker
.
_
get_trainer_endpoints
()
current_endpoint
=
endpoints
[
self
.
role_maker
.
_
worker_index
()]
self
.
startup_program
=
startup_program
if
startup_program
is
None
:
self
.
startup_program
=
fluid
.
default_startup_program
()
...
...
@@ -142,7 +142,7 @@ class PipelineOptimizer(MetaOptimizerBase):
self
.
nranks
=
nranks
self
.
nrings
=
len
(
self
.
main_program_list
)
self
.
rank
=
self
.
role_maker
.
worker_index
()
self
.
rank
=
self
.
role_maker
.
_
worker_index
()
self
.
endpoints
=
endpoints
self
.
current_endpoint
=
current_endpoint
...
...
python/paddle/distributed/fleet/runtime/parameter_server_runtime.py
浏览文件 @
f52c4f8b
...
...
@@ -104,9 +104,9 @@ class ParameterServerRuntime(RuntimeBase):
def
_init_worker
(
self
):
def
sync_strategy_envs
():
kwargs
=
{}
kwargs
[
"pserver_endpoints"
]
=
self
.
role_maker
.
get_pserver_endpoints
(
)
kwargs
[
"trainer_id"
]
=
self
.
role_maker
.
worker_index
()
kwargs
[
"pserver_endpoints"
]
=
self
.
role_maker
.
_get_pserver_endpoints
(
)
kwargs
[
"trainer_id"
]
=
self
.
role_maker
.
_
worker_index
()
return
kwargs
def
geo_strategy_envs
():
...
...
@@ -150,7 +150,7 @@ class ParameterServerRuntime(RuntimeBase):
return
"#"
.
join
(
init_attrs
)
kwargs
=
{}
kwargs
[
"trainers"
]
=
self
.
role_maker
.
worker_num
()
kwargs
[
"trainers"
]
=
self
.
role_maker
.
_
worker_num
()
kwargs
[
"sparse_attrs"
]
=
get_sparse_attrs
()
return
kwargs
...
...
@@ -338,7 +338,7 @@ class ParameterServerRuntime(RuntimeBase):
block
.
append_op
(
type
=
'recv_save'
,
attrs
=
{
"trainer_id"
:
self
.
role_maker
.
worker_index
(),
"trainer_id"
:
self
.
role_maker
.
_
worker_index
(),
"shape"
:
var
.
shape
,
"slice_shapes"
:
[
","
.
join
([
str
(
i
)
for
i
in
var
.
shape
])],
...
...
@@ -378,14 +378,15 @@ class ParameterServerRuntime(RuntimeBase):
block
.
append_op
(
type
=
'recv_save'
,
attrs
=
{
"trainer_id"
:
self
.
role_maker
.
worker_index
(),
"trainer_id"
:
self
.
role_maker
.
_
worker_index
(),
"shape"
:
var
.
shape
,
"slice_shapes"
:
slice_shapes
,
"slice_varnames"
:
var_ctx
.
split_varnames
(),
"remote_varnames"
:
var_ctx
.
split_varnames
(),
"is_sparse"
:
True
,
"endpoints"
:
var_ctx
.
split_endpoints
(),
"pserver_num"
:
len
(
self
.
role_maker
.
get_pserver_endpoints
()),
"pserver_num"
:
len
(
self
.
role_maker
.
_get_pserver_endpoints
()),
"file_path"
:
os
.
path
.
join
(
dirname
,
var
.
name
)
})
...
...
@@ -403,7 +404,7 @@ class ParameterServerRuntime(RuntimeBase):
block
.
append_op
(
type
=
'recv_save'
,
attrs
=
{
"trainer_id"
:
self
.
role_maker
.
worker_index
(),
"trainer_id"
:
self
.
role_maker
.
_
worker_index
(),
"shape"
:
var
.
shape
,
"slice_shapes"
:
slice_shapes
,
"slice_varnames"
:
slice_varnames
,
...
...
@@ -411,7 +412,7 @@ class ParameterServerRuntime(RuntimeBase):
"is_sparse"
:
True
,
"endpoints"
:
var_ctx
.
split_endpoints
(),
"pserver_num"
:
len
(
self
.
role_maker
.
get_pserver_endpoints
()),
len
(
self
.
role_maker
.
_
get_pserver_endpoints
()),
"file_path"
:
os
.
path
.
join
(
dirname
,
var
.
name
)
})
...
...
@@ -422,7 +423,7 @@ class ParameterServerRuntime(RuntimeBase):
block
.
append_op
(
type
=
'recv_save'
,
attrs
=
{
"trainer_id"
:
self
.
role_maker
.
worker_index
(),
"trainer_id"
:
self
.
role_maker
.
_
worker_index
(),
"shape"
:
var
.
shape
,
"slice_shapes"
:
[
","
.
join
([
str
(
i
)
for
i
in
var
.
shape
])],
...
...
python/paddle/fluid/__init__.py
浏览文件 @
f52c4f8b
...
...
@@ -197,6 +197,7 @@ def __bootstrap__():
'free_when_no_cache_hit'
,
'call_stack_level'
,
'sort_sum_gradient'
,
'max_inplace_grad_add'
,
]
if
'Darwin'
not
in
sysstr
:
read_env_flags
.
append
(
'use_pinned_memory'
)
...
...
python/paddle/fluid/backward.py
浏览文件 @
f52c4f8b
...
...
@@ -251,12 +251,19 @@ def _rename_arg_(op_descs, old_name, new_name, begin_idx=None, end_idx=None):
begin_idx
=
0
if
end_idx
is
None
:
end_idx
=
len
(
op_descs
)
for
i
in
range
(
begin_idx
,
end_idx
):
op_desc
=
op_descs
[
i
]
if
isinstance
(
op_desc
,
tuple
):
op_desc
=
op_desc
[
0
]
op_desc
.
_rename_input
(
old_name
,
new_name
)
op_desc
.
_rename_output
(
old_name
,
new_name
)
if
isinstance
(
op_descs
,
(
list
,
tuple
)):
for
i
in
range
(
begin_idx
,
end_idx
):
op_desc
=
op_descs
[
i
]
if
isinstance
(
op_desc
,
tuple
):
op_desc
=
op_desc
[
0
]
op_desc
.
_rename_input
(
old_name
,
new_name
)
op_desc
.
_rename_output
(
old_name
,
new_name
)
if
isinstance
(
op_descs
,
collections
.
OrderedDict
):
for
key
,
value
in
op_descs
.
items
():
if
isinstance
(
value
,
(
list
,
tuple
)):
for
op_desc
in
value
:
op_desc
.
_rename_input
(
old_name
,
new_name
)
op_desc
.
_rename_output
(
old_name
,
new_name
)
def
_create_op_desc_
(
op_type
,
inputs
,
outputs
,
attrs
):
...
...
@@ -369,6 +376,41 @@ def _append_grad_suffix_(name):
return
cpt
.
to_text
(
name
)
+
core
.
grad_var_suffix
()
def
_accumulate_gradients_by_sum_op_
(
var_name
,
renamed_vars
,
pending_sum_ops
,
op_idx
):
"""
Use sum op to accumulate_gradients, the gradients are stored in renamed_vars.
"""
if
op_idx
not
in
pending_sum_ops
.
keys
():
pending_sum_ops
[
op_idx
]
=
[]
pending_sum_ops
[
op_idx
].
append
(
_create_op_desc_
(
"sum"
,
{
"X"
:
renamed_vars
[
var_name
]},
{
"Out"
:
[
var_name
]},
{
"use_mkldnn"
:
False
}))
renamed_vars
[
var_name
]
=
[
var_name
]
def
_accumulate_gradients_by_add_ops_
(
var_name
,
renamed_vars
,
pending_sum_ops
,
op_idx
):
"""
Use several inplace add op to accumulate_gradients, the gradients are stored in renamed_vars.
"""
if
op_idx
not
in
pending_sum_ops
.
keys
():
pending_sum_ops
[
op_idx
]
=
[]
out_name
=
renamed_vars
[
var_name
][
0
]
for
i
in
range
(
1
,
len
(
renamed_vars
[
var_name
])):
x_name
=
out_name
y_name
=
renamed_vars
[
var_name
][
i
]
if
i
!=
len
(
renamed_vars
[
var_name
])
-
1
:
out_name
=
var_name
+
'@ADD@'
+
str
(
i
)
else
:
out_name
=
var_name
pending_sum_ops
[
op_idx
].
append
(
_create_op_desc_
(
"grad_add"
,
{
"X"
:
[
x_name
],
"Y"
:
[
y_name
]},
{
"Out"
:
[
out_name
]},
{
"use_mkldnn"
:
False
}))
renamed_vars
[
var_name
]
=
[
var_name
]
def
_addup_repetitive_outputs_
(
op_descs
,
block_idx
):
"""
In backward part, an variable may be the output of more than one ops.
...
...
@@ -376,7 +418,9 @@ def _addup_repetitive_outputs_(op_descs, block_idx):
In these cases, the variable should be the accumulation of all the outputs.
`sum_op`s are added to implement the accumulate.
"""
pending_sum_ops
=
[]
_MAX_ADD_NUM_
=
core
.
globals
()[
'FLAGS_max_inplace_grad_add'
]
#pending_sum_ops = []
pending_sum_ops
=
collections
.
OrderedDict
()
var_rename_count
=
collections
.
defaultdict
(
int
)
renamed_vars
=
collections
.
defaultdict
(
list
)
renamed_var_start_idx
=
collections
.
defaultdict
(
list
)
...
...
@@ -385,10 +429,13 @@ def _addup_repetitive_outputs_(op_descs, block_idx):
if
"@GRAD"
not
in
var_name
:
continue
if
len
(
renamed_vars
[
var_name
])
>
1
:
pending_sum_ops
.
append
((
_create_op_desc_
(
"sum"
,
{
"X"
:
renamed_vars
[
var_name
]},
{
"Out"
:
[
var_name
]},
{
"use_mkldnn"
:
False
}),
idx
))
renamed_vars
[
var_name
]
=
[
var_name
]
if
len
(
renamed_vars
[
var_name
])
>
_MAX_ADD_NUM_
:
_accumulate_gradients_by_sum_op_
(
var_name
,
renamed_vars
,
pending_sum_ops
,
idx
)
else
:
_accumulate_gradients_by_add_ops_
(
var_name
,
renamed_vars
,
pending_sum_ops
,
idx
)
for
param_idx
,
param_name
in
enumerate
(
op_desc
.
output_names
()):
arg_names
=
op_desc
.
output
(
param_name
)
for
arg_idx
,
var_name
in
enumerate
(
arg_names
):
...
...
@@ -440,13 +487,26 @@ def _addup_repetitive_outputs_(op_descs, block_idx):
renamed_vars
[
var_name
].
append
(
new_name
)
for
var_name
,
inputs
in
six
.
iteritems
(
renamed_vars
):
if
len
(
inputs
)
>
1
:
pending_sum_ops
.
append
(
(
_create_op_desc_
(
"sum"
,
{
"X"
:
inputs
},
{
"Out"
:
[
var_name
]},
{
"use_mkldnn"
:
False
}),
len
(
op_descs
)))
if
len
(
renamed_vars
[
var_name
])
>
1
:
if
len
(
renamed_vars
[
var_name
])
>
_MAX_ADD_NUM_
:
_accumulate_gradients_by_sum_op_
(
var_name
,
renamed_vars
,
pending_sum_ops
,
len
(
op_descs
))
else
:
_accumulate_gradients_by_add_ops_
(
var_name
,
renamed_vars
,
pending_sum_ops
,
len
(
op_descs
))
# sum_op descs are sorted according to their insert position
for
p
in
reversed
(
pending_sum_ops
):
op_descs
.
insert
(
p
[
1
],
p
[
0
])
for
key
,
value
in
collections
.
OrderedDict
(
reversed
(
list
(
pending_sum_ops
.
items
()))).
items
():
# NOTE(zhiqiu): Since reversed, the idx of op_descs to be inserted will remains correct.
# For example, [0, 1, 2], and we want to insert 'a' at idx 1, 'b' at idx 2, and the expected result is [0, 1, 'a', 2, 'b'].
# If reversed, we first insert 'b' at idx 2, it becomes [0, 1, 2, 'b'], and then insert 'a' at idx 1, it becomes [0, 1, 'a', 2, 'b'].
# If not reverse, we first insert 'a' at idx 1, it becomes [0, 1, 'a', 2], and then insert 'b' at idx 2, it becomes [0, 1, 'a', 'b', 2].
idx
=
key
for
i
,
op
in
enumerate
(
value
):
op_descs
.
insert
(
idx
+
i
,
op
)
return
op_descs
...
...
python/paddle/fluid/contrib/slim/quantization/imperative/qat.py
浏览文件 @
f52c4f8b
...
...
@@ -99,7 +99,12 @@ class ImperativeQuantAware(object):
self
.
_activation_bits
=
activation_bits
self
.
_moving_rate
=
moving_rate
quant_type
=
{
'abs_max'
,
'moving_average_abs_max'
}
quant_type
=
{
'abs_max'
,
'moving_average_abs_max'
,
'channel_wise_abs_max'
}
assert
activation_quantize_type
!=
'channel_wise_abs_max'
,
\
"The activation quantization type does not support 'channel_wise_abs_max'."
if
activation_quantize_type
not
in
quant_type
:
raise
ValueError
(
"Unknown activation_quantize_type : '%s'. It can only be "
...
...
@@ -108,8 +113,8 @@ class ImperativeQuantAware(object):
if
weight_quantize_type
not
in
quant_type
:
raise
ValueError
(
"Unknown weight_quantize_type: '%s'. It can only be "
"'abs_max' or 'moving_average_abs_max'
now."
%
(
str
(
weight_quantize_type
)))
"'abs_max' or 'moving_average_abs_max'
or 'channel_wise_abs_max' now."
%
(
str
(
weight_quantize_type
)))
self
.
_activation_quantize_type
=
activation_quantize_type
self
.
_weight_quantize_type
=
weight_quantize_type
...
...
python/paddle/fluid/contrib/slim/quantization/imperative/quant_nn.py
浏览文件 @
f52c4f8b
...
...
@@ -24,7 +24,7 @@ from paddle.fluid.data_feeder import check_variable_and_dtype
__all__
=
[
'FakeQuantMovingAverage'
,
'FakeQuantAbsMax'
,
'QuantizedConv2D'
,
'QuantizedLinear'
'QuantizedLinear'
,
'FakeChannelWiseQuantDequantAbsMax'
]
...
...
@@ -209,6 +209,89 @@ class FakeQuantAbsMax(layers.Layer):
return
quant_out
class
FakeChannelWiseQuantDequantAbsMax
(
layers
.
Layer
):
def
__init__
(
self
,
name
=
None
,
channel_num
=
None
,
quant_bits
=
8
,
quant_axis
=
0
,
dtype
=
'float32'
,
quant_on_weight
=
False
):
assert
quant_on_weight
==
True
,
"Channel_wise only can be used on weight quantization."
super
(
FakeChannelWiseQuantDequantAbsMax
,
self
).
__init__
()
self
.
_quant_bits
=
quant_bits
self
.
_quant_axis
=
quant_axis
self
.
_dtype
=
dtype
self
.
_name
=
name
self
.
_channel_num
=
channel_num
scale_prefix
=
"{}.scale"
.
format
(
name
)
if
name
else
'quant_dequant.scale'
self
.
_scale_name
=
unique_name
.
generate
(
scale_prefix
)
if
quant_on_weight
:
scale_attr
=
ParamAttr
(
name
=
self
.
_scale_name
,
initializer
=
Constant
(
0.0
),
trainable
=
False
)
self
.
_scale
=
self
.
create_parameter
(
shape
=
[
self
.
_channel_num
],
attr
=
scale_attr
,
dtype
=
self
.
_dtype
)
self
.
_scale
.
stop_gradient
=
True
else
:
self
.
_scale
=
None
def
forward
(
self
,
input
):
if
in_dygraph_mode
():
attrs
=
(
'bit_length'
,
self
.
_quant_bits
,
'quant_axis'
,
self
.
_quant_axis
)
quant_out
=
_varbase_creator
(
type
=
input
.
type
,
name
=
"{}.quantized.dequantized"
.
format
(
input
.
name
),
shape
=
input
.
shape
,
dtype
=
input
.
dtype
,
persistable
=
False
)
out_scale
=
self
.
_scale
if
out_scale
is
None
:
out_scale
=
_varbase_creator
(
type
=
core
.
VarDesc
.
VarType
.
LOD_TENSOR
,
name
=
self
.
_scale_name
,
shape
=
[
self
.
_channel_num
],
dtype
=
self
.
_dtype
,
persistable
=
False
)
out_scale
.
stop_gradient
=
True
out
,
_
,
=
core
.
ops
.
fake_channel_wise_quantize_dequantize_abs_max
(
input
,
quant_out
,
out_scale
,
*
attrs
)
return
out
check_variable_and_dtype
(
input
,
'input'
,
[
'float32'
],
"FakeChannelWiseQuantDequantAbsMax"
)
attrs
=
{
'bit_length'
:
self
.
_quant_bits
,
'quant_axis'
:
self
.
_quant_axis
}
inputs
=
{
"X"
:
[
input
]}
quant_out
=
self
.
_helper
.
create_variable
(
name
=
"{}.quantized.dequantized"
.
format
(
input
.
name
),
dtype
=
input
.
dtype
,
type
=
core
.
VarDesc
.
VarType
.
LOD_TENSOR
,
persistable
=
False
,
stop_gradient
=
False
)
out_scale
=
self
.
_scale
if
not
out_scale
:
out_scale
=
self
.
_helper
.
create_variable
(
name
=
self
.
_scale_name
,
dtype
=
self
.
_dtype
,
type
=
core
.
VarDesc
.
VarType
.
LOD_TENSOR
,
persistable
=
False
,
stop_gradient
=
True
)
outputs
=
{
"Out"
:
[
quant_out
],
"OutScale"
:
[
out_scale
]}
self
.
_helper
.
append_op
(
type
=
"fake_channel_wise_quantize_dequantize_abs_max"
,
inputs
=
inputs
,
outputs
=
outputs
,
attrs
=
attrs
)
return
quant_out
def
_get_fake_quant_type
(
quant_type
,
**
kwargs
):
call_args
=
{
"name"
:
kwargs
.
get
(
"name"
,
None
),
...
...
@@ -220,10 +303,17 @@ def _get_fake_quant_type(quant_type, **kwargs):
call_args
[
"quant_on_weight"
]
=
kwargs
.
get
(
"quant_on_weight"
,
False
)
elif
quant_type
==
'moving_average_abs_max'
:
call_args
[
"moving_rate"
]
=
kwargs
.
get
(
"moving_rate"
,
0.9
)
elif
quant_type
==
'channel_wise_abs_max'
:
call_args
[
"quant_on_weight"
]
=
kwargs
.
get
(
"quant_on_weight"
,
False
)
call_args
[
"channel_num"
]
=
kwargs
.
get
(
"channel_num"
,
None
)
call_args
[
"quant_axis"
]
=
kwargs
.
get
(
"quant_axis"
,
0
)
assert
call_args
[
"channel_num"
]
is
not
None
,
(
"You need to input channel_num"
"when you use channel_wise_abs_max strategy."
)
fake_quant_map
=
{
'abs_max'
:
FakeQuantAbsMax
,
'moving_average_abs_max'
:
FakeQuantMovingAverage
'moving_average_abs_max'
:
FakeQuantMovingAverage
,
'channel_wise_abs_max'
:
FakeChannelWiseQuantDequantAbsMax
}
return
fake_quant_map
[
quant_type
](
**
call_args
)
...
...
@@ -255,19 +345,23 @@ class QuantizedConv2D(layers.Layer):
self
.
weight
=
getattr
(
layer
,
'weight'
)
self
.
bias
=
getattr
(
layer
,
'bias'
)
# For FakeQuant
self
.
_conv2d_quant_axis
=
0
self
.
_fake_quant_weight
=
_get_fake_quant_type
(
weight_quantize_type
,
name
=
self
.
weight
.
name
,
moving_rate
=
moving_rate
,
quant_bits
=
weight_bits
,
dtype
=
self
.
_dtype
,
quant_on_weight
=
True
)
quant_on_weight
=
True
,
channel_num
=
self
.
weight
.
shape
[
self
.
_conv2d_quant_axis
],
quant_axis
=
self
.
_conv2d_quant_axis
)
self
.
_fake_quant_input
=
_get_fake_quant_type
(
activation_quantize_type
,
name
=
layer
.
full_name
(),
moving_rate
=
moving_rate
,
quant_bits
=
activation_bits
,
dtype
=
self
.
_dtype
)
dtype
=
self
.
_dtype
,
quant_on_weight
=
False
)
def
forward
(
self
,
input
):
quant_input
=
self
.
_fake_quant_input
(
input
)
...
...
@@ -341,19 +435,23 @@ class QuantizedLinear(layers.Layer):
self
.
weight
=
getattr
(
layer
,
'weight'
)
self
.
bias
=
getattr
(
layer
,
'bias'
)
# For FakeQuant
self
.
_linear_quant_axis
=
1
self
.
_fake_quant_weight
=
_get_fake_quant_type
(
weight_quantize_type
,
name
=
self
.
weight
.
name
,
moving_rate
=
moving_rate
,
quant_bits
=
weight_bits
,
dtype
=
self
.
_dtype
,
quant_on_weight
=
True
)
quant_on_weight
=
True
,
channel_num
=
self
.
weight
.
shape
[
self
.
_linear_quant_axis
],
quant_axis
=
self
.
_linear_quant_axis
)
self
.
_fake_quant_input
=
_get_fake_quant_type
(
activation_quantize_type
,
name
=
layer
.
full_name
(),
moving_rate
=
moving_rate
,
quant_bits
=
activation_bits
,
dtype
=
self
.
_dtype
)
dtype
=
self
.
_dtype
,
quant_on_weight
=
False
)
def
forward
(
self
,
input
):
quant_input
=
self
.
_fake_quant_input
(
input
)
...
...
python/paddle/fluid/contrib/slim/tests/test_imperative_qat.py
浏览文件 @
f52c4f8b
...
...
@@ -181,7 +181,6 @@ class TestImperativeQat(unittest.TestCase):
img
=
fluid
.
dygraph
.
to_variable
(
x_data
)
label
=
fluid
.
dygraph
.
to_variable
(
y_data
)
out
=
lenet
(
img
)
acc
=
fluid
.
layers
.
accuracy
(
out
,
label
)
loss
=
fluid
.
layers
.
cross_entropy
(
out
,
label
)
...
...
python/paddle/fluid/contrib/slim/tests/test_imperative_qat_channelwise.py
0 → 100644
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/incubate/fleet/parameter_server/ir/public.py
浏览文件 @
f52c4f8b
...
...
@@ -170,22 +170,40 @@ class CompileTimeStrategy(object):
return
trainer
.
mode
==
DistributedMode
.
ASYNC
def
get_role_id
(
self
):
return
self
.
role_maker
.
role_id
()
try
:
return
self
.
role_maker
.
_role_id
()
except
Exception
:
return
self
.
role_maker
.
role_id
()
def
get_trainers
(
self
):
return
self
.
role_maker
.
worker_num
()
try
:
return
self
.
role_maker
.
_worker_num
()
except
Exception
:
return
self
.
role_maker
.
worker_num
()
def
get_ps_endpoint
(
self
):
return
self
.
role_maker
.
get_pserver_endpoints
()[
self
.
get_role_id
()]
try
:
return
self
.
role_maker
.
_get_pserver_endpoints
()[
self
.
get_role_id
()]
except
Exception
:
return
self
.
role_maker
.
get_pserver_endpoints
()[
self
.
get_role_id
()]
def
get_ps_endpoints
(
self
):
return
self
.
role_maker
.
get_pserver_endpoints
()
try
:
return
self
.
role_maker
.
_get_pserver_endpoints
()
except
Exception
:
return
self
.
role_maker
.
get_pserver_endpoints
()
def
get_heter_worker_endpoints
(
self
):
return
self
.
role_maker
.
_get_heter_worker_endpoints
()
try
:
return
self
.
role_maker
.
_get_heter_worker_endpoints
()
except
Exception
:
return
self
.
role_maker
.
get_heter_worker_endpoints
()
def
get_heter_worker_endpoint
(
self
):
return
self
.
role_maker
.
_get_heter_worker_endpoint
()
try
:
return
self
.
role_maker
.
_get_heter_worker_endpoint
()
except
Exception
:
return
self
.
role_maker
.
get_heter_worker_endpoint
()
def
get_origin_programs
(
self
):
return
self
.
origin_main_program
,
self
.
origin_startup_program
...
...
python/paddle/fluid/layers/tensor.py
浏览文件 @
f52c4f8b
...
...
@@ -680,8 +680,10 @@ def fill_constant(shape, dtype, value, force_cpu=False, out=None, name=None):
if
not
isinstance
(
value
,
Variable
):
if
dtype
in
[
'int64'
,
'int32'
]:
attrs
[
'str_value'
]
=
str
(
int
(
value
))
attrs
[
'value'
]
=
int
(
value
)
else
:
attrs
[
'str_value'
]
=
str
(
float
(
value
))
attrs
[
'value'
]
=
float
(
value
)
if
in_dygraph_mode
():
shape
=
utils
.
convert_shape_to_list
(
shape
)
...
...
python/paddle/fluid/tests/unittests/ir/inference/test_conv_affine_channel_fuse_pass.py
0 → 100644
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/tests/unittests/ir/inference/test_conv_bn_fuse_pass.py
0 → 100644
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/tests/unittests/ir/inference/test_conv_elementwise_add2_act_fuse_pass.py
浏览文件 @
f52c4f8b
...
...
@@ -19,6 +19,7 @@ import numpy as np
from
inference_pass_test
import
InferencePassTest
import
paddle.fluid
as
fluid
import
paddle.fluid.core
as
core
from
paddle.fluid.core
import
PassVersionChecker
from
paddle.fluid.core
import
AnalysisConfig
"""Test for fusion of conv, elementwise_add and 2 act."""
...
...
@@ -46,6 +47,9 @@ class ConvElementwiseAdd2ActFusePassTest(InferencePassTest):
if
core
.
is_compiled_with_cuda
():
use_gpu
=
True
self
.
check_output_with_option
(
use_gpu
)
self
.
assertTrue
(
PassVersionChecker
.
IsCompatible
(
'conv_elementwise_add2_act_fuse_pass'
))
if
__name__
==
"__main__"
:
...
...
python/paddle/fluid/tests/unittests/ir/inference/test_conv_elementwise_add_act_fuse_pass.py
浏览文件 @
f52c4f8b
...
...
@@ -19,6 +19,7 @@ import numpy as np
from
inference_pass_test
import
InferencePassTest
import
paddle.fluid
as
fluid
import
paddle.fluid.core
as
core
from
paddle.fluid.core
import
PassVersionChecker
from
paddle.fluid.core
import
AnalysisConfig
"""Test for fusion of conv, elementwise_add and act."""
...
...
@@ -48,6 +49,9 @@ class ConvElementwiseAddActFusePassTest(InferencePassTest):
if
core
.
is_compiled_with_cuda
():
use_gpu
=
True
self
.
check_output_with_option
(
use_gpu
)
self
.
assertTrue
(
PassVersionChecker
.
IsCompatible
(
'conv_elementwise_add_act_fuse_pass'
))
if
__name__
==
"__main__"
:
...
...
python/paddle/fluid/tests/unittests/ir/inference/test_conv_elementwise_add_fuse_pass.py
浏览文件 @
f52c4f8b
...
...
@@ -19,6 +19,7 @@ import numpy as np
from
inference_pass_test
import
InferencePassTest
import
paddle.fluid
as
fluid
import
paddle.fluid.core
as
core
from
paddle.fluid.core
import
PassVersionChecker
from
paddle.fluid.core
import
AnalysisConfig
"""Test for fusion of conv and elementwise_add."""
...
...
@@ -44,6 +45,8 @@ class ConvElementwiseAddFusePassTest(InferencePassTest):
if
core
.
is_compiled_with_cuda
():
use_gpu
=
True
self
.
check_output_with_option
(
use_gpu
)
self
.
assertTrue
(
PassVersionChecker
.
IsCompatible
(
'conv_elementwise_add_fuse_pass'
))
if
__name__
==
"__main__"
:
...
...
python/paddle/fluid/tests/unittests/ir/inference/test_fc_fuse_pass.py
0 → 100644
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/tests/unittests/ir/inference/test_fc_gru_fuse_pass.py
0 → 100644
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/tests/unittests/ir/inference/test_fc_lstm_fuse_pass.py
0 → 100644
浏览文件 @
f52c4f8b
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
unittest
import
numpy
as
np
from
inference_pass_test
import
InferencePassTest
import
paddle.fluid
as
fluid
import
paddle.fluid.core
as
core
from
paddle.fluid.core
import
PassVersionChecker
class
MulLstmFusePassTest
(
InferencePassTest
):
def
setUp
(
self
):
with
fluid
.
program_guard
(
self
.
main_program
,
self
.
startup_program
):
dict_dim
,
emb_dim
=
128
,
64
hidden_dim
=
512
data
=
fluid
.
data
(
name
=
'data'
,
shape
=
[
1
],
dtype
=
'int64'
,
lod_level
=
1
)
emb
=
fluid
.
embedding
(
input
=
data
,
size
=
[
dict_dim
,
emb_dim
])
x
=
fluid
.
layers
.
fc
(
input
=
emb
,
size
=
hidden_dim
*
4
,
bias_attr
=
False
)
forward
,
cell
=
fluid
.
layers
.
dynamic_lstm
(
input
=
x
,
size
=
hidden_dim
*
4
)
batch
=
16
lod_tensor
=
fluid
.
LoDTensor
()
lod_tensor
.
set
(
np
.
random
.
randint
(
0
,
dict_dim
,
size
=
[
batch
]).
astype
(
"int64"
),
fluid
.
CPUPlace
())
lod_tensor
.
set_lod
([[
0
,
batch
]])
self
.
feeds
=
{
"data"
:
lod_tensor
}
self
.
fetch_list
=
[
forward
,
cell
]
def
test_check_output
(
self
):
use_gpu
=
False
self
.
check_output_with_option
(
use_gpu
)
self
.
assertTrue
(
PassVersionChecker
.
IsCompatible
(
'mul_lstm_fuse_pass'
))
if
__name__
==
"__main__"
:
unittest
.
main
()
python/paddle/fluid/tests/unittests/ir/inference/test_repeated_fc_relu_fuse_pass.py
0 → 100644
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/tests/unittests/ir/inference/test_squared_mat_sub_fuse_pass.py
0 → 100644
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/tests/unittests/ir/inference/test_transpose_flatten_concat_fuse_pass.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/tests/unittests/ir/inference/test_trt_shuffle_channel_detect_pass.py
0 → 100644
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/tests/unittests/test_fake_quantize_op.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/tests/unittests/test_fleet_base.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/tests/unittests/test_fleet_rolemaker_2.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/tests/unittests/test_fleet_rolemaker_new.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/tests/unittests/test_fleet_util.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/tests/unittests/test_inplace_addto_strategy.py
0 → 100644
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/tests/unittests/test_top_k_v2_op.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/fluid/tests/unittests/test_transformer_api.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/nn/layer/transformer.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/optimizer/adam.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/reader/decorator.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/tests/test_dataset_cifar.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/tests/test_datasets.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/text/datasets/uci_housing.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/utils/__init__.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/utils/lazy_import.py
0 → 100644
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/vision/datasets/cifar.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/vision/datasets/folder.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/vision/datasets/mnist.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/vision/transforms/functional.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/paddle/vision/transforms/transforms.py
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/requirements.txt
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
python/setup.py.in
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
tools/check_api_approvals.sh
浏览文件 @
f52c4f8b
此差异已折叠。
点击以展开。
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录