squeeze_op 在inference模式下导致产生错误
Created by: Meiyim
测试环境:cuda 9, cudnn 7.0.3 采用C++代码对inference_model进行前向预测。 相关config配置如下:
paddle::AnalysisConfig config;
config.SetModel(FLAGS_model_dir);
config.EnableUseGpu(100, 0);
config.SwitchSpecifyInputNames(true);
config.EnableCUDNN();
config.SwitchIrOptim(true);
config.EnableMemoryOptim();
若采用fluid_inference 1.6.0 则直接出core没有信息 若菜用fluid_inference develop,版本信息如下:
GIT COMMIT ID: 0fe16539ef3651966080d5ae96850da4557751e0
WITH_MKL: ON
WITH_MKLDNN: ON
WITH_GPU: ON
CUDA version: 9.0
CUDNN version: v7
运行log如下:
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [cudnn_placement_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [multihead_matmul_fuse_pass]
--- Running IR pass [fc_fuse_pass]
I1219 10:18:14.103123 63015 graph_pattern_detector.cc:101] --- detected 12 subgraphs
I1219 10:18:14.138665 63015 graph_pattern_detector.cc:101] --- detected 62 subgraphs
--- Running IR pass [fc_elementwise_layernorm_fuse_pass]
I1219 10:18:14.181339 63015 graph_pattern_detector.cc:101] --- detected 24 subgraphs
--- Running IR pass [conv_elementwise_add_act_fuse_pass]
--- Running IR pass [conv_elementwise_add2_act_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
I1219 10:18:14.225345 63015 ir_params_sync_among_devices_pass.cc:41] Sync params from CPU to GPU
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [memory_optimize_pass]
I1219 10:18:14.448055 63015 memory_optimize_pass.cc:223] Cluster name : expand_1.tmp_0 size: 1572864
I1219 10:18:14.448089 63015 memory_optimize_pass.cc:223] Cluster name : cast_6.tmp_0 size: 786432
I1219 10:18:14.448096 63015 memory_optimize_pass.cc:223] Cluster name : where_0.tmp_0 size: 16
I1219 10:18:14.448108 63015 memory_optimize_pass.cc:223] Cluster name : fc_25.tmp_1 size: 3072
I1219 10:18:14.448114 63015 memory_optimize_pass.cc:223] Cluster name : layer_norm_4.tmp_2 size: 3072
I1219 10:18:14.448122 63015 memory_optimize_pass.cc:223] Cluster name : scatter_nd_add_22.tmp_0 size: 3072
I1219 10:18:14.448128 63015 memory_optimize_pass.cc:223] Cluster name : scatter_nd_add_23.tmp_0 size: 3072
I1219 10:18:14.448134 63015 memory_optimize_pass.cc:223] Cluster name : layer_norm_14.tmp_2 size: 3072
I1219 10:18:14.448140 63015 memory_optimize_pass.cc:223] Cluster name : shape_1.tmp_0 size: 12
I1219 10:18:14.448146 63015 memory_optimize_pass.cc:223] Cluster name : layer_norm_0.tmp_2 size: 393216
I1219 10:18:14.448158 63015 memory_optimize_pass.cc:223] Cluster name : eval_placeholder_1 size: 1024
--- Running analysis [ir_graph_to_program_pass]
I1219 10:18:14.516865 63015 analysis_predictor.cc:471] ======= optimize end =======
W1219 10:18:15.083250 63015 device_context.cc:236] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 9.2, Runtime API Version: 9.0
W1219 10:18:15.088192 63015 device_context.cc:244] device: 0, cuDNN Version: 7.3.
terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
what():
--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2 paddle::platform::CUDADeviceContext::Wait() const
3 paddle::framework::TransDataDevice(paddle::framework::Tensor const&, paddle::platform::Place const&, paddle::framework::Tensor*)
4 paddle::framework::TransformData(paddle::framework::OpKernelType const&, paddle::framework::OpKernelType const&, paddle::framework::Tensor const&, paddle::framework::Tensor*)
5 paddle::framework::OperatorWithKernel::PrepareData(paddle::framework::Scope const&, paddle::framework::OpKernelType const&, std::vector<std::string, std::allocator<std::string> >*, paddle::framework::RuntimeContext*) const
6 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
7 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
8 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
9 paddle::framework::NaiveExecutor::Run()
10 paddle::AnalysisPredictor::Run(std::vector<paddle::PaddleTensor, std::allocator<paddle::PaddleTensor> > const&, std::vector<paddle::PaddleTensor, std::allocator<paddle::PaddleTensor> >*, int)
------------------------------------------
Python Call Stacks (More useful to users):
------------------------------------------
File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/app/lib/python3.6/site-packages/paddle/fluid/framework.py", line 2488, in append_op
attrs=kwargs.get("attrs", None))
File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/app/lib/python3.6/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op
return self.main_program.current_block().append_op(*args, **kwargs)
File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/app/lib/python3.6/site-packages/paddle/fluid/layers/nn.py", line 9105, in squeeze
"XShape": x_shape})
File "/home/work/chenxuyi/gitlab/paddle-models/model/transformer_encoder.py", line 372, in encoder
pad_idx = L.where(L.cast(L.squeeze(input_mask, axes=[2]), 'bool'))
File "/home/work/chenxuyi/gitlab/paddle-models/model/ernie.py", line 187, in _build_model
name='encoder')
File "/home/work/chenxuyi/gitlab/paddle-models/model/ernie.py", line 124, in __init__
input_mask)
File "./ernie/xnli.py", line 57, in forward
use_fp16=self.hparam['use_fp16']
File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/paddle-estimator/propeller/paddle/train/trainer.py", line 147, in _model_fn
pred = model.forward(fea)
File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/paddle-estimator/propeller/paddle/train/trainer.py", line 83, in _build_net
features=features, mode=mode, params=params, run_config=run_config)
File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/paddle-estimator/propeller/paddle/train/trainer.py", line 230, in _build_for_eval
self.params, self.run_config)
File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/paddle-estimator/propeller/paddle/train/trainer.py", line 482, in __init__
0]) #eval_datasets must have same output shapes
File "/home/work/chenxuyi/dis/pp/fine/mergeWeb/paddle-estimator/propeller/paddle/train/trainer.py", line 530, in train_and_eval
train_hooks.append(_EvalHookOnTrainLoop())
File "./ernie/xnli.py", line 221, in <module>
exporters=[best_exporter])
----------------------
Error Message Summary:
----------------------
FatalError: cudaStreamSynchronize raises error: unspecified launch failure, errono: 4: unspecified launch failure at (/work/paddle/fluid/platform/device_context.cc:330)
[operator < squeeze2 > error]
./gpu.sh: line 13: 63015 Aborted (core dumped) ./build/inference --logtostderr --model_dir $2 --data $1 --repeat 1 --output_prediction true --use_gpu true --device 0
截取部分组网代码,贴在下面:
d_shape = L.shape(L.cast(enc_input, 'float32'))
input_hidden_dim = enc_input.shape[-1]
pad_idx = L.where(L.cast(L.squeeze(input_mask, axes=[2]), 'bool')) #!!!!!!!!!!!!!
attn_bias = L.matmul(input_mask, input_mask, transpose_y=True)
attn_bias = (1. - attn_bias) * -10000.
attn_bias = L.unsqueeze(attn_bias, axes=[1])
attn_bias = L.expand(attn_bias, [1, n_head, 1, 1])
if attn_bias.dtype != enc_input.dtype:
attn_bias = L.cast(attn_bias, enc_input.dtype)