multi_box_head layer中reshape随机在ValidateShape fail (#16235) · Issue · PaddlePaddle / Paddle

multi_box_head layer中reshape随机在ValidateShape fail

Created by: wangguibao

版本、环境信息： 1）PaddlePaddle版本：paddle-GPU-py2.7post85-1.3.0 2）CPU： 3）GPU：Nvidia k40m cuda 9.0, cudnn5.0/cudnn7.0均可复现 4）系统环境：Linux 3.10.0_3-0-0-15，python2.7.13
训练信息 1）单机多卡 2）显存信息 11439MB / 卡 3）出错的OP：Reshape2
复现信息：如为报错，请给出复现环境、复现步骤复现环境：公司内虚拟机，账号密码可联系@wangguibao 复现步骤：执行用户的训练脚本，通常经过若干iteration就会复现
问题描述：请详细描述您的问题，同步贴出报错信息、日志、可复现的代码片段报错日志：

Traceback (most recent call last):
  File "train_v2.py", line 246, in <module>
    pretrained_model=args.pretrained_model)
  File "train_v2.py", line 191, in parallel_exe
    feed=feeder.feed(data))
  File "/home/vis/zhongdonghong/app/python-build/lib/python2.7/site-packages/paddle/fluid/parallel_executor.py", line 297, in run
    self.executor.run(fetch_list, fetch_var_name)
paddle.fluid.core.EnforceNotMet: Enforce failed. Expected output_shape[unk_dim_idx] * capacity == -in_size, but received output_shape[unk_dim_idx] * capacity:0 != -in_size:-960.
Invalid shape is given. at [/home/dongdaxiang/paddle-fork/Paddle/paddle/fluid/operators/reshape_op.cc:104]
PaddlePaddle Call Stacks:
0       0x7f6385f3e76dp void paddle::platform::EnforceNotMet::Init<std::string>(std::string, char const*, int) + 365
1       0x7f6385f3eab7p paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const*, int) + 87
2       0x7f63863f1c02p paddle::operators::ReshapeOp::ValidateShape(std::vector<int, std::allocator<int> >, paddle::framework::DDim const&) + 2226
3       0x7f63863f24ecp paddle::operators::ReshapeKernel::operator()(paddle::framework::ExecutionContext const&) const + 1180
4       0x7f63879c217cp paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const + 844
5       0x7f63879c0384p paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) + 340
6       0x7f638784034ap paddle::framework::details::ComputationOpHandle::RunImpl() + 250
7       0x7f6387839a26p paddle::framework::details::OpHandleBase::Run(bool) + 118
8       0x7f63877d442dp
9       0x7f6386b90663p std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&) + 35
10      0x7f6386b56f17p std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&) + 39
11        0x38c040cbe0p pthread_once + 80
12      0x7f63877d3142p
13      0x7f6386b58334p ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const + 404
14      0x7f63be8522d0p
15        0x38c0407df3p
16        0x38bfcf62cdp clone + 109

可复现的代码片段：用户网络是MobileNetV2，用到了multi_box_head layer，相关的network配置代码片段如下：

def mobile_net(num_classes, img, img_shape, scale=1.0):
    tmp = conv_bn(img, 3, int(32 * scale), 2, 1, 3)
    # 150x150 64x64
    tmp = depthwise_separable(tmp, 32, 64, 32, 1, scale)
    tmp = depthwise_separable(tmp, 64, 128, 64, 2, scale)
    # 75x75 32x32
    tmp = depthwise_separable(tmp, 128, 128, 128, 1, scale)
    tmp = depthwise_separable(tmp, 128, 256, 128, 2, scale)
    # 38x38 16x16
    module10 = depthwise_separable(tmp, 256, 256, 256, 1, scale)
    tmp = depthwise_separable(tmp, 256, 512, 256, 2, scale)

    # 19x19 8x8
    for i in range(5):
        tmp = depthwise_separable(tmp, 512, 512, 512, 1, scale)
    module11 = tmp #19x19
    tmp = depthwise_separable(tmp, 512, 1024, 512, 2, scale)

    # 10x10 4x4
    module13 = depthwise_separable(tmp, 1024, 1024, 1024, 1, scale)#10x10
    module14 = extra_block(module13, 256, 512, 1, 2, scale)#5x5
    # 5x5 2x2
    module15 = extra_block(module14, 128, 256, 1, 2, scale)#3x3
    # 3x3 1x1
    module16 = extra_block(module15, 128, 256, 1, 2, scale)#2x2
#    # 2x2 1x1

    mbox_locs, mbox_confs, box, box_var = fluid.layers.multi_box_head(
        inputs=[module14, module15, module16],
        image=img,
        num_classes=num_classes,
        min_ratio=80,
        max_ratio=115,
        min_sizes=[80.0, 100.0, 115.0],
        max_sizes=[[], 115.0, 128.0],
        aspect_ratios=[[2., 3.], [2., 3.], [2., 3.]],
        base_size=img_shape[2],
        offset=0.5,
        flip=True)

    return mbox_locs, mbox_confs, box, box_var

PaddlePaddle / Paddle 大约 2 年 前同步成功

multi_box_head layer中reshape随机在ValidateShape fail

PaddlePaddle / Paddle
大约 2 年前同步成功