• B
    nbd: fix the co_queue multi-adding bug · 141cabe6
    Bin Wu 提交于
    When we tested the VM migartion between different hosts with NBD
    devices, we found if we sent a cancel command after the drive_mirror
    was just started, a coroutine re-enter error would occur. The stack
    was as follow:
    
    (gdb) bt
    00)  0x00007fdfc744d885 in raise () from /lib64/libc.so.6
    01)  0x00007fdfc744ee61 in abort () from /lib64/libc.so.6
    02)  0x00007fdfca467cc5 in qemu_coroutine_enter (co=0x7fdfcaedb400, opaque=0x0)
    at qemu-coroutine.c:118
    03)  0x00007fdfca467f6c in qemu_co_queue_run_restart (co=0x7fdfcaedb400) at
    qemu-coroutine-lock.c:59
    04)  0x00007fdfca467be5 in coroutine_swap (from=0x7fdfcaf3c4e8,
    to=0x7fdfcaedb400) at qemu-coroutine.c:96
    05)  0x00007fdfca467cea in qemu_coroutine_enter (co=0x7fdfcaedb400, opaque=0x0)
    at qemu-coroutine.c:123
    06)  0x00007fdfca467f6c in qemu_co_queue_run_restart (co=0x7fdfcaedbdc0) at
    qemu-coroutine-lock.c:59
    07)  0x00007fdfca467be5 in coroutine_swap (from=0x7fdfcaf3c4e8,
    to=0x7fdfcaedbdc0) at qemu-coroutine.c:96
    08)  0x00007fdfca467cea in qemu_coroutine_enter (co=0x7fdfcaedbdc0, opaque=0x0)
    at qemu-coroutine.c:123
    09)  0x00007fdfca4a1fa4 in nbd_recv_coroutines_enter_all (s=0x7fdfcaef7dd0) at
    block/nbd-client.c:41
    10) 0x00007fdfca4a1ff9 in nbd_teardown_connection (client=0x7fdfcaef7dd0) at
    block/nbd-client.c:50
    11) 0x00007fdfca4a20f0 in nbd_reply_ready (opaque=0x7fdfcaef7dd0) at
    block/nbd-client.c:92
    12) 0x00007fdfca45ed80 in aio_dispatch (ctx=0x7fdfcae15e90) at aio-posix.c:144
    13) 0x00007fdfca45ef1b in aio_poll (ctx=0x7fdfcae15e90, blocking=false) at
    aio-posix.c:222
    14) 0x00007fdfca448c34 in aio_ctx_dispatch (source=0x7fdfcae15e90, callback=0x0,
    user_data=0x0) at async.c:212
    15) 0x00007fdfc8f2f69a in g_main_context_dispatch () from
    /usr/lib64/libglib-2.0.so.0
    16) 0x00007fdfca45c391 in glib_pollfds_poll () at main-loop.c:190
    17) 0x00007fdfca45c489 in os_host_main_loop_wait (timeout=1483677098) at
    main-loop.c:235
    18) 0x00007fdfca45c57b in main_loop_wait (nonblocking=0) at main-loop.c:484
    19) 0x00007fdfca25f403 in main_loop () at vl.c:2249
    20) 0x00007fdfca266fc2 in main (argc=42, argv=0x7ffff517d638,
    envp=0x7ffff517d790) at vl.c:4814
    
    We find the nbd_recv_coroutines_enter_all function (triggered by a cancel
    command or a network connection breaking down) will enter a coroutine which
    is waiting for the sending lock. If the lock is still held by another coroutine,
    the entering coroutine will be added into the co_queue again. Latter, when the
    lock is released, a coroutine re-enter error will occur.
    
    This bug can be fixed simply by delaying the setting of recv_coroutine as
    suggested by paolo. After applying this patch, we have tested the cancel
    operation in mirror phase looply for more than 5 hous and everything is fine.
    Without this patch, a coroutine re-enter error will occur in 5 minutes.
    Signed-off-by: NBn Wu <wu.wubin@huawei.com>
    Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
    Message-id: 1423552846-3896-1-git-send-email-wu.wubin@huawei.com
    Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
    141cabe6
nbd-client.c 12.0 KB