1. 31 8月, 2017 3 次提交
    • V
      block/nbd-client: rename nbd_recv_coroutines_enter_all · 07b1b99c
      Vladimir Sementsov-Ogievskiy 提交于
      Rename nbd_recv_coroutines_enter_all to nbd_recv_coroutines_wake_all,
      as it most probably just adds all recv coroutines into co_queue_wakeup,
      rather than directly enter them.
      Signed-off-by: NVladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
      Message-Id: <20170804151440.320927-9-vsementsov@virtuozzo.com>
      [eblake: tweak commit message]
      Signed-off-by: NEric Blake <eblake@redhat.com>
      07b1b99c
    • V
      block/nbd-client: get rid of ssize_t · 6faa0777
      Vladimir Sementsov-Ogievskiy 提交于
      Use int variable for nbd_co_send_request return value (as
      nbd_co_send_request returns int).
      Signed-off-by: NVladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
      Message-Id: <20170804151440.320927-6-vsementsov@virtuozzo.com>
      Signed-off-by: NEric Blake <eblake@redhat.com>
      6faa0777
    • S
      nbd-client: avoid read_reply_co entry if send failed · 3c2d5183
      Stefan Hajnoczi 提交于
      The following segfault is encountered if the NBD server closes the UNIX
      domain socket immediately after negotiation:
      
        Program terminated with signal SIGSEGV, Segmentation fault.
        #0  aio_co_schedule (ctx=0x0, co=0xd3c0ff2ef0) at util/async.c:441
        441       QSLIST_INSERT_HEAD_ATOMIC(&ctx->scheduled_coroutines,
        (gdb) bt
        #0  0x000000d3c01a50f8 in aio_co_schedule (ctx=0x0, co=0xd3c0ff2ef0) at util/async.c:441
        #1  0x000000d3c012fa90 in nbd_coroutine_end (bs=bs@entry=0xd3c0fec650, request=<optimized out>) at block/nbd-client.c:207
        #2  0x000000d3c012fb58 in nbd_client_co_preadv (bs=0xd3c0fec650, offset=0, bytes=<optimized out>, qiov=0x7ffc10a91b20, flags=0) at block/nbd-client.c:237
        #3  0x000000d3c0128e63 in bdrv_driver_preadv (bs=bs@entry=0xd3c0fec650, offset=offset@entry=0, bytes=bytes@entry=512, qiov=qiov@entry=0x7ffc10a91b20, flags=0) at block/io.c:836
        #4  0x000000d3c012c3e0 in bdrv_aligned_preadv (child=child@entry=0xd3c0ff51d0, req=req@entry=0x7f31885d6e90, offset=offset@entry=0, bytes=bytes@entry=512, align=align@entry=1, qiov=qiov@entry=0x7ffc10a91b20, f
      +lags=0) at block/io.c:1086
        #5  0x000000d3c012c6b8 in bdrv_co_preadv (child=0xd3c0ff51d0, offset=offset@entry=0, bytes=bytes@entry=512, qiov=qiov@entry=0x7ffc10a91b20, flags=flags@entry=0) at block/io.c:1182
        #6  0x000000d3c011cc17 in blk_co_preadv (blk=0xd3c0ff4f80, offset=0, bytes=512, qiov=0x7ffc10a91b20, flags=0) at block/block-backend.c:1032
        #7  0x000000d3c011ccec in blk_read_entry (opaque=0x7ffc10a91b40) at block/block-backend.c:1079
        #8  0x000000d3c01bbb96 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:79
        #9  0x00007f3196cb8600 in __start_context () at /lib64/libc.so.6
      
      The problem is that nbd_client_init() uses
      nbd_client_attach_aio_context() -> aio_co_schedule(new_context,
      client->read_reply_co).  Execution of read_reply_co is deferred to a BH
      which doesn't run until later.
      
      In the mean time blk_co_preadv() can be called and nbd_coroutine_end()
      calls aio_wake() on read_reply_co.  At this point in time
      read_reply_co's ctx isn't set because it has never been entered yet.
      
      This patch simplifies the nbd_co_send_request() ->
      nbd_co_receive_reply() -> nbd_coroutine_end() lifecycle to just
      nbd_co_send_request() -> nbd_co_receive_reply().  The request is "ended"
      if an error occurs at any point.  Callers no longer have to invoke
      nbd_coroutine_end().
      
      This cleanup also eliminates the segfault because we don't call
      aio_co_schedule() to wake up s->read_reply_co if sending the request
      failed.  It is only necessary to wake up s->read_reply_co if a reply was
      received.
      
      Note this only happens with UNIX domain sockets on Linux.  It doesn't
      seem possible to reproduce this with TCP sockets.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Message-Id: <20170829122745.14309-2-stefanha@redhat.com>
      Signed-off-by: NEric Blake <eblake@redhat.com>
      3c2d5183
  2. 24 8月, 2017 1 次提交
  3. 23 8月, 2017 1 次提交
  4. 15 8月, 2017 1 次提交
    • E
      nbd-client: Fix regression when server sends garbage · 72b6ffc7
      Eric Blake 提交于
      When we switched NBD to use coroutines for qemu 2.9 (in particular,
      commit a12a712a), we introduced a regression: if a server sends us
      garbage (such as a corrupted magic number), we quit the read loop
      but do not stop sending further queued commands, resulting in the
      client hanging when it never reads the response to those additional
      commands.  In qemu 2.8, we properly detected that the server is no
      longer reliable, and cancelled all existing pending commands with
      EIO, then tore down the socket so that all further command attempts
      get EPIPE.
      
      Restore the proper behavior of quitting (almost) all communication
      with a broken server: Once we know we are out of sync or otherwise
      can't trust the server, we must assume that any further incoming
      data is unreliable and therefore end all pending commands with EIO,
      and quit trying to send any further commands.  As an exception, we
      still (try to) send NBD_CMD_DISC to let the server know we are going
      away (in part, because it is easier to do that than to further
      refactor nbd_teardown_connection, and in part because it is the
      only command where we do not have to wait for a reply).
      
      Based on a patch by Vladimir Sementsov-Ogievskiy.
      
      A malicious server can be created with the following hack,
      followed by setting NBD_SERVER_DEBUG to a non-zero value in the
      environment when running qemu-nbd:
      
      | --- a/nbd/server.c
      | +++ b/nbd/server.c
      | @@ -919,6 +919,17 @@ static int nbd_send_reply(QIOChannel *ioc, NBDReply *reply, Error **errp)
      |      stl_be_p(buf + 4, reply->error);
      |      stq_be_p(buf + 8, reply->handle);
      |
      | +    static int debug;
      | +    static int count;
      | +    if (!count++) {
      | +        const char *str = getenv("NBD_SERVER_DEBUG");
      | +        if (str) {
      | +            debug = atoi(str);
      | +        }
      | +    }
      | +    if (debug && !(count % debug)) {
      | +        buf[0] = 0;
      | +    }
      |      return nbd_write(ioc, buf, sizeof(buf), errp);
      |  }
      Reported-by: NVladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Message-Id: <20170814213426.24681-1-eblake@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      72b6ffc7
  5. 14 7月, 2017 2 次提交
    • E
      nbd: Implement NBD_INFO_BLOCK_SIZE on client · 081dd1fe
      Eric Blake 提交于
      The upstream NBD Protocol has defined a new extension to allow
      the server to advertise block sizes to the client, as well as
      a way for the client to inform the server whether it intends to
      obey block sizes.
      
      When using the block layer as the client, we will obey block
      sizes; but when used as 'qemu-nbd -c' to hand off to the
      kernel nbd module as the client, we are still waiting for the
      kernel to implement a way for us to learn if it will honor
      block sizes (perhaps by an addition to sysfs, rather than an
      ioctl), as well as any way to tell the kernel what additional
      block sizes to obey (NBD_SET_BLKSIZE appears to be accurate
      for the minimum size, but preferred and maximum sizes would
      probably be new ioctl()s), so until then, we need to make our
      request for block sizes conditional.
      
      When using ioctl(NBD_SET_BLKSIZE) to hand off to the kernel,
      use the minimum block size as the sector size if it is larger
      than 512, which also has the nice effect of cooperating with
      (non-qemu) servers that don't do read-modify-write when
      exposing a block device with 4k sectors; it might also allow
      us to visit a file larger than 2T on a 32-bit kernel.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Message-Id: <20170707203049.534-10-eblake@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      081dd1fe
    • E
      nbd: Create struct for tracking export info · 004a89fc
      Eric Blake 提交于
      The NBD Protocol is introducing some additional information
      about exports, such as minimum request size and alignment, as
      well as an advertised maximum request size.  It will be easier
      to feed this information back to the block layer if we gather
      all the information into a struct, rather than adding yet more
      pointer parameters during negotiation.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Message-Id: <20170707203049.534-2-eblake@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      004a89fc
  6. 04 7月, 2017 1 次提交
  7. 26 6月, 2017 1 次提交
  8. 15 6月, 2017 1 次提交
    • V
      nbd: rename read_sync and friends · d1fdf257
      Vladimir Sementsov-Ogievskiy 提交于
      Rename
        nbd_wr_syncv -> nbd_rwv
        read_sync -> nbd_read
        read_sync_eof -> nbd_read_eof
        write_sync -> nbd_write
        drop_sync -> nbd_drop
      
      1. nbd_ prefix
         read_sync and write_sync are already shared, so it is good to have a
         namespace prefix. drop_sync will be shared, and read_sync_eof is
         related to read_sync, so let's rename them all.
      
      2. _sync suffix
         _sync is related to the fact that nbd_wr_syncv doesn't return if a
         write to socket returns EAGAIN. The first implementation of
         nbd_wr_syncv (was wr_sync in 7a5ca864) just loops while getting
         EAGAIN, the current implementation yields in this case.
         Why we want to get rid of it:
         - it is normal for r/w functions to be synchronous, so having an
           additional suffix for it looks redundant (contrariwise, we have
           _aio suffix for async functions)
         - _sync suffix in block layer is used when function does flush (so
           using it for other thing is confusing a bit)
         - keep function names short after adding nbd_ prefix
      
      3. for nbd_wr_syncv let's use more common notation 'rw'
      Reviewed-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NVladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
      Message-Id: <20170602150150.258222-2-vsementsov@virtuozzo.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1fdf257
  9. 08 6月, 2017 1 次提交
    • P
      nbd: make it thread-safe, fix qcow2 over nbd · 6bdcc018
      Paolo Bonzini 提交于
      NBD is not thread safe, because it accesses s->in_flight without
      a CoMutex.  Fixing this will be required for multiqueue.
      CoQueue doesn't have spurious wakeups but, when another coroutine can
      run between qemu_co_queue_next's wakeup and qemu_co_queue_wait's
      re-locking of the mutex, the wait condition can become false and
      a loop is necessary.
      
      In fact, it turns out that the loop is necessary even without this
      multi-threaded scenario.  A particular sequence of coroutine wakeups
      is happening ~80% of the time when starting a guest with qcow2 image
      served over NBD (i.e. qemu-nbd --format=raw, and QEMU's -drive option
      has -format=qcow2).  This patch fixes that issue too.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6bdcc018
  10. 07 6月, 2017 2 次提交
  11. 27 3月, 2017 1 次提交
    • P
      nbd-client: fix handling of hungup connections · a12a712a
      Paolo Bonzini 提交于
      After the switch to reading replies in a coroutine, nothing is
      reentering pending receive coroutines if the connection hangs.
      Move nbd_recv_coroutines_enter_all to the reply read coroutine,
      which is the place where hangups are detected.  nbd_teardown_connection
      can simply wait for the reply read coroutine to detect the hangup
      and clean up after itself.
      
      This wouldn't be enough though because nbd_receive_reply returns 0
      (rather than -EPIPE or similar) when reading from a hung connection.
      Fix the return value check in nbd_read_reply_entry.
      
      This fixes qemu-iotests 083.
      Reported-by: NMax Reitz <mreitz@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-id: 20170314111157.14464-1-pbonzini@redhat.com
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      a12a712a
  12. 21 2月, 2017 2 次提交
  13. 04 1月, 2017 1 次提交
  14. 23 11月, 2016 1 次提交
    • E
      nbd: Allow unmap and fua during write zeroes · 169407e1
      Eric Blake 提交于
      Commit fa778fff wired up support to send the NBD_CMD_WRITE_ZEROES,
      but forgot to inform the block layer that FUA unmapping of zeroes is
      supported.  Without BDRV_REQ_MAY_UNMAP listed as a supported flag,
      the block layer will always insist on the NBD layer passing
      NBD_CMD_FLAG_NO_HOLE, resulting in the server always allocating
      things even when it was desired to let the server punch holes.
      Similarly, failing to set BDRV_REQ_FUA means that the client may
      send unnecessary NBD_CMD_FLUSH when it could have instead used the
      NBD_CMD_FLAG_FUA bit.
      
      CC: qemu-stable@nongnu.org
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Message-Id: <1479413642-22463-2-git-send-email-eblake@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      169407e1
  15. 02 11月, 2016 4 次提交
  16. 01 11月, 2016 1 次提交
    • C
      nbd: Use CoQueue for free_sema instead of CoMutex · 9bc9732f
      Changlong Xie 提交于
      NBD is using the CoMutex in a way that wasn't anticipated. For example, if there are
      N(N=26, MAX_NBD_REQUESTS=16) nbd write requests, so we will invoke nbd_client_co_pwritev
      N times.
      ----------------------------------------------------------------------------------------
      time request Actions
      1    1       in_flight=1, Coroutine=C1
      2    2       in_flight=2, Coroutine=C2
      ...
      15   15      in_flight=15, Coroutine=C15
      16   16      in_flight=16, Coroutine=C16, free_sema->holder=C16, mutex->locked=true
      17   17      in_flight=16, Coroutine=C17, queue C17 into free_sema->queue
      18   18      in_flight=16, Coroutine=C18, queue C18 into free_sema->queue
      ...
      26   N       in_flight=16, Coroutine=C26, queue C26 into free_sema->queue
      ----------------------------------------------------------------------------------------
      
      Once nbd client recieves request No.16' reply, we will re-enter C16. It's ok, because
      it's equal to 'free_sema->holder'.
      ----------------------------------------------------------------------------------------
      time request Actions
      27   16      in_flight=15, Coroutine=C16, free_sema->holder=C16, mutex->locked=false
      ----------------------------------------------------------------------------------------
      
      Then nbd_coroutine_end invokes qemu_co_mutex_unlock what will pop coroutines from
      free_sema->queue's head and enter C17. More free_sema->holder is C17 now.
      ----------------------------------------------------------------------------------------
      time request Actions
      28   17      in_flight=16, Coroutine=C17, free_sema->holder=C17, mutex->locked=true
      ----------------------------------------------------------------------------------------
      
      In above scenario, we only recieves request No.16' reply. As time goes by, nbd client will
      almostly recieves replies from requests 1 to 15 rather than request 17 who owns C17. In this
      case, we will encounter assert "mutex->holder == self" failed since Kevin's commit 0e438cdc
      "coroutine: Let CoMutex remember who holds it". For example, if nbd client recieves request
      No.15' reply, qemu will stop unexpectedly:
      ----------------------------------------------------------------------------------------
      time request       Actions
      29   15(most case) in_flight=15, Coroutine=C15, free_sema->holder=C17, mutex->locked=false
      ----------------------------------------------------------------------------------------
      
      Per Paolo's suggestion "The simplest fix is to change it to CoQueue, which is like a condition
      variable", this patch replaces CoMutex with CoQueue.
      
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Reported-by: Nzhanghailiang <zhang.zhanghailiang@huawei.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NChanglong Xie <xiecl.fnst@cn.fujitsu.com>
      Message-Id: <1476267508-19499-1-git-send-email-xiecl.fnst@cn.fujitsu.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9bc9732f
  17. 20 7月, 2016 4 次提交
  18. 13 7月, 2016 1 次提交
    • P
      coroutine: move entry argument to qemu_coroutine_create · 0b8b8753
      Paolo Bonzini 提交于
      In practice the entry argument is always known at creation time, and
      it is confusing that sometimes qemu_coroutine_enter is used with a
      non-NULL argument to re-enter a coroutine (this happens in
      block/sheepdog.c and tests/test-coroutine.c).  So pass the opaque value
      at creation time, for consistency with e.g. aio_bh_new.
      
      Mostly done with the following semantic patch:
      
      @ entry1 @
      expression entry, arg, co;
      @@
      - co = qemu_coroutine_create(entry);
      + co = qemu_coroutine_create(entry, arg);
        ...
      - qemu_coroutine_enter(co, arg);
      + qemu_coroutine_enter(co);
      
      @ entry2 @
      expression entry, arg;
      identifier co;
      @@
      - Coroutine *co = qemu_coroutine_create(entry);
      + Coroutine *co = qemu_coroutine_create(entry, arg);
        ...
      - qemu_coroutine_enter(co, arg);
      + qemu_coroutine_enter(co);
      
      @ entry3 @
      expression entry, arg;
      @@
      - qemu_coroutine_enter(qemu_coroutine_create(entry), arg);
      + qemu_coroutine_enter(qemu_coroutine_create(entry, arg));
      
      @ reentry @
      expression co;
      @@
      - qemu_coroutine_enter(co, NULL);
      + qemu_coroutine_enter(co);
      
      except for the aforementioned few places where the semantic patch
      stumbled (as expected) and for test_co_queue, which would otherwise
      produce an uninitialized variable warning.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NFam Zheng <famz@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      0b8b8753
  19. 05 7月, 2016 1 次提交
  20. 12 5月, 2016 2 次提交
    • E
      nbd: Simplify client FUA handling · 52a46505
      Eric Blake 提交于
      Now that the block layer honors per-bds FUA support, we don't
      have to duplicate the fallback flush at the NBD layer.  The
      static function nbd_co_writev_flags() is no longer needed, and
      the driver can just directly use nbd_client_co_writev().
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NFam Zheng <famz@redhat.com>
      Acked-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      52a46505
    • E
      block: Make supported_write_flags a per-bds property · 4df863f3
      Eric Blake 提交于
      Pre-patch, .supported_write_flags lives at the driver level, which
      means we are blindly declaring that all block devices using a
      given driver will either equally support FUA, or that we need a
      fallback at the block layer.  But there are drivers where FUA
      support is a per-block decision: the NBD block driver is dependent
      on the remote server advertising NBD_FLAG_SEND_FUA (and has
      fallback code to duplicate the flush that the block layer would do
      if NBD had not set .supported_write_flags); and the iscsi block
      driver is dependent on the mode sense bits advertised by the
      underlying device (and is currently silently ignoring FUA requests
      if the underlying device does not support FUA).
      
      The fix is to make supported flags as a per-BDS option, set during
      .bdrv_open().  This patch moves the variable and fixes NBD and iscsi
      to set it only conditionally; later patches will then further
      simplify the NBD driver to quit duplicating work done at the block
      layer, as well as tackle the fact that SCSI does not support FUA
      semantics on WRITESAME(10/16) but only on WRITE(10/16).
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NFam Zheng <famz@redhat.com>
      Acked-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      4df863f3
  21. 05 4月, 2016 1 次提交
    • E
      nbd: don't request FUA on FLUSH · a89ef0c3
      Eric Blake 提交于
      The NBD protocol does not clearly document what will happen
      if a client sends NBD_CMD_FLAG_FUA on NBD_CMD_FLUSH.
      Historically, both the qemu and upstream NBD servers silently
      ignored that flag, but that feels a bit risky.  Meanwhile, the
      qemu NBD client unconditionally sends the flag (without even
      bothering to check whether the caller cares; at least with
      NBD_CMD_WRITE the client only sends FUA if requested by a
      higher layer).
      
      There is ongoing discussion on the NBD list to fix the
      protocol documentation to require that the server MUST ignore
      the flag (unless the kernel folks can better explain what FUA
      means for a flush), but until those doc improvements land, the
      current nbd.git master was recently changed to reject the flag
      with EINVAL (see nbd commit ab22e082), which now makes it
      impossible for a qemu client to use FLUSH with an upstream NBD
      server.
      
      We should not send FUA with flush unless the upstream protocol
      documents what it will do, and even then, it should be something
      that the caller can opt into, rather than being unconditional.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Message-Id: <1459526902-32561-1-git-send-email-eblake@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a89ef0c3
  22. 30 3月, 2016 1 次提交
    • K
      nbd: Support BDRV_REQ_FUA · 2b556518
      Kevin Wolf 提交于
      The NBD server already used to send a FUA flag when the writethrough
      mode was set. This code was a remnant from the times where protocol
      drivers actually had to implement writethrough modes. Since nowadays the
      block layer sends flushes in writethrough mode and non-root nodes are
      always writeback, this was mostly dead code - only mostly because if NBD
      was configured to be used without a format, we sent _both_ FUA and an
      explicit flush afterwards, which makes the code not technically dead,
      but useless overhead.
      
      This patch changes the code so that the block layer's FUA flag is
      recognised and translated into a NBD FUA flag. The additional flush is
      avoided now.
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      2b556518
  23. 17 2月, 2016 4 次提交
  24. 20 1月, 2016 1 次提交
  25. 24 10月, 2015 1 次提交