1. 13 10月, 2017 2 次提交
  2. 06 10月, 2017 5 次提交
    • E
      block: Perform copy-on-read in loop · cb2e2878
      Eric Blake 提交于
      Improve our braindead copy-on-read implementation.  Pre-patch,
      we have multiple issues:
      - we create a bounce buffer and perform a write for the entire
      request, even if the active image already has 99% of the
      clusters occupied, and really only needs to copy-on-read the
      remaining 1% of the clusters
      - our bounce buffer was as large as the read request, and can
      needlessly exhaust our memory by using double the memory of
      the request size (the original request plus our bounce buffer),
      rather than a capped maximum overhead beyond the original
      - if a driver has a max_transfer limit, we are bypassing the
      normal code in bdrv_aligned_preadv() that fragments to that
      limit, and instead attempt to read the entire buffer from the
      driver in one go, which some drivers may assert on
      - a client can request a large request of nearly 2G such that
      rounding the request out to cluster boundaries results in a
      byte count larger than 2G.  While this cannot exceed 32 bits,
      it DOES have some follow-on problems:
      -- the call to bdrv_driver_pread() can assert for exceeding
      BDRV_REQUEST_MAX_BYTES, if the driver is old and lacks
      .bdrv_co_preadv
      -- if the buffer is all zeroes, the subsequent call to
      bdrv_co_do_pwrite_zeroes is a no-op due to a negative size,
      which means we did not actually copy on read
      
      Fix all of these issues by breaking up the action into a loop,
      where each iteration is capped to sane limits.  Also, querying
      the allocation status allows us to optimize: when data is
      already present in the active layer, we don't need to bounce.
      
      Note that the code has a telling comment that copy-on-read
      should probably be a filter driver rather than a bolt-on hack
      in io.c; but that remains a task for another day.
      
      CC: qemu-stable@nongnu.org
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NKevin Wolf <kwolf@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      cb2e2878
    • E
      block: Add blkdebug hook for copy-on-read · d855ebcd
      Eric Blake 提交于
      Make it possible to inject errors on writes performed during a
      read operation due to copy-on-read semantics.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NJeff Cody <jcody@redhat.com>
      Reviewed-by: NKevin Wolf <kwolf@redhat.com>
      Reviewed-by: NJohn Snow <jsnow@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      d855ebcd
    • E
      block: Uniform handling of 0-length bdrv_get_block_status() · 9cdcfd9f
      Eric Blake 提交于
      Handle a 0-length block status request up front, with a uniform
      return value claiming the area is not allocated.
      
      Most callers don't pass a length of 0 to bdrv_get_block_status()
      and friends; but it definitely happens with a 0-length read when
      copy-on-read is enabled.  While we could audit all callers to
      ensure that they never make a 0-length request, and then assert
      that fact, it was just as easy to fix things to always report
      success (as long as the callers are careful to not go into an
      infinite loop).  However, we had inconsistent behavior on whether
      the status is reported as allocated or defers to the backing
      layer, depending on what callbacks the driver implements, and
      possibly wasting quite a few CPU cycles to get to that answer.
      Consistently reporting unallocated up front doesn't really hurt
      anything, and makes it easier both for callers (0-length requests
      now have well-defined behavior) and for drivers (drivers don't
      have to deal with 0-length requests).
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      9cdcfd9f
    • E
      dirty-bitmap: Switch bdrv_set_dirty() to bytes · 0fdf1a4f
      Eric Blake 提交于
      Both callers already had bytes available, but were scaling to
      sectors.  Move the scaling to internal code.  In the case of
      bdrv_aligned_pwritev(), we are now passing the exact offset
      rather than a rounded sector-aligned value, but that's okay
      as long as dirty bitmap widens start/bytes to granularity
      boundaries.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NJohn Snow <jsnow@redhat.com>
      Reviewed-by: NKevin Wolf <kwolf@redhat.com>
      Reviewed-by: NFam Zheng <famz@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      0fdf1a4f
    • E
      block: Typo fix in copy_on_readv() · 765d9df9
      Eric Blake 提交于
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      765d9df9
  3. 05 9月, 2017 1 次提交
  4. 07 8月, 2017 1 次提交
  5. 17 7月, 2017 2 次提交
  6. 11 7月, 2017 1 次提交
  7. 10 7月, 2017 5 次提交
    • E
      block: Make bdrv_is_allocated_above() byte-based · 51b0a488
      Eric Blake 提交于
      We are gradually moving away from sector-based interfaces, towards
      byte-based.  In the common case, allocation is unlikely to ever use
      values that are not naturally sector-aligned, but it is possible
      that byte-based values will let us be more precise about allocation
      at the end of an unaligned file that can do byte-based access.
      
      Changing the signature of the function to use int64_t *pnum ensures
      that the compiler enforces that all callers are updated.  For now,
      the io.c layer still assert()s that all callers are sector-aligned,
      but that can be relaxed when a later patch implements byte-based
      block status.  Therefore, for the most part this patch is just the
      addition of scaling at the callers followed by inverse scaling at
      bdrv_is_allocated().  But some code, particularly stream_run(),
      gets a lot simpler because it no longer has to mess with sectors.
      Leave comments where we can further simplify by switching to
      byte-based iterations, once later patches eliminate the need for
      sector-aligned operations.
      
      For ease of review, bdrv_is_allocated() was tackled separately.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      51b0a488
    • E
      block: Minimize raw use of bds->total_sectors · c00716be
      Eric Blake 提交于
      bdrv_is_allocated_above() was relying on intermediate->total_sectors,
      which is a field that can have stale contents depending on the value
      of intermediate->has_variable_length.  An audit shows that we are safe
      (we were first calling through bdrv_co_get_block_status() which in
      turn calls bdrv_nb_sectors() and therefore just refreshed the current
      length), but it's nicer to favor our accessor functions to avoid having
      to repeat such an audit, even if it means refresh_total_sectors() is
      called more frequently.
      Suggested-by: NJohn Snow <jsnow@redhat.com>
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NManos Pitsidianakis <el13635@mail.ntua.gr>
      Reviewed-by: NJeff Cody <jcody@redhat.com>
      Reviewed-by: NJohn Snow <jsnow@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      c00716be
    • E
      block: Make bdrv_is_allocated() byte-based · d6a644bb
      Eric Blake 提交于
      We are gradually moving away from sector-based interfaces, towards
      byte-based.  In the common case, allocation is unlikely to ever use
      values that are not naturally sector-aligned, but it is possible
      that byte-based values will let us be more precise about allocation
      at the end of an unaligned file that can do byte-based access.
      
      Changing the signature of the function to use int64_t *pnum ensures
      that the compiler enforces that all callers are updated.  For now,
      the io.c layer still assert()s that all callers are sector-aligned
      on input and that *pnum is sector-aligned on return to the caller,
      but that can be relaxed when a later patch implements byte-based
      block status.  Therefore, this code adds usages like
      DIV_ROUND_UP(,BDRV_SECTOR_SIZE) to callers that still want aligned
      values, where the call might reasonbly give non-aligned results
      in the future; on the other hand, no rounding is needed for callers
      that should just continue to work with byte alignment.
      
      For the most part this patch is just the addition of scaling at the
      callers followed by inverse scaling at bdrv_is_allocated().  But
      some code, particularly bdrv_commit(), gets a lot simpler because it
      no longer has to mess with sectors; also, it is now possible to pass
      NULL if the caller does not care how much of the image is allocated
      beyond the initial offset.  Leave comments where we can further
      simplify once a later patch eliminates the need for sector-aligned
      requests through bdrv_is_allocated().
      
      For ease of review, bdrv_is_allocated_above() will be tackled
      separately.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      d6a644bb
    • E
      block: Drop unused bdrv_round_sectors_to_clusters() · e8a81e9c
      Eric Blake 提交于
      Now that the last user [mirror_iteration()] has converted to using
      bytes, we no longer need a function to round sectors to clusters.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NJohn Snow <jsnow@redhat.com>
      Reviewed-by: NJeff Cody <jcody@redhat.com>
      Reviewed-by: NKevin Wolf <kwolf@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      e8a81e9c
    • E
      block: Guarantee that *file is set on bdrv_get_block_status() · 81c219ac
      Eric Blake 提交于
      We document that *file is valid if the return is not an error and
      includes BDRV_BLOCK_OFFSET_VALID, but forgot to obey this contract
      when a driver (such as blkdebug) lacks a callback.  Messed up in
      commit 67a0fd2a (v2.6), when we added the file parameter.
      
      Enhance qemu-iotest 177 to cover this, using a sequence that would
      print garbage or even SEGV, because it was dererefencing through
      uninitialized memory.  [The resulting test output shows that we
      have less-than-ideal block status from the blkdebug driver, but
      that's a separate fix coming up soon.]
      
      Setting *file on all paths that return BDRV_BLOCK_OFFSET_VALID is
      enough to fix the crash, but we can go one step further: always
      setting *file, even on error, means that a broken caller that
      blindly dereferences file without checking for error is now more
      likely to get a reliable SEGV instead of randomly acting on garbage,
      making it easier to diagnose such buggy callers.  Adding an
      assertion that file is set where expected doesn't hurt either.
      
      CC: qemu-stable@nongnu.org
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NFam Zheng <famz@redhat.com>
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      Reviewed-by: NJohn Snow <jsnow@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      81c219ac
  8. 30 6月, 2017 2 次提交
    • E
      block: Exploit BDRV_BLOCK_EOF for larger zero blocks · c61e684e
      Eric Blake 提交于
      When we have a BDS with unallocated clusters, but asking the status
      of its underlying bs->file or backing layer encounters an end-of-file
      condition, we know that the rest of the unallocated area will read as
      zeroes.  However, pre-patch, this required two separate calls to
      bdrv_get_block_status(), as the first call stops at the point where
      the underlying file ends.  Thanks to BDRV_BLOCK_EOF, we can now widen
      the results of the primary status if the secondary status already
      includes BDRV_BLOCK_ZERO.
      
      In turn, this fixes a TODO mentioned in iotest 154, where we can now
      see that all sectors in a partial cluster at the end of a file read
      as zero when coupling the shorter backing file's status along with our
      knowledge that the remaining sectors came from an unallocated cluster.
      
      Also, note that the loop in bdrv_co_get_block_status_above() had an
      inefficent exit: in cases where the active layer sets BDRV_BLOCK_ZERO
      but does NOT set BDRV_BLOCK_ALLOCATED (namely, where we know we read
      zeroes merely because our unallocated clusters lie beyond the backing
      file's shorter length), we still ended up probing the backing layer
      even though we already had a good answer.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Message-Id: <20170505021500.19315-3-eblake@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NFam Zheng <famz@redhat.com>
      c61e684e
    • E
      block: Add BDRV_BLOCK_EOF to bdrv_get_block_status() · fb0d8654
      Eric Blake 提交于
      Just as the block layer already sets BDRV_BLOCK_ALLOCATED as a
      shortcut for subsequent operations, there are also some optimizations
      that are made easier if we can quickly tell that *pnum will advance
      us to the end of a file, via a new BDRV_BLOCK_EOF which gets set
      by the block layer.
      
      This just plumbs up the new bit; subsequent patches will make use
      of it.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Message-Id: <20170505021500.19315-2-eblake@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NFam Zheng <famz@redhat.com>
      fb0d8654
  9. 26 6月, 2017 4 次提交
  10. 16 6月, 2017 8 次提交
  11. 25 5月, 2017 1 次提交
  12. 12 5月, 2017 1 次提交
  13. 27 4月, 2017 3 次提交
  14. 18 4月, 2017 1 次提交
    • F
      block: Walk bs->children carefully in bdrv_drain_recurse · 178bd438
      Fam Zheng 提交于
      The recursive bdrv_drain_recurse may run a block job completion BH that
      drops nodes. The coming changes will make that more likely and use-after-free
      would happen without this patch
      
      Stash the bs pointer and use bdrv_ref/bdrv_unref in addition to
      QLIST_FOREACH_SAFE to prevent such a case from happening.
      
      Since bdrv_unref accesses global state that is not protected by the AioContext
      lock, we cannot use bdrv_ref/bdrv_unref unconditionally.  Fortunately the
      protection is not needed in IOThread because only main loop can modify a graph
      with the AioContext lock held.
      Signed-off-by: NFam Zheng <famz@redhat.com>
      Message-Id: <20170418143044.12187-2-famz@redhat.com>
      Reviewed-by: NJeff Cody <jcody@redhat.com>
      Tested-by: NJeff Cody <jcody@redhat.com>
      Signed-off-by: NFam Zheng <famz@redhat.com>
      178bd438
  15. 11 4月, 2017 3 次提交
    • M
      block/io: Comment out permission assertions · e3e0003a
      Max Reitz 提交于
      In case of block migration, there may be writes to BlockBackends that do
      not have the write permission taken. Before this issue is fixed (which
      is not going to happen in 2.9), we therefore cannot assert that this is
      the case.
      Suggested-by: NKevin Wolf <kwolf@redhat.com>
      Signed-off-by: NMax Reitz <mreitz@redhat.com>
      Reviewed-by: NKevin Wolf <kwolf@redhat.com>
      Tested-by: NKevin Wolf <kwolf@redhat.com>
      Message-id: 20170411145050.31290-1-mreitz@redhat.com
      Tested-by: NLaurent Vivier <lvivier@redhat.com>
      Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
      e3e0003a
    • F
      block: Fix bdrv_co_flush early return · 49ca6259
      Fam Zheng 提交于
      bdrv_inc_in_flight and bdrv_dec_in_flight are mandatory for
      BDRV_POLL_WHILE to work, even for the shortcut case where flush is
      unnecessary. Move the if block to below bdrv_dec_in_flight, and BTW fix
      the variable declaration position.
      Signed-off-by: NFam Zheng <famz@redhat.com>
      Acked-by: NStefan Hajnoczi <stefanha@redhat.com>
      Reviewed-by: NKevin Wolf <kwolf@redhat.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      49ca6259
    • F
      block: Use bdrv_coroutine_enter to start I/O coroutines · e92f0e19
      Fam Zheng 提交于
      BDRV_POLL_WHILE waits for the started I/O by releasing bs's ctx then polling
      the main context, which relies on the yielded coroutine continuing on bs->ctx
      before notifying qemu_aio_context with bdrv_wakeup().
      
      Thus, using qemu_coroutine_enter to start I/O is wrong because if the coroutine
      is entered from main loop, co->ctx will be qemu_aio_context, as a result of the
      "release, poll, acquire" loop of BDRV_POLL_WHILE, race conditions happen when
      both main thread and the iothread access the same BDS:
      
        main loop                                iothread
      -----------------------------------------------------------------------
        blockdev_snapshot
          aio_context_acquire(bs->ctx)
                                                 virtio_scsi_data_plane_handle_cmd
          bdrv_drained_begin(bs->ctx)
          bdrv_flush(bs)
            bdrv_co_flush(bs)                      aio_context_acquire(bs->ctx).enter
              ...
              qemu_coroutine_yield(co)
            BDRV_POLL_WHILE()
              aio_context_release(bs->ctx)
                                                   aio_context_acquire(bs->ctx).return
                                                     ...
                                                       aio_co_wake(co)
              aio_poll(qemu_aio_context)               ...
                co_schedule_bh_cb()                    ...
                  qemu_coroutine_enter(co)             ...
      
                    /* (A) bdrv_co_flush(bs)           /* (B) I/O on bs */
                            continues... */
                                                   aio_context_release(bs->ctx)
              aio_context_acquire(bs->ctx)
      
      Note that in above case, bdrv_drained_begin() doesn't do the "release,
      poll, acquire" in BDRV_POLL_WHILE, because bs->in_flight == 0.
      
      Fix this by using bdrv_coroutine_enter and enter coroutine in the right
      context.
      
      iotests 109 output is updated because the coroutine reenter flow during
      mirror job complete is different (now through co_queue_wakeup, instead
      of the unconditional qemu_coroutine_switch before), making the end job
      len different.
      Signed-off-by: NFam Zheng <famz@redhat.com>
      Acked-by: NStefan Hajnoczi <stefanha@redhat.com>
      Reviewed-by: NKevin Wolf <kwolf@redhat.com>
      e92f0e19