1. 04 1月, 2017 1 次提交
  2. 29 9月, 2016 1 次提交
    • S
      linux-aio: fix re-entrant completion processing · fe121b9d
      Stefan Hajnoczi 提交于
      Commit 0ed93d84 ("linux-aio: process
      completions from ioq_submit()") added an optimization that processes
      completions each time ioq_submit() returns with requests in flight.
      This commit introduces a "Co-routine re-entered recursively" error which
      can be triggered with -drive format=qcow2,aio=native.
      
      Fam Zheng <famz@redhat.com>, Kevin Wolf <kwolf@redhat.com>, and I
      debugged the following backtrace:
      
        (gdb) bt
        #0  0x00007ffff0a046f5 in raise () at /lib64/libc.so.6
        #1  0x00007ffff0a062fa in abort () at /lib64/libc.so.6
        #2  0x0000555555ac0013 in qemu_coroutine_enter (co=0x5555583464d0) at util/qemu-coroutine.c:113
        #3  0x0000555555a4b663 in qemu_laio_process_completions (s=s@entry=0x555557e2f7f0) at block/linux-aio.c:218
        #4  0x0000555555a4b874 in ioq_submit (s=s@entry=0x555557e2f7f0) at block/linux-aio.c:331
        #5  0x0000555555a4ba12 in laio_do_submit (fd=fd@entry=13, laiocb=laiocb@entry=0x555559d38ae0, offset=offset@entry=2932727808, type=type@entry=1) at block/linux-aio.c:383
        #6  0x0000555555a4bbd3 in laio_co_submit (bs=<optimized out>, s=0x555557e2f7f0, fd=13, offset=2932727808, qiov=0x555559d38e20, type=1) at block/linux-aio.c:402
        #7  0x0000555555a4fd23 in bdrv_driver_preadv (bs=bs@entry=0x55555663bcb0, offset=offset@entry=2932727808, bytes=bytes@entry=8192, qiov=qiov@entry=0x555559d38e20, flags=0) at block/io.c:804
        #8  0x0000555555a52b34 in bdrv_aligned_preadv (bs=bs@entry=0x55555663bcb0, req=req@entry=0x555559d38d20, offset=offset@entry=2932727808, bytes=bytes@entry=8192, align=align@entry=512, qiov=qiov@entry=0x555559d38e20, flags=0) at block/io.c:1041
        #9  0x0000555555a52db8 in bdrv_co_preadv (child=<optimized out>, offset=2932727808, bytes=8192, qiov=qiov@entry=0x555559d38e20, flags=flags@entry=0) at block/io.c:1133
        #10 0x0000555555a29629 in qcow2_co_preadv (bs=0x555556635890, offset=6178725888, bytes=8192, qiov=0x555557527840, flags=<optimized out>) at block/qcow2.c:1509
        #11 0x0000555555a4fd23 in bdrv_driver_preadv (bs=bs@entry=0x555556635890, offset=offset@entry=6178725888, bytes=bytes@entry=8192, qiov=qiov@entry=0x555557527840, flags=0) at block/io.c:804
        #12 0x0000555555a52b34 in bdrv_aligned_preadv (bs=bs@entry=0x555556635890, req=req@entry=0x555559d39000, offset=offset@entry=6178725888, bytes=bytes@entry=8192, align=align@entry=1, qiov=qiov@entry=0x555557527840, flags=0) at block/io.c:1041
        #13 0x0000555555a52db8 in bdrv_co_preadv (child=<optimized out>, offset=offset@entry=6178725888, bytes=bytes@entry=8192, qiov=qiov@entry=0x555557527840, flags=flags@entry=0) at block/io.c:1133
        #14 0x0000555555a4515a in blk_co_preadv (blk=0x5555566356d0, offset=6178725888, bytes=8192, qiov=0x555557527840, flags=0) at block/block-backend.c:783
        #15 0x0000555555a45266 in blk_aio_read_entry (opaque=0x5555577025e0) at block/block-backend.c:991
        #16 0x0000555555ac0cfa in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:78
      
      It turned out that re-entrant ioq_submit() and completion processing
      between three requests caused this error.  The following check is not
      sufficient to prevent recursively entering coroutines:
      
        if (laiocb->co != qemu_coroutine_self()) {
            qemu_coroutine_enter(laiocb->co);
        }
      
      As the following coroutine backtrace shows, not just the current
      coroutine (self) can be entered.  There might also be other coroutines
      that are currently entered and transferred control due to the qcow2 lock
      (CoMutex):
      
        (gdb) qemu coroutine 0x5555583464d0
        #0  0x0000555555ac0c90 in qemu_coroutine_switch (from_=from_@entry=0x5555583464d0, to_=to_@entry=0x5555572f9890, action=action@entry=COROUTINE_ENTER) at util/coroutine-ucontext.c:175
        #1  0x0000555555abfe54 in qemu_coroutine_enter (co=0x5555572f9890) at util/qemu-coroutine.c:117
        #2  0x0000555555ac031c in qemu_co_queue_run_restart (co=co@entry=0x5555583462c0) at util/qemu-coroutine-lock.c:60
        #3  0x0000555555abfe5e in qemu_coroutine_enter (co=0x5555583462c0) at util/qemu-coroutine.c:119
        #4  0x0000555555a4b663 in qemu_laio_process_completions (s=s@entry=0x555557e2f7f0) at block/linux-aio.c:218
        #5  0x0000555555a4b874 in ioq_submit (s=s@entry=0x555557e2f7f0) at block/linux-aio.c:331
        #6  0x0000555555a4ba12 in laio_do_submit (fd=fd@entry=13, laiocb=laiocb@entry=0x55555a338b40, offset=offset@entry=2911477760, type=type@entry=1) at block/linux-aio.c:383
        #7  0x0000555555a4bbd3 in laio_co_submit (bs=<optimized out>, s=0x555557e2f7f0, fd=13, offset=2911477760, qiov=0x55555a338e80, type=1) at block/linux-aio.c:402
        #8  0x0000555555a4fd23 in bdrv_driver_preadv (bs=bs@entry=0x55555663bcb0, offset=offset@entry=2911477760, bytes=bytes@entry=8192, qiov=qiov@entry=0x55555a338e80, flags=0) at block/io.c:804
        #9  0x0000555555a52b34 in bdrv_aligned_preadv (bs=bs@entry=0x55555663bcb0, req=req@entry=0x55555a338d80, offset=offset@entry=2911477760, bytes=bytes@entry=8192, align=align@entry=512, qiov=qiov@entry=0x55555a338e80, flags=0) at block/io.c:1041
        #10 0x0000555555a52db8 in bdrv_co_preadv (child=<optimized out>, offset=2911477760, bytes=8192, qiov=qiov@entry=0x55555a338e80, flags=flags@entry=0) at block/io.c:1133
        #11 0x0000555555a29629 in qcow2_co_preadv (bs=0x555556635890, offset=6157475840, bytes=8192, qiov=0x5555575df720, flags=<optimized out>) at block/qcow2.c:1509
        #12 0x0000555555a4fd23 in bdrv_driver_preadv (bs=bs@entry=0x555556635890, offset=offset@entry=6157475840, bytes=bytes@entry=8192, qiov=qiov@entry=0x5555575df720, flags=0) at block/io.c:804
        #13 0x0000555555a52b34 in bdrv_aligned_preadv (bs=bs@entry=0x555556635890, req=req@entry=0x55555a339060, offset=offset@entry=6157475840, bytes=bytes@entry=8192, align=align@entry=1, qiov=qiov@entry=0x5555575df720, flags=0) at block/io.c:1041
        #14 0x0000555555a52db8 in bdrv_co_preadv (child=<optimized out>, offset=offset@entry=6157475840, bytes=bytes@entry=8192, qiov=qiov@entry=0x5555575df720, flags=flags@entry=0) at block/io.c:1133
        #15 0x0000555555a4515a in blk_co_preadv (blk=0x5555566356d0, offset=6157475840, bytes=8192, qiov=0x5555575df720, flags=0) at block/block-backend.c:783
        #16 0x0000555555a45266 in blk_aio_read_entry (opaque=0x555557231aa0) at block/block-backend.c:991
        #17 0x0000555555ac0cfa in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:78
      
      Use the new qemu_coroutine_entered() function instead of comparing
      against qemu_coroutine_self().  This is correct because:
      
      1. If a coroutine is not entered then it must have yielded to wait for
         I/O completion.  It is therefore safe to enter.
      
      2. If a coroutine is entered then it must be in
         ioq_submit()/qemu_laio_process_completions() because otherwise it
         would be yielded while waiting for I/O completion.  Therefore it will
         check laio->ret and return from ioq_submit() instead of yielding,
         i.e. it's guaranteed not to hang.
      Reported-by: NFam Zheng <famz@redhat.com>
      Tested-by: NFam Zheng <famz@redhat.com>
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Reviewed-by: NFam Zheng <famz@redhat.com>
      Message-id: 1474989516-18255-4-git-send-email-stefanha@redhat.com
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      fe121b9d
  3. 13 9月, 2016 3 次提交
    • R
      linux-aio: process completions from ioq_submit() · 0ed93d84
      Roman Pen 提交于
      In order to reduce completion latency it makes sense to harvest completed
      requests ASAP.  Very fast backend device can complete requests just after
      submission, so it is worth trying to check ring buffer in order to peek
      completed requests directly after io_submit() has been called.
      
      Indeed, this patch reduces the completions latencies and increases the
      overall throughput, e.g. the following is the percentiles of number of
      completed requests at once:
      
              1th 10th  20th  30th  40th  50th  60th  70th  80th  90th  99.99th
      Before    2    4    42   112   128   128   128   128   128   128    128
       After    1    1     4    14    33    45    47    48    50    51    108
      
      That means, that before the current patch is applied the ring buffer is
      observed as full (128 requests were consumed at once) in 60% of calls.
      
      After patch is applied the distribution of number of completed requests
      is "smoother" and the queue (requests in-flight) is almost never full.
      
      The fio read results are the following (write results are almost the
      same and are not showed here):
      
        Before
        ------
      job: (groupid=0, jobs=8): err= 0: pid=2227: Tue Jul 19 11:29:50 2016
        Description  : [Emulation of Storage Server Access Pattern]
        read : io=54681MB, bw=1822.7MB/s, iops=179779, runt= 30001msec
          slat (usec): min=172, max=16883, avg=338.35, stdev=109.66
          clat (usec): min=1, max=21977, avg=1051.45, stdev=299.29
           lat (usec): min=317, max=22521, avg=1389.83, stdev=300.73
          clat percentiles (usec):
           |  1.00th=[  346],  5.00th=[  596], 10.00th=[  708], 20.00th=[  852],
           | 30.00th=[  932], 40.00th=[  996], 50.00th=[ 1048], 60.00th=[ 1112],
           | 70.00th=[ 1176], 80.00th=[ 1256], 90.00th=[ 1384], 95.00th=[ 1496],
           | 99.00th=[ 1800], 99.50th=[ 1928], 99.90th=[ 2320], 99.95th=[ 2672],
           | 99.99th=[ 4704]
          bw (KB  /s): min=205229, max=553181, per=12.50%, avg=233278.26, stdev=18383.51
      
        After
        ------
      job: (groupid=0, jobs=8): err= 0: pid=2220: Tue Jul 19 11:31:51 2016
        Description  : [Emulation of Storage Server Access Pattern]
        read : io=57637MB, bw=1921.2MB/s, iops=189529, runt= 30002msec
          slat (usec): min=169, max=20636, avg=329.61, stdev=124.18
          clat (usec): min=2, max=19592, avg=988.78, stdev=251.04
           lat (usec): min=381, max=21067, avg=1318.42, stdev=243.58
          clat percentiles (usec):
           |  1.00th=[  310],  5.00th=[  580], 10.00th=[  748], 20.00th=[  876],
           | 30.00th=[  908], 40.00th=[  948], 50.00th=[ 1012], 60.00th=[ 1064],
           | 70.00th=[ 1080], 80.00th=[ 1128], 90.00th=[ 1224], 95.00th=[ 1288],
           | 99.00th=[ 1496], 99.50th=[ 1608], 99.90th=[ 1960], 99.95th=[ 2256],
           | 99.99th=[ 5408]
          bw (KB  /s): min=212149, max=390160, per=12.49%, avg=245746.04, stdev=11606.75
      
      Throughput increased from 1822MB/s to 1921MB/s, average completion latencies
      decreased from 1051us to 988us.
      Signed-off-by: NRoman Pen <roman.penyaev@profitbricks.com>
      Message-id: 1468931263-32667-4-git-send-email-roman.penyaev@profitbricks.com
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: qemu-devel@nongnu.org
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      0ed93d84
    • R
      linux-aio: split processing events function · 3407de57
      Roman Pen 提交于
      Prepare processing events function to be called from ioq_submit(),
      thus split function on two parts: the first harvests completed IO
      requests, the second submits pending requests.
      Signed-off-by: NRoman Pen <roman.penyaev@profitbricks.com>
      Message-id: 1468931263-32667-3-git-send-email-roman.penyaev@profitbricks.com
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: qemu-devel@nongnu.org
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      3407de57
    • R
      linux-aio: consume events in userspace instead of calling io_getevents · 9e909a58
      Roman Pen 提交于
      AIO context in userspace is represented as a simple ring buffer, which
      can be consumed directly without entering the kernel, which obviously
      can bring some performance gain.  QEMU does not use timeout value for
      waiting for events completions, so we can consume all events from
      userspace.
      Signed-off-by: NRoman Pen <roman.penyaev@profitbricks.com>
      Message-id: 1468931263-32667-2-git-send-email-roman.penyaev@profitbricks.com
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: qemu-devel@nongnu.org
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      9e909a58
  4. 11 8月, 2016 1 次提交
  5. 18 7月, 2016 2 次提交
    • R
      linux-aio: prevent submitting more than MAX_EVENTS · 5e1b34a3
      Roman Pen 提交于
      Invoking io_setup(MAX_EVENTS) we ask kernel to create ring buffer for us
      with specified number of events.  But kernel ring buffer allocation logic
      is a bit tricky (ring buffer is page size aligned + some percpu allocation
      are required) so eventually more than requested events number is allocated.
      
      From a userspace side we have to follow the convention and should not try
      to io_submit() more or logic, which consumes completed events, should be
      changed accordingly.  The pitfall is in the following sequence:
      
          MAX_EVENTS = 128
          io_setup(MAX_EVENTS)
      
          io_submit(MAX_EVENTS)
          io_submit(MAX_EVENTS)
      
          /* now 256 events are in-flight */
      
          io_getevents(MAX_EVENTS) = 128
      
          /* we can handle only 128 events at once, to be sure
           * that nothing is pended the io_getevents(MAX_EVENTS)
           * call must be invoked once more or hang will happen. */
      
      To prevent the hang or reiteration of io_getevents() call this patch
      restricts the number of in-flights, which is now limited to MAX_EVENTS.
      Signed-off-by: NRoman Pen <roman.penyaev@profitbricks.com>
      Reviewed-by: NFam Zheng <famz@redhat.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Message-id: 1468415004-31755-1-git-send-email-roman.penyaev@profitbricks.com
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: qemu-devel@nongnu.org
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      5e1b34a3
    • P
      linux-aio: share one LinuxAioState within an AioContext · 0187f5c9
      Paolo Bonzini 提交于
      This has better performance because it executes fewer system calls
      and does not use a bottom half per disk.
      
      Originally proposed by Ming Lei.
      
      [Changed #include "raw-aio.h" to "block/raw-aio.h" in win32-aio.c to fix
      build error as reported by Peter Maydell <peter.maydell@linaro.org>.
      --Stefan]
      Acked-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-id: 1467650000-51385-1-git-send-email-pbonzini@redhat.com
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      
      squash! linux-aio: share one LinuxAioState within an AioContext
      0187f5c9
  6. 13 7月, 2016 1 次提交
    • P
      coroutine: move entry argument to qemu_coroutine_create · 0b8b8753
      Paolo Bonzini 提交于
      In practice the entry argument is always known at creation time, and
      it is confusing that sometimes qemu_coroutine_enter is used with a
      non-NULL argument to re-enter a coroutine (this happens in
      block/sheepdog.c and tests/test-coroutine.c).  So pass the opaque value
      at creation time, for consistency with e.g. aio_bh_new.
      
      Mostly done with the following semantic patch:
      
      @ entry1 @
      expression entry, arg, co;
      @@
      - co = qemu_coroutine_create(entry);
      + co = qemu_coroutine_create(entry, arg);
        ...
      - qemu_coroutine_enter(co, arg);
      + qemu_coroutine_enter(co);
      
      @ entry2 @
      expression entry, arg;
      identifier co;
      @@
      - Coroutine *co = qemu_coroutine_create(entry);
      + Coroutine *co = qemu_coroutine_create(entry, arg);
        ...
      - qemu_coroutine_enter(co, arg);
      + qemu_coroutine_enter(co);
      
      @ entry3 @
      expression entry, arg;
      @@
      - qemu_coroutine_enter(qemu_coroutine_create(entry), arg);
      + qemu_coroutine_enter(qemu_coroutine_create(entry, arg));
      
      @ reentry @
      expression co;
      @@
      - qemu_coroutine_enter(co, NULL);
      + qemu_coroutine_enter(co);
      
      except for the aforementioned few places where the semantic patch
      stumbled (as expected) and for test_co_queue, which would otherwise
      produce an uninitialized variable warning.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NFam Zheng <famz@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      0b8b8753
  7. 05 7月, 2016 1 次提交
    • D
      block: fix return code for partial write for Linux AIO · 1c42f149
      Denis V. Lunev 提交于
      Partial write most likely means that there is not space rather than
      "something wrong happens". Thus it would be more natural to return
      ENOSPC rather than EINVAL.
      
      The problem actually happens with NBD server, which has reported EINVAL
      rather then ENOSPC on the first error using its protocol, which makes
      report to the user wrong.
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
      CC: Kevin Wolf <kwolf@redhat.com>
      CC: Max Reitz <mreitz@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      1c42f149
  8. 16 6月, 2016 3 次提交
    • K
      linux-aio: Cancel BH if not needed · ccb9dc10
      Kevin Wolf 提交于
      linux-aio uses a BH in order to make sure that the remaining completions
      are processed even in nested event loops of completion callbacks in
      order to avoid deadlocks.
      
      There is no need, however, to have the BH overhead for the first call
      into qemu_laio_completion_bh() or after all pending completions have
      already been processed. Therefore, this patch calls directly into
      qemu_laio_completion_bh() in qemu_laio_completion_cb() and cancels
      the BH after qemu_laio_completion_bh() has processed all pending
      completions.
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      ccb9dc10
    • K
      raw-posix: Implement .bdrv_co_preadv/pwritev · 9d52aa3c
      Kevin Wolf 提交于
      The raw-posix block driver actually supports byte-aligned requests now
      on non-O_DIRECT images, like it already (and previously incorrectly)
      claimed in bs->request_alignment.
      
      For some block drivers this means that a RMW cycle can be avoided when
      they write sub-sector metadata e.g. for cluster allocation.
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      Reviewed-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      9d52aa3c
    • K
      raw-posix: Switch to bdrv_co_* interfaces · 2174f12b
      Kevin Wolf 提交于
      In order to use the modern byte-based .bdrv_co_preadv/pwritev()
      interface, this patch switches raw-posix to coroutine-based interfaces
      as a first step. In terms of semantics and performance, it doesn't make
      a difference with the existing code whether we go from a coroutine to a
      callback-based interface already in block/io.c or only in linux-aio.c
      
      As there have been concerns in the past that this change may be a step
      in the wrong direction with respect to a possible AIO fast path, the
      old callback-based interface for linux-aio is left around and can be
      reactivated when a fast path (e.g. directly from virtio-blk dataplane,
      bypassing the whole block layer) is implemented.
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      Reviewed-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      2174f12b
  9. 12 5月, 2016 2 次提交
  10. 20 1月, 2016 1 次提交
  11. 24 10月, 2015 1 次提交
  12. 13 12月, 2014 5 次提交
  13. 20 10月, 2014 2 次提交
  14. 22 9月, 2014 2 次提交
  15. 29 8月, 2014 1 次提交
    • S
      linux-aio: avoid deadlock in nested aio_poll() calls · 2cdff7f6
      Stefan Hajnoczi 提交于
      If two Linux AIO request completions are fetched in the same
      io_getevents() call, QEMU will deadlock if request A's callback waits
      for request B to complete using an aio_poll() loop.  This was reported
      to happen with the mirror blockjob.
      
      This patch moves completion processing into a BH and makes it resumable.
      Nested event loops can resume completion processing so that request B
      will complete and the deadlock will not occur.
      
      Cc: Kevin Wolf <kwolf@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Marcin Gibuła <m.gibula@beyond.pl>
      Reported-by: NMarcin Gibuła <m.gibula@beyond.pl>
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Tested-by: NMarcin Gibuła <m.gibula@beyond.pl>
      2cdff7f6
  16. 15 7月, 2014 1 次提交
  17. 07 7月, 2014 1 次提交
  18. 04 6月, 2014 2 次提交
  19. 19 8月, 2013 2 次提交
  20. 19 12月, 2012 2 次提交
  21. 15 11月, 2012 1 次提交
  22. 31 10月, 2012 2 次提交
  23. 30 10月, 2012 2 次提交