• P
    aio: strengthen memory barriers for bottom half scheduling · e8d3b1a2
    Paolo Bonzini 提交于
    There are two problems with memory barriers in async.c.  The fix is
    to use atomic_xchg in order to achieve sequential consistency between
    the scheduling of a bottom half and the corresponding execution.
    
    First, if bh->scheduled is already 1 in qemu_bh_schedule, QEMU does
    not execute a memory barrier to order any writes needed by the callback
    before the read of bh->scheduled.  If the other side sees req->state as
    THREAD_ACTIVE, the callback is not invoked and you get deadlock.
    
    Second, the memory barrier in aio_bh_poll is too weak.  Without this
    patch, it is possible that bh->scheduled = 0 is not "published" until
    after the callback has returned.  Another thread wants to schedule the
    bottom half, but it sees bh->scheduled = 1 and does nothing.  This causes
    a lost wakeup.  The memory barrier should have been changed to smp_mb()
    in commit 924fe129 (aio: fix qemu_bh_schedule() bh->ctx race condition,
    2014-06-03) together with qemu_bh_schedule()'s.  Guess who reviewed
    that patch?
    
    Both of these involve a store and a load, so they are reproducible on
    x86_64 as well.  It is however much easier on aarch64, where the
    libguestfs test suite triggers the bug fairly easily.  Even there the
    failure can go away or appear depending on compiler optimization level,
    tracing options, or even kernel debugging options.
    
    Paul Leveille however reported how to trigger the problem within 15
    minutes on x86_64 as well.  His (untested) recipe, reproduced here
    for reference, is the following:
    
       1) Qcow2 (or 3) is critical – raw files alone seem to avoid the problem.
    
       2) Use “cache=directsync” rather than the default of
       “cache=none” to make it happen easier.
    
       3) Use a server with a write-back RAID controller to allow for rapid
       IO rates.
    
       4) Run a random-access load that (mostly) writes chunks to various
       files on the virtual block device.
    
          a. I use ‘diskload.exe c:25’, a Microsoft HCT load
             generator, on Windows VMs.
    
          b. Iometer can probably be configured to generate a similar load.
    
       5) Run multiple VMs in parallel, against the same storage device,
       to shake the failure out sooner.
    
       6) IvyBridge and Haswell processors for certain; not sure about others.
    
    A similar patch survived over 12 hours of testing, where an unpatched
    QEMU would fail within 15 minutes.
    
    This bug is, most likely, also the cause of failures in the libguestfs
    testsuite on AArch64.
    
    Thanks to Laszlo Ersek for initially reporting this bug, to Stefan
    Hajnoczi for suggesting closer examination of qemu_bh_schedule, and to
    Paul for providing test input and a prototype patch.
    Reported-by: NLaszlo Ersek <lersek@redhat.com>
    Reported-by: NPaul Leveille <Paul.Leveille@stratus.com>
    Reported-by: NJohn Snow <jsnow@redhat.com>
    Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
    Message-id: 1428419779-26062-1-git-send-email-pbonzini@redhat.com
    Suggested-by: NPaul Leveille <Paul.Leveille@stratus.com>
    Suggested-by: NStefan Hajnoczi <stefanha@redhat.com>
    Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
    e8d3b1a2
async.c 8.5 KB