1. 06 1月, 2018 11 次提交
    • P
      block, bfq: let a queue be merged only shortly after starting I/O · 7b8fa3b9
      Paolo Valente 提交于
      In BFQ and CFQ, two processes are said to be cooperating if they do
      I/O in such a way that the union of their I/O requests yields a
      sequential I/O pattern. To get such a sequential I/O pattern out of
      the non-sequential pattern of each cooperating process, BFQ and CFQ
      merge the queues associated with these processes. In more detail,
      cooperating processes, and thus their associated queues, usually
      start, or restart, to do I/O shortly after each other. This is the
      case, e.g., for the I/O threads of KVM/QEMU and of the dump
      utility. Basing on this assumption, this commit allows a bfq_queue to
      be merged only during a short time interval (100ms) after it starts,
      or re-starts, to do I/O.  This filtering provides two important
      benefits.
      
      First, it greatly reduces the probability that two non-cooperating
      processes have their queues merged by mistake, if they just happen to
      do I/O close to each other for a short time interval. These spurious
      merges cause loss of service guarantees. A low-weight bfq_queue may
      unjustly get more than its expected share of the throughput: if such a
      low-weight queue is merged with a high-weight queue, then the I/O for
      the low-weight queue is served as if the queue had a high weight. This
      may damage other high-weight queues unexpectedly.  For instance,
      because of this issue, lxterminal occasionally took 7.5 seconds to
      start, instead of 6.5 seconds, when some sequential readers and
      writers did I/O in the background on a FUJITSU MHX2300BT HDD.  The
      reason is that the bfq_queues associated with some of the readers or
      the writers were merged with the high-weight queues of some processes
      that had to do some urgent but little I/O. The readers then exploited
      the inherited high weight for all or most of their I/O, during the
      start-up of terminal. The filtering introduced by this commit
      eliminated any outlier caused by spurious queue merges in our start-up
      time tests.
      
      This filtering also provides a little boost of the throughput
      sustainable by BFQ: 3-4%, depending on the CPU. The reason is that,
      once a bfq_queue cannot be merged any longer, this commit makes BFQ
      stop updating the data needed to handle merging for the queue.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7b8fa3b9
    • A
      block, bfq: check low_latency flag in bfq_bfqq_save_state() · 1be6e8a9
      Angelo Ruocco 提交于
      A just-created bfq_queue will certainly be deemed as interactive on
      the arrival of its first I/O request, if the low_latency flag is
      set. Yet, if the queue is merged with another queue on the arrival of
      its first I/O request, it will not have the chance to be flagged as
      interactive. Nevertheless, if the queue is then split soon enough, it
      has to be flagged as interactive after the split.
      
      To handle this early-merge scenario correctly, BFQ saves the state of
      the queue, on the merge, as if the latter had already been deemed
      interactive. So, if the queue is split soon, it will get
      weight-raised, because the previous state of the queue is resumed on
      the split.
      
      Unfortunately, in the act of saving the state of the newly-created
      queue, BFQ doesn't check whether the low_latency flag is set, and this
      causes early-merged queues to be then weight-raised, on queue splits,
      even if low_latency is off. This commit addresses this problem by
      adding the missing check.
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1be6e8a9
    • P
      block, bfq: add missing rq_pos_tree update on rq removal · 05e90283
      Paolo Valente 提交于
      If two processes do I/O close to each other, then BFQ merges the
      bfq_queues associated with these processes, to get a more sequential
      I/O, and thus a higher throughput.  In this respect, to detect whether
      two processes are doing I/O close to each other, BFQ keeps a list of
      the head-of-line I/O requests of all active bfq_queues.  The list is
      ordered by initial sectors, and implemented through a red-black tree
      (rq_pos_tree).
      
      Unfortunately, the update of the rq_pos_tree was incomplete, because
      the tree was not updated on the removal of the head-of-line I/O
      request of a bfq_queue, in case the queue did not remain empty. This
      commit adds the missing update.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05e90283
    • P
      block, bfq: increase threshold to deem I/O as random · f0ba5ea2
      Paolo Valente 提交于
      If two processes do I/O close to each other, i.e., are cooperating
      processes in BFQ (and CFQ'S) nomenclature, then BFQ merges their
      associated bfq_queues, so as to get sequential I/O from the union of
      the I/O requests of the processes, and thus reach a higher
      throughput. A merged queue is then split if its I/O stops being
      sequential. In this respect, BFQ deems the I/O of a bfq_queue as
      (mostly) sequential only if less than 4 I/O requests are random, out
      of the last 32 requests inserted into the queue.
      
      Unfortunately, extensive testing (with the interleaved_io benchmark of
      the S suite [1], and with real applications spawning cooperating
      processes) has clearly shown that, with such a low threshold, only a
      rather low I/O throughput may be reached when several cooperating
      processes do I/O. In particular, the outcome of each test run was
      bimodal: if queue merging occurred and was stable during the test,
      then the throughput was close to the peak rate of the storage device,
      otherwise the throughput was arbitrarily low (usually around 1/10 of
      the peak rate with a rotational device). The probability to get the
      unlucky outcomes grew with the number of cooperating processes: it was
      already significant with 5 processes, and close to one with 7 or more
      processes.
      
      The cause of the low throughput in the unlucky runs was that the
      merged queues containing the I/O of these cooperating processes were
      soon split, because they contained more random I/O requests than those
      tolerated by the 4/32 threshold, but
      - that I/O would have however allowed the storage device to reach
        peak throughput or almost peak throughput;
      - in contrast, the I/O of these processes, if served individually
        (from separate queues) yielded a rather low throughput.
      
      So we repeated our tests with increasing values of the threshold,
      until we found the minimum value (19) for which we obtained maximum
      throughput, reliably, with at least up to 9 cooperating
      processes. Then we checked that the use of that higher threshold value
      did not cause any regression for any other benchmark in the suite [1].
      This commit raises the threshold to such a higher value.
      
      [1] https://github.com/Algodev-github/SSigned-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f0ba5ea2
    • D
      deadline-iosched: Introduce zone locking support · 8dc8146f
      Damien Le Moal 提交于
      Introduce zone write locking to avoid write request reordering with
      zoned block devices. This is achieved using a finer selection of the
      next request to dispatch:
      1) Any non-write request is always allowed to proceed.
      2) Any write to a conventional zone is always allowed to proceed.
      3) For a write to a sequential zone, the zone lock is first checked.
         a) If the zone is not locked, the write is allowed to proceed after
            its target zone is locked.
         b) If the zone is locked, the write request is skipped and the next
            request in the dispatch queue tested (back to step 1).
      
      For a write request that has locked its target zone, the zone is
      unlocked either when the request completes and the method
      deadline_request_completed() is called, or when the request is requeued
      using the method deadline_add_request().
      
      Requests targeting a locked zone are always left in the scheduler queue
      to preserve the initial write order. If no write request can be
      dispatched, allow reads to be dispatched even if the write batch is not
      done.
      
      If the device used is not a zoned block device, or if zoned block device
      support is disabled, this patch does not modify deadline behavior.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8dc8146f
    • D
      deadline-iosched: Introduce dispatch helpers · c117bac7
      Damien Le Moal 提交于
      Avoid directly referencing the next_rq and fifo_list arrays using the
      helper functions deadline_next_request() and deadline_fifo_request() to
      facilitate changes in the dispatch request selection in
      deadline_dispatch_requests() for zoned block devices.
      
      While at it, also remove the unnecessary forward declaration of the
      function deadline_move_request().
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c117bac7
    • D
      mq-deadline: Introduce zone locking support · 5700f691
      Damien Le Moal 提交于
      Introduce zone write locking to avoid write request reordering with
      zoned block devices. This is achieved using a finer selection of the
      next request to dispatch:
      1) Any non-write request is always allowed to proceed.
      2) Any write to a conventional zone is always allowed to proceed.
      3) For a write to a sequential zone, the zone lock is first checked.
         a) If the zone is not locked, the write is allowed to proceed after
            its target zone is locked.
         b) If the zone is locked, the write request is skipped and the next
            request in the dispatch queue tested (back to step 1).
      
      For a write request that has locked its target zone, the zone is
      unlocked either when the request completes with a call to the method
      deadline_request_completed() or when the request is requeued using
      dd_insert_request().
      
      Requests targeting a locked zone are always left in the scheduler queue
      to preserve the lba ordering for write requests. If no write request
      can be dispatched, allow reads to be dispatched even if the write batch
      is not done.
      
      If the device used is not a zoned block device, or if zoned block device
      support is disabled, this patch does not modify mq-deadline behavior.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5700f691
    • D
      mq-deadline: Introduce dispatch helpers · bf09ce56
      Damien Le Moal 提交于
      Avoid directly referencing the next_rq and fifo_list arrays using the
      helper functions deadline_next_request() and deadline_fifo_request() to
      facilitate changes in the dispatch request selection in
      __dd_dispatch_request() for zoned block devices.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NBart Van Assche <Bart.VanAssche@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bf09ce56
    • C
      block: introduce zoned block devices zone write locking · 6cc77e9c
      Christoph Hellwig 提交于
      Components relying only on the request_queue structure for accessing
      block devices (e.g. I/O schedulers) have a limited knowledged of the
      device characteristics. In particular, the device capacity cannot be
      easily discovered, which for a zoned block device also result in the
      inability to easily know the number of zones of the device (the zone
      size is indicated by the chunk_sectors field of the queue limits).
      
      Introduce the nr_zones field to the request_queue structure to simplify
      access to this information. Also, add the bitmap seq_zone_bitmap which
      indicates which zones of the device are sequential zones (write
      preferred or write required) and the bitmap seq_zones_wlock which
      indicates if a zone is write locked, that is, if a write request
      targeting a zone was dispatched to the device. These fields are
      initialized by the low level block device driver (sd.c for ZBC/ZAC
      disks). They are not initialized by stacking drivers (device mappers)
      handling zoned block devices (e.g. dm-linear).
      
      Using this, I/O schedulers can introduce zone write locking to control
      request dispatching to a zoned block device and avoid write request
      reordering by limiting to at most a single write request per zone
      outside of the scheduler at any time.
      
      Based on previous patches from Damien Le Moal.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      [Damien]
      * Fixed comments and identation in blkdev.h
      * Changed helper functions
      * Fixed this commit message
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6cc77e9c
    • B
      pktcdvd: Fix a recently introduced NULL pointer dereference · 882d4171
      Bart Van Assche 提交于
      Call bdev_get_queue(bdev) after bdev->bd_disk has been initialized
      instead of just before that pointer has been initialized. This patch
      avoids that the following command
      
      pktsetup 1 /dev/sr0
      
      triggers the following kernel crash:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000548
      IP: pkt_setup_dev+0x2db/0x670 [pktcdvd]
      CPU: 2 PID: 724 Comm: pktsetup Not tainted 4.15.0-rc4-dbg+ #1
      Call Trace:
       pkt_ctl_ioctl+0xce/0x1c0 [pktcdvd]
       do_vfs_ioctl+0x8e/0x670
       SyS_ioctl+0x3c/0x70
       entry_SYSCALL_64_fastpath+0x23/0x9a
      Reported-by: NMaciej S. Szmigiero <mail@maciej.szmigiero.name>
      Fixes: commit ca18d6f7 ("block: Make most scsi_req_init() calls implicit")
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Tested-by: NMaciej S. Szmigiero <mail@maciej.szmigiero.name>
      Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
      Cc: <stable@vger.kernel.org> # v4.13
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      882d4171
    • B
      pktcdvd: Fix pkt_setup_dev() error path · 5a0ec388
      Bart Van Assche 提交于
      Commit 523e1d39 ("block: make gendisk hold a reference to its queue")
      modified add_disk() and disk_release() but did not update any of the
      error paths that trigger a put_disk() call after disk->queue has been
      assigned. That introduced the following behavior in the pktcdvd driver
      if pkt_new_dev() fails:
      
      Kernel BUG at 00000000e98fd882 [verbose debug info unavailable]
      
      Since disk_release() calls blk_put_queue() anyway if disk->queue != NULL,
      fix this by removing the blk_cleanup_queue() call from the pkt_setup_dev()
      error path.
      
      Fixes: commit 523e1d39 ("block: make gendisk hold a reference to its queue")
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
      Cc: <stable@vger.kernel.org> # v3.2
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5a0ec388
  2. 05 1月, 2018 26 次提交
  3. 23 12月, 2017 1 次提交
    • J
      blk-mq: improve heavily contended tag case · 4e5dff41
      Jens Axboe 提交于
      Even with a number of waitqueues, we can get into a situation where we
      are heavily contended on the waitqueue lock. I got a report on spc1
      where we're spending seconds doing this. Arguably the use case is nasty,
      I reproduce it with one device and 1000 threads banging on the device.
      But that doesn't mean we shouldn't be handling it better.
      
      What ends up happening is that a thread will fail to get a tag, add
      itself to the waitqueue, and subsequently get woken up when a tag is
      freed - only to find itself going back to sleep on the waitqueue.
      
      Instead of waking all threads, use an exclusive wait and wake up our
      sbitmap batch count instead. This seems to work well for me (massive
      improvement for this use case), and it survives basic testing. But I
      haven't fully verified it yet.
      
      An additional improvement is running the queue and checking for a new
      tag BEFORE needing to add ourselves to the waitqueue.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4e5dff41
  4. 18 12月, 2017 2 次提交