1. 15 4月, 2017 4 次提交
    • O
      blk-mq: introduce Kyber multiqueue I/O scheduler · 00e04393
      Omar Sandoval 提交于
      The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
      scale to multiple queues. Users configure only two knobs, the target
      read and synchronous write latencies, and the scheduler tunes itself to
      achieve that latency goal.
      
      The implementation is based on "tokens", built on top of the scalable
      bitmap library. Tokens serve as a mechanism for limiting requests. There
      are two tiers of tokens: queueing tokens and dispatch tokens.
      
      A queueing token is required to allocate a request. In fact, these
      tokens are actually the blk-mq internal scheduler tags, but the
      scheduler manages the allocation directly in order to implement its
      policy.
      
      Dispatch tokens are device-wide and split up into two scheduling
      domains: reads vs. writes. Each hardware queue dispatches batches
      round-robin between the scheduling domains as long as tokens are
      available for that domain.
      
      These tokens can be used as the mechanism to enable various policies.
      The policy Kyber uses is inspired by active queue management techniques
      for network routing, similar to blk-wbt. The scheduler monitors
      latencies and scales the number of dispatch tokens accordingly. Queueing
      tokens are used to prevent starvation of synchronous requests by
      asynchronous requests.
      
      Various extensions are possible, including better heuristics and ionice
      support. The new scheduler isn't set as the default yet.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      00e04393
    • O
      blk-mq-sched: make completed_request() callback more useful · c05f8525
      Omar Sandoval 提交于
      Currently, this callback is called right after put_request() and has no
      distinguishable purpose. Instead, let's call it before put_request() as
      soon as I/O has completed on the request, before we account it in
      blk-stat. With this, Kyber can enable stats when it sees a latency
      outlier and make sure the outlier gets accounted.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c05f8525
    • O
      blk-mq: export helpers · 5b727272
      Omar Sandoval 提交于
      blk_mq_finish_request() is required for schedulers that define their own
      put_request(). blk_mq_run_hw_queue() is required for schedulers that
      hold back requests to be run later.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5b727272
    • O
      blk-mq: add shallow depth option for blk_mq_get_tag() · 229a9287
      Omar Sandoval 提交于
      Wire up the sbitmap_get_shallow() operation to the tag code so that a
      caller can limit the number of tags available to it.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      229a9287
  2. 11 4月, 2017 3 次提交
  3. 09 4月, 2017 7 次提交
  4. 08 4月, 2017 8 次提交
  5. 07 4月, 2017 6 次提交
    • N
      block: trace completion of all bios. · fbbaf700
      NeilBrown 提交于
      Currently only dm and md/raid5 bios trigger
      trace_block_bio_complete().  Now that we have bio_chain() and
      bio_inc_remaining(), it is not possible, in general, for a driver to
      know when the bio is really complete.  Only bio_endio() knows that.
      
      So move the trace_block_bio_complete() call to bio_endio().
      
      Now trace_block_bio_complete() pairs with trace_block_bio_queue().
      Any bio for which a 'queue' event is traced, will subsequently
      generate a 'complete' event.
      
      There are a few cases where completion tracing is not wanted.
      1/ If blk_update_request() has already generated a completion
         trace event at the 'request' level, there is no point generating
         one at the bio level too.  In this case the bi_sector and bi_size
         will have changed, so the bio level event would be wrong
      
      2/ If the bio hasn't actually been queued yet, but is being aborted
         early, then a trace event could be confusing.  Some filesystems
         call bio_endio() but do not want tracing.
      
      3/ The bio_integrity code interposes itself by replacing bi_end_io,
         then restoring it and calling bio_endio() again.  This would produce
         two identical trace events if left like that.
      
      To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
      produce the trace event when this is set.
      We address point 1 above by clearing the flag in blk_update_request().
      We address point 2 above by only setting the flag when
      generic_make_request() is called.
      We address point 3 above by clearing the flag after generating a
      completion event.
      
      When bio_split() is used on a bio, particularly in blk_queue_split(),
      there is an extra complication.  A new bio is split off the front, and
      may be handle directly without going through generic_make_request().
      The old bio, which has been advanced, is passed to
      generic_make_request(), so it will trigger a trace event a second
      time.
      Probably the best result when a split happens is to see a single
      'queue' event for the whole bio, then multiple 'complete' events - one
      for each component.  To achieve this was can:
      - copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
      - avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
      This way, the split-off bio won't create a queue event, the original
      won't either even if it re-submitted to generic_make_request(),
      but both will produce completion events, each for their own range.
      
      So if generic_make_request() is called (which generates a QUEUED
      event), then bi_endio() will create a single COMPLETE event for each
      range that the bio is split into, unless the driver has explicitly
      requested it not to.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fbbaf700
    • O
      blk-mq: remap queues when adding/removing hardware queues · ebe8bddb
      Omar Sandoval 提交于
      blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
      behavior that drivers expect. However, commit 4e68a011 changed
      blk_mq_queue_reinit() to not remap queues for the case of CPU
      hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
      queues as well. This breaks, for example, NBD's multi-connection mode,
      leaving the added hardware queues unused. Fix it by making
      blk_mq_update_nr_hw_queues() explicitly remap the queues.
      
      Fixes: 4e68a011 ("blk-mq: don't redistribute hardware queues on a CPU hotplug event")
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ebe8bddb
    • O
      blk-mq-sched: fix crash in switch error path · 54d5329d
      Omar Sandoval 提交于
      In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
      back to the original scheduler. However, at this point, we've already
      torn down the original scheduler's tags, so this causes a crash. Doing
      the fallback like the legacy elevator path is much harder for mq, so fix
      it by just falling back to none, instead.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      54d5329d
    • O
      blk-mq-sched: set up scheduler tags when bringing up new queues · 93252632
      Omar Sandoval 提交于
      If a new hardware queue is added at runtime, we don't allocate scheduler
      tags for it, leading to a crash. This hooks up the scheduler framework
      to blk_mq_{init,exit}_hctx() to make sure everything gets properly
      initialized/freed.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      93252632
    • O
      blk-mq-sched: refactor scheduler initialization · 6917ff0b
      Omar Sandoval 提交于
      Preparation cleanup for the next couple of fixes, push
      blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6917ff0b
    • O
      blk-mq: use the right hctx when getting a driver tag fails · 81380ca1
      Omar Sandoval 提交于
      While dispatching requests, if we fail to get a driver tag, we mark the
      hardware queue as waiting for a tag and put the requests on a
      hctx->dispatch list to be run later when a driver tag is freed. However,
      blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
      queues if using a single-queue scheduler with a multiqueue device. If
      blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
      are processing. This means we end up using the hardware queue of the
      previous request, which may or may not be the same as that of the
      current request. If it isn't, the wrong hardware queue will end up
      waiting for a tag, and the requests will be on the wrong dispatch list,
      leading to a hang.
      
      The fix is twofold:
      
      1. Make sure we save which hardware queue we were trying to get a
         request for in blk_mq_get_driver_tag() regardless of whether it
         succeeds or not.
      2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
         blk_mq_hw_queue to make it clear that it must handle multiple
         hardware queues, since I've already messed this up on a couple of
         occasions.
      
      This didn't appear in testing with nvme and mq-deadline because nvme has
      more driver tags than the default number of scheduler tags. However,
      with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      81380ca1
  6. 06 4月, 2017 1 次提交
  7. 05 4月, 2017 3 次提交
    • B
      blk-mq: Remove blk_mq_queue_data.list · f2fbc9dd
      Bart Van Assche 提交于
      The block layer core sets blk_mq_queue_data.list but no block
      drivers read that member. Hence remove it and also the code that
      is used to set this member.
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f2fbc9dd
    • J
      cfq: Disable writeback throttling by default · 142bbdfc
      Jan Kara 提交于
      Writeback throttling does not play well with CFQ since that also tries
      to throttle async writes. As a result async writeback can get starved in
      presence of readers. As an example take a benchmark simulating
      postgreSQL database running over a standard rotating SATA drive. There
      are 16 processes doing random reads from a huge file (2*machine memory),
      1 process doing random writes to the huge file and calling fsync once
      per 50000 writes and 1 process doing sequential 8k writes to a
      relatively small file wrapping around at the end of the file and calling
      fsync every 5 writes. Under this load read latency easily exceeds the
      target latency of 75 ms (just because there are so many reads happening
      against a relatively slow disk) and thus writeback is throttled to a
      point where only 1 write request is allowed at a time. Blktrace data
      then looks like:
      
        8,0    1        0     8.347751764     0  m   N cfq workload slice:40000000
        8,0    1        0     8.347755256     0  m   N cfq293A  / set_active wl_class: 0 wl_type:0
        8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    1     3814     8.347763916  5839 UT   N [kworker/u9:2] 1
        8,0    0        0     8.347777605     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    3     1596     8.354364057     0  C   R 156109528 + 8 (6906954) [0]
        8,0    3        0     8.354383193     0  m   N cfq6196SN / complete rqnoidle 0
        8,0    3        0     8.354386476     0  m   N cfq schedule dispatch
        8,0    3        0     8.354399397     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    3        0     8.354404705     0  m   N cfq293A  / dispatch_insert
        8,0    3        0     8.354409454     0  m   N cfq293A  / dispatched a request
        8,0    3        0     8.354412527     0  m   N cfq293A  / activate rq, drv=1
        8,0    3     1597     8.354414692     0  D   W 145961400 + 24 (6718452) [swapper/0]
        8,0    3        0     8.354484184     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    3        0     8.354487536     0  m   N cfq293A  / slice expired t=0
        8,0    3        0     8.354498013     0  m   N / served: vt=5888102466265088 min_vt=5888074869387264
        8,0    3        0     8.354502692     0  m   N cfq293A  / sl_used=6737519 disp=1 charge=6737519 iops=0 sect=24
        8,0    3        0     8.354505695     0  m   N cfq293A  / del_from_rr
      ...
        8,0    0     1810     8.354728768     0  C   W 145961400 + 24 (314076) [0]
        8,0    0        0     8.354746927     0  m   N cfq293A  / complete rqnoidle 0
      ...
        8,0    1     3829     8.389886102  5839  G   W 145962968 + 24 [kworker/u9:2]
        8,0    1     3830     8.389888127  5839  P   N [kworker/u9:2]
        8,0    1     3831     8.389908102  5839  A   W 145978336 + 24 <- (8,4) 44000
        8,0    1     3832     8.389910477  5839  Q   W 145978336 + 24 [kworker/u9:2]
        8,0    1     3833     8.389914248  5839  I   W 145962968 + 24 (28146) [kworker/u9:2]
        8,0    1        0     8.389919137     0  m   N cfq293A  / insert_request
        8,0    1        0     8.389924305     0  m   N cfq293A  / add_to_rr
        8,0    1     3834     8.389933175  5839 UT   N [kworker/u9:2] 1
      ...
        8,0    0        0     9.455290997     0  m   N cfq workload slice:40000000
        8,0    0        0     9.455294769     0  m   N cfq293A  / set_active wl_class:0 wl_type:0
        8,0    0        0     9.455303499     0  m   N cfq293A  / fifo=ffff880003166090
        8,0    0        0     9.455306851     0  m   N cfq293A  / dispatch_insert
        8,0    0        0     9.455311251     0  m   N cfq293A  / dispatched a request
        8,0    0        0     9.455314324     0  m   N cfq293A  / activate rq, drv=1
        8,0    0     2043     9.455316210  6204  D   W 145962968 + 24 (1065401962) [pgioperf]
        8,0    0        0     9.455392407     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    0        0     9.455395969     0  m   N cfq293A  / slice expired t=0
        8,0    0        0     9.455404210     0  m   N / served: vt=5888958194597888 min_vt=5888941810597888
        8,0    0        0     9.455410077     0  m   N cfq293A  / sl_used=4000000 disp=1 charge=4000000 iops=0 sect=24
        8,0    0        0     9.455416851     0  m   N cfq293A  / del_from_rr
      ...
        8,0    0     2045     9.455648515     0  C   W 145962968 + 24 (332305) [0]
        8,0    0        0     9.455668350     0  m   N cfq293A  / complete rqnoidle 0
      ...
        8,0    1     4371     9.455710115  5839  G   W 145978336 + 24 [kworker/u9:2]
        8,0    1     4372     9.455712350  5839  P   N [kworker/u9:2]
        8,0    1     4373     9.455730159  5839  A   W 145986616 + 24 <- (8,4) 52280
        8,0    1     4374     9.455732674  5839  Q   W 145986616 + 24 [kworker/u9:2]
        8,0    1     4375     9.455737563  5839  I   W 145978336 + 24 (27448) [kworker/u9:2]
        8,0    1        0     9.455742871     0  m   N cfq293A  / insert_request
        8,0    1        0     9.455747550     0  m   N cfq293A  / add_to_rr
        8,0    1     4376     9.455756629  5839 UT   N [kworker/u9:2] 1
      
      So we can see a Q event for a write request, then IO is blocked by
      writeback throttling and G and I events for the request happen only once
      other writeback IO is completed. Thus CFQ always sees only one write
      request. When it sees it, it queues the async queue behind all the read
      queues and the async queue gets scheduled after about one second. When
      it is scheduled, that one request gets dispatched and async queue is
      expired as it has no more requests to submit. Overall we submit about
      one write request per second.
      
      Although this scheduling is beneficial for read latency, writes are
      heavily starved and this causes large delays all over the system (due to
      processes blocking on page lock, transaction starts, etc.). When
      writeback throttling is disabled, write throughput is about one fifth of
      a read throughput which roughly matches readers/writers ratio and
      overall the system stalls are much shorter.
      
      Mixing writeback throttling logic with CFQ throttling logic is always a
      recipe for surprises as CFQ assumes it sees the big part of the picture
      which is not necessarily true when writeback throttling is blocking
      requests. So disable writeback throttling logic by default when CFQ is
      used as an IO scheduler.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      142bbdfc
    • A
      block: fix inheriting request priority from bio · 85003a44
      Adam Manzanares 提交于
      In 4.10 I introduced a patch that associates the ioc priority with
      each request in the block layer. This work was done in the single queue
      block layer code. This patch unifies ioc priority to request mapping across
      the single/multi queue block layers.
      
      I have tested this patch with the null block device driver with the following
      parameters.
      
      null_blk queue_mode=2 irqmode=0 use_per_node_hctx=1 nr_devices=1
      
      I have not seen a performance regression with this patch and I would appreciate
      any feedback or additional testing.
      
      I have also verified that io priorities are passed to the device when using
      the SQ and MQ path to a SATA HDD that supports io priorities.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAdam Manzanares <adam.manzanares@wdc.com>
      Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      85003a44
  8. 31 3月, 2017 1 次提交
  9. 30 3月, 2017 7 次提交