1. 08 4月, 2017 6 次提交
    • O
      blk-mq-sched: provide hooks for initializing hardware queue data · ee056f98
      Omar Sandoval 提交于
      Schedulers need to be informed when a hardware queue is added or removed
      at runtime so they can allocate/free per-hardware queue data. So,
      replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
      at init time, with .init_hctx() and .exit_hctx() hooks.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ee056f98
    • J
      Merge branch 'for-linus' into for-4.12/block · 65f619d2
      Jens Axboe 提交于
      We've added a considerable amount of fixes for stalls and issues
      with the blk-mq scheduling in the 4.11 series since forking
      off the for-4.12/block branch. We need to do improvements on
      top of that for 4.12, so pull in the previous fixes to make
      our lives easier going forward.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      65f619d2
    • B
      blk-mq: Restart a single queue if tag sets are shared · 6d8c6c0f
      Bart Van Assche 提交于
      To improve scalability, if hardware queues are shared, restart
      a single hardware queue in round-robin fashion. Rename
      blk_mq_sched_restart_queues() to reflect the new semantics.
      Remove blk_mq_sched_mark_restart_queue() because this function
      has no callers. Remove flag QUEUE_FLAG_RESTART because this
      patch removes the code that uses this flag.
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6d8c6c0f
    • B
      dm rq: Avoid that request processing stalls sporadically · 6077c2d7
      Bart Van Assche 提交于
      While running the srp-test software I noticed that request
      processing stalls sporadically at the beginning of a test, namely
      when mkfs is run against a dm-mpath device. Every time when that
      happened the following command was sufficient to resume request
      processing:
      
          echo run >/sys/kernel/debug/block/dm-0/state
      
      This patch avoids that such request processing stalls occur. The
      test I ran is as follows:
      
          while srp-test/run_tests -d -r 30 -t 02-mq; do :; done
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6077c2d7
    • B
      scsi: Avoid that SCSI queues get stuck · 36e3cf27
      Bart Van Assche 提交于
      If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block
      driver that implements that function is responsible for rerunning the
      hardware queue once requests can be queued again successfully.
      
      commit 52d7f1b5 ("blk-mq: Avoid that requeueing starts stopped
      queues") removed the blk_mq_stop_hw_queue() call from scsi_queue_rq()
      for the BLK_MQ_RQ_QUEUE_BUSY case. Hence change all calls to functions
      that are intended to rerun a busy queue such that these examine all
      hardware queues instead of only stopped queues.
      
      Since no other functions than scsi_internal_device_block() and
      scsi_internal_device_unblock() should ever stop or restart a SCSI
      queue, change the blk_mq_delay_queue() call into a
      blk_mq_delay_run_hw_queue() call.
      
      Fixes: commit 52d7f1b5 ("blk-mq: Avoid that requeueing starts stopped queues")
      Fixes: commit 7e79dadc ("blk-mq: stop hardware queue in blk_mq_delay_queue()")
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Long Li <longli@microsoft.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      36e3cf27
    • B
      blk-mq: Introduce blk_mq_delay_run_hw_queue() · 7587a5ae
      Bart Van Assche 提交于
      Introduce a function that runs a hardware queue unconditionally
      after a delay. Note: there is already a function that stops and
      restarts a hardware queue after a delay, namely blk_mq_delay_queue().
      
      This function will be used in the next patch in this series.
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Long Li <longli@microsoft.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7587a5ae
  2. 07 4月, 2017 7 次提交
    • N
      block: trace completion of all bios. · fbbaf700
      NeilBrown 提交于
      Currently only dm and md/raid5 bios trigger
      trace_block_bio_complete().  Now that we have bio_chain() and
      bio_inc_remaining(), it is not possible, in general, for a driver to
      know when the bio is really complete.  Only bio_endio() knows that.
      
      So move the trace_block_bio_complete() call to bio_endio().
      
      Now trace_block_bio_complete() pairs with trace_block_bio_queue().
      Any bio for which a 'queue' event is traced, will subsequently
      generate a 'complete' event.
      
      There are a few cases where completion tracing is not wanted.
      1/ If blk_update_request() has already generated a completion
         trace event at the 'request' level, there is no point generating
         one at the bio level too.  In this case the bi_sector and bi_size
         will have changed, so the bio level event would be wrong
      
      2/ If the bio hasn't actually been queued yet, but is being aborted
         early, then a trace event could be confusing.  Some filesystems
         call bio_endio() but do not want tracing.
      
      3/ The bio_integrity code interposes itself by replacing bi_end_io,
         then restoring it and calling bio_endio() again.  This would produce
         two identical trace events if left like that.
      
      To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
      produce the trace event when this is set.
      We address point 1 above by clearing the flag in blk_update_request().
      We address point 2 above by only setting the flag when
      generic_make_request() is called.
      We address point 3 above by clearing the flag after generating a
      completion event.
      
      When bio_split() is used on a bio, particularly in blk_queue_split(),
      there is an extra complication.  A new bio is split off the front, and
      may be handle directly without going through generic_make_request().
      The old bio, which has been advanced, is passed to
      generic_make_request(), so it will trigger a trace event a second
      time.
      Probably the best result when a split happens is to see a single
      'queue' event for the whole bio, then multiple 'complete' events - one
      for each component.  To achieve this was can:
      - copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
      - avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
      This way, the split-off bio won't create a queue event, the original
      won't either even if it re-submitted to generic_make_request(),
      but both will produce completion events, each for their own range.
      
      So if generic_make_request() is called (which generates a QUEUED
      event), then bi_endio() will create a single COMPLETE event for each
      range that the bio is split into, unless the driver has explicitly
      requested it not to.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fbbaf700
    • N
      block: simple improvements for bio->flags · dbde775c
      NeilBrown 提交于
      The comment for the 'flags' field of 'bio' mentions
      "command" which is no longer stored there, and doesn't
      mention the bvec pool number, which is.
      
      BIO_RESET_BITS is set in such a way that it would need to be
      updated if new bits were added, which is easy to miss.
      
      BVEC_POOL_BITS is larger than needed.  The BVEC_POOL_IDX()
      ranges from 0 to 6, so 3 bits are sufficient.
      
      This patch make improvements in each of these areas.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dbde775c
    • O
      blk-mq: remap queues when adding/removing hardware queues · ebe8bddb
      Omar Sandoval 提交于
      blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
      behavior that drivers expect. However, commit 4e68a011 changed
      blk_mq_queue_reinit() to not remap queues for the case of CPU
      hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
      queues as well. This breaks, for example, NBD's multi-connection mode,
      leaving the added hardware queues unused. Fix it by making
      blk_mq_update_nr_hw_queues() explicitly remap the queues.
      
      Fixes: 4e68a011 ("blk-mq: don't redistribute hardware queues on a CPU hotplug event")
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ebe8bddb
    • O
      blk-mq-sched: fix crash in switch error path · 54d5329d
      Omar Sandoval 提交于
      In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
      back to the original scheduler. However, at this point, we've already
      torn down the original scheduler's tags, so this causes a crash. Doing
      the fallback like the legacy elevator path is much harder for mq, so fix
      it by just falling back to none, instead.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      54d5329d
    • O
      blk-mq-sched: set up scheduler tags when bringing up new queues · 93252632
      Omar Sandoval 提交于
      If a new hardware queue is added at runtime, we don't allocate scheduler
      tags for it, leading to a crash. This hooks up the scheduler framework
      to blk_mq_{init,exit}_hctx() to make sure everything gets properly
      initialized/freed.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      93252632
    • O
      blk-mq-sched: refactor scheduler initialization · 6917ff0b
      Omar Sandoval 提交于
      Preparation cleanup for the next couple of fixes, push
      blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6917ff0b
    • O
      blk-mq: use the right hctx when getting a driver tag fails · 81380ca1
      Omar Sandoval 提交于
      While dispatching requests, if we fail to get a driver tag, we mark the
      hardware queue as waiting for a tag and put the requests on a
      hctx->dispatch list to be run later when a driver tag is freed. However,
      blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
      queues if using a single-queue scheduler with a multiqueue device. If
      blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
      are processing. This means we end up using the hardware queue of the
      previous request, which may or may not be the same as that of the
      current request. If it isn't, the wrong hardware queue will end up
      waiting for a tag, and the requests will be on the wrong dispatch list,
      leading to a hang.
      
      The fix is twofold:
      
      1. Make sure we save which hardware queue we were trying to get a
         request for in blk_mq_get_driver_tag() regardless of whether it
         succeeds or not.
      2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
         blk_mq_hw_queue to make it clear that it must handle multiple
         hardware queues, since I've already messed this up on a couple of
         occasions.
      
      This didn't appear in testing with nvme and mq-deadline because nvme has
      more driver tags than the default number of scheduler tags. However,
      with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      81380ca1
  3. 06 4月, 2017 6 次提交
  4. 05 4月, 2017 4 次提交
    • C
      remove the obsolete hd driver · 8e14be53
      Christoph Hellwig 提交于
      This driver is for pre-IDE hardisk that are only found in PC from the
      stoneage of personal computing, and which we don't support elsewhere
      in the kernel these days.
      
      It's also been marked broken forever.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8e14be53
    • B
      blk-mq: Remove blk_mq_queue_data.list · f2fbc9dd
      Bart Van Assche 提交于
      The block layer core sets blk_mq_queue_data.list but no block
      drivers read that member. Hence remove it and also the code that
      is used to set this member.
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f2fbc9dd
    • J
      cfq: Disable writeback throttling by default · 142bbdfc
      Jan Kara 提交于
      Writeback throttling does not play well with CFQ since that also tries
      to throttle async writes. As a result async writeback can get starved in
      presence of readers. As an example take a benchmark simulating
      postgreSQL database running over a standard rotating SATA drive. There
      are 16 processes doing random reads from a huge file (2*machine memory),
      1 process doing random writes to the huge file and calling fsync once
      per 50000 writes and 1 process doing sequential 8k writes to a
      relatively small file wrapping around at the end of the file and calling
      fsync every 5 writes. Under this load read latency easily exceeds the
      target latency of 75 ms (just because there are so many reads happening
      against a relatively slow disk) and thus writeback is throttled to a
      point where only 1 write request is allowed at a time. Blktrace data
      then looks like:
      
        8,0    1        0     8.347751764     0  m   N cfq workload slice:40000000
        8,0    1        0     8.347755256     0  m   N cfq293A  / set_active wl_class: 0 wl_type:0
        8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    1     3814     8.347763916  5839 UT   N [kworker/u9:2] 1
        8,0    0        0     8.347777605     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    3     1596     8.354364057     0  C   R 156109528 + 8 (6906954) [0]
        8,0    3        0     8.354383193     0  m   N cfq6196SN / complete rqnoidle 0
        8,0    3        0     8.354386476     0  m   N cfq schedule dispatch
        8,0    3        0     8.354399397     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    3        0     8.354404705     0  m   N cfq293A  / dispatch_insert
        8,0    3        0     8.354409454     0  m   N cfq293A  / dispatched a request
        8,0    3        0     8.354412527     0  m   N cfq293A  / activate rq, drv=1
        8,0    3     1597     8.354414692     0  D   W 145961400 + 24 (6718452) [swapper/0]
        8,0    3        0     8.354484184     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    3        0     8.354487536     0  m   N cfq293A  / slice expired t=0
        8,0    3        0     8.354498013     0  m   N / served: vt=5888102466265088 min_vt=5888074869387264
        8,0    3        0     8.354502692     0  m   N cfq293A  / sl_used=6737519 disp=1 charge=6737519 iops=0 sect=24
        8,0    3        0     8.354505695     0  m   N cfq293A  / del_from_rr
      ...
        8,0    0     1810     8.354728768     0  C   W 145961400 + 24 (314076) [0]
        8,0    0        0     8.354746927     0  m   N cfq293A  / complete rqnoidle 0
      ...
        8,0    1     3829     8.389886102  5839  G   W 145962968 + 24 [kworker/u9:2]
        8,0    1     3830     8.389888127  5839  P   N [kworker/u9:2]
        8,0    1     3831     8.389908102  5839  A   W 145978336 + 24 <- (8,4) 44000
        8,0    1     3832     8.389910477  5839  Q   W 145978336 + 24 [kworker/u9:2]
        8,0    1     3833     8.389914248  5839  I   W 145962968 + 24 (28146) [kworker/u9:2]
        8,0    1        0     8.389919137     0  m   N cfq293A  / insert_request
        8,0    1        0     8.389924305     0  m   N cfq293A  / add_to_rr
        8,0    1     3834     8.389933175  5839 UT   N [kworker/u9:2] 1
      ...
        8,0    0        0     9.455290997     0  m   N cfq workload slice:40000000
        8,0    0        0     9.455294769     0  m   N cfq293A  / set_active wl_class:0 wl_type:0
        8,0    0        0     9.455303499     0  m   N cfq293A  / fifo=ffff880003166090
        8,0    0        0     9.455306851     0  m   N cfq293A  / dispatch_insert
        8,0    0        0     9.455311251     0  m   N cfq293A  / dispatched a request
        8,0    0        0     9.455314324     0  m   N cfq293A  / activate rq, drv=1
        8,0    0     2043     9.455316210  6204  D   W 145962968 + 24 (1065401962) [pgioperf]
        8,0    0        0     9.455392407     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    0        0     9.455395969     0  m   N cfq293A  / slice expired t=0
        8,0    0        0     9.455404210     0  m   N / served: vt=5888958194597888 min_vt=5888941810597888
        8,0    0        0     9.455410077     0  m   N cfq293A  / sl_used=4000000 disp=1 charge=4000000 iops=0 sect=24
        8,0    0        0     9.455416851     0  m   N cfq293A  / del_from_rr
      ...
        8,0    0     2045     9.455648515     0  C   W 145962968 + 24 (332305) [0]
        8,0    0        0     9.455668350     0  m   N cfq293A  / complete rqnoidle 0
      ...
        8,0    1     4371     9.455710115  5839  G   W 145978336 + 24 [kworker/u9:2]
        8,0    1     4372     9.455712350  5839  P   N [kworker/u9:2]
        8,0    1     4373     9.455730159  5839  A   W 145986616 + 24 <- (8,4) 52280
        8,0    1     4374     9.455732674  5839  Q   W 145986616 + 24 [kworker/u9:2]
        8,0    1     4375     9.455737563  5839  I   W 145978336 + 24 (27448) [kworker/u9:2]
        8,0    1        0     9.455742871     0  m   N cfq293A  / insert_request
        8,0    1        0     9.455747550     0  m   N cfq293A  / add_to_rr
        8,0    1     4376     9.455756629  5839 UT   N [kworker/u9:2] 1
      
      So we can see a Q event for a write request, then IO is blocked by
      writeback throttling and G and I events for the request happen only once
      other writeback IO is completed. Thus CFQ always sees only one write
      request. When it sees it, it queues the async queue behind all the read
      queues and the async queue gets scheduled after about one second. When
      it is scheduled, that one request gets dispatched and async queue is
      expired as it has no more requests to submit. Overall we submit about
      one write request per second.
      
      Although this scheduling is beneficial for read latency, writes are
      heavily starved and this causes large delays all over the system (due to
      processes blocking on page lock, transaction starts, etc.). When
      writeback throttling is disabled, write throughput is about one fifth of
      a read throughput which roughly matches readers/writers ratio and
      overall the system stalls are much shorter.
      
      Mixing writeback throttling logic with CFQ throttling logic is always a
      recipe for surprises as CFQ assumes it sees the big part of the picture
      which is not necessarily true when writeback throttling is blocking
      requests. So disable writeback throttling logic by default when CFQ is
      used as an IO scheduler.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      142bbdfc
    • A
      block: fix inheriting request priority from bio · 85003a44
      Adam Manzanares 提交于
      In 4.10 I introduced a patch that associates the ioc priority with
      each request in the block layer. This work was done in the single queue
      block layer code. This patch unifies ioc priority to request mapping across
      the single/multi queue block layers.
      
      I have tested this patch with the null block device driver with the following
      parameters.
      
      null_blk queue_mode=2 irqmode=0 use_per_node_hctx=1 nr_devices=1
      
      I have not seen a performance regression with this patch and I would appreciate
      any feedback or additional testing.
      
      I have also verified that io priorities are passed to the device when using
      the SQ and MQ path to a SATA HDD that supports io priorities.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAdam Manzanares <adam.manzanares@wdc.com>
      Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      85003a44
  5. 04 4月, 2017 17 次提交