1. 22 3月, 2017 6 次提交
    • J
      block: fix stacked driver stats init and free · a83b576c
      Jens Axboe 提交于
      If a driver allocates a queue for stacked usage, then it does
      not currently get stats allocated. This causes the later init
      of, eg, writeback throttling to blow up. Move the init to the
      queue allocation instead.
      
      Additionally, allow a NULL callback unregistration. This avoids
      having the caller check for that, fixing another oops on
      removal of a block device that doesn't have poll stats allocated.
      
      Fixes: 34dbad5d ("blk-stat: convert to callback-based statistics reporting")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a83b576c
    • O
      blk-stat: convert to callback-based statistics reporting · 34dbad5d
      Omar Sandoval 提交于
      Currently, statistics are gathered in ~0.13s windows, and users grab the
      statistics whenever they need them. This is not ideal for both in-tree
      users:
      
      1. Writeback throttling wants its own dynamically sized window of
         statistics. Since the blk-stats statistics are reset after every
         window and the wbt windows don't line up with the blk-stats windows,
         wbt doesn't see every I/O.
      2. Polling currently grabs the statistics on every I/O. Again, depending
         on how the window lines up, we may miss some I/Os. It's also
         unnecessary overhead to get the statistics on every I/O; the hybrid
         polling heuristic would be just as happy with the statistics from the
         previous full window.
      
      This reworks the blk-stats infrastructure to be callback-based: users
      register a callback that they want called at a given time with all of
      the statistics from the window during which the callback was active.
      Users can dynamically bucketize the statistics. wbt and polling both
      currently use read vs. write, but polling can be extended to further
      subdivide based on request size.
      
      The callbacks are kept on an RCU list, and each callback has percpu
      stats buffers. There will only be a few users, so the overhead on the
      I/O completion side is low. The stats flushing is also simplified
      considerably: since the timer function is responsible for clearing the
      statistics, we don't have to worry about stale statistics.
      
      wbt is a trivial conversion. After the conversion, the windowing problem
      mentioned above is fixed.
      
      For polling, we register an extra callback that caches the previous
      window's statistics in the struct request_queue for the hybrid polling
      heuristic to use.
      
      Since we no longer have a single stats buffer for the request queue,
      this also removes the sysfs and debugfs stats entries. To replace those,
      we add a debugfs entry for the poll statistics.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      34dbad5d
    • O
      blk-stat: move BLK_RQ_STAT_BATCH definition to blk-stat.c · 4875253f
      Omar Sandoval 提交于
      This is an implementation detail that no-one outside of blk-stat.c uses.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4875253f
    • O
      blk-stat: use READ and WRITE instead of BLK_STAT_{READ,WRITE} · fa2e39cb
      Omar Sandoval 提交于
      The stats buckets will become generic soon, so make the existing users
      use the common READ and WRITE definitions instead of one internal to
      blk-stat.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fa2e39cb
    • O
      block: remove extra calls to wbt_exit() · 0315b159
      Omar Sandoval 提交于
      We always call wbt_exit() from blk_release_queue(), so these are
      unnecessary.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0315b159
    • O
      blk-stat: fix blk_stat_sum() if all samples are batched · 7d8d0014
      Omar Sandoval 提交于
      We need to flush the batch _before_ we check the number of samples,
      otherwise we'll miss all of the batched samples.
      
      Fixes: cf43e6be ("block: add scalable completion tracking of requests")
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7d8d0014
  2. 15 3月, 2017 1 次提交
  3. 13 3月, 2017 1 次提交
  4. 12 3月, 2017 1 次提交
    • N
      blk: Ensure users for current->bio_list can see the full list. · f5fe1b51
      NeilBrown 提交于
      Commit 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      changed current->bio_list so that it did not contain *all* of the
      queued bios, but only those submitted by the currently running
      make_request_fn.
      
      There are two places which walk the list and requeue selected bios,
      and others that check if the list is empty.  These are no longer
      correct.
      
      So redefine current->bio_list to point to an array of two lists, which
      contain all queued bios, and adjust various code to test or walk both
      lists.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Fixes: 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f5fe1b51
  5. 09 3月, 2017 8 次提交
    • N
      blk: improve order of bio handling in generic_make_request() · 79bd9959
      NeilBrown 提交于
      To avoid recursion on the kernel stack when stacked block devices
      are in use, generic_make_request() will, when called recursively,
      queue new requests for later handling.  They will be handled when the
      make_request_fn for the current bio completes.
      
      If any bios are submitted by a make_request_fn, these will ultimately
      be handled seqeuntially.  If the handling of one of those generates
      further requests, they will be added to the end of the queue.
      
      This strict first-in-first-out behaviour can lead to deadlocks in
      various ways, normally because a request might need to wait for a
      previous request to the same device to complete.  This can happen when
      they share a mempool, and can happen due to interdependencies
      particular to the device.  Both md and dm have examples where this happens.
      
      These deadlocks can be erradicated by more selective ordering of bios.
      Specifically by handling them in depth-first order.  That is: when the
      handling of one bio generates one or more further bios, they are
      handled immediately after the parent, before any siblings of the
      parent.  That way, when generic_make_request() calls make_request_fn
      for some particular device, we can be certain that all previously
      submited requests for that device have been completely handled and are
      not waiting for anything in the queue of requests maintained in
      generic_make_request().
      
      An easy way to achieve this would be to use a last-in-first-out stack
      instead of a queue.  However this will change the order of consecutive
      bios submitted by a make_request_fn, which could have unexpected consequences.
      Instead we take a slightly more complex approach.
      A fresh queue is created for each call to a make_request_fn.  After it completes,
      any bios for a different device are placed on the front of the main queue, followed
      by any bios for the same device, followed by all bios that were already on
      the queue before the make_request_fn was called.
      This provides the depth-first approach without reordering bios on the same level.
      
      This, by itself, it not enough to remove all deadlocks.  It just makes
      it possible for drivers to take the extra step required themselves.
      
      To avoid deadlocks, drivers must never risk waiting for a request
      after submitting one to generic_make_request.  This includes never
      allocing from a mempool twice in the one call to a make_request_fn.
      
      A common pattern in drivers is to call bio_split() in a loop, handling
      the first part and then looping around to possibly split the next part.
      Instead, a driver that finds it needs to split a bio should queue
      (with generic_make_request) the second part, handle the first part,
      and then return.  The new code in generic_make_request will ensure the
      requests to underlying bios are processed first, then the second bio
      that was split off.  If it splits again, the same process happens.  In
      each case one bio will be completely handled before the next one is attempted.
      
      With this is place, it should be possible to disable the
      punt_bios_to_recover() recovery thread for many block devices, and
      eventually it may be possible to remove it completely.
      
      Ref: http://www.spinics.net/lists/raid/msg54680.htmlTested-by: NJinpu Wang <jinpu.wang@profitbricks.com>
      Inspired-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      79bd9959
    • J
      Revert "scsi, block: fix duplicate bdi name registration crashes" · c01228db
      Jan Kara 提交于
      This reverts commit 0dba1314. It causes
      leaking of device numbers for SCSI when SCSI registers multiple gendisks
      for one request_queue in succession. It can be easily reproduced using
      Omar's script [1] on kernel with CONFIG_DEBUG_TEST_DRIVER_REMOVE.
      Furthermore the protection provided by this commit is not needed anymore
      as the problem it was fixing got also fixed by commit 165a5e22
      "block: Move bdi_unregister() to del_gendisk()".
      
      [1]: http://marc.info/?l=linux-block&m=148554717109098&w=2Signed-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c01228db
    • J
      block: Make del_gendisk() safer for disks without queues · 90f16fdd
      Jan Kara 提交于
      Commit 165a5e22 "block: Move bdi_unregister() to del_gendisk()"
      added disk->queue dereference to del_gendisk(). Although del_gendisk()
      is not supposed to be called without disk->queue valid and
      blk_unregister_queue() warns in that case, this change will make it oops
      instead. Return to the old more robust behavior of just warning when
      del_gendisk() gets called for gendisk with disk->queue being NULL.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Tested-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      90f16fdd
    • J
      block/sed: Fix opal user range check and unused variables · b0bfdfc2
      Jon Derrick 提交于
      Fixes check that the opal user is within the range, and cleans up unused
      method variables.
      Signed-off-by: NJon Derrick <jonathan.derrick@intel.com>
      Reviewed-by: NScott Bauer <scott.bauer@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b0bfdfc2
    • M
      blk-mq: free hctx->cpumask in release handler of hctx's kobject · 01388df3
      Ming Lei 提交于
      It is obviously that hctx->cpumask is per hctx, and both
      share same lifetime, so this patch moves freeing of hctx->cpumask
      into release handler of hctx's kobject.
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      01388df3
    • M
      blk-mq: make lifetime consistent between hctx and its kobject · 6c8b232e
      Ming Lei 提交于
      This patch removes kobject_put() over hctx in __blk_mq_unregister_dev(),
      and trys to keep lifetime consistent between hctx and hctx's kobject.
      
      Now blk_mq_sysfs_register() and blk_mq_sysfs_unregister() become
      totally symmetrical, and kobject's refcounter drops to zero just
      when the hctx is freed.
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6c8b232e
    • M
      blk-mq: make lifetime consitent between q/ctx and its kobject · 7ea5fe31
      Ming Lei 提交于
      Currently from kobject view, both q->mq_kobj and ctx->kobj can
      be released during one cycle of blk_mq_register_dev() and
      blk_mq_unregister_dev(). Actually, sw queue's lifetime is
      same with its request queue's, which is covered by request_queue->kobj.
      
      So we don't need to call kobject_put() for the two kinds of
      kobject in __blk_mq_unregister_dev(), instead we do that
      in release handler of request queue.
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7ea5fe31
    • M
      blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue() · 737f98cf
      Ming Lei 提交于
      Both q->mq_kobj and sw queues' kobjects should have been initialized
      once, instead of doing that each add_disk context.
      
      Also this patch removes clearing of ctx in blk_mq_init_cpu_queues()
      because percpu allocator fills zero to allocated variable.
      
      This patch fixes one issue[1] reported from Omar.
      
      [1] kernel wearning when doing unbind/bind on one scsi-mq device
      
      [   19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong.
      [   19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34
      [   19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
      [   19.350920] Workqueue: events_unbound async_run_entry_fn
      [   19.350920] Call Trace:
      [   19.350920]  dump_stack+0x63/0x83
      [   19.350920]  kobject_init+0x77/0x90
      [   19.350920]  blk_mq_register_dev+0x40/0x130
      [   19.350920]  blk_register_queue+0xb6/0x190
      [   19.350920]  device_add_disk+0x1ec/0x4b0
      [   19.350920]  sd_probe_async+0x10d/0x1c0 [sd_mod]
      [   19.350920]  async_run_entry_fn+0x48/0x150
      [   19.350920]  process_one_work+0x1d0/0x480
      [   19.350920]  worker_thread+0x48/0x4e0
      [   19.350920]  kthread+0x101/0x140
      [   19.350920]  ? process_one_work+0x480/0x480
      [   19.350920]  ? kthread_create_on_node+0x60/0x60
      [   19.350920]  ret_from_fork+0x2c/0x40
      
      Cc: Omar Sandoval <osandov@osandov.com>
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      737f98cf
  6. 03 3月, 2017 3 次提交
  7. 02 3月, 2017 14 次提交
  8. 28 2月, 2017 3 次提交
  9. 24 2月, 2017 3 次提交
    • O
      blk-mq-sched: separate mark hctx and queue restart operations · d38d3515
      Omar Sandoval 提交于
      In blk_mq_sched_dispatch_requests(), we call blk_mq_sched_mark_restart()
      after we dispatch requests left over on our hardware queue dispatch
      list. This is so we'll go back and dispatch requests from the scheduler.
      In this case, it's only necessary to restart the hardware queue that we
      are running; there's no reason to run other hardware queues just because
      we are using shared tags.
      
      So, split out blk_mq_sched_mark_restart() into two operations, one for
      just the hardware queue and one for the whole request queue. The core
      code only needs the hctx variant, but I/O schedulers will want to use
      both.
      
      This also requires adjusting blk_mq_sched_restart_queues() to always
      check the queue restart flag, not just when using shared tags.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d38d3515
    • O
      blk-mq: use sbq wait queues instead of restart for driver tags · da55f2cc
      Omar Sandoval 提交于
      Commit 50e1dab8 ("blk-mq-sched: fix starvation for multiple hardware
      queues and shared tags") fixed one starvation issue for shared tags.
      However, we can still get into a situation where we fail to allocate a
      tag because all tags are allocated but we don't have any pending
      requests on any hardware queue.
      
      One solution for this would be to restart all queues that share a tag
      map, but that really sucks. Ideally, we could just block and wait for a
      tag, but that isn't always possible from blk_mq_dispatch_rq_list().
      
      However, we can still use the struct sbitmap_queue wait queues with a
      custom callback instead of blocking. This has a few benefits:
      
      1. It avoids iterating over all hardware queues when completing an I/O,
         which the current restart code has to do.
      2. It benefits from the existing rolling wakeup code.
      3. It avoids punting to another thread just to have it block.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      da55f2cc
    • S
      block/sed-opal: Propagate original error message to userland. · 2d19020b
      Scott Bauer 提交于
      During an error on a comannd, ex: user provides wrong pw to unlock
      range, we will gracefully terminate the opal session. We want to
      propagate the original error to userland instead of the result of
      the session termination, which is almost always a success.
      Signed-off-by: NScott Bauer <scott.bauer@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2d19020b