1. 22 3月, 2017 3 次提交
    • J
      block: fix stacked driver stats init and free · a83b576c
      Jens Axboe 提交于
      If a driver allocates a queue for stacked usage, then it does
      not currently get stats allocated. This causes the later init
      of, eg, writeback throttling to blow up. Move the init to the
      queue allocation instead.
      
      Additionally, allow a NULL callback unregistration. This avoids
      having the caller check for that, fixing another oops on
      removal of a block device that doesn't have poll stats allocated.
      
      Fixes: 34dbad5d ("blk-stat: convert to callback-based statistics reporting")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a83b576c
    • O
      blk-stat: convert to callback-based statistics reporting · 34dbad5d
      Omar Sandoval 提交于
      Currently, statistics are gathered in ~0.13s windows, and users grab the
      statistics whenever they need them. This is not ideal for both in-tree
      users:
      
      1. Writeback throttling wants its own dynamically sized window of
         statistics. Since the blk-stats statistics are reset after every
         window and the wbt windows don't line up with the blk-stats windows,
         wbt doesn't see every I/O.
      2. Polling currently grabs the statistics on every I/O. Again, depending
         on how the window lines up, we may miss some I/Os. It's also
         unnecessary overhead to get the statistics on every I/O; the hybrid
         polling heuristic would be just as happy with the statistics from the
         previous full window.
      
      This reworks the blk-stats infrastructure to be callback-based: users
      register a callback that they want called at a given time with all of
      the statistics from the window during which the callback was active.
      Users can dynamically bucketize the statistics. wbt and polling both
      currently use read vs. write, but polling can be extended to further
      subdivide based on request size.
      
      The callbacks are kept on an RCU list, and each callback has percpu
      stats buffers. There will only be a few users, so the overhead on the
      I/O completion side is low. The stats flushing is also simplified
      considerably: since the timer function is responsible for clearing the
      statistics, we don't have to worry about stale statistics.
      
      wbt is a trivial conversion. After the conversion, the windowing problem
      mentioned above is fixed.
      
      For polling, we register an extra callback that caches the previous
      window's statistics in the struct request_queue for the hybrid polling
      heuristic to use.
      
      Since we no longer have a single stats buffer for the request queue,
      this also removes the sysfs and debugfs stats entries. To replace those,
      we add a debugfs entry for the poll statistics.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      34dbad5d
    • O
      block: remove extra calls to wbt_exit() · 0315b159
      Omar Sandoval 提交于
      We always call wbt_exit() from blk_release_queue(), so these are
      unnecessary.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0315b159
  2. 12 3月, 2017 1 次提交
    • N
      blk: Ensure users for current->bio_list can see the full list. · f5fe1b51
      NeilBrown 提交于
      Commit 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      changed current->bio_list so that it did not contain *all* of the
      queued bios, but only those submitted by the currently running
      make_request_fn.
      
      There are two places which walk the list and requeue selected bios,
      and others that check if the list is empty.  These are no longer
      correct.
      
      So redefine current->bio_list to point to an array of two lists, which
      contain all queued bios, and adjust various code to test or walk both
      lists.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Fixes: 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f5fe1b51
  3. 09 3月, 2017 2 次提交
    • N
      blk: improve order of bio handling in generic_make_request() · 79bd9959
      NeilBrown 提交于
      To avoid recursion on the kernel stack when stacked block devices
      are in use, generic_make_request() will, when called recursively,
      queue new requests for later handling.  They will be handled when the
      make_request_fn for the current bio completes.
      
      If any bios are submitted by a make_request_fn, these will ultimately
      be handled seqeuntially.  If the handling of one of those generates
      further requests, they will be added to the end of the queue.
      
      This strict first-in-first-out behaviour can lead to deadlocks in
      various ways, normally because a request might need to wait for a
      previous request to the same device to complete.  This can happen when
      they share a mempool, and can happen due to interdependencies
      particular to the device.  Both md and dm have examples where this happens.
      
      These deadlocks can be erradicated by more selective ordering of bios.
      Specifically by handling them in depth-first order.  That is: when the
      handling of one bio generates one or more further bios, they are
      handled immediately after the parent, before any siblings of the
      parent.  That way, when generic_make_request() calls make_request_fn
      for some particular device, we can be certain that all previously
      submited requests for that device have been completely handled and are
      not waiting for anything in the queue of requests maintained in
      generic_make_request().
      
      An easy way to achieve this would be to use a last-in-first-out stack
      instead of a queue.  However this will change the order of consecutive
      bios submitted by a make_request_fn, which could have unexpected consequences.
      Instead we take a slightly more complex approach.
      A fresh queue is created for each call to a make_request_fn.  After it completes,
      any bios for a different device are placed on the front of the main queue, followed
      by any bios for the same device, followed by all bios that were already on
      the queue before the make_request_fn was called.
      This provides the depth-first approach without reordering bios on the same level.
      
      This, by itself, it not enough to remove all deadlocks.  It just makes
      it possible for drivers to take the extra step required themselves.
      
      To avoid deadlocks, drivers must never risk waiting for a request
      after submitting one to generic_make_request.  This includes never
      allocing from a mempool twice in the one call to a make_request_fn.
      
      A common pattern in drivers is to call bio_split() in a loop, handling
      the first part and then looping around to possibly split the next part.
      Instead, a driver that finds it needs to split a bio should queue
      (with generic_make_request) the second part, handle the first part,
      and then return.  The new code in generic_make_request will ensure the
      requests to underlying bios are processed first, then the second bio
      that was split off.  If it splits again, the same process happens.  In
      each case one bio will be completely handled before the next one is attempted.
      
      With this is place, it should be possible to disable the
      punt_bios_to_recover() recovery thread for many block devices, and
      eventually it may be possible to remove it completely.
      
      Ref: http://www.spinics.net/lists/raid/msg54680.htmlTested-by: NJinpu Wang <jinpu.wang@profitbricks.com>
      Inspired-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      79bd9959
    • J
      Revert "scsi, block: fix duplicate bdi name registration crashes" · c01228db
      Jan Kara 提交于
      This reverts commit 0dba1314. It causes
      leaking of device numbers for SCSI when SCSI registers multiple gendisks
      for one request_queue in succession. It can be easily reproduced using
      Omar's script [1] on kernel with CONFIG_DEBUG_TEST_DRIVER_REMOVE.
      Furthermore the protection provided by this commit is not needed anymore
      as the problem it was fixing got also fixed by commit 165a5e22
      "block: Move bdi_unregister() to del_gendisk()".
      
      [1]: http://marc.info/?l=linux-block&m=148554717109098&w=2Signed-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c01228db
  4. 03 3月, 2017 1 次提交
  5. 09 2月, 2017 2 次提交
  6. 04 2月, 2017 1 次提交
    • J
      block: free merged request in the caller · e4d750c9
      Jens Axboe 提交于
      If we end up doing a request-to-request merge when we have completed
      a bio-to-request merge, we free the request from deep down in that
      path. For blk-mq-sched, the merge path has to hold the appropriate
      lock, but we don't need it for freeing the request. And in fact
      holding the lock is problematic, since we are now calling the
      mq sched put_rq_private() hook with the lock held. Other call paths
      do not hold this lock.
      
      Fix this inconsistency by ensuring that the caller frees a merged
      request. Then we can do it outside of the lock, making it both more
      efficient and fixing the blk-mq-sched problem of invoking parts of
      the scheduler with an unknown lock state.
      Reported-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      e4d750c9
  7. 03 2月, 2017 1 次提交
  8. 02 2月, 2017 6 次提交
    • D
      scsi, block: fix duplicate bdi name registration crashes · 0dba1314
      Dan Williams 提交于
      Warnings of the following form occur because scsi reuses a devt number
      while the block layer still has it referenced as the name of the bdi
      [1]:
      
       WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
       sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192'
       [..]
       Call Trace:
        dump_stack+0x86/0xc3
        __warn+0xcb/0xf0
        warn_slowpath_fmt+0x5f/0x80
        ? kernfs_path_from_node+0x4f/0x60
        sysfs_warn_dup+0x62/0x80
        sysfs_create_dir_ns+0x77/0x90
        kobject_add_internal+0xb2/0x350
        kobject_add+0x75/0xd0
        device_add+0x15a/0x650
        device_create_groups_vargs+0xe0/0xf0
        device_create_vargs+0x1c/0x20
        bdi_register+0x90/0x240
        ? lockdep_init_map+0x57/0x200
        bdi_register_owner+0x36/0x60
        device_add_disk+0x1bb/0x4e0
        ? __pm_runtime_use_autosuspend+0x5c/0x70
        sd_probe_async+0x10d/0x1c0
        async_run_entry_fn+0x39/0x170
      
      This is a brute-force fix to pass the devt release information from
      sd_probe() to the locations where we register the bdi,
      device_add_disk(), and unregister the bdi, blk_cleanup_queue().
      
      Thanks to Omar for the quick reproducer script [2]. This patch survives
      where an unmodified kernel fails in a few seconds.
      
      [1]: https://marc.info/?l=linux-scsi&m=147116857810716&w=4
      [2]: http://marc.info/?l=linux-block&m=148554717109098&w=2
      
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Bart Van Assche <bart.vanassche@sandisk.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Reported-by: NOmar Sandoval <osandov@osandov.com>
      Tested-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0dba1314
    • J
      block: Get rid of blk_get_backing_dev_info() · efa7c9f9
      Jan Kara 提交于
      blk_get_backing_dev_info() is now a simple dereference. Remove that
      function and simplify some code around that.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      efa7c9f9
    • J
      block: Make blk_get_backing_dev_info() safe without open bdev · b1d2dc56
      Jan Kara 提交于
      Currenly blk_get_backing_dev_info() is not safe to be called when the
      block device is not open as bdev->bd_disk is NULL in that case. However
      inode_to_bdi() uses this function and may be call called from flusher
      worker or other writeback related functions without bdev being open
      which leads to crashes such as:
      
      [113031.075540] Unable to handle kernel paging request for data at address 0x00000000
      [113031.075614] Faulting instruction address: 0xc0000000003692e0
      0:mon> t
      [c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
      [c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
      [c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
      [c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
      [c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
      [c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
      [c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
      [c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b1d2dc56
    • J
      block: Dynamically allocate and refcount backing_dev_info · d03f6cdc
      Jan Kara 提交于
      Instead of storing backing_dev_info inside struct request_queue,
      allocate it dynamically, reference count it, and free it when the last
      reference is dropped. Currently only request_queue holds the reference
      but in the following patch we add other users referencing
      backing_dev_info.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d03f6cdc
    • J
      block: Use pointer to backing_dev_info from request_queue · dc3b17cc
      Jan Kara 提交于
      We will want to have struct backing_dev_info allocated separately from
      struct request_queue. As the first step add pointer to backing_dev_info
      to request_queue and convert all users touching it. No functional
      changes in this patch.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dc3b17cc
    • T
      block: queue lock must be acquired when iterating over rls · bbfc3c5d
      Tahsin Erdogan 提交于
      blk_set_queue_dying() does not acquire queue lock before it calls
      blk_queue_for_each_rl(). This allows a racing blkg_destroy() to
      remove blkg->q_node from the linked list and have
      blk_queue_for_each_rl() loop infitely over the removed blkg->q_node
      list node.
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bbfc3c5d
  9. 01 2月, 2017 2 次提交
  10. 28 1月, 2017 8 次提交
  11. 18 1月, 2017 2 次提交
  12. 09 12月, 2016 1 次提交
    • C
      block: improve handling of the magic discard payload · f9d03f96
      Christoph Hellwig 提交于
      Instead of allocating a single unused biovec for discard requests, send
      them down without any payload.  Instead we allow the driver to add a
      "special" payload using a biovec embedded into struct request (unioned
      over other fields never used while in the driver), and overloading
      the number of segments for this case.
      
      This has a couple of advantages:
      
       - we don't have to allocate the bio_vec
       - the amount of special casing for discard requests in the block
         layer is significantly reduced
       - using this same scheme for other request types is trivial,
         which will be important for implementing the new WRITE_ZEROES
         op on devices where it actually requires a payload (e.g. SCSI)
       - we can get rid of playing games with the request length, as
         we'll never touch it and completions will work just fine
       - it will allow us to support ranged discard operations in the
         future by merging non-contiguous discard bios into a single
         request
       - last but not least it removes a lot of code
      
      This patch is the common base for my WIP series for ranges discards and to
      remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
      so it would be good to get it in quickly.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f9d03f96
  13. 05 12月, 2016 1 次提交
    • N
      block: fix unintended fallthrough in generic_make_request_checks() · 58886785
      Nicolai Stange 提交于
      Since commit e73c23ff ("block: add async variant of
      blkdev_issue_zeroout") messages like the following show up:
      
        EXT4-fs (dm-1): Delayed block allocation failed for inode 2368848 at
                        logical offset 0 with max blocks 1 with error 95
        EXT4-fs (dm-1): This should not happen!! Data will be lost
      
      Due to the following fallthrough introduced with
      commit 2d253440 ("block: Define zoned block device operations"),
      generic_make_request_checks() would accept a REQ_OP_WRITE_SAME bio only
      if the block device supports "write same" *and* is a zoned one:
      
        switch (bio_op(bio)) {
        [...]
        case REQ_OP_WRITE_SAME:
              if (!bdev_write_same(bio->bi_bdev))
                      goto not_supported;
        case REQ_OP_ZONE_REPORT:
        case REQ_OP_ZONE_RESET:
                      if (!bdev_is_zoned(bio->bi_bdev))
                              goto not_supported;
                      break;
        [...]
        }
      
      Thus, although the bio setup as done by __blkdev_issue_write_same() from
      commit e73c23ff ("block: add async variant of blkdev_issue_zeroout")
      would succeed, its actual submission would not, resulting in the
      EOPNOTSUPP == 95.
      
      Fix this by removing the fallthrough which, due to the lack of an explicit
      comment, seems to be unintended anyway.
      
      Fixes: e73c23ff ("block: add async variant of blkdev_issue_zeroout")
      Fixes: 2d253440 ("block: Define zoned block device operations")
      Signed-off-by: NNicolai Stange <nicstange@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      58886785
  14. 01 12月, 2016 1 次提交
  15. 22 11月, 2016 1 次提交
  16. 16 11月, 2016 1 次提交
    • M
      block: deal with stale req count of plug list · 0a6219a9
      Ming Lei 提交于
      In both legacy and mq path, req count of plug list is computed
      before allocating request, so the number can be stale when falling
      back to slept allocation, also the new introduced wbt can sleep
      too.
      
      This patch deals with the case by checking if plug list becomes
      empty, and fixes the KASAN report of 'BUG: KASAN: stack-out-of-bounds'
      which is introduced by Shaohua's patches of dispatching big request.
      
      Fixes: 600271d9(blk-mq: immediately dispatch big size request)
      Fixes: 50d24c34(block: immediately dispatch big size request)
      Cc: Shaohua Li <shli@fb.com>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0a6219a9
  17. 12 11月, 2016 1 次提交
  18. 11 11月, 2016 2 次提交
    • J
      block: hook up writeback throttling · 87760e5e
      Jens Axboe 提交于
      Enable throttling of buffered writeback to make it a lot
      more smooth, and has way less impact on other system activity.
      Background writeback should be, by definition, background
      activity. The fact that we flush huge bundles of it at the time
      means that it potentially has heavy impacts on foreground workloads,
      which isn't ideal. We can't easily limit the sizes of writes that
      we do, since that would impact file system layout in the presence
      of delayed allocation. So just throttle back buffered writeback,
      unless someone is waiting for it.
      
      The algorithm for when to throttle takes its inspiration in the
      CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
      the minimum latencies of requests over a window of time. In that
      window of time, if the minimum latency of any request exceeds a
      given target, then a scale count is incremented and the queue depth
      is shrunk. The next monitoring window is shrunk accordingly. Unlike
      CoDel, if we hit a window that exhibits good behavior, then we
      simply increment the scale count and re-calculate the limits for that
      scale value. This prevents us from oscillating between a
      close-to-ideal value and max all the time, instead remaining in the
      windows where we get good behavior.
      
      Unlike CoDel, blk-wb allows the scale count to to negative. This
      happens if we primarily have writes going on. Unlike positive
      scale counts, this doesn't change the size of the monitoring window.
      When the heavy writers finish, blk-bw quickly snaps back to it's
      stable state of a zero scale count.
      
      The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
      target to me met. It defaults to 2 msec for non-rotational storage, and
      75 msec for rotational storage. Setting this value to '0' disables
      blk-wb. Generally, a user would not have to touch this setting.
      
      We don't enable WBT on devices that are managed with CFQ, and have
      a non-root block cgroup attached. If we have a proportional share setup
      on this particular disk, then the wbt throttling will interfere with
      that. We don't have a strong need for wbt for that case, since we will
      rely on CFQ doing that for us.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      87760e5e
    • J
      block: add scalable completion tracking of requests · cf43e6be
      Jens Axboe 提交于
      For legacy block, we simply track them in the request queue. For
      blk-mq, we track them on a per-sw queue basis, which we can then
      sum up through the hardware queues and finally to a per device
      state.
      
      The stats are tracked in, roughly, 0.1s interval windows.
      
      Add sysfs files to display the stats.
      
      The feature is off by default, to avoid any extra overhead. In-kernel
      users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
      flags. We currently don't turn it on if someone just reads any of
      the stats files, that is something we could add as well.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cf43e6be
  19. 04 11月, 2016 1 次提交
    • S
      block: immediately dispatch big size request · 50d24c34
      Shaohua Li 提交于
      Currently block plug holds up to 16 non-mergeable requests. This makes
      sense if the request size is small, eg, reduce lock contention. But if
      request size is big enough, we don't need to worry about lock
      contention. Holding such request makes no sense and it lows the disk
      utilization.
      
      In practice, this improves 10% throughput for my raid5 sequential write
      workload.
      
      The size (128k) is arbitrary right now, but it makes sure lock
      contention is small. This probably could be more intelligent, eg, check
      average request size holded. Since this is mainly for sequential IO,
      probably not worthy.
      
      V2: check the last request instead of the first request, so as long as
      there is one big size request we flush the plug.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      50d24c34
  20. 28 10月, 2016 2 次提交
    • C
      block: better op and flags encoding · ef295ecf
      Christoph Hellwig 提交于
      Now that we don't need the common flags to overflow outside the range
      of a 32-bit type we can encode them the same way for both the bio and
      request fields.  This in addition allows us to place the operation
      first (and make some room for more ops while we're at it) and to
      stop having to shift around the operation values.
      
      In addition this allows passing around only one value in the block layer
      instead of two (and eventuall also in the file systems, but we can do
      that later) and thus clean up a lot of code.
      
      Last but not least this allows decreasing the size of the cmd_flags
      field in struct request to 32-bits.  Various functions passing this
      value could also be updated, but I'd like to avoid the churn for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ef295ecf
    • C
      block: split out request-only flags into a new namespace · e8064021
      Christoph Hellwig 提交于
      A lot of the REQ_* flags are only used on struct requests, and only of
      use to the block layer and a few drivers that dig into struct request
      internals.
      
      This patch adds a new req_flags_t rq_flags field to struct request for
      them, and thus dramatically shrinks the number of common requests.  It
      also removes the unfortunate situation where we have to fit the fields
      from the same enum into 32 bits for struct bio and 64 bits for
      struct request.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NShaun Tancheff <shaun.tancheff@seagate.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e8064021