1. 11 11月, 2016 2 次提交
    • J
      block: hook up writeback throttling · 87760e5e
      Jens Axboe 提交于
      Enable throttling of buffered writeback to make it a lot
      more smooth, and has way less impact on other system activity.
      Background writeback should be, by definition, background
      activity. The fact that we flush huge bundles of it at the time
      means that it potentially has heavy impacts on foreground workloads,
      which isn't ideal. We can't easily limit the sizes of writes that
      we do, since that would impact file system layout in the presence
      of delayed allocation. So just throttle back buffered writeback,
      unless someone is waiting for it.
      
      The algorithm for when to throttle takes its inspiration in the
      CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
      the minimum latencies of requests over a window of time. In that
      window of time, if the minimum latency of any request exceeds a
      given target, then a scale count is incremented and the queue depth
      is shrunk. The next monitoring window is shrunk accordingly. Unlike
      CoDel, if we hit a window that exhibits good behavior, then we
      simply increment the scale count and re-calculate the limits for that
      scale value. This prevents us from oscillating between a
      close-to-ideal value and max all the time, instead remaining in the
      windows where we get good behavior.
      
      Unlike CoDel, blk-wb allows the scale count to to negative. This
      happens if we primarily have writes going on. Unlike positive
      scale counts, this doesn't change the size of the monitoring window.
      When the heavy writers finish, blk-bw quickly snaps back to it's
      stable state of a zero scale count.
      
      The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
      target to me met. It defaults to 2 msec for non-rotational storage, and
      75 msec for rotational storage. Setting this value to '0' disables
      blk-wb. Generally, a user would not have to touch this setting.
      
      We don't enable WBT on devices that are managed with CFQ, and have
      a non-root block cgroup attached. If we have a proportional share setup
      on this particular disk, then the wbt throttling will interfere with
      that. We don't have a strong need for wbt for that case, since we will
      rely on CFQ doing that for us.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      87760e5e
    • J
      block: add scalable completion tracking of requests · cf43e6be
      Jens Axboe 提交于
      For legacy block, we simply track them in the request queue. For
      blk-mq, we track them on a per-sw queue basis, which we can then
      sum up through the hardware queues and finally to a per device
      state.
      
      The stats are tracked in, roughly, 0.1s interval windows.
      
      Add sysfs files to display the stats.
      
      The feature is off by default, to avoid any extra overhead. In-kernel
      users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
      flags. We currently don't turn it on if someone just reads any of
      the stats files, that is something we could add as well.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cf43e6be
  2. 06 11月, 2016 1 次提交
    • J
      block: add code to track actual device queue depth · d278d4a8
      Jens Axboe 提交于
      For blk-mq, ->nr_requests does track queue depth, at least at init
      time. But for the older queue paths, it's simply a soft setting.
      On top of that, it's generally larger than the hardware setting
      on purpose, to allow backup of requests for merging.
      
      Fill a hole in struct request with a 'queue_depth' member, that
      drivers can call to more closely inform the block layer of the
      real queue depth.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      d278d4a8
  3. 04 11月, 2016 1 次提交
    • S
      block: immediately dispatch big size request · 50d24c34
      Shaohua Li 提交于
      Currently block plug holds up to 16 non-mergeable requests. This makes
      sense if the request size is small, eg, reduce lock contention. But if
      request size is big enough, we don't need to worry about lock
      contention. Holding such request makes no sense and it lows the disk
      utilization.
      
      In practice, this improves 10% throughput for my raid5 sequential write
      workload.
      
      The size (128k) is arbitrary right now, but it makes sure lock
      contention is small. This probably could be more intelligent, eg, check
      average request size holded. Since this is mainly for sequential IO,
      probably not worthy.
      
      V2: check the last request instead of the first request, so as long as
      there is one big size request we flush the plug.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      50d24c34
  4. 03 11月, 2016 1 次提交
    • B
      blk-mq: Introduce blk_mq_quiesce_queue() · 6a83e74d
      Bart Van Assche 提交于
      blk_mq_quiesce_queue() waits until ongoing .queue_rq() invocations
      have finished. This function does *not* wait until all outstanding
      requests have finished (this means invocation of request.end_io()).
      The algorithm used by blk_mq_quiesce_queue() is as follows:
      * Hold either an RCU read lock or an SRCU read lock around
        .queue_rq() calls. The former is used if .queue_rq() does not
        block and the latter if .queue_rq() may block.
      * blk_mq_quiesce_queue() first calls blk_mq_stop_hw_queues()
        followed by synchronize_srcu() or synchronize_rcu(). The latter
        call waits for .queue_rq() invocations that started before
        blk_mq_quiesce_queue() was called.
      * The blk_mq_hctx_stopped() calls that control whether or not
        .queue_rq() will be called are called with the (S)RCU read lock
        held. This is necessary to avoid race conditions against
        blk_mq_quiesce_queue().
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6a83e74d
  5. 28 10月, 2016 2 次提交
    • C
      block: better op and flags encoding · ef295ecf
      Christoph Hellwig 提交于
      Now that we don't need the common flags to overflow outside the range
      of a 32-bit type we can encode them the same way for both the bio and
      request fields.  This in addition allows us to place the operation
      first (and make some room for more ops while we're at it) and to
      stop having to shift around the operation values.
      
      In addition this allows passing around only one value in the block layer
      instead of two (and eventuall also in the file systems, but we can do
      that later) and thus clean up a lot of code.
      
      Last but not least this allows decreasing the size of the cmd_flags
      field in struct request to 32-bits.  Various functions passing this
      value could also be updated, but I'd like to avoid the churn for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ef295ecf
    • C
      block: split out request-only flags into a new namespace · e8064021
      Christoph Hellwig 提交于
      A lot of the REQ_* flags are only used on struct requests, and only of
      use to the block layer and a few drivers that dig into struct request
      internals.
      
      This patch adds a new req_flags_t rq_flags field to struct request for
      them, and thus dramatically shrinks the number of common requests.  It
      also removes the unfortunate situation where we have to fit the fields
      from the same enum into 32 bits for struct bio and 64 bits for
      struct request.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NShaun Tancheff <shaun.tancheff@seagate.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e8064021
  6. 19 10月, 2016 3 次提交
  7. 15 9月, 2016 1 次提交
    • M
      blk-mq: introduce blk_mq_delay_kick_requeue_list() · 2849450a
      Mike Snitzer 提交于
      blk_mq_delay_kick_requeue_list() provides the ability to kick the
      q->requeue_list after a specified time.  To do this the request_queue's
      'requeue_work' member was changed to a delayed_work.
      
      blk_mq_delay_kick_requeue_list() allows DM to defer processing requeued
      requests while it doesn't make sense to immediately requeue them
      (e.g. when all paths in a DM multipath have failed).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2849450a
  8. 29 8月, 2016 1 次提交
  9. 16 8月, 2016 1 次提交
  10. 08 8月, 2016 1 次提交
  11. 05 8月, 2016 2 次提交
  12. 21 7月, 2016 5 次提交
  13. 13 7月, 2016 1 次提交
    • D
      pmem: kill __pmem address space · 7a9eb206
      Dan Williams 提交于
      The __pmem address space was meant to annotate codepaths that touch
      persistent memory and need to coordinate a call to wmb_pmem().  Now that
      wmb_pmem() is gone, there is little need to keep this annotation.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      7a9eb206
  14. 28 6月, 2016 1 次提交
    • J
      block: Convert fifo_time from ulong to u64 · 9828c2c6
      Jan Kara 提交于
      Currently rq->fifo_time is unsigned long but CFQ stores nanosecond
      timestamp in it which would overflow on 32-bit archs. Convert it to u64
      to avoid the overflow. Since the rq->fifo_time is unioned with struct
      call_single_data(), this does not change the size of struct request in
      any way.
      
      We have to slightly fixup block/deadline-iosched.c so that comparison
      happens in the right types.
      
      Fixes: 9a7f38c4Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9828c2c6
  15. 09 6月, 2016 2 次提交
  16. 08 6月, 2016 6 次提交
  17. 19 5月, 2016 1 次提交
  18. 17 5月, 2016 3 次提交
    • T
      block: Update blkdev_dax_capable() for consistency · a8078b1f
      Toshi Kani 提交于
      blkdev_dax_capable() is similar to bdev_dax_supported(), but needs
      to remain as a separate interface for checking dax capability of
      a raw block device.
      
      Rename and relocate blkdev_dax_capable() to keep them maintained
      consistently, and call bdev_direct_access() for the dax capability
      check.
      
      There is no change in the behavior.
      
      Link: https://lkml.org/lkml/2016/5/9/950Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      a8078b1f
    • T
      block: Add bdev_dax_supported() for dax mount checks · 2d96afc8
      Toshi Kani 提交于
      DAX imposes additional requirements to a device.  Add
      bdev_dax_supported() which performs all the precondition checks
      necessary for filesystem to mount the device with dax option.
      
      Also add a new check to verify if a partition is aligned by 4KB.
      When a partition is unaligned, any dax read/write access fails,
      except for metadata update.
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      2d96afc8
    • T
      block: Add vfs_msg() interface · 2af3a815
      Toshi Kani 提交于
      In preparation of moving DAX capability checks to the block layer
      from filesystem code, add a VFS message interface that aligns with
      filesystem's message format.
      
      For instance, a vfs_msg() message followed by XFS messages in case
      of a dax mount error may look like:
      
        VFS (pmem0p1): error: unaligned partition for dax
        XFS (pmem0p1): DAX unsupported by block device. Turning off DAX.
        XFS (pmem0p1): Mounting V5 Filesystem
         :
      
      vfs_msg() is largely based on ext4_msg().
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      2af3a815
  19. 02 5月, 2016 1 次提交
  20. 14 4月, 2016 1 次提交
  21. 13 4月, 2016 3 次提交