1. 18 12月, 2016 1 次提交
  2. 09 12月, 2016 1 次提交
    • C
      block: improve handling of the magic discard payload · f9d03f96
      Christoph Hellwig 提交于
      Instead of allocating a single unused biovec for discard requests, send
      them down without any payload.  Instead we allow the driver to add a
      "special" payload using a biovec embedded into struct request (unioned
      over other fields never used while in the driver), and overloading
      the number of segments for this case.
      
      This has a couple of advantages:
      
       - we don't have to allocate the bio_vec
       - the amount of special casing for discard requests in the block
         layer is significantly reduced
       - using this same scheme for other request types is trivial,
         which will be important for implementing the new WRITE_ZEROES
         op on devices where it actually requires a payload (e.g. SCSI)
       - we can get rid of playing games with the request length, as
         we'll never touch it and completions will work just fine
       - it will allow us to support ranged discard operations in the
         future by merging non-contiguous discard bios into a single
         request
       - last but not least it removes a lot of code
      
      This patch is the common base for my WIP series for ranges discards and to
      remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
      so it would be good to get it in quickly.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f9d03f96
  3. 01 12月, 2016 2 次提交
  4. 18 11月, 2016 3 次提交
    • T
      block: Change extern inline to static inline · 9a05e754
      Tobias Klauser 提交于
      With compilers which follow the C99 standard (like modern versions of
      gcc and clang), "extern inline" does the opposite thing from older
      versions of gcc (emits code for an externally linkable version of the
      inline function).
      
      "static inline" does the intended behavior in all cases instead.
      
      Description taken from commit 6d91857d ("staging, rtl8192e,
      LLVMLinux: Change extern inline to static inline").
      
      This also fixes the following GCC warning when building with CONFIG_PM
      disabled:
      
        ./include/linux/blkdev.h:1143:20: warning: no previous prototype for 'blk_set_runtime_active' [-Wmissing-prototypes]
      
      Fixes: d07ab6d1 ("block: Add blk_set_runtime_active()")
      Reviewed-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9a05e754
    • J
      blk-mq: make the polling code adaptive · 64f1c21e
      Jens Axboe 提交于
      The previous commit introduced the hybrid sleep/poll mode. Take
      that one step further, and use the completion latencies to
      automatically sleep for half the mean completion time. This is
      a good approximation.
      
      This changes the 'io_poll_delay' sysfs file a bit to expose the
      various options. Depending on the value, the polling code will
      behave differently:
      
      -1	Never enter hybrid sleep mode
       0	Use half of the completion mean for the sleep delay
      >0	Use this specific value as the sleep delay
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Tested-By: NStephen Bates <sbates@raithlin.com>
      Reviewed-By: NStephen Bates <sbates@raithlin.com>
      64f1c21e
    • J
      blk-mq: implement hybrid poll mode for sync O_DIRECT · 06426adf
      Jens Axboe 提交于
      This patch enables a hybrid polling mode. Instead of polling after IO
      submission, we can induce an artificial delay, and then poll after that.
      For example, if the IO is presumed to complete in 8 usecs from now, we
      can sleep for 4 usecs, wake up, and then do our polling. This still puts
      a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
      after the IO has completed, it'll happen before. With this hybrid
      scheme, we can achieve big latency reductions while still using the same
      (or less) amount of CPU.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Tested-By: NStephen Bates <sbates@raithlin.com>
      Reviewed-By: NStephen Bates <sbates@raithlin.com>
      06426adf
  5. 12 11月, 2016 1 次提交
  6. 11 11月, 2016 2 次提交
    • J
      block: hook up writeback throttling · 87760e5e
      Jens Axboe 提交于
      Enable throttling of buffered writeback to make it a lot
      more smooth, and has way less impact on other system activity.
      Background writeback should be, by definition, background
      activity. The fact that we flush huge bundles of it at the time
      means that it potentially has heavy impacts on foreground workloads,
      which isn't ideal. We can't easily limit the sizes of writes that
      we do, since that would impact file system layout in the presence
      of delayed allocation. So just throttle back buffered writeback,
      unless someone is waiting for it.
      
      The algorithm for when to throttle takes its inspiration in the
      CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
      the minimum latencies of requests over a window of time. In that
      window of time, if the minimum latency of any request exceeds a
      given target, then a scale count is incremented and the queue depth
      is shrunk. The next monitoring window is shrunk accordingly. Unlike
      CoDel, if we hit a window that exhibits good behavior, then we
      simply increment the scale count and re-calculate the limits for that
      scale value. This prevents us from oscillating between a
      close-to-ideal value and max all the time, instead remaining in the
      windows where we get good behavior.
      
      Unlike CoDel, blk-wb allows the scale count to to negative. This
      happens if we primarily have writes going on. Unlike positive
      scale counts, this doesn't change the size of the monitoring window.
      When the heavy writers finish, blk-bw quickly snaps back to it's
      stable state of a zero scale count.
      
      The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
      target to me met. It defaults to 2 msec for non-rotational storage, and
      75 msec for rotational storage. Setting this value to '0' disables
      blk-wb. Generally, a user would not have to touch this setting.
      
      We don't enable WBT on devices that are managed with CFQ, and have
      a non-root block cgroup attached. If we have a proportional share setup
      on this particular disk, then the wbt throttling will interfere with
      that. We don't have a strong need for wbt for that case, since we will
      rely on CFQ doing that for us.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      87760e5e
    • J
      block: add scalable completion tracking of requests · cf43e6be
      Jens Axboe 提交于
      For legacy block, we simply track them in the request queue. For
      blk-mq, we track them on a per-sw queue basis, which we can then
      sum up through the hardware queues and finally to a per device
      state.
      
      The stats are tracked in, roughly, 0.1s interval windows.
      
      Add sysfs files to display the stats.
      
      The feature is off by default, to avoid any extra overhead. In-kernel
      users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
      flags. We currently don't turn it on if someone just reads any of
      the stats files, that is something we could add as well.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cf43e6be
  7. 06 11月, 2016 1 次提交
    • J
      block: add code to track actual device queue depth · d278d4a8
      Jens Axboe 提交于
      For blk-mq, ->nr_requests does track queue depth, at least at init
      time. But for the older queue paths, it's simply a soft setting.
      On top of that, it's generally larger than the hardware setting
      on purpose, to allow backup of requests for merging.
      
      Fill a hole in struct request with a 'queue_depth' member, that
      drivers can call to more closely inform the block layer of the
      real queue depth.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      d278d4a8
  8. 04 11月, 2016 1 次提交
    • S
      block: immediately dispatch big size request · 50d24c34
      Shaohua Li 提交于
      Currently block plug holds up to 16 non-mergeable requests. This makes
      sense if the request size is small, eg, reduce lock contention. But if
      request size is big enough, we don't need to worry about lock
      contention. Holding such request makes no sense and it lows the disk
      utilization.
      
      In practice, this improves 10% throughput for my raid5 sequential write
      workload.
      
      The size (128k) is arbitrary right now, but it makes sure lock
      contention is small. This probably could be more intelligent, eg, check
      average request size holded. Since this is mainly for sequential IO,
      probably not worthy.
      
      V2: check the last request instead of the first request, so as long as
      there is one big size request we flush the plug.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      50d24c34
  9. 03 11月, 2016 1 次提交
    • B
      blk-mq: Introduce blk_mq_quiesce_queue() · 6a83e74d
      Bart Van Assche 提交于
      blk_mq_quiesce_queue() waits until ongoing .queue_rq() invocations
      have finished. This function does *not* wait until all outstanding
      requests have finished (this means invocation of request.end_io()).
      The algorithm used by blk_mq_quiesce_queue() is as follows:
      * Hold either an RCU read lock or an SRCU read lock around
        .queue_rq() calls. The former is used if .queue_rq() does not
        block and the latter if .queue_rq() may block.
      * blk_mq_quiesce_queue() first calls blk_mq_stop_hw_queues()
        followed by synchronize_srcu() or synchronize_rcu(). The latter
        call waits for .queue_rq() invocations that started before
        blk_mq_quiesce_queue() was called.
      * The blk_mq_hctx_stopped() calls that control whether or not
        .queue_rq() will be called are called with the (S)RCU read lock
        held. This is necessary to avoid race conditions against
        blk_mq_quiesce_queue().
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6a83e74d
  10. 28 10月, 2016 2 次提交
    • C
      block: better op and flags encoding · ef295ecf
      Christoph Hellwig 提交于
      Now that we don't need the common flags to overflow outside the range
      of a 32-bit type we can encode them the same way for both the bio and
      request fields.  This in addition allows us to place the operation
      first (and make some room for more ops while we're at it) and to
      stop having to shift around the operation values.
      
      In addition this allows passing around only one value in the block layer
      instead of two (and eventuall also in the file systems, but we can do
      that later) and thus clean up a lot of code.
      
      Last but not least this allows decreasing the size of the cmd_flags
      field in struct request to 32-bits.  Various functions passing this
      value could also be updated, but I'd like to avoid the churn for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ef295ecf
    • C
      block: split out request-only flags into a new namespace · e8064021
      Christoph Hellwig 提交于
      A lot of the REQ_* flags are only used on struct requests, and only of
      use to the block layer and a few drivers that dig into struct request
      internals.
      
      This patch adds a new req_flags_t rq_flags field to struct request for
      them, and thus dramatically shrinks the number of common requests.  It
      also removes the unfortunate situation where we have to fit the fields
      from the same enum into 32 bits for struct bio and 64 bits for
      struct request.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NShaun Tancheff <shaun.tancheff@seagate.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e8064021
  11. 20 10月, 2016 1 次提交
    • A
      block: Add iocontext priority to request · 5dc8b362
      Adam Manzanares 提交于
      Patch adds an association between iocontext ioprio and the ioprio of a
      request. This is done to enable request based drivers the ability to
      act on priority information stored in the request. An example being
      ATA devices that support command priorities. If the ATA driver discovers
      that the device supports command priorities and the request has valid
      priority information indicating the request is high priority, then a high
      priority command can be sent to the device. This should improve tail
      latencies for high priority IO on any device that queues requests
      internally and can make use of the priority information stored in the
      request.
      
      The ioprio of the request is set in blk_rq_set_prio which takes the
      request and the ioc as arguments. If the ioc is valid in blk_rq_set_prio
      then the iopriority of the request is set as the iopriority of the ioc.
      In init_request_from_bio a check is made to see if the ioprio of the bio
      is valid and if so then the request prio comes from the bio.
      Signed-off-by: NAdam Manzananares <adam.manzanares@wdc.com>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5dc8b362
  12. 19 10月, 2016 3 次提交
  13. 15 9月, 2016 1 次提交
    • M
      blk-mq: introduce blk_mq_delay_kick_requeue_list() · 2849450a
      Mike Snitzer 提交于
      blk_mq_delay_kick_requeue_list() provides the ability to kick the
      q->requeue_list after a specified time.  To do this the request_queue's
      'requeue_work' member was changed to a delayed_work.
      
      blk_mq_delay_kick_requeue_list() allows DM to defer processing requeued
      requests while it doesn't make sense to immediately requeue them
      (e.g. when all paths in a DM multipath have failed).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2849450a
  14. 29 8月, 2016 1 次提交
  15. 16 8月, 2016 1 次提交
  16. 08 8月, 2016 1 次提交
  17. 05 8月, 2016 2 次提交
  18. 21 7月, 2016 5 次提交
  19. 13 7月, 2016 1 次提交
    • D
      pmem: kill __pmem address space · 7a9eb206
      Dan Williams 提交于
      The __pmem address space was meant to annotate codepaths that touch
      persistent memory and need to coordinate a call to wmb_pmem().  Now that
      wmb_pmem() is gone, there is little need to keep this annotation.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      7a9eb206
  20. 28 6月, 2016 1 次提交
    • J
      block: Convert fifo_time from ulong to u64 · 9828c2c6
      Jan Kara 提交于
      Currently rq->fifo_time is unsigned long but CFQ stores nanosecond
      timestamp in it which would overflow on 32-bit archs. Convert it to u64
      to avoid the overflow. Since the rq->fifo_time is unioned with struct
      call_single_data(), this does not change the size of struct request in
      any way.
      
      We have to slightly fixup block/deadline-iosched.c so that comparison
      happens in the right types.
      
      Fixes: 9a7f38c4Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9828c2c6
  21. 09 6月, 2016 2 次提交
  22. 08 6月, 2016 6 次提交