1. 20 4月, 2017 1 次提交
  2. 19 4月, 2017 1 次提交
    • A
      block, bfq: add full hierarchical scheduling and cgroups support · e21b7a0b
      Arianna Avanzini 提交于
      Add complete support for full hierarchical scheduling, with a cgroups
      interface. Full hierarchical scheduling is implemented through the
      'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
      associated with processes, and groups are represented in general by
      entities. Given the bfq_queues associated with the processes belonging
      to a given group, the entities representing these queues are sons of
      the entity representing the group. At higher levels, if a group, say
      G, contains other groups, then the entity representing G is the parent
      entity of the entities representing the groups in G.
      
      Hierarchical scheduling is performed as follows: if the timestamps of
      a leaf entity (i.e., of a bfq_queue) change, and such a change lets
      the entity become the next-to-serve entity for its parent entity, then
      the timestamps of the parent entity are recomputed as a function of
      the budget of its new next-to-serve leaf entity. If the parent entity
      belongs, in its turn, to a group, and its new timestamps let it become
      the next-to-serve for its parent entity, then the timestamps of the
      latter parent entity are recomputed as well, and so on. When a new
      bfq_queue must be set in service, the reverse path is followed: the
      next-to-serve highest-level entity is chosen, then its next-to-serve
      child entity, and so on, until the next-to-serve leaf entity is
      reached, and the bfq_queue that this entity represents is set in
      service.
      
      Writeback is accounted for on a per-group basis, i.e., for each group,
      the async I/O requests of the processes of the group are enqueued in a
      distinct bfq_queue, and the entity associated with this queue is a
      child of the entity associated with the group.
      
      Weights can be assigned explicitly to groups and processes through the
      cgroups interface, differently from what happens, for single
      processes, if the cgroups interface is not used (as explained in the
      description of the previous patch). In particular, since each node has
      a full scheduler, each group can be assigned its own weight.
      Signed-off-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e21b7a0b
  3. 09 4月, 2017 3 次提交
  4. 08 4月, 2017 1 次提交
  5. 06 4月, 2017 2 次提交
  6. 29 3月, 2017 1 次提交
  7. 22 3月, 2017 1 次提交
    • O
      blk-stat: convert to callback-based statistics reporting · 34dbad5d
      Omar Sandoval 提交于
      Currently, statistics are gathered in ~0.13s windows, and users grab the
      statistics whenever they need them. This is not ideal for both in-tree
      users:
      
      1. Writeback throttling wants its own dynamically sized window of
         statistics. Since the blk-stats statistics are reset after every
         window and the wbt windows don't line up with the blk-stats windows,
         wbt doesn't see every I/O.
      2. Polling currently grabs the statistics on every I/O. Again, depending
         on how the window lines up, we may miss some I/Os. It's also
         unnecessary overhead to get the statistics on every I/O; the hybrid
         polling heuristic would be just as happy with the statistics from the
         previous full window.
      
      This reworks the blk-stats infrastructure to be callback-based: users
      register a callback that they want called at a given time with all of
      the statistics from the window during which the callback was active.
      Users can dynamically bucketize the statistics. wbt and polling both
      currently use read vs. write, but polling can be extended to further
      subdivide based on request size.
      
      The callbacks are kept on an RCU list, and each callback has percpu
      stats buffers. There will only be a few users, so the overhead on the
      I/O completion side is low. The stats flushing is also simplified
      considerably: since the timer function is responsible for clearing the
      statistics, we don't have to worry about stale statistics.
      
      wbt is a trivial conversion. After the conversion, the windowing problem
      mentioned above is fixed.
      
      For polling, we register an extra callback that caches the previous
      window's statistics in the struct request_queue for the hybrid polling
      heuristic to use.
      
      Since we no longer have a single stats buffer for the request queue,
      this also removes the sysfs and debugfs stats entries. To replace those,
      we add a debugfs entry for the poll statistics.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      34dbad5d
  8. 09 3月, 2017 1 次提交
  9. 02 3月, 2017 1 次提交
  10. 09 2月, 2017 1 次提交
  11. 03 2月, 2017 1 次提交
  12. 02 2月, 2017 4 次提交
  13. 01 2月, 2017 3 次提交
  14. 28 1月, 2017 3 次提交
  15. 27 1月, 2017 2 次提交
  16. 18 1月, 2017 1 次提交
  17. 14 1月, 2017 1 次提交
  18. 12 1月, 2017 3 次提交
  19. 18 12月, 2016 1 次提交
  20. 09 12月, 2016 1 次提交
    • C
      block: improve handling of the magic discard payload · f9d03f96
      Christoph Hellwig 提交于
      Instead of allocating a single unused biovec for discard requests, send
      them down without any payload.  Instead we allow the driver to add a
      "special" payload using a biovec embedded into struct request (unioned
      over other fields never used while in the driver), and overloading
      the number of segments for this case.
      
      This has a couple of advantages:
      
       - we don't have to allocate the bio_vec
       - the amount of special casing for discard requests in the block
         layer is significantly reduced
       - using this same scheme for other request types is trivial,
         which will be important for implementing the new WRITE_ZEROES
         op on devices where it actually requires a payload (e.g. SCSI)
       - we can get rid of playing games with the request length, as
         we'll never touch it and completions will work just fine
       - it will allow us to support ranged discard operations in the
         future by merging non-contiguous discard bios into a single
         request
       - last but not least it removes a lot of code
      
      This patch is the common base for my WIP series for ranges discards and to
      remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
      so it would be good to get it in quickly.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f9d03f96
  21. 01 12月, 2016 2 次提交
  22. 18 11月, 2016 3 次提交
    • T
      block: Change extern inline to static inline · 9a05e754
      Tobias Klauser 提交于
      With compilers which follow the C99 standard (like modern versions of
      gcc and clang), "extern inline" does the opposite thing from older
      versions of gcc (emits code for an externally linkable version of the
      inline function).
      
      "static inline" does the intended behavior in all cases instead.
      
      Description taken from commit 6d91857d ("staging, rtl8192e,
      LLVMLinux: Change extern inline to static inline").
      
      This also fixes the following GCC warning when building with CONFIG_PM
      disabled:
      
        ./include/linux/blkdev.h:1143:20: warning: no previous prototype for 'blk_set_runtime_active' [-Wmissing-prototypes]
      
      Fixes: d07ab6d1 ("block: Add blk_set_runtime_active()")
      Reviewed-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9a05e754
    • J
      blk-mq: make the polling code adaptive · 64f1c21e
      Jens Axboe 提交于
      The previous commit introduced the hybrid sleep/poll mode. Take
      that one step further, and use the completion latencies to
      automatically sleep for half the mean completion time. This is
      a good approximation.
      
      This changes the 'io_poll_delay' sysfs file a bit to expose the
      various options. Depending on the value, the polling code will
      behave differently:
      
      -1	Never enter hybrid sleep mode
       0	Use half of the completion mean for the sleep delay
      >0	Use this specific value as the sleep delay
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Tested-By: NStephen Bates <sbates@raithlin.com>
      Reviewed-By: NStephen Bates <sbates@raithlin.com>
      64f1c21e
    • J
      blk-mq: implement hybrid poll mode for sync O_DIRECT · 06426adf
      Jens Axboe 提交于
      This patch enables a hybrid polling mode. Instead of polling after IO
      submission, we can induce an artificial delay, and then poll after that.
      For example, if the IO is presumed to complete in 8 usecs from now, we
      can sleep for 4 usecs, wake up, and then do our polling. This still puts
      a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
      after the IO has completed, it'll happen before. With this hybrid
      scheme, we can achieve big latency reductions while still using the same
      (or less) amount of CPU.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Tested-By: NStephen Bates <sbates@raithlin.com>
      Reviewed-By: NStephen Bates <sbates@raithlin.com>
      06426adf
  23. 12 11月, 2016 1 次提交
  24. 11 11月, 2016 1 次提交
    • J
      block: hook up writeback throttling · 87760e5e
      Jens Axboe 提交于
      Enable throttling of buffered writeback to make it a lot
      more smooth, and has way less impact on other system activity.
      Background writeback should be, by definition, background
      activity. The fact that we flush huge bundles of it at the time
      means that it potentially has heavy impacts on foreground workloads,
      which isn't ideal. We can't easily limit the sizes of writes that
      we do, since that would impact file system layout in the presence
      of delayed allocation. So just throttle back buffered writeback,
      unless someone is waiting for it.
      
      The algorithm for when to throttle takes its inspiration in the
      CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
      the minimum latencies of requests over a window of time. In that
      window of time, if the minimum latency of any request exceeds a
      given target, then a scale count is incremented and the queue depth
      is shrunk. The next monitoring window is shrunk accordingly. Unlike
      CoDel, if we hit a window that exhibits good behavior, then we
      simply increment the scale count and re-calculate the limits for that
      scale value. This prevents us from oscillating between a
      close-to-ideal value and max all the time, instead remaining in the
      windows where we get good behavior.
      
      Unlike CoDel, blk-wb allows the scale count to to negative. This
      happens if we primarily have writes going on. Unlike positive
      scale counts, this doesn't change the size of the monitoring window.
      When the heavy writers finish, blk-bw quickly snaps back to it's
      stable state of a zero scale count.
      
      The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
      target to me met. It defaults to 2 msec for non-rotational storage, and
      75 msec for rotational storage. Setting this value to '0' disables
      blk-wb. Generally, a user would not have to touch this setting.
      
      We don't enable WBT on devices that are managed with CFQ, and have
      a non-root block cgroup attached. If we have a proportional share setup
      on this particular disk, then the wbt throttling will interfere with
      that. We don't have a strong need for wbt for that case, since we will
      rely on CFQ doing that for us.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      87760e5e