提交 · cb365b9675fda026caba4cb5df83292cb7c0811a · openanolis / cloud-kernel

09 4月, 2017 2 次提交

block: add a new BLKDEV_ZERO_NOFALLBACK flag · cb365b96

由 Christoph Hellwig 提交于 4月 05, 2017

This avoids fallbacks to explicit zeroing in (__)blkdev_issue_zeroout if
the caller doesn't want them.

Also clean up the convoluted check for the return condition that this
new flag is added to.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

cb365b96

block: add a flags argument to (__)blkdev_issue_zeroout · ee472d83

由 Christoph Hellwig 提交于 4月 05, 2017

Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with
similar semantics, but without referring to diѕcard.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

ee472d83

08 4月, 2017 1 次提交

blk-mq: Restart a single queue if tag sets are shared · 6d8c6c0f

由 Bart Van Assche 提交于 4月 07, 2017

To improve scalability, if hardware queues are shared, restart
a single hardware queue in round-robin fashion. Rename
blk_mq_sched_restart_queues() to reflect the new semantics.
Remove blk_mq_sched_mark_restart_queue() because this function
has no callers. Remove flag QUEUE_FLAG_RESTART because this
patch removes the code that uses this flag.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

6d8c6c0f

06 4月, 2017 2 次提交

block: move timeout field in struct request to pack better · 1dd5198b

由 Jens Axboe 提交于 4月 05, 2017

After commit 64c7f1d1, we went from 1 to 2 holes in my
test setup. If we move the timeout field a bit, we remove
both of those holes and shrink struct request by 8 bytes.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

1dd5198b

block, scsi: move the retries field to struct scsi_request · 64c7f1d1

由 Christoph Hellwig 提交于 4月 05, 2017

Instead of bloating the generic struct request with it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

64c7f1d1

29 3月, 2017 1 次提交

block: warn if sharing request queue across gendisks · 334335d2

由 Omar Sandoval 提交于 3月 28, 2017

Now that the remaining drivers have been converted to one request queue
per gendisk, let's warn if a request queue gets registered more than
once. This will catch future drivers which might do it inadvertently or
any old drivers that I may have missed.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

334335d2

22 3月, 2017 1 次提交

blk-stat: convert to callback-based statistics reporting · 34dbad5d

由 Omar Sandoval 提交于 3月 21, 2017

Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:

1. Writeback throttling wants its own dynamically sized window of
   statistics. Since the blk-stats statistics are reset after every
   window and the wbt windows don't line up with the blk-stats windows,
   wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
   on how the window lines up, we may miss some I/Os. It's also
   unnecessary overhead to get the statistics on every I/O; the hybrid
   polling heuristic would be just as happy with the statistics from the
   previous full window.

This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.

The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.

wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.

For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.

Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

34dbad5d

09 3月, 2017 1 次提交

Revert "scsi, block: fix duplicate bdi name registration crashes" · c01228db

由 Jan Kara 提交于 3月 08, 2017

This reverts commit 0dba1314. It causes
leaking of device numbers for SCSI when SCSI registers multiple gendisks
for one request_queue in succession. It can be easily reproduced using
Omar's script [1] on kernel with CONFIG_DEBUG_TEST_DRIVER_REMOVE.
Furthermore the protection provided by this commit is not needed anymore
as the problem it was fixing got also fixed by commit 165a5e22
"block: Move bdi_unregister() to del_gendisk()".

[1]: http://marc.info/?l=linux-block&m=148554717109098&w=2Signed-off-by: NJan Kara <jack@suse.cz>
Acked-by: NDan Williams <dan.j.williams@intel.com>
Tested-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c01228db

02 3月, 2017 1 次提交

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h> · e6017571

由 Ingo Molnar 提交于 2月 01, 2017

We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
will have to be picked up from other headers and .c files.

Create a trivial placeholder <linux/sched/clock.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

e6017571

09 2月, 2017 1 次提交

block: optionally merge discontiguous discard bios into a single request · 1e739730

由 Christoph Hellwig 提交于 2月 08, 2017

Add a new merge strategy that merges discard bios into a request until the
maximum number of discard ranges (or the maximum discard size) is reached
from the plug merging code.  I/O scheduler merging is not wired up yet
but might also be useful, although not for fast devices like NVMe which
are the only user for now.

Note that for now we don't support limiting the size of each discard range,
but if needed that can be added later.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

1e739730

03 2月, 2017 1 次提交

block: fix debugfs config conditional in struct request_queue · 03796c14

由 Omar Sandoval 提交于 1月 31, 2017

The debugfs dentries are only used for CONFIG_BLK_DEBUG_FS, so make them
conditional on that instead of CONFIG_DEBUG_FS.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

03796c14

02 2月, 2017 4 次提交

scsi, block: fix duplicate bdi name registration crashes · 0dba1314

由 Dan Williams 提交于 2月 01, 2017

Warnings of the following form occur because scsi reuses a devt number
while the block layer still has it referenced as the name of the bdi
[1]:

 WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
 sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192'
 [..]
 Call Trace:
  dump_stack+0x86/0xc3
  __warn+0xcb/0xf0
  warn_slowpath_fmt+0x5f/0x80
  ? kernfs_path_from_node+0x4f/0x60
  sysfs_warn_dup+0x62/0x80
  sysfs_create_dir_ns+0x77/0x90
  kobject_add_internal+0xb2/0x350
  kobject_add+0x75/0xd0
  device_add+0x15a/0x650
  device_create_groups_vargs+0xe0/0xf0
  device_create_vargs+0x1c/0x20
  bdi_register+0x90/0x240
  ? lockdep_init_map+0x57/0x200
  bdi_register_owner+0x36/0x60
  device_add_disk+0x1bb/0x4e0
  ? __pm_runtime_use_autosuspend+0x5c/0x70
  sd_probe_async+0x10d/0x1c0
  async_run_entry_fn+0x39/0x170

This is a brute-force fix to pass the devt release information from
sd_probe() to the locations where we register the bdi,
device_add_disk(), and unregister the bdi, blk_cleanup_queue().

Thanks to Omar for the quick reproducer script [2]. This patch survives
where an unmodified kernel fails in a few seconds.

[1]: https://marc.info/?l=linux-scsi&m=147116857810716&w=4
[2]: http://marc.info/?l=linux-block&m=148554717109098&w=2

Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Reported-by: NOmar Sandoval <osandov@osandov.com>
Tested-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

0dba1314

block: Get rid of blk_get_backing_dev_info() · efa7c9f9

由 Jan Kara 提交于 2月 02, 2017

blk_get_backing_dev_info() is now a simple dereference. Remove that
function and simplify some code around that.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

efa7c9f9

block: Dynamically allocate and refcount backing_dev_info · d03f6cdc

由 Jan Kara 提交于 2月 02, 2017

Instead of storing backing_dev_info inside struct request_queue,
allocate it dynamically, reference count it, and free it when the last
reference is dropped. Currently only request_queue holds the reference
but in the following patch we add other users referencing
backing_dev_info.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

d03f6cdc

block: Use pointer to backing_dev_info from request_queue · dc3b17cc

由 Jan Kara 提交于 2月 02, 2017

We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

dc3b17cc

01 2月, 2017 3 次提交

block: move internal_tag to same cache line as tag · d486f1f2

由 Jens Axboe 提交于 1月 31, 2017

Since we removed cmd_type, we now have a hole in the struct. Move
the internal_tag member to the same cacheline as tag, since we
use them at the same time.

This doesn't fix the hole, just moves it elsewhere.
Signed-off-by: NJens Axboe <axboe@fb.com>

d486f1f2

block: fold cmd_type into the REQ_OP_ space · aebf526b

由 Christoph Hellwig 提交于 1月 31, 2017

Instead of keeping two levels of indirection for requests types, fold it
all into the operations.  The little caveat here is that previously
cmd_type only applied to struct request, while the request and bio op
fields were set to plain REQ_OP_READ/WRITE even for passthrough
operations.

Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
private requests, althought it has to add two for each so that we
can communicate the data in/out nature of the request.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

aebf526b

block: introduce blk_rq_is_passthrough · 57292b58

由 Christoph Hellwig 提交于 1月 31, 2017

This can be used to check for fs vs non-fs requests and basically
removes all knowledge of BLOCK_PC specific from the block layer,
as well as preparing for removing the cmd_type field in struct request.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

57292b58

28 1月, 2017 3 次提交

block: split scsi_request out of struct request · 82ed4db4

由 Christoph Hellwig 提交于 1月 27, 2017

And require all drivers that want to support BLOCK_PC to allocate it
as the first thing of their private data.  To support this the legacy
IDE and BSG code is switched to set cmd_size on their queues to let
the block layer allocate the additional space.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

82ed4db4

block: allow specifying size for extra command data · 6d247d7f

由 Christoph Hellwig 提交于 1月 27, 2017

This mirrors the blk-mq capabilities to allocate extra drivers-specific
data behind struct request by setting a cmd_size field, as well as having
a constructor / destructor for it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

6d247d7f

block: simplify blk_init_allocated_queue · 5ea708d1

由 Christoph Hellwig 提交于 1月 03, 2017

Return an errno value instead of the passed in queue so that the callers
don't have to keep track of two queues, and move the assignment of the
request_fn and lock to the caller as passing them as argument doesn't
simplify anything.  While we're at it also remove two pointless NULL
assignments, given that the request structure is zeroed on allocation.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

5ea708d1

27 1月, 2017 2 次提交

blk-mq-sched: fix starvation for multiple hardware queues and shared tags · 50e1dab8

由 Jens Axboe 提交于 1月 26, 2017

If we have both multiple hardware queues and shared tag map between
devices, we need to ensure that we propagate the hardware queue
restart bit higher up. This is because we can get into a situation
where we don't have any IO pending on a hardware queue, yet we fail
getting a tag to start new IO. If that happens, it's not enough to
mark the hardware queue as needing a restart, we need to bubble
that up to the higher level queue as well.
Signed-off-by: NJens Axboe <axboe@fb.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Tested-by: NHannes Reinecke <hare@suse.com>

50e1dab8

blk-mq: create debugfs directory tree · 07e4fead

由 Omar Sandoval 提交于 1月 25, 2017

In preparation for putting blk-mq debugging information in debugfs,
create a directory tree mirroring the one in sysfs:

    # tree -d /sys/kernel/debug/block
    /sys/kernel/debug/block
    |-- nvme0n1
    |   `-- mq
    |       |-- 0
    |       |   `-- cpu0
    |       |-- 1
    |       |   `-- cpu1
    |       |-- 2
    |       |   `-- cpu2
    |       `-- 3
    |           `-- cpu3
    `-- vda
        `-- mq
            `-- 0
                |-- cpu0
                |-- cpu1
                |-- cpu2
                `-- cpu3

Also add the scaffolding for the actual files that will go in here,
either under the hardware queue or software queue directories.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

07e4fead

18 1月, 2017 1 次提交

blk-mq-sched: add framework for MQ capable IO schedulers · bd166ef1

由 Jens Axboe 提交于 1月 17, 2017

This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.

We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.

We split driver and scheduler tags, so we can run the scheduling
independently of device queue depth.
Signed-off-by: NJens Axboe <axboe@fb.com>
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>

bd166ef1

14 1月, 2017 1 次提交

block: add blk_rq_payload_bytes · 2e3258ec

由 Christoph Hellwig 提交于 1月 13, 2017

Add a helper to calculate the actual data transfer size for special
payload requests.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

2e3258ec

12 1月, 2017 3 次提交

block: Rename blk_queue_zone_size and bdev_zone_size · f99e8648

由 Damien Le Moal 提交于 1月 12, 2017

All block device data fields and functions returning a number of 512B
sectors are by convention named xxx_sectors while names in the form
xxx_size are generally used for a number of bytes. The blk_queue_zone_size
and bdev_zone_size functions were not following this convention so rename
them.

No functional change is introduced by this patch.
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>

Collapsed the two patches, they were nonsensically split and broke
bisection.
Signed-off-by: NJens Axboe <axboe@fb.com>

f99e8648

blk-mq: make mq_ops a const pointer · f8a5b122

由 Jens Axboe 提交于 12月 13, 2016

We never change it, make that clear.
Signed-off-by: NJens Axboe <axboe@fb.com>
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>

f8a5b122

block: relax check on sg gap · 729204ef

由 Ming Lei 提交于 12月 17, 2016

If the last bvec of the 1st bio and the 1st bvec of the next
bio are physically contigious, and the latter can be merged
to last segment of the 1st bio, we should think they don't
violate sg gap(or virt boundary) limit.

Both Vitaly and Dexuan reported lots of unmergeable small bios
are observed when running mkfs on Hyper-V virtual storage, and
performance becomes quite low. This patch fixes that performance
issue.

The same issue should exist on NVMe, since it sets virt boundary too.
Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Reported-by: NDexuan Cui <decui@microsoft.com>
Tested-by: NDexuan Cui <decui@microsoft.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: NMing Lei <ming.lei@canonical.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

729204ef

18 12月, 2016 1 次提交

block: Remove unused member (busy) from struct blk_queue_tag · e8465447

由 Ritesh Harjani 提交于 12月 16, 2016

Signed-off-by: NRitesh Harjani <riteshh@codeaurora.org>
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

e8465447

09 12月, 2016 1 次提交

block: improve handling of the magic discard payload · f9d03f96

由 Christoph Hellwig 提交于 12月 08, 2016

Instead of allocating a single unused biovec for discard requests, send
them down without any payload.  Instead we allow the driver to add a
"special" payload using a biovec embedded into struct request (unioned
over other fields never used while in the driver), and overloading
the number of segments for this case.

This has a couple of advantages:

 - we don't have to allocate the bio_vec
 - the amount of special casing for discard requests in the block
   layer is significantly reduced
 - using this same scheme for other request types is trivial,
   which will be important for implementing the new WRITE_ZEROES
   op on devices where it actually requires a payload (e.g. SCSI)
 - we can get rid of playing games with the request length, as
   we'll never touch it and completions will work just fine
 - it will allow us to support ranged discard operations in the
   future by merging non-contiguous discard bios into a single
   request
 - last but not least it removes a lot of code

This patch is the common base for my WIP series for ranges discards and to
remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
so it would be good to get it in quickly.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

f9d03f96

01 12月, 2016 2 次提交

block: add support for REQ_OP_WRITE_ZEROES · a6f0788e

由 Chaitanya Kulkarni 提交于 11月 30, 2016

This adds a new block layer operation to zero out a range of
LBAs. This allows to implement zeroing for devices that don't use
either discard with a predictable zero pattern or WRITE SAME of zeroes.
The prominent example of that is NVMe with the Write Zeroes command,
but in the future, this should also help with improving the way
zeroing discards work. For this operation, suitable entry is exported in
sysfs which indicate the number of maximum bytes allowed in one
write zeroes operation by the device.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

a6f0788e

block: add async variant of blkdev_issue_zeroout · e73c23ff

由 Chaitanya Kulkarni 提交于 11月 30, 2016

Similar to __blkdev_issue_discard this variant allows submitting
the final bio asynchronously and chaining multiple ranges
into a single completion.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

e73c23ff

18 11月, 2016 3 次提交

block: Change extern inline to static inline · 9a05e754

由 Tobias Klauser 提交于 11月 18, 2016

With compilers which follow the C99 standard (like modern versions of
gcc and clang), "extern inline" does the opposite thing from older
versions of gcc (emits code for an externally linkable version of the
inline function).

"static inline" does the intended behavior in all cases instead.

Description taken from commit 6d91857d ("staging, rtl8192e,
LLVMLinux: Change extern inline to static inline").

This also fixes the following GCC warning when building with CONFIG_PM
disabled:

  ./include/linux/blkdev.h:1143:20: warning: no previous prototype for 'blk_set_runtime_active' [-Wmissing-prototypes]

Fixes: d07ab6d1 ("block: Add blk_set_runtime_active()")
Reviewed-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
Signed-off-by: NJens Axboe <axboe@fb.com>

9a05e754

blk-mq: make the polling code adaptive · 64f1c21e

由 Jens Axboe 提交于 11月 14, 2016

The previous commit introduced the hybrid sleep/poll mode. Take
that one step further, and use the completion latencies to
automatically sleep for half the mean completion time. This is
a good approximation.

This changes the 'io_poll_delay' sysfs file a bit to expose the
various options. Depending on the value, the polling code will
behave differently:

-1	Never enter hybrid sleep mode
 0	Use half of the completion mean for the sleep delay
>0	Use this specific value as the sleep delay
Signed-off-by: NJens Axboe <axboe@fb.com>
Tested-By: NStephen Bates <sbates@raithlin.com>
Reviewed-By: NStephen Bates <sbates@raithlin.com>

64f1c21e

blk-mq: implement hybrid poll mode for sync O_DIRECT · 06426adf

由 Jens Axboe 提交于 11月 14, 2016

This patch enables a hybrid polling mode. Instead of polling after IO
submission, we can induce an artificial delay, and then poll after that.
For example, if the IO is presumed to complete in 8 usecs from now, we
can sleep for 4 usecs, wake up, and then do our polling. This still puts
a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
after the IO has completed, it'll happen before. With this hybrid
scheme, we can achieve big latency reductions while still using the same
(or less) amount of CPU.
Signed-off-by: NJens Axboe <axboe@fb.com>
Tested-By: NStephen Bates <sbates@raithlin.com>
Reviewed-By: NStephen Bates <sbates@raithlin.com>

06426adf

12 11月, 2016 1 次提交

block: move poll code to blk-mq · bbd7bb70

由 Jens Axboe 提交于 11月 04, 2016

The poll code is blk-mq specific, let's move it to blk-mq.c. This
is a prep patch for improving the polling code.
Signed-off-by: NJens Axboe <axboe@fb.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

bbd7bb70

11 11月, 2016 2 次提交

block: hook up writeback throttling · 87760e5e

由 Jens Axboe 提交于 11月 09, 2016

Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.

Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.

The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.

We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: NJens Axboe <axboe@fb.com>

87760e5e

block: add scalable completion tracking of requests · cf43e6be

由 Jens Axboe 提交于 11月 07, 2016

For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

The feature is off by default, to avoid any extra overhead. In-kernel
users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
flags. We currently don't turn it on if someone just reads any of
the stats files, that is something we could add as well.
Signed-off-by: NJens Axboe <axboe@fb.com>

cf43e6be

06 11月, 2016 1 次提交

block: add code to track actual device queue depth · d278d4a8

由 Jens Axboe 提交于 3月 30, 2016

For blk-mq, ->nr_requests does track queue depth, at least at init
time. But for the older queue paths, it's simply a soft setting.
On top of that, it's generally larger than the hardware setting
on purpose, to allow backup of requests for merging.

Fill a hole in struct request with a 'queue_depth' member, that
drivers can call to more closely inform the block layer of the
real queue depth.
Signed-off-by: NJens Axboe <axboe@fb.com>
Reviewed-by: NJan Kara <jack@suse.cz>

d278d4a8

04 11月, 2016 1 次提交

block: immediately dispatch big size request · 50d24c34

由 Shaohua Li 提交于 11月 03, 2016

Currently block plug holds up to 16 non-mergeable requests. This makes
sense if the request size is small, eg, reduce lock contention. But if
request size is big enough, we don't need to worry about lock
contention. Holding such request makes no sense and it lows the disk
utilization.

In practice, this improves 10% throughput for my raid5 sequential write
workload.

The size (128k) is arbitrary right now, but it makes sure lock
contention is small. This probably could be more intelligent, eg, check
average request size holded. Since this is mainly for sequential IO,
probably not worthy.

V2: check the last request instead of the first request, so as long as
there is one big size request we flush the plug.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

50d24c34

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功