提交 · 0fe80347fd701a7261bea278cab692de79e83af7 · openeuler / Kernel

19 10月, 2021 18 次提交

md: use bdev_nr_sectors instead of open coding it · 0fe80347

由 Christoph Hellwig 提交于 10月 18, 2021

Use the proper helper to read the block device size.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKees Cook <keescook@chromium.org>
Acked-by: NSong Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20211018101130.1838532-7-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

0fe80347

dm: use bdev_nr_sectors and bdev_nr_bytes instead of open coding them · 6dcbb52c

由 Christoph Hellwig 提交于 10月 18, 2021

Use the proper helpers to read the block device size.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKees Cook <keescook@chromium.org>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20211018101130.1838532-6-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

6dcbb52c

drbd: use bdev_nr_sectors instead of open coding it · da7b3924

由 Christoph Hellwig 提交于 10月 18, 2021

Use the proper helper to read the block device size.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NLee Duncan <lduncan@suse.com>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211018101130.1838532-5-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

da7b3924

bcache: remove bdev_sectors · cda25b82

由 Christoph Hellwig 提交于 10月 18, 2021

Use the equivalent block layer helper instead.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Acked-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20211018101130.1838532-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

cda25b82

block: add a bdev_nr_bytes helper · 6436bd90

由 Christoph Hellwig 提交于 10月 18, 2021

Add a helper to query the size of a block device in bytes. This
will be used to remove open coded access to ->bd_inode.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20211018101130.1838532-3-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

6436bd90

block: move the SECTOR_SIZE related definitions to blk_types.h · 99457db8

由 Christoph Hellwig 提交于 10月 18, 2021

Ensure these are always available for inlines in the various block layer
headers.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211018101130.1838532-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

99457db8

nvme: wire up completion batching for the IRQ path · 4f502245

由 Jens Axboe 提交于 10月 18, 2021

Trivial to do now, just need our own io_comp_batch on the stack and pass
that in to the usual command completion handling.

I pondered making this dependent on how many entries we had to process,
but even for a single entry there's no discernable difference in
performance or latency. Running a sync workload over io_uring:

t/io_uring -b512 -d1 -s1 -c1 -p0 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1

yields the below performance before the patch:

IOPS=254820, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251174, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=250806, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)

and the following after:

IOPS=255972, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251920, BW=123MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251794, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)

which definitely isn't slower, about the same if you factor in a bit of
variance. For peak performance workloads, benchmarking shows a 2%
improvement.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4f502245

io_uring: utilize the io batching infrastructure for more efficient polled IO · b688f11e

由 Jens Axboe 提交于 10月 12, 2021

Wire up using an io_comp_batch for f_op->iopoll(). If the lower stack
supports it, we can handle high rates of polled IO more efficiently.

This raises the single core efficiency on my system from ~6.1M IOPS to
~6.6M IOPS running a random read workload at depth 128 on two gen2
Optane drives.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b688f11e

nvme: add support for batched completion of polled IO · c234a653

由 Jens Axboe 提交于 10月 08, 2021

Take advantage of struct io_comp_batch, if passed in to the nvme poll
handler. If it's set, rather than complete each request individually
inline, store them in the io_comp_batch list. We only do so for requests
that will complete successfully, anything else will be completed inline as
before.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c234a653

block: add support for blk_mq_end_request_batch() · f794f335

由 Jens Axboe 提交于 10月 08, 2021

Instead of calling blk_mq_end_request() on a single request, add a helper
that takes the new struct io_comp_batch and completes any request stored
in there.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f794f335

sbitmap: add helper to clear a batch of tags · 1aec5e4a

由 Jens Axboe 提交于 10月 08, 2021

sbitmap currently only supports clearing tags one-by-one, add a helper
that allows the caller to pass in an array of tags to clear.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1aec5e4a

block: add a struct io_comp_batch argument to fops->iopoll() · 5a72e899

由 Jens Axboe 提交于 10月 12, 2021

struct io_comp_batch contains a list head and a completion handler, which
will allow completions to more effciently completed batches of IO.

For now, no functional changes in this patch, we just define the
io_comp_batch structure and add the argument to the file_operations iopoll
handler.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5a72e899

block: provide helpers for rq_list manipulation · 013a7f95

由 Jens Axboe 提交于 10月 13, 2021

Instead of open-coding the list additions, traversal, and removal,
provide a basic set of helpers.
Suggested-by: NChristoph Hellwig <hch@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

013a7f95

block: remove some blk_mq_hw_ctx debugfs entries · afd7de03

由 Jens Axboe 提交于 10月 18, 2021

Just like the blk_mq_ctx counterparts, we've got a bunch of counters
in here that are only for debugfs and are of questionnable value. They
are:

- dispatched, index of how many requests were dispatched in one go

- poll_{considered,invoked,success}, which track poll sucess rates. We're
  confident in the iopoll implementation at this point, don't bother
  tracking these.

As a bonus, this shrinks each hardware queue from 576 bytes to 512 bytes,
dropping a whole cacheline.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

afd7de03

block: remove debugfs blk_mq_ctx dispatched/merged/completed attributes · 9a14d6ce

由 Jens Axboe 提交于 10月 16, 2021

These were added as part of early days debugging for blk-mq, and they
are not really useful anymore. Rather than spend cycles updating them,
just get rid of them.

As a bonus, this shrinks the per-cpu software queue size from 256b
to 192b. That's a whole cacheline less.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9a14d6ce

block: cache rq_flags inside blk_mq_rq_ctx_init() · 12845906

由 Pavel Begunkov 提交于 10月 18, 2021

Add a local variable for rq_flags, it helps to compile out some of
rq_flags reloads.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

12845906

block: blk_mq_rq_ctx_init cache ctx/q/hctx · 605f784e

由 Pavel Begunkov 提交于 10月 18, 2021

We should have enough of registers in blk_mq_rq_ctx_init(), store them
in local vars, so we don't keep reloading them.

note: keeping q->elevator may look unnecessary, but it's also used
inside inlined blk_mq_tags_from_data().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

605f784e

block: skip elevator fields init for non-elv queue · 4f266f2b

由 Pavel Begunkov 提交于 10月 18, 2021

Don't init rq->hash and rq->rb_node in blk_mq_rq_ctx_init() if there is
no elevator. Also, move some other initialisers that imply barriers to
the end, so the compiler is free to rearrange and optimise other the
rest of them.

note: fold in a change from Jens leaving queue_list unconditional, as
it might lead to problems otherwise.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4f266f2b

18 10月, 2021 22 次提交

block: store elevator state in request · 2ff0682d

由 Jens Axboe 提交于 10月 15, 2021

Add an rq private RQF_ELV flag, which tells the block layer that this
request was initialized on a queue that has an IO scheduler attached.
This allows for faster checking in the fast path, rather than having to
deference rq->q later on.

Elevator switching does full quiesce of the queue before detaching an
IO scheduler, so it's safe to cache this in the request itself.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2ff0682d

block: only mark bio as tracked if it really is tracked · 90b8faa0

由 Jens Axboe 提交于 10月 15, 2021

We set BIO_TRACKED unconditionally when rq_qos_throttle() is called, even
though we may not even have an rq_qos handler. Only mark it as TRACKED if
it really is potentially tracked.

This saves considerable time for the case where the bio isn't tracked:

     2.64%     -1.65%  [kernel.vmlinux]  [k] bio_endio
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

90b8faa0

block: improve layout of struct request · b6087629

由 Jens Axboe 提交于 10月 15, 2021

It's been a while since this was analyzed, move some members around to
better flow with the use case. Initial state up top, and queued state
after that. This improves my peak case by about 1.5%, from 7750K to
7900K IOPS.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b6087629

block: move update request helpers into blk-mq.c · 9be3e06f

由 Jens Axboe 提交于 10月 14, 2021

For some reason we still have them in blk-core, with the rest of the
request completion being in blk-mq. That causes and out-of-line call
for each completion.

Move them into blk-mq.c instead, where they belong.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9be3e06f

block: remove useless caller argument to print_req_error() · c477b797

由 Jens Axboe 提交于 10月 14, 2021

We have exactly one caller of this, just get rid of adding the useless
function name to the output.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c477b797

block: don't bother iter advancing a fully done bio · d4aa57a1

由 Jens Axboe 提交于 10月 13, 2021

If we're completing nbytes and nbytes is the size of the bio, don't bother
with calling into the iterator increment helpers. Just clear the bio
size and we're done.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d4aa57a1

block: convert the rest of block to bdev_get_queue · ed6cddef

由 Pavel Begunkov 提交于 10月 14, 2021

Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/addf6ea988c04213697ba3684c853e4ed7642a39.1634219547.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

ed6cddef

block: use bdev_get_queue() in blk-core.c · eab4e027

由 Pavel Begunkov 提交于 10月 14, 2021

Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/efc41f880262517c8dc32f932f1b23112f21b255.1634219547.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

eab4e027

block: use bdev_get_queue() in bio.c · 3caee463

由 Pavel Begunkov 提交于 10月 14, 2021

Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/85c36ea784d285a5075baa10049e6b59e15fb484.1634219547.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

3caee463

block: use bdev_get_queue() in bdev.c · 025a3865

由 Pavel Begunkov 提交于 10月 14, 2021

Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a352936ce5d9ac719645b1e29b173d931ebcdc02.1634219547.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

025a3865

block: cache request queue in bdev · 17220ca5

由 Pavel Begunkov 提交于 10月 14, 2021

There are tons of places where we need to get a request_queue only
having bdev, which turns into bdev->bd_disk->queue. There are probably a
hundred of such places considering inline helpers, and enough of them
are in hot paths.

Cache queue pointer in struct block_device and make use of it in
bdev_get_queue().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a3bfaecdd28956f03629d0ca5c63ebc096e1c809.1634219547.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

17220ca5

block: handle fast path of bio splitting inline · abd45c15

由 Jens Axboe 提交于 10月 13, 2021

The fast path is no splitting needed. Separate the handling into a
check part we can inline, and an out-of-line handling path if we do
need to split.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

abd45c15

block: use flags instead of bit fields for blkdev_dio · 09ce8744

由 Jens Axboe 提交于 10月 14, 2021

This generates a lot better code for me, and bumps performance from
7650K IOPS to 7750K IOPS. Looking at profiles for the run and running
perf diff, it confirms that we're now sending a lot less time there:

     6.38%     -2.80%  [kernel.vmlinux]  [k] blkdev_direct_IO

Taking it from the 2nd most cycle consumer to only the 9th most at
3.35% of the CPU time.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

09ce8744

block: cache bdev in struct file for raw bdev IO · fac7c6d5

由 Pavel Begunkov 提交于 10月 13, 2021

bdev = &BDEV_I(file->f_mapping->host)->bdev

Getting struct block_device from a file requires 2 memory dereferences
as illustrated above, that takes a toll on performance, so cache it in
yet unused file->private_data. That gives a noticeable peak performance
improvement.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/8415f9fe12e544b9da89593dfbca8de2b52efe03.1634115360.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

fac7c6d5

nvme-multipath: enable polled I/O · c712dccc

由 Christoph Hellwig 提交于 10月 12, 2021

Set the poll queue flag to enable polling, given that the multipath
node just dispatches the bios to a lower queue.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-17-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

c712dccc

block: don't allow writing to the poll queue attribute · a614dd22

由 Christoph Hellwig 提交于 10月 12, 2021

The poll attribute is a historic artefact from before when we had
explicit poll queues that require driver specific configuration.
Just print a warning when writing to the attribute.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-16-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

a614dd22

block: switch polling to be bio based · 3e08773c

由 Christoph Hellwig 提交于 10月 12, 2021

Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.

Polling for the bio itself leads to a few advantages:

 - the cookie construction can made entirely private in blk-mq.c
 - the caller does not need to remember the request_queue and cookie
   separately and thus sidesteps their lifetime issues
 - keeping the device and the cookie inside the bio allows to trivially
   support polling BIOs remapping by stacking drivers
 - a lot of code to propagate the cookie back up the submission path can
   be removed entirely.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

3e08773c

block: define 'struct bvec_iter' as packed · 19416123

由 Ming Lei 提交于 10月 12, 2021

'struct bvec_iter' is embedded into 'struct bio', define it as packed
so that we can get one extra 4bytes for other uses without expanding
bio.

'struct bvec_iter' is often allocated on stack, so making it packed
doesn't affect performance. Also I have run io_uring on both
nvme/null_blk, and not observe performance effect in this way.
Suggested-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-14-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

19416123

block: use SLAB_TYPESAFE_BY_RCU for the bio slab · 1a7e76e4

由 Christoph Hellwig 提交于 10月 12, 2021

This flags ensures that the pages will not be reused for non-bio
allocations before the end of an RCU grace period. With that we can
safely use a RCU lookup for bio polling as long as we are fine with
occasionally polling the wrong device.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-13-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

1a7e76e4

block: rename REQ_HIPRI to REQ_POLLED · 6ce913fe

由 Christoph Hellwig 提交于 10月 12, 2021

Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague,
the bio flag is specific to the polling implementation, so rename and
document it properly.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-12-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

6ce913fe

io_uring: don't sleep when polling for I/O · d729cf9a

由 Christoph Hellwig 提交于 10月 12, 2021

There is no point in sleeping for the expected I/O completion timeout
in the io_uring async polling model as we never poll for a specific
I/O.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-11-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

d729cf9a

block: replace the spin argument to blk_iopoll with a flags argument · ef99b2d3

由 Christoph Hellwig 提交于 10月 12, 2021

Switch the boolean spin argument to blk_poll to passing a set of flags
instead.  This will allow to control polling behavior in a more fine
grained way.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de
[axboe: adapt to changed io_uring iopoll]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ef99b2d3

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功