提交 · e41d12f539f7c8bac9b0897031fee6cc9158a6be · openeuler / Kernel

18 10月, 2021 1 次提交

mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h> · e41d12f5

由 Christoph Hellwig 提交于 9月 20, 2021

There is no need to pull blk-cgroup.h and thus blkdev.h in here, so
break the include chain.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

e41d12f5

16 10月, 2021 1 次提交

block: keep q_usage_counter in atomic mode after del_gendisk · aec89dc5

由 Christoph Hellwig 提交于 9月 29, 2021

Don't switch back to percpu mode to avoid the double RCU grace period
when tearing down SCSI devices.  After removing the disk only passthrough
commands can be send anyway.
Suggested-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NDarrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-6-hch@lst.deTested-by: NYi Zhang <yi.zhang@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

aec89dc5

08 9月, 2021 1 次提交

blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues · 7f2a6a69

由 Song Liu 提交于 9月 07, 2021

Limiting number of request to BLK_MAX_REQUEST_COUNT at blk_plug hurts
performance for large md arrays. [1] shows resync speed of md array drops
for md array with more than 16 HDDs.

Fix this by allowing more request at plug queue. The multiple_queue flag
is used to only apply higher limit to multiple queue cases.

[1] https://lore.kernel.org/linux-raid/CAFDAVznS71BXW8Jxv6k9dXc2iR3ysX3iZRBww_rzA8WifBFxGg@mail.gmail.com/Tested-by: NMarcin Wanat <marcin.wanat@gmail.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7f2a6a69

24 8月, 2021 4 次提交

block: add an explicit ->disk backpointer to the request_queue · d152c682

由 Christoph Hellwig 提交于 8月 16, 2021

Replace the magic lookup through the kobject tree with an explicit
backpointer, given that the device model links are set up and torn
down at times when I/O is still possible, leading to potential
NULL or invalid pointer dereferences.

Fixes: edb0872f ("block: move the bdi from the request_queue to the gendisk")
Reported-by: Nsyzbot <syzbot+aa0801b6b32dca9dda82@syzkaller.appspotmail.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NSven Schnelle <svens@linux.ibm.com>
Link: https://lore.kernel.org/r/20210816134624.GA24234@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

d152c682

block: pass a request_queue to __blk_alloc_disk · 4a1fa41d

由 Christoph Hellwig 提交于 8月 16, 2021

Pass in a request_queue and assign disk->queue in __blk_alloc_disk to
ensure struct gendisk always has a valid ->queue pointer.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816131910.615153-8-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

4a1fa41d

block: remove the minors argument to __alloc_disk_node · a58bd768

由 Christoph Hellwig 提交于 8月 16, 2021

This was a leftover from the legacy alloc_disk interface. Switch
the scsi ULPs and dasd to set ->minors directly like all other
drivers and remove the argument.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com> [dasd]
Link: https://lore.kernel.org/r/20210816131910.615153-7-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

a58bd768

block: cleanup the lockdep handling in *alloc_disk · 4dcc4874

由 Christoph Hellwig 提交于 8月 16, 2021

Pass the lockdep name to the low-level __blk_alloc_disk helper and
hardcode the name for it given that the number of minors or node_id
are not very useful information. While this passes a pointless
argument for non-lockdep builds that is not really an issue as
disk allocation is a probe time only slow path.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210816131910.615153-5-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

4dcc4874

18 8月, 2021 1 次提交

blk-mq: fix is_flush_rq · a9ed27a7

由 Ming Lei 提交于 8月 18, 2021

is_flush_rq() is called from bt_iter()/bt_tags_iter(), and runs the
following check:

	hctx->fq->flush_rq == req

but the passed hctx from bt_iter()/bt_tags_iter() may be NULL because:

1) memory re-order in blk_mq_rq_ctx_init():

	rq->mq_hctx = data->hctx;
	...
	refcount_set(&rq->ref, 1);

OR

2) tag re-use and ->rqs[] isn't updated with new request.

Fix the issue by re-writing is_flush_rq() as:

	return rq->end_io == flush_end_io;

which turns out simpler to follow and immune to data race since we have
ordered WRITE rq->end_io and refcount_set(&rq->ref, 1).

Fixes: 2e315dc0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter")
Cc: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de>
Cc: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210818010925.607383-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

a9ed27a7

17 8月, 2021 1 次提交

blk-mq: don't grab rq's refcount in blk_mq_check_expired() · c797b40c

由 Ming Lei 提交于 8月 11, 2021

Inside blk_mq_queue_tag_busy_iter() we already grabbed request's
refcount before calling ->fn(), so needn't to grab it one more time
in blk_mq_check_expired().

Meantime remove extra request expire check in blk_mq_check_expired().

Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohn Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20210811155202.629575-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c797b40c

13 8月, 2021 1 次提交

blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHARED · 454bb677

由 Yu Kuai 提交于 7月 31, 2021

We run a test that delete and recover devcies frequently(two devices on
the same host), and we found that 'active_queues' is super big after a
period of time.

If device a and device b share a tag set, and a is deleted, then
blk_mq_exit_queue() will clear BLK_MQ_F_TAG_QUEUE_SHARED because there
is only one queue that are using the tag set. However, if b is still
active, the active_queues of b might never be cleared even if b is
deleted.

Thus clear active_queues before BLK_MQ_F_TAG_QUEUE_SHARED is cleared.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210731062130.1533893-1-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

454bb677

11 8月, 2021 1 次提交

genirq: Change force_irqthreads to a static key · 91cc470e

由 Tanner Love 提交于 6月 02, 2021

With CONFIG_IRQ_FORCED_THREADING=y, testing the boolean force_irqthreads
could incur a cache line miss in invoke_softirq() and other places.

Replace the test with a static key to avoid the potential cache miss.

[ tglx: Dropped the IDE part, removed the export and updated blk-mq ]
Suggested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NTanner Love <tannerlove@google.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210602180338.3324213-1-tannerlove.kernel@gmail.com

91cc470e

10 8月, 2021 1 次提交

block: move the bdi from the request_queue to the gendisk · edb0872f

由 Christoph Hellwig 提交于 8月 09, 2021

The backing device information only makes sense for file system I/O,
and thus belongs into the gendisk and not the lower level request_queue
structure. Move it there.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-5-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

edb0872f

31 7月, 2021 1 次提交

scsi: block: Remove the remaining SG_IO-related fields from struct request_queue · 1e61c1a8

由 Christoph Hellwig 提交于 7月 29, 2021

Move the sg_timeout and sg_reserved_size fields into the bsg_device and
scsi_device structures as they have nothing to do with generic block I/O.
Note that these values are now separate for bsg vs. SCSI device node
access, but that just matches how /dev/sg vs the other nodes has always
behaved.

Link: https://lore.kernel.org/r/20210729064845.1044147-4-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>

1e61c1a8

01 7月, 2021 1 次提交

block: mark blk_mq_init_queue_data static · 5ec780a6

由 Christoph Hellwig 提交于 6月 24, 2021

All driver uses are gone now.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210624081012.256464-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

5ec780a6

25 6月, 2021 1 次提交

blk-mq: update hctx->dispatch_busy in case of real scheduler · cb9516be

由 Ming Lei 提交于 6月 25, 2021

Commit 6e6fcbc2 ("blk-mq: support batching dispatch in case of io")
starts to support io batching submission by using hctx->dispatch_busy.

However, blk_mq_update_dispatch_busy() isn't changed to update hctx->dispatch_busy
in that commit, so fix the issue by updating hctx->dispatch_busy in case
of real scheduler.
Reported-by: NJan Kara <jack@suse.cz>
Reviewed-by: NJan Kara <jack@suse.cz>
Fixes: 6e6fcbc2 ("blk-mq: support batching dispatch in case of io")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210625020248.1630497-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

cb9516be

18 6月, 2021 3 次提交

sched: Change task_struct::state · 2f064a59

由 Peter Zijlstra 提交于 6月 11, 2021

Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: NWill Deacon <will@kernel.org>
Acked-by: NDaniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org

2f064a59

sched: Add get_current_state() · d6c23bb3

由 Peter Zijlstra 提交于 6月 11, 2021

Remove yet another few p->state accesses.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NWill Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210611082838.347475156@infradead.org

d6c23bb3

sched: Introduce task_is_running() · b03fbd4f

由 Peter Zijlstra 提交于 6月 11, 2021

Replace a bunch of 'p->state == TASK_RUNNING' with a new helper:
task_is_running(p).
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Acked-by: NWill Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org

b03fbd4f

12 6月, 2021 4 次提交

blk-mq: remove blk_mq_init_sq_queue · 08c1d480

由 Christoph Hellwig 提交于 6月 02, 2021

All users are gone now.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-16-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

08c1d480

blk-mq: add the blk_mq_alloc_disk APIs · b461dfc4

由 Christoph Hellwig 提交于 6月 02, 2021

Add a new API to allocate a gendisk including the request_queue for use
with blk-mq based drivers. This is to avoid boilerplate code in drivers.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

b461dfc4

blk-mq: improve the blk_mq_init_allocated_queue interface · 26a9750a

由 Christoph Hellwig 提交于 6月 02, 2021

Don't return the passed in request_queue but a normal error code, and
drop the elevator_init argument in favor of just calling elevator_init_mq
directly from dm-rq.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-3-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

26a9750a

blk-mq: factor out a blk_mq_alloc_sq_tag_set helper · cdb14e0f

由 Christoph Hellwig 提交于 6月 02, 2021

Factour out a helper to initialize a simple single hw queue tag_set from
blk_mq_init_sq_queue. This will allow to phase out blk_mq_init_sq_queue
in favor of a more symmetric and general API.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

cdb14e0f

04 6月, 2021 1 次提交

block: Do not pull requests from the scheduler when we cannot dispatch them · 61347154

由 Jan Kara 提交于 6月 03, 2021

Provided the device driver does not implement dispatch budget accounting
(which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
requests from the IO scheduler as long as it is willing to give out any.
That defeats scheduling heuristics inside the scheduler by creating
false impression that the device can take more IO when it in fact
cannot.

For example with BFQ IO scheduler on top of virtio-blk device setting
blkio cgroup weight has barely any impact on observed throughput of
async IO because __blk_mq_do_dispatch_sched() always sucks out all the
IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
when that is all dispatched, it will give out IO of lower weight cgroups
as well. And then we have to wait for all this IO to be dispatched to
the disk (which means lot of it actually has to complete) before the
IO scheduler is queried again for dispatching more requests. This
completely destroys any service differentiation.

So grab request tag for a request pulled out of the IO scheduler already
in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
cannot get it because we are unlikely to be able to dispatch it. That
way only single request is going to wait in the dispatch list for some
tag to free.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210603104721.6309-1-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

61347154

24 5月, 2021 5 次提交

blk-mq: Use request queue-wide tags for tagset-wide sbitmap · d97e594c

由 John Garry 提交于 5月 13, 2021

The tags used for an IO scheduler are currently per hctx.

As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.

This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.

Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b

In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.

Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).

For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

d97e594c

blk-mq: Some tag allocation code refactoring · 56b68085

由 John Garry 提交于 5月 13, 2021

The tag allocation code to alloc the sbitmap pairs is common for regular
bitmaps tags and shared sbitmap, so refactor into a common function.

Also remove superfluous "flags" argument from blk_mq_init_shared_sbitmap().
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-2-git-send-email-john.garry@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

56b68085

blk-mq: clearing flush request reference in tags->rqs[] · 364b6181

由 Ming Lei 提交于 5月 11, 2021

Before we free request queue, clearing flush request reference in
tags->rqs[], so that potential UAF can be avoided.

Based on one patch written by David Jeffery.
Tested-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NDavid Jeffery <djeffery@redhat.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-5-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

364b6181

blk-mq: clear stale request in tags->rq[] before freeing one request pool · bd63141d

由 Ming Lei 提交于 5月 11, 2021

refcount_inc_not_zero() in bt_tags_iter() still may read one freed
request.

Fix the issue by the following approach:

1) hold a per-tags spinlock when reading ->rqs[tag] and calling
refcount_inc_not_zero in bt_tags_iter()

2) clearing stale request referred via ->rqs[tag] before freeing
request pool, the per-tags spinlock is held for clearing stale
->rq[tag]

So after we cleared stale requests, bt_tags_iter() won't observe
freed request any more, also the clearing will wait for pending
request reference.

The idea of clearing ->rqs[] is borrowed from John Garry's previous
patch and one recent David's patch.
Tested-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NDavid Jeffery <djeffery@redhat.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-4-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

bd63141d

blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter · 2e315dc0

由 Ming Lei 提交于 5月 11, 2021

Grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter(), and
this way will prevent the request from being re-used when ->fn is
running. The approach is same as what we do during handling timeout.

Fix request use-after-free(UAF) related with completion race or queue
releasing:

- If one rq is referred before rq->q is frozen, then queue won't be
frozen before the request is released during iteration.

- If one rq is referred after rq->q is frozen, refcount_inc_not_zero()
will return false, and we won't iterate over this request.

However, still one request UAF not covered: refcount_inc_not_zero() may
read one freed request, and it will be handled in next patch.
Tested-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-3-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

2e315dc0

14 5月, 2021 2 次提交

blk-mq: Swap two calls in blk_mq_exit_queue() · 630ef623

由 Bart Van Assche 提交于 5月 13, 2021

If a tag set is shared across request queues (e.g. SCSI LUNs) then the
block layer core keeps track of the number of active request queues in
tags->active_queues. blk_mq_tag_busy() and blk_mq_tag_idle() update that
atomic counter if the hctx flag BLK_MQ_F_TAG_QUEUE_SHARED is set. Make
sure that blk_mq_exit_queue() calls blk_mq_tag_idle() before that flag is
cleared by blk_mq_del_queue_tag_set().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Fixes: 0d2602ca ("blk-mq: improve support for shared tags maps")
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210513171529.7977-1-bvanassche@acm.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

630ef623

blk-mq: plug request for shared sbitmap · 03f26d8f

由 Ming Lei 提交于 5月 14, 2021

In case of shared sbitmap, request won't be held in plug list any more
sine commit 32bc15af ("blk-mq: Facilitate a shared sbitmap per
tagset"), this way makes request merge from flush plug list & batching
submission not possible, so cause performance regression.

Yanhui reports performance regression when running sequential IO
test(libaio, 16 jobs, 8 depth for each job) in VM, and the VM disk
is emulated with image stored on xfs/megaraid_sas.

Fix the issue by recovering original behavior to allow to hold request
in plug list.

Cc: Yanhui Ma <yama@redhat.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: kashyap.desai@broadcom.com
Fixes: 32bc15af ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514022052.1047665-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

03f26d8f

16 4月, 2021 1 次提交

blk-mq: bypass IO scheduler's limit_depth for passthrough request · 8d663f34

由 Lin Feng 提交于 4月 15, 2021

Commit 01e99aec ("blk-mq: insert passthrough request into
hctx->dispatch directly") gives high priority to passthrough requests and
bypass underlying IO scheduler. But as we allocate tag for such request it
still runs io-scheduler's callback limit_depth, while we really want is to
give full sbitmap-depth capabity to such request for acquiring available
tag.
blktrace shows PC requests(dmraid -s -c -i) hit bfq's limit_depth:
8,0 2 0 0.000000000 39952 1,0 m N bfq [bfq_limit_depth] wr_busy 0 sync 0 depth 8
8,0 2 1 0.000008134 39952 D R 4 [dmraid]
8,0 2 2 0.000021538 24 C R [0]
8,0 2 0 0.000035442 39952 1,0 m N bfq [bfq_limit_depth] wr_busy 0 sync 0 depth 8
8,0 2 3 0.000038813 39952 D R 24 [dmraid]
8,0 2 4 0.000044356 24 C R [0]

This patch introduce a new wrapper to make code not that ugly.
Signed-off-by: NLin Feng <linf@wangsu.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210415033920.213963-1-linf@wangsu.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

8d663f34

09 4月, 2021 1 次提交

treewide: Change list_sort to use const pointers · 4f0f586b

由 Sami Tolvanen 提交于 4月 08, 2021

list_sort() internally casts the comparison function passed to it
to a different type with constant struct list_head pointers, and
uses this pointer to call the functions, which trips indirect call
Control-Flow Integrity (CFI) checking.

Instead of removing the consts, this change defines the
list_cmp_func_t type and changes the comparison function types of
all list_sort() callers to use const pointers, thus avoiding type
mismatches.
Suggested-by: NNick Desaulniers <ndesaulniers@google.com>
Signed-off-by: NSami Tolvanen <samitolvanen@google.com>
Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKees Cook <keescook@chromium.org>
Tested-by: NNick Desaulniers <ndesaulniers@google.com>
Tested-by: NNathan Chancellor <nathan@kernel.org>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com

4f0f586b

05 3月, 2021 3 次提交

scsi: blk-mq: Return budget token from .get_budget callback · 2a5a24aa

由 Ming Lei 提交于 1月 22, 2021

SCSI uses a global atomic variable to track queue depth for each
LUN/request queue.

This doesn't scale well when there are lots of CPU cores and the disk is
very fast. It has been observed that IOPS is affected a lot by tracking
queue depth via sdev->device_busy in the I/O path.

Return budget token from .get_budget callback. The budget token can be
passed to driver so that we can replace the atomic variable with
sbitmap_queue and alleviate the scaling problems that way.

Link: https://lore.kernel.org/r/20210122023317.687987-9-ming.lei@redhat.com
Cc: Omar Sandoval <osandov@fb.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Tested-by: NSumanesh Samanta <sumanesh.samanta@broadcom.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>

2a5a24aa

scsi: sbitmap: Move allocation hint into sbitmap · c548e62b

由 Ming Lei 提交于 1月 22, 2021

Allocation hint should have belonged to sbitmap. Also, when sbitmap's depth
is high and there is no need to use mulitple wakeup queues, user can
benefit from percpu allocation hint too.

Move allocation hint into sbitmap, then SCSI device queue can benefit from
allocation hint when converting to plain sbitmap.

Convert vhost/scsi.c to use sbitmap allocation with percpu alloc hint. This
is more efficient than the previous approach.

Link: https://lore.kernel.org/r/20210122023317.687987-5-ming.lei@redhat.com
Cc: Omar Sandoval <osandov@fb.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: virtualization@lists.linux-foundation.org
Tested-by: NSumanesh Samanta <sumanesh.samanta@broadcom.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>

c548e62b

scsi: sbitmap: Maintain allocation round_robin in sbitmap · efe1f3a1

由 Ming Lei 提交于 1月 22, 2021

Currently the allocation round_robin info is maintained by sbitmap_queue.

However, bit allocation really belongs to sbitmap. Move it there.

Link: https://lore.kernel.org/r/20210122023317.687987-3-ming.lei@redhat.com
Cc: Omar Sandoval <osandov@fb.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: virtualization@lists.linux-foundation.org
Tested-by: NSumanesh Samanta <sumanesh.samanta@broadcom.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>

efe1f3a1

12 2月, 2021 2 次提交

blk-mq: Use llist_head for blk_cpu_done · f9ab4918

由 Sebastian Andrzej Siewior 提交于 1月 23, 2021

With llist_head it is possible to avoid the locking (the irq-off region)
when items are added. This makes it possible to add items on a remote
CPU without additional locking.
llist_add() returns true if the list was previously empty. This can be
used to invoke the SMP function call / raise sofirq only if the first
item was added (otherwise it is already pending).
This simplifies the code a little and reduces the IRQ-off regions.

blk_mq_raise_softirq() needs a preempt-disable section to ensure the
request is enqueued on the same CPU as the softirq is raised.
Some callers (USB-storage) invoke this path in preemptible context.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f9ab4918

blk-mq: Always complete remote completions requests in softirq · 0a2efafb

由 Sebastian Andrzej Siewior 提交于 1月 23, 2021

Controllers with multiple queues have their IRQ-handelers pinned to a
CPU. The core shouldn't need to complete the request on a remote CPU.

Remove this case and always raise the softirq to complete the request.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0a2efafb

25 1月, 2021 3 次提交

blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues · b6e68ee8

由 Jan Kara 提交于 1月 11, 2021

Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
a queue with multiple HW queues, the performance it rather bad. The
problem is that these IO schedulers use queue-wide locking and their
dispatch function does not respect the hctx it is passed in and returns
any request it finds appropriate. Thus locality of request access is
broken and dispatch from multiple CPUs just contends on IO scheduler
locks. For these IO schedulers there's little point in dispatching from
multiple CPUs. Instead dispatch always only from a single CPU to limit
contention.

Below is a comparison of dbench runs on XFS filesystem where the storage
is a raid card with 64 HW queues and to it attached a single rotating
disk. BFQ is used as IO scheduler:

clients MQ SQ MQ-Patched
Amean 1 39.12 (0.00%) 43.29 * -10.67%* 36.09 * 7.74%*
Amean 2 128.58 (0.00%) 101.30 * 21.22%* 96.14 * 25.23%*
Amean 4 577.42 (0.00%) 494.47 * 14.37%* 508.49 * 11.94%*
Amean 8 610.95 (0.00%) 363.86 * 40.44%* 362.12 * 40.73%*
Amean 16 391.78 (0.00%) 261.49 * 33.25%* 282.94 * 27.78%*
Amean 32 324.64 (0.00%) 267.71 * 17.54%* 233.00 * 28.23%*
Amean 64 295.04 (0.00%) 253.02 * 14.24%* 242.37 * 17.85%*
Amean 512 10281.61 (0.00%) 10211.16 * 0.69%* 10447.53 * -1.61%*

Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
advertises just a single HW queue. MQ-Patched is a kernel with this
patch applied.

You can see multiple hardware queues heavily hurt performance in
combination with BFQ. The patch restores the performance.
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b6e68ee8

Revert "blk-mq, elevator: Count requests per hctx to improve performance" · 5ac83c64

由 Jan Kara 提交于 1月 11, 2021

This reverts commit b445547e.

Since both mq-deadline and BFQ completely ignore hctx they are passed to
their dispatch function and dispatch whatever request they deem fit
checking whether any request for a particular hctx is queued is just
pointless since we'll very likely get a request from a different hctx
anyway. In the following commit we'll deal with lock contention in these
IO schedulers in presence of multiple HW queues in a different way.
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5ac83c64

block: store a block_device pointer in struct bio · 309dca30

由 Christoph Hellwig 提交于 1月 24, 2021

Replace the gendisk pointer in struct bio with a pointer to the newly
improved struct block device.  From that the gendisk can be trivially
accessed with an extra indirection, but it also allows to directly
look up all information related to partition remapping.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

309dca30

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功