提交 · 613471549f366cdf4170b81ce0f99f3867ec4d16 · openeuler / Kernel

04 6月, 2021 1 次提交

block: Do not pull requests from the scheduler when we cannot dispatch them · 61347154

由 Jan Kara 提交于 6月 03, 2021

Provided the device driver does not implement dispatch budget accounting
(which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
requests from the IO scheduler as long as it is willing to give out any.
That defeats scheduling heuristics inside the scheduler by creating
false impression that the device can take more IO when it in fact
cannot.

For example with BFQ IO scheduler on top of virtio-blk device setting
blkio cgroup weight has barely any impact on observed throughput of
async IO because __blk_mq_do_dispatch_sched() always sucks out all the
IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
when that is all dispatched, it will give out IO of lower weight cgroups
as well. And then we have to wait for all this IO to be dispatched to
the disk (which means lot of it actually has to complete) before the
IO scheduler is queried again for dispatching more requests. This
completely destroys any service differentiation.

So grab request tag for a request pulled out of the IO scheduler already
in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
cannot get it because we are unlikely to be able to dispatch it. That
way only single request is going to wait in the dispatch list for some
tag to free.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210603104721.6309-1-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

61347154

24 5月, 2021 1 次提交

blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter · 2e315dc0

由 Ming Lei 提交于 5月 11, 2021

Grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter(), and
this way will prevent the request from being re-used when ->fn is
running. The approach is same as what we do during handling timeout.

Fix request use-after-free(UAF) related with completion race or queue
releasing:

- If one rq is referred before rq->q is frozen, then queue won't be
frozen before the request is released during iteration.

- If one rq is referred after rq->q is frozen, refcount_inc_not_zero()
will return false, and we won't iterate over this request.

However, still one request UAF not covered: refcount_inc_not_zero() may
read one freed request, and it will be handled in next patch.
Tested-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-3-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

2e315dc0

05 3月, 2021 1 次提交

scsi: blk-mq: Return budget token from .get_budget callback · 2a5a24aa

由 Ming Lei 提交于 1月 22, 2021

SCSI uses a global atomic variable to track queue depth for each
LUN/request queue.

This doesn't scale well when there are lots of CPU cores and the disk is
very fast. It has been observed that IOPS is affected a lot by tracking
queue depth via sdev->device_busy in the I/O path.

Return budget token from .get_budget callback. The budget token can be
passed to driver so that we can replace the atomic variable with
sbitmap_queue and alleviate the scaling problems that way.

Link: https://lore.kernel.org/r/20210122023317.687987-9-ming.lei@redhat.com
Cc: Omar Sandoval <osandov@fb.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Tested-by: NSumanesh Samanta <sumanesh.samanta@broadcom.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>

2a5a24aa

25 1月, 2021 1 次提交

blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue · 2569063c

由 Ming Lei 提交于 12月 27, 2020

In case of blk_mq_is_sbitmap_shared(), we should test QUEUE_FLAG_HCTX_ACTIVE against
q->queue_flags instead of BLK_MQ_S_TAG_ACTIVE.

So fix it.

Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Fixes: f1b49fdc ("blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NJohn Garry <john.garry@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2569063c

13 12月, 2020 1 次提交

blk-mq: update arg in comment of blk_mq_map_queue · d220a214

由 Minwoo Im 提交于 12月 05, 2020

Update mis-named argument description of blk_mq_map_queue().  This patch
also updates description that argument to software queue percpu context.
Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: NJohn Garry <john.garry@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d220a214

02 12月, 2020 1 次提交

block: switch partition lookup to use struct block_device · 8446fe92

由 Christoph Hellwig 提交于 11月 24, 2020

Use struct block_device to lookup partitions on a disk.  This removes
all usage of struct hd_struct from the I/O path.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de>			[bcache]
Acked-by: Chao Yu <yuchao0@huawei.com>			[f2fs]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8446fe92

04 9月, 2020 5 次提交

blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap · f1b49fdc

由 John Garry 提交于 8月 19, 2020

For when using a shared sbitmap, no longer should the number of active
request queues per hctx be relied on for when judging how to share the tag
bitmap.

Instead maintain the number of active request queues per tag_set, and make
the judgement based on that.

Originally-from: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f1b49fdc

blk-mq: Record nr_active_requests per queue for when using shared sbitmap · bccf5e26

由 John Garry 提交于 8月 19, 2020

The per-hctx nr_active value can no longer be used to fairly assign a share
of tag depth per request queue for when using a shared sbitmap, as it does
not consider that the tags are shared tags over all hctx's.

For this case, record the nr_active_requests per request_queue, and make
the judgement based on that value.

Co-developed-with: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bccf5e26

blk-mq: Relocate hctx_may_queue() · a0235d23

由 John Garry 提交于 8月 19, 2020

blk-mq.h and blk-mq-tag.h include on each other, which is less than ideal.

Locate hctx_may_queue() to blk-mq.h, as it is not really tag specific code.

In this way, we can drop the blk-mq-tag.h include of blk-mq.h
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a0235d23

blk-mq: Facilitate a shared sbitmap per tagset · 32bc15af

由 John Garry 提交于 8月 19, 2020

Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.

In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0
("blk-mq: drain I/O when all CPUs in a hctx are offline").

However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.

In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.

However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e0 ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.

To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.

New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.

Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.

This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].

[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144beSigned-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

32bc15af

blk-mq: Pass flags for tag init/free · 1c0706a7

由 John Garry 提交于 8月 19, 2020

Pass hctx/tagset flags argument down to blk_mq_init_tags() and
blk_mq_free_tags() for selective init/free.

For now, make it include the alloc policy flag, which can be evaluated
when needed (in blk_mq_init_tags()).
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1c0706a7

02 7月, 2020 1 次提交

Revert "blk-mq: put driver tag when this request is completed" · 4e2f62e5

由 Jens Axboe 提交于 7月 01, 2020

This reverts commits the following commits:

	37f4a24c
	723bf178
	36a3df5a

The last one is the culprit, but we have to go a bit deeper to get this
to revert cleanly. There's been a report that this breaks some MMC
setups [1], and also causes an issue with swap [2]. Until this can be
figured out, revert the offending commits.

[1] https://lore.kernel.org/linux-block/57fb09b1-54ba-f3aa-f82c-d709b0e6b281@samsung.com/
[2] https://lore.kernel.org/linux-block/20200702043721.GA1087@lca.pw/Reported-by: NMarek Szyprowski <m.szyprowski@samsung.com>
Reported-by: NQian Cai <cai@lca.pw>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4e2f62e5

01 7月, 2020 1 次提交

blk-mq: move blk_mq_put_driver_tag() into blk-mq.c · 723bf178

由 Ming Lei 提交于 6月 30, 2020

It is used by blk-mq.c only, so move it to the source file.
Suggested-by: NChristoph Hellwig <hch@infradead.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

723bf178

30 6月, 2020 3 次提交

blk-mq: pass obtained budget count to blk_mq_dispatch_rq_list · 1fd40b5e

由 Ming Lei 提交于 6月 30, 2020

Pass obtained budget count to blk_mq_dispatch_rq_list(), and prepare
for supporting fully batching submission.

With the obtained budget count, it is easier to put extra budgets
in case of .queue_rq failure.

Meantime remove the old 'got_budget' parameter.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Tested-by: NBaolin Wang <baolin.wang7@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@infradead.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1fd40b5e

blk-mq: pass hctx to blk_mq_dispatch_rq_list · 445874e8

由 Ming Lei 提交于 6月 30, 2020

All requests in the 'list' of blk_mq_dispatch_rq_list belong to same
hctx, so it is better to pass hctx instead of request queue, because
blk-mq's dispatch target is hctx instead of request queue.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Tested-by: NBaolin Wang <baolin.wang7@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

445874e8

blk-mq: pass request queue into get/put budget callback · 65c76369

由 Ming Lei 提交于 6月 30, 2020

blk-mq budget is abstract from scsi's device queue depth, and it is
always per-request-queue instead of hctx.

It can be quite absurd to get a budget from one hctx, then dequeue a
request from scheduler queue, and this request may not belong to this
hctx, at least for bfq and deadline.

So fix the mess and always pass request queue to get/put budget
callback.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Tested-by: NBaolin Wang <baolin.wang7@gmail.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDouglas Anderson <dianders@chromium.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Douglas Anderson <dianders@chromium.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

65c76369

29 6月, 2020 1 次提交

blk-mq: remove the BLK_MQ_REQ_INTERNAL flag · 42fdc5e4

由 Christoph Hellwig 提交于 6月 29, 2020

Just check for a non-NULL elevator directly to make the code more clear.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

42fdc5e4

07 6月, 2020 1 次提交

blk-mq: split out a __blk_mq_get_driver_tag helper · d94ecfc3

由 Christoph Hellwig 提交于 6月 05, 2020

Allocation of the driver tag in the case of using a scheduler shares very
little code with the "normal" tag allocation.  Split out a new helper to
streamline this path, and untangle it from the complex normal tag
allocation.

This way also avoids to fail driver tag allocation because of inactive hctx
during cpu hotplug, and fixes potential hang risk.

Fixes: bf0beec0 ("blk-mq: drain I/O when all CPUs in a hctx are offline")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NJohn Garry <john.garry@huawei.com>
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Daniel Wagner <dwagner@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d94ecfc3

30 5月, 2020 1 次提交

blk-mq: use BLK_MQ_NO_TAG in more places · 76647368

由 Christoph Hellwig 提交于 5月 29, 2020

Replace various magic -1 constants for tags with BLK_MQ_NO_TAG.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

76647368

27 2月, 2020 1 次提交

blk-mq: Remove some unused function arguments · cae740a0

由 John Garry 提交于 2月 26, 2020

The struct blk_mq_hw_ctx pointer argument in blk_mq_put_tag(),
blk_mq_poll_nsecs(), and blk_mq_poll_hybrid_sleep() is unused, so remove
it.

Overall obj code size shows a minor reduction, before:
   text	   data	    bss	    dec	    hex	filename
  27306	   1312	      0	  28618	   6fca	block/blk-mq.o
   4303	    272	      0	   4575	   11df	block/blk-mq-tag.o

after:
  27282	   1312	      0	  28594	   6fb2	block/blk-mq.o
   4311	    272	      0	   4583	   11e7	block/blk-mq-tag.o
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
--
This minor patch had been carried as part of the blk-mq shared tags RFC,
I'd rather not carry it anymore as it required rebasing, so now or never..
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cae740a0

25 2月, 2020 1 次提交

blk-mq: insert passthrough request into hctx->dispatch directly · 01e99aec

由 Ming Lei 提交于 2月 25, 2020

For some reason, device may be in one situation which can't handle
FS request, so STS_RESOURCE is always returned and the FS request
will be added to hctx->dispatch. However passthrough request may
be required at that time for fixing the problem. If passthrough
request is added to scheduler queue, there isn't any chance for
blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
Then the FS IO request may never be completed, and IO hang is caused.

So passthrough request has to be added to hctx->dispatch directly
for fixing the IO hang.

Fix this issue by inserting passthrough request into hctx->dispatch
directly together withing adding FS request to the tail of
hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().

Then it becomes consistent with original legacy IO request
path, in which passthrough request is always added to q->queue_head.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

01e99aec

07 10月, 2019 1 次提交

blk-mq: Inline status checkers · 27a46989

由 Pavel Begunkov 提交于 9月 30, 2019

blk_mq_request_completed() and blk_mq_request_started() are
short, inline it.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

27a46989

11 7月, 2019 1 次提交

block: Disable write plugging for zoned block devices · b49773e7

由 Damien Le Moal 提交于 7月 11, 2019

Simultaneously writing to a sequential zone of a zoned block device
from multiple contexts requires mutual exclusion for BIO issuing to
ensure that writes happen sequentially. However, even for a well
behaved user correctly implementing such synchronization, BIO plugging
may interfere and result in BIOs from the different contextx to be
reordered if plugging is done outside of the mutual exclusion section,
e.g. the plug was started by a function higher in the call chain than
the function issuing BIOs.

         Context A                     Context B

   | blk_start_plug()
   | ...
   | seq_write_zone()
     | mutex_lock(zone)
     | bio-0->bi_iter.bi_sector = zone->wp
     | zone->wp += bio_sectors(bio-0)
     | submit_bio(bio-0)
     | bio-1->bi_iter.bi_sector = zone->wp
     | zone->wp += bio_sectors(bio-1)
     | submit_bio(bio-1)
     | mutex_unlock(zone)
     | return
   | -----------------------> | seq_write_zone()
  				| mutex_lock(zone)
     				| bio-2->bi_iter.bi_sector = zone->wp
     				| zone->wp += bio_sectors(bio-2)
				| submit_bio(bio-2)
				| mutex_unlock(zone)
   | <------------------------- |
   | blk_finish_plug()

In the above example, despite the mutex synchronization ensuring the
correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
issued after BIO 2 of context B, when the plug is released with
blk_finish_plug().

While this problem can be addressed using the blk_flush_plug_list()
function (in the above example, the call must be inserted before the
zone mutex lock is released), a simple generic solution in the block
layer avoid this additional code in all zoned block device user code.
The simple generic solution implemented with this patch is to introduce
the internal helper function blk_mq_plug() to access the current
context plug on BIO submission. This helper returns the current plug
only if the target device is not a zoned block device or if the BIO to
be plugged is not a write operation. Otherwise, the caller context plug
is ignored and NULL returned, resulting is all writes to zoned block
device to never be plugged.
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b49773e7

03 7月, 2019 1 次提交

blk-mq: remove blk_mq_put_ctx() · c05f4220

由 Bart Van Assche 提交于 7月 01, 2019

No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends
on preemption being disabled for its correctness. Since removing the CPU
preemption calls does not measurably affect performance, simplify the
blk-mq code by removing the blk_mq_put_ctx() function and also by not
disabling preemption in blk_mq_get_ctx().

Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c05f4220

04 5月, 2019 1 次提交

blk-mq: free hw queue's resource in hctx's release handler · c7e2d94b

由 Ming Lei 提交于 4月 30, 2019

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d9
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d9,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

	8dc765d4 ("SCSI: fix queue cleanup race before queue initialization is done")
	c2856ae2 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

	https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: NJames Smart <james.smart@broadcom.com>
Fixes: 45a9c9d9 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c7e2d94b

05 4月, 2019 1 次提交

block: Revert v5.0 blk_mq_request_issue_directly() changes · fd9c40f6

由 Bart Van Assche 提交于 4月 04, 2019

blk_mq_try_issue_directly() can return BLK_STS*_RESOURCE for requests that
have been queued. If that happens when blk_mq_try_issue_directly() is called
by the dm-mpath driver then dm-mpath will try to resubmit a request that is
already queued and a kernel crash follows. Since it is nontrivial to fix
blk_mq_request_issue_directly(), revert the blk_mq_request_issue_directly()
changes that went into kernel v5.0.

This patch reverts the following commits:
* d6a51a97 ("blk-mq: replace and kill blk_mq_request_issue_directly") # v5.0.
* 5b7a6f12 ("blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests") # v5.0.
* 7f556a44 ("blk-mq: refactor the code of issue request directly") # v5.0.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: James Smart <james.smart@broadcom.com>
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: <stable@vger.kernel.org>
Reported-by: NLaurence Oberman <loberman@redhat.com>
Tested-by: NLaurence Oberman <loberman@redhat.com>
Fixes: 7f556a44 ("blk-mq: refactor the code of issue request directly") # v5.0.
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fd9c40f6

25 3月, 2019 1 次提交

blk-mq: use blk_mq_put_driver_tag() to put tag · 13f06381

由 Yufen Yu 提交于 3月 24, 2019

Expect arguments, blk_mq_put_driver_tag_hctx() and blk_mq_put_driver_tag()
is same. We can just use argument 'request' to put tag by blk_mq_put_driver_tag().
Then we can remove the unused blk_mq_put_driver_tag_hctx().
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

13f06381

21 3月, 2019 1 次提交

block: Unexport blk_mq_add_to_requeue_list() · e6c98712

由 Bart Van Assche 提交于 3月 20, 2019

This function is not used outside the block layer core. Hence unexport it.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e6c98712

09 2月, 2019 1 次提交

blk-mq: remove duplicated definition of blk_mq_freeze_queue · 26984841

由 Liu Bo 提交于 1月 25, 2019

As the prototype has been defined in "include/linux/blk-mq.h", the one
in "block/blk-mq.h" can be removed then.
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

26984841

01 2月, 2019 2 次提交

blk-mq: save default hctx into ctx->hctxs for not-supported type · bb94aea1

由 Jianchao Wang 提交于 1月 24, 2019

Currently, we check whether the hctx type is supported every time
in hot path. Actually, this is not necessary, we could save the
default hctx into ctx->hctxs if the type is not supported when
map swqueues and use it directly with ctx->hctxs[type].

We also needn't check whether the poll is enabled or not, because
the caller would clear the REQ_HIPRI in that case.
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bb94aea1

blk-mq: save queue mapping result into ctx directly · 8ccdf4a3

由 Jianchao Wang 提交于 1月 24, 2019

Currently, the queue mapping result is saved in a two-dimensional
array. In the hot path, to get a hctx, we need do following:

  q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]]

This isn't very efficient. We could save the queue mapping result into
ctx directly with different hctx type, like,

  ctx->hctxs[type]
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8ccdf4a3

18 12月, 2018 1 次提交

blk-mq: fix dispatch from sw queue · c16d6b5a

由 Ming Lei 提交于 12月 17, 2018

When a request is added to rq list of sw queue(ctx), the rq may be from
a different type of hctx, especially after multi queue mapping is
introduced.

So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or
blk_mq_dequeue_from_ctx(), one request belonging to other queue type of
hctx can be dispatched to current hctx in case that read queue or poll
queue is enabled.

This patch fixes this issue by introducing per-queue-type list.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>

Changed by me to not use separately cacheline aligned lists, just
place them all in the same cacheline where we had just the one list
and lock before.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c16d6b5a

17 12月, 2018 1 次提交

blk-mq: only dispatch to non-defauly queue maps if they have queues · 5aceaeb2

由 Christoph Hellwig 提交于 12月 17, 2018

We should check if a given queue map actually has queues enabled before
dispatching to it.  This allows drivers to not initialize optional but
not used map types, which subsequently will allow fixing problems with
queue map rebuilds for that case.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5aceaeb2

16 12月, 2018 1 次提交

blk-mq: replace and kill blk_mq_request_issue_directly · d6a51a97

由 Jianchao Wang 提交于 12月 14, 2018

Replace blk_mq_request_issue_directly with blk_mq_try_issue_directly
in blk_insert_cloned_request and kill it as nobody uses it any more.
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d6a51a97

10 12月, 2018 1 次提交

block: return just one value from part_in_flight · e016b782

由 Mikulas Patocka 提交于 12月 06, 2018

The previous patches deleted all the code that needed the second value
returned from part_in_flight - now the kernel only uses the first value.

Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
it only returns one value.

This patch just refactors the code, there's no functional change.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e016b782

05 12月, 2018 1 次提交

block: move queues types to the block layer · e20ba6e1

由 Christoph Hellwig 提交于 12月 02, 2018

Having another indirect all in the fast path doesn't really help
in our post-spectre world.  Also having too many queue type is just
going to create confusion, so I'd rather manage them centrally.

Note that the queue type naming and ordering changes a bit - the
first index now is the default queue for everything not explicitly
marked, the optional ones are read and poll queues.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e20ba6e1

30 11月, 2018 1 次提交

blk-mq: use bd->last == true for list inserts · be94f058

由 Jens Axboe 提交于 11月 24, 2018

If we are issuing a list of requests, we know if we're at the last one.
If we fail issuing, ensure that we call ->commits_rqs() to flush any
potential previous requests.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

be94f058

21 11月, 2018 1 次提交

blk-mq: not embed .mq_kobj and ctx->kobj into queue instance · 1db4909e

由 Ming Lei 提交于 11月 20, 2018

Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
from block layer's view, actually they don't because userspace may
grab one kobject anytime via sysfs.

This patch fixes the issue by the following approach:

1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
all ctxs

2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
handler of .mq_kobj

3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
.mq_kobj is always released after all ctxs are freed.

This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
is enabled.
Reported-by: NGuenter Roeck <linux@roeck-us.net>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1db4909e

08 11月, 2018 2 次提交

blk-mq: cache request hardware queue mapping · ea4f995e

由 Jens Axboe 提交于 10月 29, 2018

We call blk_mq_map_queue() a lot, at least two times for each
request per IO, sometimes more. Since we now have an indirect
call as well in that function. cache the mapping so we don't
have to re-call blk_mq_map_queue() for the same request
multiple times.
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ea4f995e

blk-mq: support multiple hctx maps · b3c661b1

由 Jens Axboe 提交于 10月 30, 2018

Add support for the tag set carrying multiple queue maps, and
for the driver to inform blk-mq how many it wishes to support
through setting set->nr_maps.

This adds an mq_ops helper for drivers that support more than 1
map, mq_ops->rq_flags_to_type(). The function takes request/bio
flags and CPU, and returns a queue map index for that. We then
use the type information in blk_mq_map_queue() to index the map
set.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b3c661b1

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功