提交 · 358a5fb928a9df8b51ec2fcbc7a2272fd2111011 · openeuler / Kernel

29 3月, 2023 1 次提交

Revert "block: fix null-deref in percpu_ref_put" · 60469540

由 Zhong Jinghua 提交于 3月 29, 2023

hulk inclusion
category: bugfix
bugzilla: 187268, https://gitee.com/openeuler/kernel/issues/I5N162

----------------------------------------

This reverts commit 51e35e67.

There is a new fix for this problem in the mainline patch, so the patch
should return to the mainline solution.

mainline patch:
d36a9ea5 ("block: fix use-after-free of q->q_usage_counter")

Fixes: 51e35e67("block: fix null-deref in percpu_ref_put")
Signed-off-by: NZhong Jinghua <zhongjinghua@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

60469540

12 12月, 2022 1 次提交

blk-mq: fix kabi broken due to request_wrapper · 31c285dd

由 Yu Kuai 提交于 12月 12, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I65K8D
CVE: NA

--------------------------------

Before commit f60df4a0 ("blk-mq: fix kabi broken in struct
request"), drivers will got cmd address right after request, however,
after this commit, drivers will got cmd address after request_wrapper
instead, which is bigger than request and will cause compatibility
issues.

Fix the problem by placing request_wrapper behind cmd, so that the
cmd address for drivers will stay the same.

Before commit:		|request|cmd|
After commit:		|request|request_wrapper|cmd|
With this patch:	|request|cmd|request_wrapper|

Performance test: arm64 Kunpeng-920 96 core

1) null_blk setup:
modprobe null_blk nr_devices=0 &&
    udevadm settle &&
    cd /sys/kernel/config/nullb &&
    mkdir nullb0 &&
    cd nullb0 &&
    echo 0 > completion_nsec &&
    echo 512 > blocksize &&
    echo 0 > home_node &&
    echo 0 > irqmode &&
    echo 1024 > size &&
    echo 0 > memory_backed &&
    echo 2 > queue_mode &&
	echo 4096 > hw_queue_depth &&
	echo 96 > submit_queues &&
    echo 1 > power

2) fio test script:
[global]
ioengine=libaio
direct=1
numjobs=96
iodepth=32
bs=4k
rw=randwrite
allow_mounted_write=0
time_based
runtime=60
group_reporting=1
ioscheduler=none
cpus_allowed_policy=split
cpus_allowed=0-95

[test]
filename=/dev/nullb0

3) iops test result:

without this patch:	23.9M
with this patch:	24.1M

Fixes: f60df4a0 ("blk-mq: fix kabi broken in struct request")
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

31c285dd

02 11月, 2022 3 次提交

blk-mq: fix kabi broken in blk_mq_tag_set · 98a337bc

由 Yu Kuai 提交于 11月 02, 2022

hulk inclusion
category: bugfix
bugzilla: 186917, https://gitee.com/openeuler/kernel/issues/I5N1S5
CVE: NA

--------------------------------
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

98a337bc

blk-mq: Use shared tags for shared sbitmap support · 70399fa6

由 John Garry 提交于 11月 02, 2022

mainline inclusion
from mainline-v5.16-rc1
commit e155b0c2
category: performance
bugzilla: 186917, https://gitee.com/openeuler/kernel/issues/I5N1S5
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e155b0c238b20f0a866f4334d292656665836c8a

--------------------------------

Currently we use separate sbitmap pairs and active_queues atomic_t for
shared sbitmap support.

However a full sets of static requests are used per HW queue, which is
quite wasteful, considering that the total number of requests usable at
any given time across all HW queues is limited by the shared sbitmap depth.

As such, it is considerably more memory efficient in the case of shared
sbitmap to allocate a set of static rqs per tag set or request queue, and
not per HW queue.

So replace the sbitmap pairs and active_queues atomic_t with a shared
tags per tagset and request queue, which will hold a set of shared static
rqs.

Since there is now no valid HW queue index to be passed to the blk_mq_ops
.init and .exit_request callbacks, pass an invalid index token. This
changes the semantics of the APIs, such that the callback would need to
validate the HW queue index before using it. Currently no user of shared
sbitmap actually uses the HW queue index (as would be expected).
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-13-git-send-email-john.garry@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
Conflict: c6f9c0e2 ("blk-mq: allow hardware queue to get more tag
while sharing a tag set") is merged, which will cause lots of conflicts
for this patch, and in the meantime, the functionality will need to be
adapted in that patch.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

70399fa6

block: fix null-deref in percpu_ref_put · 51ddf58f

由 Zhang Wensheng 提交于 11月 02, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5N162
CVE: NA

--------------------------------

In the use of q_usage_counter of request_queue, blk_cleanup_queue using
"wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter))"
to wait q_usage_counter becoming zero. however, if the q_usage_counter
becoming zero quickly, and percpu_ref_exit will execute and ref->data
will be freed, maybe another process will cause a null-defef problem
like below:

	CPU0                             CPU1
blk_cleanup_queue
 blk_freeze_queue
  blk_mq_freeze_queue_wait
				scsi_end_request
				 percpu_ref_get
				 ...
				 percpu_ref_put
				  atomic_long_sub_and_test
  percpu_ref_exit
   ref->data -> NULL
   				   ref->data->release(ref) -> null-deref

Fix it by setting flag(QUEUE_FLAG_USAGE_COUNT_SYNC) to add synchronization
mechanism, when ref->data->release is called, the flag will be setted,
and the "wait_event" in blk_mq_freeze_queue_wait must wait flag becoming
true as well, which will limit percpu_ref_exit to execute ahead of time.
Signed-off-by: NZhang Wensheng <zhangwensheng5@huawei.com>
Reviewed-by: NYu Kuai <yukuai3@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

51ddf58f

13 7月, 2022 1 次提交

blk-mq: fix kabi broken in struct request · f60df4a0

由 Yu Kuai 提交于 7月 13, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I57S8D
CVE: NA

--------------------------------

Since there are no reserved fields, declare a wrapper to fix kabi
broken.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f60df4a0

21 3月, 2022 1 次提交

blk-mq: fix potential uaf for 'queue_hw_ctx' · fb9a380f

由 Yu Kuai 提交于 3月 21, 2022

hulk inclusion
category: bugfix
bugzilla: 186389, https://gitee.com/openeuler/kernel/issues/I4Y43S
CVE: NA

--------------------------------

blk_mq_realloc_hw_ctxs() will free the 'queue_hw_ctx'(e.g. undate
submit_queues through configfs for null_blk), while it might still be
used from other context(e.g. switch elevator to none):

t1					t2
elevator_switch
 blk_mq_unquiesce_queue
  blk_mq_run_hw_queues
   queue_for_each_hw_ctx
    // assembly code for hctx = (q)->queue_hw_ctx[i]
    mov    0x48(%rbp),%rdx -> read old queue_hw_ctx

					__blk_mq_update_nr_hw_queues
					 blk_mq_realloc_hw_ctxs
					  hctxs = q->queue_hw_ctx
					  q->queue_hw_ctx = new_hctxs
					  kfree(hctxs)
    movslq %ebx,%rax
    mov    (%rdx,%rax,8),%rdi ->uaf

Sicne the queue is freezed in __blk_mq_update_nr_hw_queues(), fix the
problem by protecting 'queue_hw_ctx' through rcu where it can be accessed
without grabbing 'q_usage_counter'.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NZhang Wensheng <zhangwensheng5@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

fb9a380f

08 3月, 2022 3 次提交

blk-mq: decrease pending_queues when it expires · 326d641b

由 Yu Kuai 提交于 3月 08, 2022

hulk inclusion
category: performance
bugzilla: https://gitee.com/openeuler/kernel/issues/I4S8DW

---------------------------

If pending_queues is increased once, it will only be decreased when
nr_active is zero, and that will lead to the under-utilization of
host tags because pending_queues is non-zero and the available
tags for the queue will be max(host tags / active_queues, 4)
instead of the needed tags of the queue.

Fix it by adding an expiration time for the increasement of pending_queues,
and decrease it when it expires, so pending_queues will be decreased
to zero if there is no tag allocation failure, and the available tags
for the queue will be the whole host tags.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

326d641b

blk-mq: allow hardware queue to get more tag while sharing a tag set · c6f9c0e2

由 Yu Kuai 提交于 3月 08, 2022

hulk inclusion
category: performance
bugzilla: https://gitee.com/openeuler/kernel/issues/I4S8DW

---------------------------

When sharing a tag set, if most disks are issuing small amount of IO, and
only a few is issuing a large amount of IO. Current approach is to limit
the max amount of tags a disk can get equally to the average of total
tags. Thus the few heavy load disk can't get enough tags while many tags
are still free in the tag set.

We add 'pending_queues' in blk_mq_tag_set to count how many queues can't
get driver tag. Thus if this value is zero, there is no need to limit
the max number of available tags.

On the other hand, if a queue doesn't issue IO, the 'active_queues' will
not be decreased in a period of time(request timeout), thus a lot of tags
will not be available because max number of available tags is set to
max(total tags / active_queues, 4). Thus we decreased it when
'nr_active' is 0.

This functionality is enabled by default, to disable it, add
"blk_mq.unfair_dtag=0" to boot cmd.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c6f9c0e2

Revert "blk-mq, elevator: Count requests per hctx to improve performance" · 1fe792d8

由 Jan Kara 提交于 3月 08, 2022

mainline inclusion
from mainline-5.12-rc1
commit 5ac83c64
category: perf
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SW26
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5ac83c644f5fb924f0b2c09102ab82fc788f8411

-------------------------------------------------

This reverts commit b445547e.

Since both mq-deadline and BFQ completely ignore hctx they are passed to
their dispatch function and dispatch whatever request they deem fit
checking whether any request for a particular hctx is queued is just
pointless since we'll very likely get a request from a different hctx
anyway. In the following commit we'll deal with lock contention in these
IO schedulers in presence of multiple HW queues in a different way.
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

Conflicts:
	block/bfq-iosched.c
Signed-off-by: NBaokun Li <libaokun1@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1fe792d8

31 12月, 2021 1 次提交

kabi: Add kabi reservation for storage module · dde4fe56

由 Zhihao Cheng 提交于 12月 31, 2021

hulk inclusion
category: feature
bugzilla: 185747 https://gitee.com/openeuler/kernel/issues/I4OUFN
CVE: NA

-------------------------------

Introduce kabi for storage module.
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

dde4fe56

15 11月, 2021 1 次提交

blk-mq: Introduce the BLK_MQ_F_NO_SCHED_BY_DEFAULT flag · aef76ac1

由 Bart Van Assche 提交于 11月 15, 2021

mainline inclusion
from mainline-5.15-rc1
commit 90b71980
category: performance
bugzilla: 174005 https://gitee.com/openeuler/kernel/issues/I4DDEL

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=90b7198001f23ea37d3b46dc631bdaa2357a20b1

---------------------------

elevator_get_default() uses the following algorithm to select an I/O
scheduler from inside add_disk():
- In case of a single hardware queue or if sharing hardware queues across
  multiple request queues (BLK_MQ_F_TAG_HCTX_SHARED), use mq-deadline.
- Otherwise, use 'none'.

This is a good choice for most but not for all block drivers. Make it
possible to override the selection of mq-deadline with a new flag,
namely BLK_MQ_F_NO_SCHED_BY_DEFAULT.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Martijn Coenen <maco@android.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210805174200.3250718-2-bvanassche@acm.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

Conflicts:
	block/elevator.c
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

aef76ac1

06 7月, 2021 1 次提交

blk-mq: add new API of blk_mq_hctx_set_fq_lock_class · 5476d7ce

由 Ming Lei 提交于 6月 28, 2021

mainline inclusion
from mainline-5.11-rc1
commit fb01a293
category: bugfix
bugzilla: 108493
CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fb01a2932e81a1fb2273f87ff92dc8172b8880ee

---------------------------

flush_end_io() may be called recursively from some driver, such as
nvme-loop, so lockdep may complain 'possible recursive locking'.
Commit b3c6a599("block: Fix a lockdep complaint triggered by
request queue flushing") tried to address this issue by assigning
dynamically allocated per-flush-queue lock class. This solution
adds synchronize_rcu() for each hctx's release handler, and causes
horrible SCSI MQ probe delay(more than half an hour on megaraid sas).

Add new API of blk_mq_hctx_set_fq_lock_class() for these drivers, so
we just need to use driver specific lock class for avoiding the
lockdep warning of 'possible recursive locking'.
Tested-by: NKashyap Desai <kashyap.desai@broadcom.com>
Reported-by: NQian Cai <cai@redhat.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5476d7ce

27 1月, 2021 2 次提交

scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT · d8545d48

由 Bart Van Assche 提交于 1月 19, 2021

stable inclusion
from stable-5.10.7
commit 782c9ef2ac059a25d6afbac344319574414258db
bugzilla: 47429

--------------------------------

[ Upstream commit a4d34da7 ]

Remove flag RQF_PREEMPT and BLK_MQ_REQ_PREEMPT since these are no longer
used by any kernel code.

Link: https://lore.kernel.org/r/20201209052951.16136-8-bvanassche@acm.org
Cc: Can Guo <cang@codeaurora.org>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Martin Kepplinger <martin.kepplinger@puri.sm>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NCan Guo <cang@codeaurora.org>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

d8545d48

scsi: block: Introduce BLK_MQ_REQ_PM · 251eaec0

由 Bart Van Assche 提交于 1月 19, 2021

stable inclusion
from stable-5.10.7
commit 8ed46b329d4e62a1d0c7b17361c0e364eaf4a9da
bugzilla: 47429

--------------------------------

[ Upstream commit 0854bcdc ]

Introduce the BLK_MQ_REQ_PM flag. This flag makes the request allocation
functions set RQF_PM. This is the first step towards removing
BLK_MQ_REQ_PREEMPT.

Link: https://lore.kernel.org/r/20201209052951.16136-3-bvanassche@acm.org
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Can Guo <cang@codeaurora.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NCan Guo <cang@codeaurora.org>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

251eaec0

29 10月, 2020 1 次提交

blk-mq: docs: add kernel-doc description for a new struct member · 6a6223ec

由 Mauro Carvalho Chehab 提交于 10月 27, 2020

As reported by kernel-doc:
	./include/linux/blk-mq.h:267: warning: Function parameter or member 'active_queues_shared_sbitmap' not described in 'blk_mq_tag_set'

There is now a new member for struct blk_mq_tag_set. Add a
description for it, based on the commit that introduced it.

Fixes: f1b49fdc ("blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap")
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NJohn Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/8e513153b83eefc05e358f51f2632b592c3f6772.1603791716.git.mchehab+huawei@kernel.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

6a6223ec

04 9月, 2020 4 次提交

blk-mq, elevator: Count requests per hctx to improve performance · b445547e

由 Kashyap Desai 提交于 8月 19, 2020

High CPU utilization on "native_queued_spin_lock_slowpath" due to lock
contention is possible for mq-deadline and bfq IO schedulers
when nr_hw_queues is more than one.

It is because kblockd work queue can submit IO from all online CPUs
(through blk_mq_run_hw_queues()) even though only one hctx has pending
commands.

The elevator callback .has_work for mq-deadline and bfq scheduler considers
pending work if there are any IOs on request queue but it does not account
hctx context.

Add a per-hctx 'elevator_queued' count to the hctx to avoid triggering
the elevator even though there are no requests queued.

[jpg: Relocated atomic_dec() in dd_dispatch_request(), update commit message per Kashyap]
Signed-off-by: NKashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b445547e

blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap · f1b49fdc

由 John Garry 提交于 8月 19, 2020

For when using a shared sbitmap, no longer should the number of active
request queues per hctx be relied on for when judging how to share the tag
bitmap.

Instead maintain the number of active request queues per tag_set, and make
the judgement based on that.

Originally-from: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f1b49fdc

blk-mq: Facilitate a shared sbitmap per tagset · 32bc15af

由 John Garry 提交于 8月 19, 2020

Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.

In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0
("blk-mq: drain I/O when all CPUs in a hctx are offline").

However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.

In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.

However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e0 ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.

To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.

New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.

Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.

This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].

[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144beSigned-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

32bc15af

blk-mq: Rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED · 51db1c37

由 Ming Lei 提交于 8月 19, 2020

BLK_MQ_F_TAG_SHARED actually means that tags is shared among request
queues, all of which should belong to LUNs attached to same HBA.

So rename it to make the point explicitly.

[jpg: rebase a few times, add rnbd-clt.c change]
Suggested-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

51db1c37

02 9月, 2020 1 次提交

block: Move blk_mq_bio_list_merge() into blk-merge.c · bdc6a287

由 Baolin Wang 提交于 8月 28, 2020

Move the blk_mq_bio_list_merge() into blk-merge.c and
rename it as a generic name.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bdc6a287

29 7月, 2020 1 次提交

block: Remove callback typedefs for blk_mq_ops · 0516c2f6

由 Daniel Wagner 提交于 7月 28, 2020

No need to define typedefs for the callbacks, because there is not a
single user except blk_mq_ops.
Signed-off-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0516c2f6

01 7月, 2020 1 次提交

block: move ->make_request_fn to struct block_device_operations · c62b37d9

由 Christoph Hellwig 提交于 7月 01, 2020

The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector.  Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does).  Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c62b37d9

30 6月, 2020 1 次提交

blk-mq: pass request queue into get/put budget callback · 65c76369

由 Ming Lei 提交于 6月 30, 2020

blk-mq budget is abstract from scsi's device queue depth, and it is
always per-request-queue instead of hctx.

It can be quite absurd to get a budget from one hctx, then dequeue a
request from scheduler queue, and this request may not belong to this
hctx, at least for bfq and deadline.

So fix the mess and always pass request queue to get/put budget
callback.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Tested-by: NBaolin Wang <baolin.wang7@gmail.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDouglas Anderson <dianders@chromium.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Baolin Wang <baolin.wang7@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Douglas Anderson <dianders@chromium.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

65c76369

29 6月, 2020 1 次提交

blk-mq: remove the BLK_MQ_REQ_INTERNAL flag · 42fdc5e4

由 Christoph Hellwig 提交于 6月 29, 2020

Just check for a non-NULL elevator directly to make the code more clear.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

42fdc5e4

24 6月, 2020 2 次提交

blk-mq: add a new blk_mq_complete_request_remote API · 40d09b53

由 Christoph Hellwig 提交于 6月 11, 2020

This is a variant of blk_mq_complete_request_remote that only completes
the request if it needs to be bounced to another CPU or a softirq.  If
the request can be completed locally the function returns false and lets
the driver complete it without requring and indirect function call.
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

40d09b53

blk-mq: move failure injection out of blk_mq_complete_request · 15f73f5b

由 Christoph Hellwig 提交于 6月 11, 2020

Move the call to blk_should_fake_timeout out of blk_mq_complete_request
and into the drivers, skipping call sites that are obvious error
handlers, and remove the now superflous blk_mq_force_complete_rq helper.
This ensures we don't keep injecting errors into completions that just
terminate the Linux request after the hardware has been reset or the
command has been aborted.
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

15f73f5b

30 5月, 2020 2 次提交

blk-mq: drain I/O when all CPUs in a hctx are offline · bf0beec0

由 Ming Lei 提交于 5月 29, 2020

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"That was the constraint of managed interrupts from the very beginning:

 The driver/subsystem has to quiesce the interrupt line and the associated
 queue _before_ it gets shutdown in CPU unplug and not fiddle with it
 until it's restarted by the core when the CPU is plugged in again."

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
cpuhp state handled after the CPU is down, so there isn't any chance to
quiesce the hctx before shutting down the CPU.

Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
where the last CPU goes away, and wait for completion of in-flight
requests.  This guarantees that there is no inflight I/O before shutting
down the managed IRQ.

Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
to wait for completion of in-flight requests from these drivers to avoid
a potential dead-lock. It is safe to do this for stacking drivers as those
do not use interrupts at all and their I/O completions are triggered by
underlying devices I/O completion.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

[hch: different retry mechanism, merged two patches, minor cleanups]
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bf0beec0

blk-mq: blk-mq: provide forced completion method · 7b11eab0

由 Keith Busch 提交于 5月 29, 2020

Drivers may need to bypass error injection for error recovery. Rename
__blk_mq_complete_request() to blk_mq_force_complete_rq() and export
that function so drivers may skip potential fake timeouts after they've
reclaimed lost requests.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7b11eab0

25 4月, 2020 1 次提交

block: bypass ->make_request_fn for blk-mq drivers · 8cf7961d

由 Christoph Hellwig 提交于 4月 25, 2020

Call blk_mq_make_request when no ->make_request_fn is set.  This is
safe now that blk_alloc_queue always sets up the pointer for make_request
based drivers.  This avoids an indirect call in the blk-mq driver I/O
fast path, which is rather expensive due to spectre mitigations.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8cf7961d

21 4月, 2020 1 次提交

blk-mq: Add blk_mq_delay_run_hw_queues() API call · b9151e7b

由 Douglas Anderson 提交于 4月 20, 2020

We have:
* blk_mq_run_hw_queue()
* blk_mq_delay_run_hw_queue()
* blk_mq_run_hw_queues()

...but not blk_mq_delay_run_hw_queues(), presumably because nobody
needed it before now.  Since we need it for a later patch in this
series, add it.
Signed-off-by: NDouglas Anderson <dianders@chromium.org>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b9151e7b

19 4月, 2020 1 次提交

blk-mq: Replace zero-length array with flexible-array member · f36aaf8b

由 Gustavo A. R. Silva 提交于 3月 23, 2020

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>

f36aaf8b

28 3月, 2020 1 次提交

block: add a blk_mq_init_queue_data helper · 2f227bb9

由 Christoph Hellwig 提交于 3月 27, 2020

This allows a driver to pass a queuedata member before ->init_hctx is
called.  null_blk currently open codes this logic, but I'd rather have
it in the core to ease future maintainance.
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2f227bb9

10 3月, 2020 1 次提交

blk-mq: Fix a comment in include/linux/blk-mq.h · 2dd209f0

由 Bart Van Assche 提交于 3月 09, 2020

The 'hctx_list' member of struct blk_mq_hw_ctx is not a list head but
instead an entry in q->unused_hctx_list. Fix the comment above this
struct member.

Fixes: d386732b ("blk-mq: fill header with kernel-doc")
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: André Almeida <andrealmeid@collabora.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2dd209f0

14 11月, 2019 1 次提交

blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue() · cb711b91

由 John Garry 提交于 11月 14, 2019

These functions are not referenced, so delete them.
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cb711b91

01 11月, 2019 1 次提交

blk-mq: Make blk_mq_run_hw_queue() return void · 626fb735

由 John Garry 提交于 10月 30, 2019

Since commit 97889f9a ("blk-mq: remove synchronize_rcu() from
blk_mq_del_queue_tag_set()"), the return value of blk_mq_run_hw_queue()
is never checked, so make it return void, which very marginally simplifies
the code.
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

626fb735

26 10月, 2019 1 次提交

blk-mq: fill header with kernel-doc · d386732b

由 André Almeida 提交于 10月 21, 2019

Insert documentation for structs, enums and functions at header file.
Format existing and new comments at struct blk_mq_ops as
kernel-doc comments.
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NAndré Almeida <andrealmeid@collabora.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d386732b

07 10月, 2019 2 次提交

blk-mq: Inline status checkers · 27a46989

由 Pavel Begunkov 提交于 9月 30, 2019

blk_mq_request_completed() and blk_mq_request_started() are
short, inline it.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

27a46989

block: Document all members of blk_mq_tag_set and bkl_mq_queue_map · 7a18312c

由 Bart Van Assche 提交于 9月 30, 2019

The meaning of several member variables of these two data structures is
nontrivial. Hence document all member variables using the kernel-doc
syntax.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7a18312c

06 9月, 2019 1 次提交

block: Delay default elevator initialization · 737eb78e

由 Damien Le Moal 提交于 9月 05, 2019

When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
the only information known about the device is the number of hardware
queues as the block device scan by the device driver is not completed
yet for most drivers. The device type and elevator required features
are not set yet, preventing to correctly select the default elevator
most suitable for the device.

This currently affects all multi-queue zoned block devices which default
to the "none" elevator instead of the required "mq-deadline" elevator.
These drives currently include host-managed SMR disks connected to a
smartpqi HBA and null_blk block devices with zoned mode enabled.
Upcoming NVMe Zoned Namespace devices will also be affected.

Fix this by adding the boolean elevator_init argument to
blk_mq_init_allocated_queue() to control the execution of
elevator_init_mq(). Two cases exist:
1) elevator_init = false is used for calls to
   blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
   case, a call to elevator_init_mq() is added to __device_add_disk(),
   resulting in the delayed initialization of the queue elevator
   after the device driver finished probing the device information. This
   effectively allows elevator_init_mq() access to more information
   about the device.
2) elevator_init = true preserves the current behavior of initializing
   the elevator directly from blk_mq_init_allocated_queue(). This case
   is used for the special request based DM devices where the device
   gendisk is created before the queue initialization and device
   information (e.g. queue limits) is already known when the queue
   initialization is executed.

Additionally, to make sure that the elevator initialization is never
done while requests are in-flight (there should be none when the device
driver calls device_add_disk()), freeze and quiesce the device request
queue before calling blk_mq_init_sched() in elevator_init_mq().
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

737eb78e

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功