提交 · 786d4e01c550e8bb7c9f9f23bca0596a2a33483c · openeuler / Kernel

27 11月, 2021 1 次提交

block: call rq_qos_done() before ref check in batch completions · 98b26a0e

由 Jens Axboe 提交于 11月 26, 2021

We need to call rq_qos_done() regardless of whether or not we're freeing
the request or not, as the reference count doesn't cover the IO completion
tracking.

Fixes: f794f335 ("block: add support for blk_mq_end_request_batch()")
Reported-by: NShinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reported-by: NKenneth R. Crudup <kenny@panix.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

98b26a0e

19 11月, 2021 1 次提交

blk-mq: don't insert FUA request with data into scheduler queue · 2b504bd4

由 Ming Lei 提交于 11月 18, 2021

We never insert flush request into scheduler queue before.

Recently commit d92ca9d8 ("blk-mq: don't handle non-flush requests in
blk_insert_flush") tries to handle FUA data request as normal request.
This way has caused warning[1] in mq-deadline dd_exit_sched() or io hang in
case of kyber since RQF_ELVPRIV isn't set for flush request, then
->finish_request won't be called.

Fix the issue by inserting FUA data request with blk_mq_request_bypass_insert()
when the device supports FUA, just like what we did before.

[1] https://lore.kernel.org/linux-block/CAHj4cs-_vkTW=dAzbZYGxpEWSpzpcmaNeY1R=vH311+9vMUSdg@mail.gmail.com/Reported-by: NYi Zhang <yi.zhang@redhat.com>
Fixes: d92ca9d8 ("blk-mq: don't handle non-flush requests in blk_insert_flush")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20211118153041.2163228-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

2b504bd4

16 11月, 2021 2 次提交

blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release() · 2a19b28f

由 Ming Lei 提交于 11月 16, 2021

For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().

However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.

Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():

1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.

2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.

[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS:  0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055]  <TASK>
[12622.918394]  scsi_mq_get_budget+0x1a/0x110
[12622.922969]  __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404]  ? pick_next_task_fair+0x39/0x390
[12622.933268]  __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194]  blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829]  __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593]  process_one_work+0x1e8/0x3c0
[12622.954059]  worker_thread+0x50/0x3b0
[12622.958144]  ? rescuer_thread+0x370/0x370
[12622.962616]  kthread+0x158/0x180
[12622.966218]  ? set_kthread_struct+0x40/0x40
[12622.970884]  ret_from_fork+0x22/0x30
[12622.974875]  </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
Reported-by: NChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

2a19b28f

block: fix missing queue put in error path · 95febeb6

由 Jens Axboe 提交于 11月 15, 2021

If we fail the submission queue checks, we don't put the queue afterwards.
This can cause various issues like stalls on scheduler switch or failure
to remove the device, or like in the original bug report, timeout waiting
for the device on reboot/restart.

While in there, fix a few whitespace discrepancies in the surrounding
code.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215039
Fixes: b637108a ("blk-mq: fix filesystem I/O request allocation")
Reported-and-tested-by: NStephen Smith <stephenmsmith@blueyonder.co.uk>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

95febeb6

13 11月, 2021 1 次提交

blk-mq: fix filesystem I/O request allocation · b637108a

由 Ming Lei 提交于 11月 12, 2021

submit_bio_checks() may update bio->bi_opf, so we have to initialize
blk_mq_alloc_data.cmd_flags with bio->bi_opf after submit_bio_checks()
returns when allocating new request.

In case of using cached request, fallback to allocate new request if
cached rq isn't compatible with the incoming bio, otherwise change
rq->cmd_flags with incoming bio->bi_opf.

Fixes: 900e0807 ("block: move queue enter logic into blk_mq_submit_bio()")
Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Tested-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b637108a

12 11月, 2021 1 次提交

blk-mq: rename blk_attempt_bio_merge · b131f201

由 Ming Lei 提交于 11月 11, 2021

It is very annoying to have two block layer functions which share same
name, so rename blk_attempt_bio_merge in blk-mq.c as
blk_mq_attempt_bio_merge.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211111085134.345235-3-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

b131f201

09 11月, 2021 1 次提交

blk-mq: add one API for waiting until quiesce is done · 9ef4d020

由 Ming Lei 提交于 11月 09, 2021

Some drivers(NVMe, SCSI) need to call quiesce and unquiesce in pair, but it
is hard to switch to this style, so these drivers need one atomic flag for
helping to balance quiesce and unquiesce.

When quiesce is in-progress, the driver still needs to wait until
the quiesce is done, so add API of blk_mq_wait_quiesce_done() for
these drivers.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211109071144.181581-2-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

9ef4d020

08 11月, 2021 1 次提交

blk-mq: don't free tags if the tag_set is used by other device in queue initialztion · a846a8e6

由 Ye Bin 提交于 11月 08, 2021

We got UAF report on v5.10 as follows:
[ 1446.674930] ==================================================================
[ 1446.675970] BUG: KASAN: use-after-free in blk_mq_get_driver_tag+0x9a4/0xa90
[ 1446.676902] Read of size 8 at addr ffff8880185afd10 by task kworker/1:2/12348
[ 1446.677851]
[ 1446.678073] CPU: 1 PID: 12348 Comm: kworker/1:2 Not tainted 5.10.0-10177-gc9c81b1e346a #2
[ 1446.679168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 1446.680692] Workqueue: kthrotld blk_throtl_dispatch_work_fn
[ 1446.681448] Call Trace:
[ 1446.681800]  dump_stack+0x9b/0xce
[ 1446.682916]  print_address_description.constprop.6+0x3e/0x60
[ 1446.685999]  kasan_report.cold.9+0x22/0x3a
[ 1446.687186]  blk_mq_get_driver_tag+0x9a4/0xa90
[ 1446.687785]  blk_mq_dispatch_rq_list+0x21a/0x1d40
[ 1446.692576]  __blk_mq_do_dispatch_sched+0x394/0x830
[ 1446.695758]  __blk_mq_sched_dispatch_requests+0x398/0x4f0
[ 1446.698279]  blk_mq_sched_dispatch_requests+0xdf/0x140
[ 1446.698967]  __blk_mq_run_hw_queue+0xc0/0x270
[ 1446.699561]  __blk_mq_delay_run_hw_queue+0x4cc/0x550
[ 1446.701407]  blk_mq_run_hw_queue+0x13b/0x2b0
[ 1446.702593]  blk_mq_sched_insert_requests+0x1de/0x390
[ 1446.703309]  blk_mq_flush_plug_list+0x4b4/0x760
[ 1446.705408]  blk_flush_plug_list+0x2c5/0x480
[ 1446.708471]  blk_finish_plug+0x55/0xa0
[ 1446.708980]  blk_throtl_dispatch_work_fn+0x23b/0x2e0
[ 1446.711236]  process_one_work+0x6d4/0xfe0
[ 1446.711778]  worker_thread+0x91/0xc80
[ 1446.713400]  kthread+0x32d/0x3f0
[ 1446.714362]  ret_from_fork+0x1f/0x30
[ 1446.714846]
[ 1446.715062] Allocated by task 1:
[ 1446.715509]  kasan_save_stack+0x19/0x40
[ 1446.716026]  __kasan_kmalloc.constprop.1+0xc1/0xd0
[ 1446.716673]  blk_mq_init_tags+0x6d/0x330
[ 1446.717207]  blk_mq_alloc_rq_map+0x50/0x1c0
[ 1446.717769]  __blk_mq_alloc_map_and_request+0xe5/0x320
[ 1446.718459]  blk_mq_alloc_tag_set+0x679/0xdc0
[ 1446.719050]  scsi_add_host_with_dma.cold.3+0xa0/0x5db
[ 1446.719736]  virtscsi_probe+0x7bf/0xbd0
[ 1446.720265]  virtio_dev_probe+0x402/0x6c0
[ 1446.720808]  really_probe+0x276/0xde0
[ 1446.721320]  driver_probe_device+0x267/0x3d0
[ 1446.721892]  device_driver_attach+0xfe/0x140
[ 1446.722491]  __driver_attach+0x13a/0x2c0
[ 1446.723037]  bus_for_each_dev+0x146/0x1c0
[ 1446.723603]  bus_add_driver+0x3fc/0x680
[ 1446.724145]  driver_register+0x1c0/0x400
[ 1446.724693]  init+0xa2/0xe8
[ 1446.725091]  do_one_initcall+0x9e/0x310
[ 1446.725626]  kernel_init_freeable+0xc56/0xcb9
[ 1446.726231]  kernel_init+0x11/0x198
[ 1446.726714]  ret_from_fork+0x1f/0x30
[ 1446.727212]
[ 1446.727433] Freed by task 26992:
[ 1446.727882]  kasan_save_stack+0x19/0x40
[ 1446.728420]  kasan_set_track+0x1c/0x30
[ 1446.728943]  kasan_set_free_info+0x1b/0x30
[ 1446.729517]  __kasan_slab_free+0x111/0x160
[ 1446.730084]  kfree+0xb8/0x520
[ 1446.730507]  blk_mq_free_map_and_requests+0x10b/0x1b0
[ 1446.731206]  blk_mq_realloc_hw_ctxs+0x8cb/0x15b0
[ 1446.731844]  blk_mq_init_allocated_queue+0x374/0x1380
[ 1446.732540]  blk_mq_init_queue_data+0x7f/0xd0
[ 1446.733155]  scsi_mq_alloc_queue+0x45/0x170
[ 1446.733730]  scsi_alloc_sdev+0x73c/0xb20
[ 1446.734281]  scsi_probe_and_add_lun+0x9a6/0x2d90
[ 1446.734916]  __scsi_scan_target+0x208/0xc50
[ 1446.735500]  scsi_scan_channel.part.3+0x113/0x170
[ 1446.736149]  scsi_scan_host_selected+0x25a/0x360
[ 1446.736783]  store_scan+0x290/0x2d0
[ 1446.737275]  dev_attr_store+0x55/0x80
[ 1446.737782]  sysfs_kf_write+0x132/0x190
[ 1446.738313]  kernfs_fop_write_iter+0x319/0x4b0
[ 1446.738921]  new_sync_write+0x40e/0x5c0
[ 1446.739429]  vfs_write+0x519/0x720
[ 1446.739877]  ksys_write+0xf8/0x1f0
[ 1446.740332]  do_syscall_64+0x2d/0x40
[ 1446.740802]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1446.741462]
[ 1446.741670] The buggy address belongs to the object at ffff8880185afd00
[ 1446.741670]  which belongs to the cache kmalloc-256 of size 256
[ 1446.743276] The buggy address is located 16 bytes inside of
[ 1446.743276]  256-byte region [ffff8880185afd00, ffff8880185afe00)
[ 1446.744765] The buggy address belongs to the page:
[ 1446.745416] page:ffffea0000616b00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x185ac
[ 1446.746694] head:ffffea0000616b00 order:2 compound_mapcount:0 compound_pincount:0
[ 1446.747719] flags: 0x1fffff80010200(slab|head)
[ 1446.748337] raw: 001fffff80010200 ffffea00006a3208 ffffea000061bf08 ffff88801004f240
[ 1446.749404] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
[ 1446.750455] page dumped because: kasan: bad access detected
[ 1446.751227]
[ 1446.751445] Memory state around the buggy address:
[ 1446.752102]  ffff8880185afc00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.753090]  ffff8880185afc80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.754079] >ffff8880185afd00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1446.755065]                          ^
[ 1446.755589]  ffff8880185afd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1446.756574]  ffff8880185afe00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.757566] ==================================================================

Flag 'BLK_MQ_F_TAG_QUEUE_SHARED' will be set if the second device on the
same host initializes it's queue successfully. However, if the second
device failed to allocate memory in blk_mq_alloc_and_init_hctx() from
blk_mq_realloc_hw_ctxs() from blk_mq_init_allocated_queue(),
__blk_mq_free_map_and_rqs() will be called on error path, and if
'BLK_MQ_TAG_HCTX_SHARED' is not set, 'tag_set->tags' will be freed
while it's still used by the first device.

To fix this issue we move release newly allocated hardware context from
blk_mq_realloc_hw_ctxs to __blk_mq_update_nr_hw_queues. As there is needn't to
release hardware context in blk_mq_init_allocated_queue.

Fixes: 868f2f0b ("blk-mq: dynamic h/w context count")
Signed-off-by: NYe Bin <yebin10@huawei.com>
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211108074019.1058843-1-yebin10@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

a846a8e6

05 11月, 2021 4 次提交

block: ensure cached plug request matches the current queue · 10c47870

由 Jens Axboe 提交于 11月 04, 2021

If we're driving multiple devices, we could have pre-populated the cache
for a different device. Ensure that the empty request matches the current
queue.

Fixes: 47c122e3 ("block: pre-allocate requests if plug is started and is a batch")
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

10c47870

block: move queue enter logic into blk_mq_submit_bio() · 900e0807

由 Jens Axboe 提交于 11月 03, 2021

Retain the old logic for the fops based submit, but for our internal
blk_mq_submit_bio(), move the queue entering logic into the core
function itself.

We need to be a bit careful if going into the scheduler, as a scheduler
or queue mappings can arbitrarily change before we have entered the queue.
Have the bio scheduler mapping do that separately, it's a very cheap
operation compared to actually doing merging locking and lookups.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
[axboe: update to check merge post submit_bio_checks() doing remap...]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

900e0807

block: split request allocation components into helpers · 71539717

由 Jens Axboe 提交于 11月 03, 2021

This is in preparation for a fix, but serves as a cleanup as well moving
the cached vs regular alloc logic out of blk_mq_submit_bio().
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

71539717

block: have plug stored requests hold references to the queue · c5fc7b93

由 Jens Axboe 提交于 11月 03, 2021

Requests that were stored in the cache deliberately didn't hold an enter
reference to the queue, instead we grabbed one every time we pulled a
request out of there. That made for awkward logic on freeing the remainder
of the cached list, if needed, where we had to artificially raise the
queue usage count before each free.

Grab references up front for cached plug requests. That's safer, and also
more efficient.

Fixes: 47c122e3 ("block: pre-allocate requests if plug is started and is a batch")
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c5fc7b93

03 11月, 2021 2 次提交

blk-mq: update hctx->nr_active in blk_mq_end_request_batch() · 3b87c6ea

由 Ming Lei 提交于 11月 02, 2021

In case of shared tags and none io sched, batched completion still may
be run into, and hctx->nr_active is accounted when getting driver tag,
so it has to be updated in blk_mq_end_request_batch().

Otherwise, hctx->nr_active may become same with queue depth, then
hctx_may_queue() always return false, then io hang is caused.

Fixes the issue by updating the counter in batched way.
Reported-by: NShinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: f794f335 ("block: add support for blk_mq_end_request_batch()")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211102153619.3627505-4-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

3b87c6ea

block: move RQF_ELV setting into allocators · 781dd830

由 Jens Axboe 提交于 11月 02, 2021

It's not safe to do this before blk_queue_enter(), as the scheduler state
could have changed in between. Hence move the RQF_ELV setting into the
allocators, where we know the queue is already entered.
Suggested-by: NMing Lei <ming.lei@redhat.com>
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Reported-by: NSteffen Maier <maier@linux.ibm.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

781dd830

02 11月, 2021 2 次提交

block: replace always false argument with 'false' · b2280909

由 Jens Axboe 提交于 11月 01, 2021

A previous commit fixed up the condition for doing direct issue, but that
left the 'from_schedule' argument dead inside the branch. Replace it with
'false'.

Fixes: ff155223 ("blk-mq: don't issue request directly in case that current is to be blocked")
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b2280909

block: assign correct tag before doing prefetch of request · a22c00be

由 Jens Axboe 提交于 11月 01, 2021

Ensure that current tag is correctly assigned before attempting
to prefetch the first cacheline of the request.

Fixes: 92aff191 ("block: prefetch request to be initialized")
Reported-and-tested-by: syzbot+cd20829ac44b92bf6ed0@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a22c00be

29 10月, 2021 1 次提交

block: improve readability of blk_mq_end_request_batch() · 02f7eab0

由 Jens Axboe 提交于 10月 28, 2021

It's faster and easier to read if we tolerate cur_hctx being NULL in
the "when to flush" condition. Rename last_hctx to cur_hctx while at it,
as it better describes the role of that variable.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

02f7eab0

27 10月, 2021 5 次提交

block: re-flow blk_mq_rq_ctx_init() · c7b84d42

由 Jens Axboe 提交于 10月 19, 2021

Now that we have flags passed in, we can do a final re-arrange of the
flow of blk_mq_rq_ctx_init() so we're always writing request in the
order in which it is laid out.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-5-axboe@kernel.dkSigned-off-by: NJens Axboe <axboe@kernel.dk>

c7b84d42

block: prefetch request to be initialized · 92aff191

由 Jens Axboe 提交于 10月 19, 2021

Now we have the tags available in __blk_mq_alloc_requests_batch(), we
can start fetching the first request cacheline before calling into the
request initialization.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-4-axboe@kernel.dkSigned-off-by: NJens Axboe <axboe@kernel.dk>

92aff191

block: pass in blk_mq_tags to blk_mq_rq_ctx_init() · fe6134f6

由 Jens Axboe 提交于 10月 19, 2021

Instead of getting this from data for every invocation of request
initialization, pass it in as an argument instead.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-3-axboe@kernel.dkSigned-off-by: NJens Axboe <axboe@kernel.dk>

fe6134f6

block: add rq_flags to struct blk_mq_alloc_data · 56f8da64

由 Jens Axboe 提交于 10月 19, 2021

There's a hole here we can use, and it's faster to set this earlier
rather than need to check q->elevator multiple times.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-2-axboe@kernel.dkSigned-off-by: NJens Axboe <axboe@kernel.dk>

56f8da64

block: schedule queue restart after BLK_STS_ZONE_RESOURCE · 9586e67b

由 Naohiro Aota 提交于 10月 27, 2021

When dispatching a zone append write request to a SCSI zoned block device,
if the target zone of the request is already locked, the device driver will
return BLK_STS_ZONE_RESOURCE and the request will be pushed back to the
hctx dipatch queue. The queue will be marked as RESTART in
dd_finish_request() and restarted in __blk_mq_free_request(). However, this
restart applies to the hctx of the completed request. If the requeued
request is on a different hctx, dispatch will no be retried until another
request is submitted or the next periodic queue run triggers, leading to up
to 30 seconds latency for the requeued request.

Fix this problem by scheduling a queue restart similarly to the
BLK_STS_RESOURCE case or when we cannot get the budget.

Also, consolidate the checks into the "need_resource" variable to simplify
the condition.
Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Link: https://lore.kernel.org/r/20211026165127.4151055-1-naohiro.aota@wdc.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

9586e67b

26 10月, 2021 1 次提交

blk-mq: don't issue request directly in case that current is to be blocked · ff155223

由 Ming Lei 提交于 10月 26, 2021

When flushing plug list in case that current will be blocked, we can't
issue request directly because ->queue_rq() may sleep, otherwise scheduler
may complain.

Fixes: dc5fc361 ("block: attempt direct issue of plug list")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211026082257.2889890-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

ff155223

22 10月, 2021 1 次提交

block: fix req_bio_endio append error handling · 297db731

由 Pavel Begunkov 提交于 10月 22, 2021

Shinichiro Kawasaki reports that there is a bug in a recent
req_bio_endio() patch causing problems with zonefs. As Shinichiro
suggested, inverse the condition in zone append path to resemble how it
was before: fail when it's not fully completed.

Fixes: 478eb72b ("block: optimise req_bio_endio()")
Reported-by: NShinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/344ea4e334aace9148b41af5f2426da38c8aa65a.1634914228.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

297db731

21 10月, 2021 1 次提交

block: clean up blk_mq_submit_bio() merging · 179ae84f

由 Pavel Begunkov 提交于 10月 20, 2021

Combine blk_mq_sched_bio_merge() and blk_attempt_plug_merge() under a
common if, so we don't check it twice.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/daedc90d4029a5d1d73344771632b1faca3aaf81.1634755800.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

179ae84f

20 10月, 2021 6 次提交

blk-mq: only flush requests from the plug in blk_mq_submit_bio · a214b949

由 Christoph Hellwig 提交于 10月 20, 2021

Replace the call to blk_flush_plug_list in blk_mq_submit_bio with a
direct call to blk_mq_flush_plug_list. This means we do not flush
plug callback from stackable devices, which doesn't really help with
the accumulated requests anyway, and it also means the cached requests
aren't freed here as they can still be used later on.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211020144119.142582-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

a214b949

block: remove inaccurate requeue check · 037057a5

由 Jens Axboe 提交于 10月 20, 2021

This check is meant to catch cases where a requeue is attempted on a
request that is still inserted. It's never really been useful to catch any
misuse, and now it's actively wrong. Outside of that, this should not be a
BUG_ON() to begin with.

Remove the check as it's now causing active harm, as requeue off the plug
path will trigger it even though the request state is just fine.
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Link: https://lore.kernel.org/linux-block/CAHj4cs80zAUc2grnCZ015-2Rvd-=gXRfB_dFKy=RTm+wRo09HQ@mail.gmail.com/Signed-off-by: NJens Axboe <axboe@kernel.dk>

037057a5

block: optimise req_bio_endio() · 478eb72b

由 Pavel Begunkov 提交于 10月 19, 2021

First, get rid of an extra branch and chain error checks. Also reshuffle
it with bio_advance(), so it goes closer to the final check, with that
the compiler loads rq->rq_flags only once, and also doesn't reload
bio->bi_iter.bi_size if bio_advance() didn't actually advanced the iter.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

478eb72b

blk-mq: support concurrent queue quiesce/unquiesce · e70feb8b

由 Ming Lei 提交于 10月 14, 2021

blk_mq_quiesce_queue() has been used a bit wide now, so far we don't support
concurrent/nested quiesce. One biggest issue is that unquiesce can happen
unexpectedly in case that quiesce/unquiesce are run concurrently from
more than one context.

This patch introduces q->mq_quiesce_depth to deal concurrent quiesce,
and we only unquiesce queue when it is the last/outer-most one of all
contexts.

Several kernel panic issue has been reported[1][2][3] when running stress
quiesce test. And this patch has been verified in these reports.

[1] https://lore.kernel.org/linux-block/9b21c797-e505-3821-4f5b-df7bf9380328@huawei.com/T/#m1fc52431fad7f33b1ffc3f12c4450e4238540787
[2] https://lore.kernel.org/linux-block/9b21c797-e505-3821-4f5b-df7bf9380328@huawei.com/T/#m10ad90afeb9c8cc318334190a7c24c8b5c5e0722
[3] https://listman.redhat.com/archives/dm-devel/2021-September/msg00189.htmlSigned-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-7-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

e70feb8b

block: inline fast path of driver tag allocation · a808a9d5

由 Jens Axboe 提交于 10月 13, 2021

If we don't use an IO scheduler or have shared tags, then we don't need
to call into this external function at all. This saves ~2% for such
a setup.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a808a9d5

blk-mq: don't handle non-flush requests in blk_insert_flush · d92ca9d8

由 Christoph Hellwig 提交于 10月 19, 2021

Return to the normal blk_mq_submit_bio flow if the bio did not end up
actually being a flush because the device didn't support it. Note that
this is basically impossible to hit without special instrumentation given
that submit_bio_checks already clears these flags usually, so we'd need a
tight race to actually hit this code path.

With this the call to blk_mq_run_hw_queue for the flush requests can be
removed given that the actual flush requests are always issued via the
requeue workqueue which runs the queue unconditionally.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019122553.2467817-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

d92ca9d8

19 10月, 2021 9 次提交

block: attempt direct issue of plug list · dc5fc361

由 Jens Axboe 提交于 10月 19, 2021

If we have just one queue type in the plug list, then we can extend our
direct issue to cover a full plug list as well. This allows sending a
batch of requests for direct issue, which is more efficient than doing
one-at-a-time kind of issue.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dc5fc361

block: change plugging to use a singly linked list · bc490f81

由 Jens Axboe 提交于 10月 18, 2021

Use a singly linked list for the blk_plug. This saves 8 bytes in the
blk_plug struct, and makes for faster list manipulations than doubly
linked lists. As we don't use the doubly linked lists for anything,
singly linked is just fine.

This yields a bump in default (merging enabled) performance from 7.0
to 7.1M IOPS, and ~7.5M IOPS with merging disabled.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bc490f81

block: move blk_mq_tag_to_rq() inline · e028f167

由 Jens Axboe 提交于 10月 16, 2021

This is in the fast path of driver issue or completion, and it's a single
array index operation. Move it inline to avoid a function call for it.

This does mean making struct blk_mq_tags block layer public, but there's
not really much in there.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e028f167

block: get rid of plug list sorting · df87eb0f

由 Jens Axboe 提交于 10月 18, 2021

Even if we have multiple queues in the plug list, chances that they
are very interspersed is minimal. Don't bother spending CPU cycles
sorting the list.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

df87eb0f

block: return whether or not to unplug through boolean · 87c037d1

由 Jens Axboe 提交于 10月 18, 2021

Instead of returning the same queue request through a request pointer,
use a boolean to accomplish the same.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

87c037d1

block: don't call blk_status_to_errno in blk_update_request · 8a7d267b

由 Christoph Hellwig 提交于 10月 18, 2021

We only need to call it to resolve the blk_status_t -> errno mapping for
tracing, so move the conversion into the tracepoints that are not called
at all when tracing isn't enabled.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8a7d267b

block: fix too broad elevator check in blk_mq_free_request() · e0d78afe

由 Jens Axboe 提交于 10月 18, 2021

We added RQF_ELV to tell whether there's an IO scheduler attached, and
RQF_ELVPRIV tells us whether there's an IO scheduler with private data
attached. Don't check RQF_ELV in blk_mq_free_request(), what we care
about here is just if we have scheduler private data attached.

This fixes a boot crash

Fixes: 2ff0682d ("block: store elevator state in request")
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Reported-by: syzbot+eb8104072aeab6cc1195@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e0d78afe

block: add support for blk_mq_end_request_batch() · f794f335

由 Jens Axboe 提交于 10月 08, 2021

Instead of calling blk_mq_end_request() on a single request, add a helper
that takes the new struct io_comp_batch and completes any request stored
in there.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f794f335

block: add a struct io_comp_batch argument to fops->iopoll() · 5a72e899

由 Jens Axboe 提交于 10月 12, 2021

struct io_comp_batch contains a list head and a completion handler, which
will allow completions to more effciently completed batches of IO.

For now, no functional changes in this patch, we just define the
io_comp_batch structure and add the argument to the file_operations iopoll
handler.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5a72e899

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功