提交 · 7f05635764390d5f811971af9f4c89b794032c80 · openeuler / Kernel

30 9月, 2022 18 次提交

nvme: Use blk_rq_map_user_io helper · 7f056357

由 Anuj Gupta 提交于 9月 30, 2022

User blk_rq_map_user_io instead of duplicating the same code at
different places
Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-6-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

7f056357

scsi: Use blk_rq_map_user_io helper · 6732932c

由 Anuj Gupta 提交于 9月 30, 2022

Use the new blk_rq_map_user_io helper instead of duplicating code at
various places. Additionally this also takes advantage of the on-stack
iov fast path.
Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-5-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

6732932c

block: add blk_rq_map_user_io · 55765402

由 Anuj Gupta 提交于 9月 30, 2022

Create a helper blk_rq_map_user_io for mapping of vectored as well as
non-vectored requests. This will help in saving dupilcation of code at few
places in scsi and nvme.
Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com>
Suggested-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-4-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

55765402

io_uring: introduce fixed buffer support for io_uring_cmd · 9cda70f6

由 Anuj Gupta 提交于 9月 30, 2022

Add IORING_URING_CMD_FIXED flag that is to be used for sending io_uring
command with previously registered buffers. User-space passes the buffer
index in sqe->buf_index, same as done in read/write variants that uses
fixed buffers.
Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com>
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-3-anuj20.g@samsung.com
[axboe: shuffle valid flags check before acting on it]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9cda70f6

io_uring: add io_uring_cmd_import_fixed · a9216fac

由 Anuj Gupta 提交于 9月 30, 2022

This is a new helper that callers can use to obtain a bvec iterator for
the previously mapped buffer. This is preparatory work to enable
fixed-buffer support for io_uring_cmd.
Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com>
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-2-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

a9216fac

nvme: enable batched completions of passthrough IO · 851eb780

由 Jens Axboe 提交于 9月 22, 2022

Now that the normal passthrough end_io path doesn't need the request
anymore, we can kill the explicit blk_mq_free_request() and just pass
back RQ_END_IO_FREE instead. This enables the batched completion from
freeing batches of requests at the time.

This brings passthrough IO performance at least on par with bdev based
O_DIRECT with io_uring. With this and batche allocations, peak performance
goes from 110M IOPS to 122M IOPS. For IRQ based, passthrough is now also
about 10% faster than previously, going from ~61M to ~67M IOPS.
Reviewed-by: NAnuj Gupta <anuj20.g@samsung.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Co-developed-by: NStefan Roesch <shr@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

851eb780

nvme: split out metadata vs non metadata end_io uring_cmd completions · c0a7ba77

由 Jens Axboe 提交于 9月 21, 2022

By splitting up the metadata and non-metadata end_io handling, we can
remove any request dependencies on the normal non-metadata IO path. This
is in preparation for enabling the normal IO passthrough path to pass
the ownership of the request back to the block layer.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NAnuj Gupta <anuj20.g@samsung.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Co-developed-by: NStefan Roesch <shr@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c0a7ba77

block: allow end_io based requests in the completion batch handling · ab3e1d3b

由 Jens Axboe 提交于 9月 21, 2022

With end_io handlers now being able to potentially pass ownership of
the request upon completion, we can allow requests with end_io handlers
in the batch completion handling.
Reviewed-by: NAnuj Gupta <anuj20.g@samsung.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Co-developed-by: NStefan Roesch <shr@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ab3e1d3b

block: change request end_io handler to pass back a return value · de671d61

由 Jens Axboe 提交于 9月 21, 2022

Everything is just converted to returning RQ_END_IO_NONE, and there
should be no functional changes with this patch.

In preparation for allowing the end_io handler to pass ownership back
to the block layer, rather than retain ownership of the request.
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

de671d61

block: enable batched allocation for blk_mq_alloc_request() · 4b6a5d9c

由 Jens Axboe 提交于 9月 21, 2022

The filesystem IO path can take advantage of allocating batches of
requests, if the underlying submitter tells the block layer about it
through the blk_plug. For passthrough IO, the exported API is the
blk_mq_alloc_request() helper, and that one does not allow for
request caching.

Wire up request caching for blk_mq_alloc_request(), which is generally
done without having a bio available upfront.
Tested-by: NAnuj Gupta <anuj20.g@samsung.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4b6a5d9c

block: kill deprecated BUG_ON() in the flush handling · e73a625b

由 Jens Axboe 提交于 9月 28, 2022

We've never had any useful reports from this BUG_ON(), and in fact a
number of the BUG_ON()'s in the flush handling need to be turned into
more graceful handling.

In preparation for allowing batched completions of the end_io handling,
where we can enter the flush completion with queuelist having been reused
for the batch, get rid of this BUG_ON().
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e73a625b

Merge branch 'for-6.1/io_uring' into for-6.1/passthrough · 5853a7b5

由 Jens Axboe 提交于 9月 30, 2022

* for-6.1/io_uring: (56 commits)
  io_uring/net: fix notif cqe reordering
  io_uring/net: don't update msg_name if not provided
  io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
  io_uring/rw: defer fsnotify calls to task context
  io_uring/net: fix fast_iov assignment in io_setup_async_msg()
  io_uring/net: fix non-zc send with address
  io_uring/net: don't skip notifs for failed requests
  io_uring/rw: don't lose short results on io_setup_async_rw()
  io_uring/rw: fix unexpected link breakage
  io_uring/net: fix cleanup double free free_iov init
  io_uring: fix CQE reordering
  io_uring/net: fix UAF in io_sendrecv_fail()
  selftest/net: adjust io_uring sendzc notif handling
  io_uring: ensure local task_work marks task as running
  io_uring/net: zerocopy sendmsg
  io_uring/net: combine fail handlers
  io_uring/net: rename io_sendzc()
  io_uring/net: support non-zerocopy sendto
  io_uring/net: refactor io_setup_async_addr
  io_uring/net: don't lose partial send_zc on fail
  ...

5853a7b5

Merge branch 'for-6.1/block' into for-6.1/passthrough · 736feaa3

由 Jens Axboe 提交于 9月 30, 2022

* for-6.1/block: (162 commits)
  sbitmap: fix lockup while swapping
  block: add rationale for not using blk_mq_plug() when applicable
  block: adapt blk_mq_plug() to not plug for writes that require a zone lock
  s390/dasd: use blk_mq_alloc_disk
  blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
  nvmet: don't look at the request_queue in nvmet_bdev_set_limits
  nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
  blk-mq: use quiesced elevator switch when reinitializing queues
  block: replace blk_queue_nowait with bdev_nowait
  nvme: remove nvme_ctrl_init_connect_q
  nvme-loop: use the tagset alloc/free helpers
  nvme-loop: store the generic nvme_ctrl in set->driver_data
  nvme-loop: initialize sqsize later
  nvme-fc: use the tagset alloc/free helpers
  nvme-fc: store the generic nvme_ctrl in set->driver_data
  nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
  nvme-rdma: use the tagset alloc/free helpers
  nvme-rdma: store the generic nvme_ctrl in set->driver_data
  nvme-tcp: use the tagset alloc/free helpers
  nvme-tcp: store the generic nvme_ctrl in set->driver_data
  ...
Signed-off-by: NJens Axboe <axboe@kernel.dk>

736feaa3

sbitmap: fix lockup while swapping · 30514bd2

由 Hugh Dickins 提交于 9月 29, 2022

Commit 4acb8341 ("sbitmap: fix batched wait_cnt accounting")
is a big improvement: without it, I had to revert to before commit
040b83fc ("sbitmap: fix possible io hung due to lost wakeup")
to avoid the high system time and freezes which that had introduced.

Now okay on the NVME laptop, but 4acb8341 is a disaster for heavy
swapping (kernel builds in low memory) on another: soon locking up in
sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling
around with waitqueue_active() but wait_cnt 0 .  Here is a backtrace,
showing the common pattern of outer sbitmap_queue_wake_up() interrupted
before setting wait_cnt 0 back to wake_batch (in some cases other CPUs
are idle, in other cases they're spinning for a lock in dd_bio_merge()):

sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request <
scsi_end_request < scsi_io_completion < scsi_finish_command <
scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq <
__irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt <
_raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up <
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < dd_bio_merge <
blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio <
__submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct <
submit_bio < __swap_writepage < swap_writepage < pageout <
shrink_folio_list < evict_folios < lru_gen_shrink_lruvec <
shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages <
__alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio <
do_anonymous_page < __handle_mm_fault < handle_mm_fault <
do_user_addr_fault < exc_page_fault < asm_exc_page_fault

See how the process-context sbitmap_queue_wake_up() has been interrupted,
after bringing wait_cnt down to 0 (and in this example, after doing its
wakeups), before advancing wake_index and refilling wake_cnt: an
interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck.

I have almost no grasp of all the possible sbitmap races, and their
consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0,
so it is better if sbq_wake_ptr() skips on to the next ws in that case:
which fixes the lockup and shows no adverse consequence for me.

The check for wait_cnt being 0 is obviously racy, and ultimately can lead
to lost wakeups: for example, when there is only a single waitqueue with
waiters.  However, lost wakeups are unlikely to matter in these cases,
and a proper fix requires redesign (and benchmarking) of the batched
wakeup code: so let's plug the hole with this bandaid for now.
Signed-off-by: NHugh Dickins <hughd@google.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/9c2038a7-cdc5-5ee-854c-fbc6168bf16@google.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

30514bd2

io_uring/net: fix notif cqe reordering · 108893dd

由 Pavel Begunkov 提交于 9月 29, 2022

send zc is not restricted to !IO_URING_F_UNLOCKED anymore and so
we can't use task-tw ordering trick to order notification cqes
with requests completions. In this case leave it alone and let
io_send_zc_cleanup() flush it.

Cc: stable@vger.kernel.org
Fixes: 53bdc88a ("io_uring/notif: order notif vs send CQEs")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0031f3a00d492e814a4a0935a2029a46d9c9ba06.1664486545.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

108893dd

io_uring/net: don't update msg_name if not provided · 6f10ae8a

由 Pavel Begunkov 提交于 9月 29, 2022

io_sendmsg_copy_hdr() may clear msg->msg_name if the userspace didn't
provide it, we should retain NULL in this case.

Cc: stable@vger.kernel.org
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/97d49f61b5ec76d0900df658cfde3aa59ff22121.1664486545.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

6f10ae8a

io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL · 46a525e1

由 Jens Axboe 提交于 9月 29, 2022

This isn't a reliable mechanism to tell if we have task_work pending, we
really should be looking at whether we have any items queued. This is
problematic if forward progress is gated on running said task_work. One
such example is reading from a pipe, where the write side has been closed
right before the read is started. The fput() of the file queues TWA_RESUME
task_work, and we need that task_work to be run before ->release() is
called for the pipe. If ->release() isn't called, then the read will sit
forever waiting on data that will never arise.

Fix this by io_run_task_work() so it checks if we have task_work pending
rather than rely on TIF_NOTIFY_SIGNAL for that. The latter obviously
doesn't work for task_work that is queued without TWA_SIGNAL.
Reported-by: NChristiano Haesbaert <haesbaert@haesbaert.org>
Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/issues/665Signed-off-by: NJens Axboe <axboe@kernel.dk>

46a525e1

io_uring/rw: defer fsnotify calls to task context · b000145e

由 Jens Axboe 提交于 9月 29, 2022

We can't call these off the kiocb completion as that might be off
soft/hard irq context. Defer the calls to when we process the
task_work for this request. That avoids valid complaints like:

stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.0.0-rc6-syzkaller-00321-g105a36f3 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 print_usage_bug kernel/locking/lockdep.c:3961 [inline]
 valid_state kernel/locking/lockdep.c:3973 [inline]
 mark_lock_irq kernel/locking/lockdep.c:4176 [inline]
 mark_lock.part.0.cold+0x18/0xd8 kernel/locking/lockdep.c:4632
 mark_lock kernel/locking/lockdep.c:4596 [inline]
 mark_usage kernel/locking/lockdep.c:4527 [inline]
 __lock_acquire+0x11d9/0x56d0 kernel/locking/lockdep.c:5007
 lock_acquire kernel/locking/lockdep.c:5666 [inline]
 lock_acquire+0x1ab/0x570 kernel/locking/lockdep.c:5631
 __fs_reclaim_acquire mm/page_alloc.c:4674 [inline]
 fs_reclaim_acquire+0x115/0x160 mm/page_alloc.c:4688
 might_alloc include/linux/sched/mm.h:271 [inline]
 slab_pre_alloc_hook mm/slab.h:700 [inline]
 slab_alloc mm/slab.c:3278 [inline]
 __kmem_cache_alloc_lru mm/slab.c:3471 [inline]
 kmem_cache_alloc+0x39/0x520 mm/slab.c:3491
 fanotify_alloc_fid_event fs/notify/fanotify/fanotify.c:580 [inline]
 fanotify_alloc_event fs/notify/fanotify/fanotify.c:813 [inline]
 fanotify_handle_event+0x1130/0x3f40 fs/notify/fanotify/fanotify.c:948
 send_to_group fs/notify/fsnotify.c:360 [inline]
 fsnotify+0xafb/0x1680 fs/notify/fsnotify.c:570
 __fsnotify_parent+0x62f/0xa60 fs/notify/fsnotify.c:230
 fsnotify_parent include/linux/fsnotify.h:77 [inline]
 fsnotify_file include/linux/fsnotify.h:99 [inline]
 fsnotify_access include/linux/fsnotify.h:309 [inline]
 __io_complete_rw_common+0x485/0x720 io_uring/rw.c:195
 io_complete_rw+0x1a/0x1f0 io_uring/rw.c:228
 iomap_dio_complete_work fs/iomap/direct-io.c:144 [inline]
 iomap_dio_bio_end_io+0x438/0x5e0 fs/iomap/direct-io.c:178
 bio_endio+0x5f9/0x780 block/bio.c:1564
 req_bio_endio block/blk-mq.c:695 [inline]
 blk_update_request+0x3fc/0x1300 block/blk-mq.c:825
 scsi_end_request+0x7a/0x9a0 drivers/scsi/scsi_lib.c:541
 scsi_io_completion+0x173/0x1f70 drivers/scsi/scsi_lib.c:971
 scsi_complete+0x122/0x3b0 drivers/scsi/scsi_lib.c:1438
 blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1022
 __do_softirq+0x1d3/0x9c6 kernel/softirq.c:571
 invoke_softirq kernel/softirq.c:445 [inline]
 __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
 irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
 common_interrupt+0xa9/0xc0 arch/x86/kernel/irq.c:240

Fixes: f63cf519 ("io_uring: ensure that fsnotify is always called")
Link: https://lore.kernel.org/all/20220929135627.ykivmdks2w5vzrwg@quack3/
Reported-by: syzbot+dfcc5f4da15868df7d4d@syzkaller.appspotmail.com
Reported-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b000145e

29 9月, 2022 5 次提交

block: add rationale for not using blk_mq_plug() when applicable · 110fdb44

由 Pankaj Raghav 提交于 9月 29, 2022

There are two places in the block layer at the moment where
blk_mq_plug() helper could be used instead of directly accessing the
plug from struct current. In both these cases, directly accessing the plug
should not have any consequences for zoned devices.

Make the intent explicit by adding comments instead of introducing unwanted
checks with blk_mq_plug() helper.[1]

[1] https://lore.kernel.org/linux-block/f6e54907-1035-2b2c-6387-ed178be05ccb@kernel.dk/Signed-off-by: NPankaj Raghav <p.raghav@samsung.com>
Suggested-by: NJens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220929144141.140077-1-p.raghav@samsung.com
[axboe: fixup multi-line comment style]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

110fdb44

block: adapt blk_mq_plug() to not plug for writes that require a zone lock · 8cafdb5a

由 Pankaj Raghav 提交于 9月 29, 2022

The current implementation of blk_mq_plug() disables plugging for all
operations that involves a transfer to the device as we just check if
the last bit in op_is_write() function.

Modify blk_mq_plug() to disable plugging only for REQ_OP_WRITE and
REQ_OP_WRITE_ZEROS as they might require a zone lock.
Suggested-by: NChristoph Hellwig <hch@lst.de>
Suggested-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NPankaj Raghav <p.raghav@samsung.com>
Reviewed-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220929074745.103073-2-p.raghav@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

8cafdb5a

io_uring/net: fix fast_iov assignment in io_setup_async_msg() · 3e4cb6eb

由 Stefan Metzmacher 提交于 9月 29, 2022

I hit a very bad problem during my tests of SENDMSG_ZC.
BUG(); in first_iovec_segment() triggered very easily.
The problem was io_setup_async_msg() in the partial retry case,
which seems to happen more often with _ZC.

iov_iter_iovec_advance() may change i->iov in order to have i->iov_offset
being only relative to the first element.

Which means kmsg->msg.msg_iter.iov is no longer the
same as kmsg->fast_iov.

But this would rewind the copy to be the start of
async_msg->fast_iov, which means the internal
state of sync_msg->msg.msg_iter is inconsitent.

I tested with 5 vectors with length like this 4, 0, 64, 20, 8388608
and got a short writes with:
- ret=2675244 min_ret=8388692 => remaining 5713448 sr->done_io=2675244
- ret=-EAGAIN => io_uring_poll_arm
- ret=4911225 min_ret=5713448 => remaining 802223  sr->done_io=7586469
- ret=-EAGAIN => io_uring_poll_arm
- ret=802223  min_ret=802223  => res=8388692

While this was easily triggered with SENDMSG_ZC (queued for 6.1),
it was a potential problem starting with 7ba89d2a
in 5.18 for IORING_OP_RECVMSG.
And also with 4c3c0943 in 5.19
for IORING_OP_SENDMSG.

However 257e84a5 introduced the critical
code into io_setup_async_msg() in 5.11.

Fixes: 7ba89d2a ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly")
Fixes: 257e84a5 ("io_uring: refactor sendmsg/recvmsg iov managing")
Cc: stable@vger.kernel.org
Signed-off-by: NStefan Metzmacher <metze@samba.org>
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b2e7be246e2fb173520862b0c7098e55767567a2.1664436949.git.metze@samba.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

3e4cb6eb

io_uring/net: fix non-zc send with address · 04360d3e

由 Pavel Begunkov 提交于 9月 29, 2022

We're currently ignoring the dest address with non-zerocopy send because
even though we copy it from the userspace shortly after ->msg_name gets
zeroed. Move msghdr init earlier.

Fixes: 516e82f0 ("io_uring/net: support non-zerocopy sendto")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/176ced5e8568aa5d300ca899b7f05b303ebc49fd.1664409532.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

04360d3e

Merge tag 'nvme-6.1-2022-09-28' of git://git.infradead.org/nvme into for-6.1/block · dfdcbf1f

由 Jens Axboe 提交于 9月 28, 2022

Pull NVMe updates from Christoph:

"nvme updates for Linux 6.1

 - handle effects after freeing the request (Keith Busch)
 - copy firmware_rev on each init (Keith Busch)
 - restrict management ioctls to admin (Keith Busch)
 - ensure subsystem reset is single threaded (Keith Busch)
 - report the actual number of tagset maps in nvme-pci (Keith Busch)
 - small fabrics authentication fixups (Christoph Hellwig)
 - add common code for tagset allocation and freeing (Christoph Hellwig)
 - stop using the request_queue in nvmet (Christoph Hellwig)
 - set min_align_mask before calculating max_hw_sectors
   (Rishabh Bhatnagar)
 - send a rediscover uevent when a persistent discovery controller
   reconnects (Sagi Grimberg)
 - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi)"

* tag 'nvme-6.1-2022-09-28' of git://git.infradead.org/nvme: (31 commits)
  nvmet: don't look at the request_queue in nvmet_bdev_set_limits
  nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
  nvme: remove nvme_ctrl_init_connect_q
  nvme-loop: use the tagset alloc/free helpers
  nvme-loop: store the generic nvme_ctrl in set->driver_data
  nvme-loop: initialize sqsize later
  nvme-fc: use the tagset alloc/free helpers
  nvme-fc: store the generic nvme_ctrl in set->driver_data
  nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
  nvme-rdma: use the tagset alloc/free helpers
  nvme-rdma: store the generic nvme_ctrl in set->driver_data
  nvme-tcp: use the tagset alloc/free helpers
  nvme-tcp: store the generic nvme_ctrl in set->driver_data
  nvme-tcp: remove the unused queue_size member in nvme_tcp_queue
  nvme: add common helpers to allocate and free tagsets
  nvme-auth: add a MAINTAINERS entry
  nvmet: add helpers to set the result field for connect commands
  nvme: improve the NVME_CONNECT_AUTHREQ* definitions
  nvmet-auth: don't try to cancel a non-initialized work_struct
  nvmet-tcp: remove nvmet_tcp_finish_cmd
  ...

dfdcbf1f

28 9月, 2022 5 次提交

s390/dasd: use blk_mq_alloc_disk · c68f4f4e

由 Christoph Hellwig 提交于 9月 28, 2022

As far as I can tell there is no need for the staged setup in
dasd, so allocate the tagset and the disk with the queue in
dasd_gendisk_alloc.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NStefan Haberland <sth@linux.ibm.com>
Signed-off-by: NStefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20220928143945.1687114-2-sth@linux.ibm.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c68f4f4e

io_uring/net: don't skip notifs for failed requests · 6ae91ac9

由 Pavel Begunkov 提交于 9月 28, 2022

We currently only add a notification CQE when the send succeded, i.e.
cqe.res >= 0. However, it'd be more robust to do buffer notifications
for failed requests as well in case drivers decide do something fanky.

Always return a buffer notification after initial prep, don't hide it.
This behaviour is better aligned with documentation and the patch also
helps the userspace to respect it.

Cc: stable@vger.kernel.org # 6.0
Suggested-by: NStefan Metzmacher <metze@samba.org>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9c8bead87b2b980fcec441b8faef52188b4a6588.1664292100.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

6ae91ac9

blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep · 5765033c

由 Christoph Hellwig 提交于 9月 27, 2022

blkg_conf_prep just creates a new blkg structure, there is no real
need to update the lookup hint which should only be done on a
successful lookup in the I/O path.
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220927065425.257876-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

5765033c

nvmet: don't look at the request_queue in nvmet_bdev_set_limits · 84fe64f8

由 Christoph Hellwig 提交于 9月 27, 2022

nvmet is a consumer of the block layer and should not directly look at
the request_queue.  Use the bdev_ helpers to retrieve the device limits
instead.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>

84fe64f8

nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all · 8df20252

由 Christoph Hellwig 提交于 9月 27, 2022

nvmet is a consumer of the block layer and should not directly look at
the request_queue.  Just use the NUMA node ID from the gendisk instead of
the request_queue.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>

8df20252

27 9月, 2022 12 次提交

blk-mq: use quiesced elevator switch when reinitializing queues · 8237c01f

由 Keith Busch 提交于 9月 27, 2022

The hctx's run_work may be racing with the elevator switch when
reinitializing hardware queues. The queue is merely frozen in this
context, but that only prevents requests from allocating and doesn't
stop the hctx work from running. The work may get an elevator pointer
that's being torn down, and can result in use-after-free errors and
kernel panics (example below). Use the quiesced elevator switch instead,
and make the previous one static since it is now only used locally.

  nvme nvme0: resetting controller
  nvme nvme0: 32/0/0 default/read/poll queues
  BUG: kernel NULL pointer dereference, address: 0000000000000008
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 80000020c8861067 P4D 80000020c8861067 PUD 250f8c8067 PMD 0
  Oops: 0000 [#1] SMP PTI
  Workqueue: kblockd blk_mq_run_work_fn
  RIP: 0010:kyber_has_work+0x29/0x70

...

  Call Trace:
   __blk_mq_do_dispatch_sched+0x83/0x2b0
   __blk_mq_sched_dispatch_requests+0x12e/0x170
   blk_mq_sched_dispatch_requests+0x30/0x60
   __blk_mq_run_hw_queue+0x2b/0x50
   process_one_work+0x1ef/0x380
   worker_thread+0x2d/0x3e0
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220927155652.3260724-1-kbusch@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

8237c01f

block: replace blk_queue_nowait with bdev_nowait · 568ec936

由 Christoph Hellwig 提交于 9月 27, 2022

Replace blk_queue_nowait with a bdev_nowait helpers that takes the
block_device given that the I/O submission path should not have to
look into the request_queue.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NPankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

568ec936

nvme: remove nvme_ctrl_init_connect_q · fe6f04c0

由 Christoph Hellwig 提交于 9月 20, 2022

Unused now.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

fe6f04c0

nvme-loop: use the tagset alloc/free helpers · ceee1953

由 Christoph Hellwig 提交于 9月 20, 2022

Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_loop_ctrl.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

ceee1953

nvme-loop: store the generic nvme_ctrl in set->driver_data · 2ade8221

由 Christoph Hellwig 提交于 9月 20, 2022

Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

2ade8221

nvme-loop: initialize sqsize later · 379e0df5

由 Christoph Hellwig 提交于 9月 20, 2022

Defer initializing the sqsize field from the options until it has been
capped by MAXCMD.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

379e0df5

nvme-fc: use the tagset alloc/free helpers · 6dfba1c0

由 Christoph Hellwig 提交于 9月 20, 2022

Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_fc_ctrl.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJames Smart <jsmart2021@gmail.com>

6dfba1c0

nvme-fc: store the generic nvme_ctrl in set->driver_data · 1864ea46

由 Christoph Hellwig 提交于 9月 20, 2022

Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code and use the chance the cleanup
the init_hctx methods a bit.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJames Smart <jsmart2021@gmail.com>

1864ea46

nvme-fc: keep ctrl->sqsize in sync with opts->queue_size · 18ecd975

由 Christoph Hellwig 提交于 9月 20, 2022

Also update the sqsize field when capping the queue size, and remove the
check a queue size that is larger than sqsize given that sqsize is only
initialized from opts->queue_size.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJames Smart <jsmart2021@gmail.com>

18ecd975

nvme-rdma: use the tagset alloc/free helpers · cefa1032

由 Christoph Hellwig 提交于 9月 20, 2022

Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_rdma_ctrl.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

cefa1032

nvme-rdma: store the generic nvme_ctrl in set->driver_data · 2d60738c

由 Christoph Hellwig 提交于 9月 20, 2022

Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

2d60738c

nvme-tcp: use the tagset alloc/free helpers · de777825

由 Christoph Hellwig 提交于 9月 20, 2022

Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_tcp_ctrl.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

de777825

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功