提交 · 294e73ffb0efac4c8bac2d9e6a896225098bd419 · openeuler / Kernel

20 10月, 2022 2 次提交

io-wq: Fix memory leak in worker creation · 996d3efe

由 Rafael Mendonca 提交于 10月 19, 2022

If the CPU mask allocation for a node fails, then the memory allocated for
the 'io_wqe' struct of the current node doesn't get freed on the error
handling path, since it has not yet been added to the 'wqes' array.

This was spotted when fuzzing v6.1-rc1 with Syzkaller:
BUG: memory leak
unreferenced object 0xffff8880093d5000 (size 1024):
  comm "syz-executor.2", pid 7701, jiffies 4295048595 (age 13.900s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000cb463369>] __kmem_cache_alloc_node+0x18e/0x720
    [<00000000147a3f9c>] kmalloc_node_trace+0x2a/0x130
    [<000000004e107011>] io_wq_create+0x7b9/0xdc0
    [<00000000c38b2018>] io_uring_alloc_task_context+0x31e/0x59d
    [<00000000867399da>] __io_uring_add_tctx_node.cold+0x19/0x1ba
    [<000000007e0e7a79>] io_uring_setup.cold+0x1b80/0x1dce
    [<00000000b545e9f6>] __x64_sys_io_uring_setup+0x5d/0x80
    [<000000008a8a7508>] do_syscall_64+0x5d/0x90
    [<000000004ac08bec>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: 0e03496d ("io-wq: use private CPU mask")
Cc: stable@vger.kernel.org
Signed-off-by: NRafael Mendonca <rafaelmendsr@gmail.com>
Link: https://lore.kernel.org/r/20221020014710.902201-1-rafaelmendsr@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

996d3efe

io_uring/msg_ring: Fix NULL pointer dereference in io_msg_send_fd() · 16bbdfe5

由 Harshit Mogalapalli 提交于 10月 19, 2022

Syzkaller produced the below call trace:

 BUG: KASAN: null-ptr-deref in io_msg_ring+0x3cb/0x9f0
 Write of size 8 at addr 0000000000000070 by task repro/16399

 CPU: 0 PID: 16399 Comm: repro Not tainted 6.1.0-rc1 #28
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7
 Call Trace:
  <TASK>
  dump_stack_lvl+0xcd/0x134
  ? io_msg_ring+0x3cb/0x9f0
  kasan_report+0xbc/0xf0
  ? io_msg_ring+0x3cb/0x9f0
  kasan_check_range+0x140/0x190
  io_msg_ring+0x3cb/0x9f0
  ? io_msg_ring_prep+0x300/0x300
  io_issue_sqe+0x698/0xca0
  io_submit_sqes+0x92f/0x1c30
  __do_sys_io_uring_enter+0xae4/0x24b0
....
 RIP: 0033:0x7f2eaf8f8289
 RSP: 002b:00007fff40939718 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2eaf8f8289
 RDX: 0000000000000000 RSI: 0000000000006f71 RDI: 0000000000000004
 RBP: 00007fff409397a0 R08: 0000000000000000 R09: 0000000000000039
 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004006d0
 R13: 00007fff40939880 R14: 0000000000000000 R15: 0000000000000000
  </TASK>
 Kernel panic - not syncing: panic_on_warn set ...

We don't have a NULL check on file_ptr in io_msg_send_fd() function,
so when file_ptr is NUL src_file is also NULL and get_file()
dereferences a NULL pointer and leads to above crash.

Add a NULL check to fix this issue.

Fixes: e6130eba ("io_uring: add support for passing fixed file descriptors")
Reported-by: Nsyzkaller <syzkaller@googlegroups.com>
Signed-off-by: NHarshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Link: https://lore.kernel.org/r/20221019171218.1337614-1-harshit.m.mogalapalli@oracle.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

16bbdfe5

17 10月, 2022 5 次提交

io_uring/rw: remove leftover debug statement · 5c61795e

由 Jens Axboe 提交于 10月 16, 2022

This debug statement was never meant to go into the upstream release,
kill it off before it ends up in a release. It was just part of the
testing for the initial version of the patch.

Fixes: 2ec33a6c ("io_uring/rw: ensure kiocb_end_write() is always called")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5c61795e

io_uring: don't iopoll from io_ring_ctx_wait_and_kill() · 02bac94b

由 Pavel Begunkov 提交于 10月 16, 2022

We should not be completing requests from a task context that has already
undergone io_uring cancellations, i.e. __io_uring_cancel(), as there are
some assumptions, e.g. around cached task refs draining. Remove
iopolling from io_ring_ctx_wait_and_kill() as it can be called later
after PF_EXITING is set with the last task_work run.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7c03cc91455c4a1af49c6b9cbda4e57ea467aa11.1665891182.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

02bac94b

io_uring: reuse io_alloc_req() · 34f0bc42

由 Pavel Begunkov 提交于 10月 16, 2022

Don't duplicate io_alloc_req() in io_req_caches_free() but reuse the
helper.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6005fc88274864a49fc3096c22d8bdd605cf8576.1665891182.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

34f0bc42

io_uring: kill hot path fixed file bitmap debug checks · 4d505951

由 Pavel Begunkov 提交于 10月 16, 2022

We test file_table.bitmap in io_file_get_fixed() to check invariants,
don't do it, it's expensive and was showing up in profiles. No reports of
this triggering has come in. Move the check to the file clear instead,
which will still catch any wrong usage.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cf77f2ded68d2e5b2bc7355784d969837d48e023.1665891182.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

4d505951

io_uring: remove FFS_SCM · 38eddb2c

由 Pavel Begunkov 提交于 10月 16, 2022

THe lifetime of SCM'ed files is bound to ring_sock, which is destroyed
strictly after we're done with registered file tables. This means there
is no need for the FFS_SCM hack, which was not available on 32-bit builds
anyway.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/984226a1045adf42dc35d8bd7fb5a8bbfa472ce1.1665891182.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

38eddb2c

13 10月, 2022 10 次提交

io_uring/rw: ensure kiocb_end_write() is always called · 2ec33a6c

由 Jens Axboe 提交于 10月 11, 2022

A previous commit moved the notifications and end-write handling, but
it is now missing a few spots where we also want to call both of those.
Without that, we can potentially be missing file notifications, and
more importantly, have an imbalance in the super_block writers sem
accounting.

Fixes: b000145e ("io_uring/rw: defer fsnotify calls to task context")
Reported-by: NDave Chinner <david@fromorbit.com>
Link: https://lore.kernel.org/all/20221010050319.GC2703033@dread.disaster.area/Signed-off-by: NJens Axboe <axboe@kernel.dk>

2ec33a6c

io_uring: fix fdinfo sqe offsets calculation · 00927931

由 Pavel Begunkov 提交于 10月 11, 2022

Only with the big sqe feature they take 128 bytes per entry, but we
unconditionally advance by 128B. Fix it by using sq_shift.

Fixes: 3b8fdd1d ("io_uring/fdinfo: fix sqe dumping for IORING_SETUP_SQE128")
Reported-and-tested-by: syzbot+e5198737e8a2d23d958c@syzkaller.appspotmail.com
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8b41287cb75d5efb8fcb5cccde845ddbbadd8372.1665449983.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

00927931

io_uring: local variable rw shadows outer variable in io_write · c86416c6

由 Stefan Roesch 提交于 10月 10, 2022

This fixes the shadowing of the outer variable rw in the function
io_write(). No issue is caused by this, but let's silence the shadowing
warning anyway.
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NStefan Roesch <shr@devkernel.io>
Link: https://lore.kernel.org/r/20221010234330.244244-1-shr@devkernel.ioSigned-off-by: NJens Axboe <axboe@kernel.dk>

c86416c6

io_uring/opdef: remove 'audit_skip' from SENDMSG_ZC · 11528491

由 Jens Axboe 提交于 10月 07, 2022

The msg variants of sending aren't audited separately, so we should not
be setting audit_skip for the zerocopy sendmsg variant either.

Fixes: 493108d9 ("io_uring/net: zerocopy sendmsg")
Reported-by: NPaul Moore <paul@paul-moore.com>
Reviewed-by: NPaul Moore <paul@paul-moore.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

11528491

io_uring: optimise locking for local tw with submit_wait · 44f87745

由 Pavel Begunkov 提交于 10月 06, 2022

Running local task_work requires taking uring_lock, for submit + wait we
can try to run them right after submit while we still hold the lock and
save one lock/unlokc pair. The optimisation was implemented in the first
local tw patches but got dropped for simplicity.
Suggested-by: NDylan Yudaken <dylany@fb.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/281fc79d98b5d91fe4778c5137a17a2ab4693e5c.1665088876.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

44f87745

io_uring: remove redundant memory barrier in io_req_local_work_add · fc86f9d3

由 Pavel Begunkov 提交于 10月 06, 2022

io_cqring_wake() needs a barrier for the waitqueue_active() check.
However, in the case of io_req_local_work_add(), we call llist_add()
first, which implies an atomic. Hence we can replace smb_mb() with
smp_mb__after_atomic().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/43983bc8bc507172adda7a0f00cab1aff09fd238.1665018309.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

fc86f9d3

io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT · 3fb1bd68

由 Jens Axboe 提交于 10月 04, 2022

We treat EINPROGRESS like EAGAIN, but if we're retrying post getting
EINPROGRESS, then we just need to check the socket for errors and
terminate the request.

This was exposed on a bluetooth connection request which ends up
taking a while and hitting EINPROGRESS, and yields a CQE result of
-EBADFD because we're retrying a connect on a socket that is now
connected.

Cc: stable@vger.kernel.org
Fixes: 87f80d62 ("io_uring: handle connect -EINPROGRESS like -EAGAIN")
Link: https://github.com/axboe/liburing/issues/671Reported-by: NAidan Sun <aidansun05@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3fb1bd68

io_uring: remove notif leftovers · b7a81775

由 Pavel Begunkov 提交于 10月 04, 2022

Notifications were killed but there is a couple of fields and struct
declarations left, remove them.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8df8877d677be5a2b43afd936d600e60105ea960.1664849941.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

b7a81775

io_uring: correct pinned_vm accounting · 42b6419d

由 Pavel Begunkov 提交于 10月 04, 2022

->mm_account should be released only after we free all registered
buffers, otherwise __io_sqe_buffers_unregister() will see a NULL
->mm_account and skip locked_vm accounting.

Cc: <Stable@vger.kernel.org>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6d798f65ed4ab8db3664c4d3397d4af16ca98846.1664849932.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

42b6419d

io_uring/af_unix: defer registered files gc to io_uring release · 0091bfc8

由 Pavel Begunkov 提交于 10月 03, 2022

Instead of putting io_uring's registered files in unix_gc() we want it
to be done by io_uring itself. The trick here is to consider io_uring
registered files for cycle detection but not actually putting them down.
Because io_uring can't register other ring instances, this will remove
all refs to the ring file triggering the ->release path and clean up
with io_ring_ctx_free().

Cc: stable@vger.kernel.org
Fixes: 6b06314c ("io_uring: add file set registration")
Reported-and-tested-by: NDavid Bouman <dbouman03@gmail.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@canonical.com>
[axboe: add kerneldoc comment to skb, fold in skb leak fix]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0091bfc8

08 10月, 2022 3 次提交

io_uring: limit registration w/ SINGLE_ISSUER · d7cce96c

由 Pavel Begunkov 提交于 9月 27, 2022

IORING_SETUP_SINGLE_ISSUER restricts what tasks can submit requests.
Extend it to registration as well, so non-owning task can't do
registrations. It's not necessary at the moment but might be useful in
the future.

Cc: <stable@vger.kernel.org> # 6.0
Fixes: 97bbdc06 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f52a6a9c8a8990d4a831f73c0571e7406aac2bba.1664237592.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

d7cce96c

io_uring: remove io_register_submitter · 4add705e

由 Dylan Yudaken 提交于 9月 26, 2022

this is no longer needed, as submitter_task is set at creation time.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Fixes: 97bbdc06 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4add705e

io_uring: simplify __io_uring_add_tctx_node · 97c96e9f

由 Dylan Yudaken 提交于 9月 26, 2022

Remove submitter parameter from __io_uring_add_tctx_node.

It was only called from one place, and we can do that logic in that one
place.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Fixes: 97bbdc06 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

97c96e9f

30 9月, 2022 6 次提交

io_uring: introduce fixed buffer support for io_uring_cmd · 9cda70f6

由 Anuj Gupta 提交于 9月 30, 2022

Add IORING_URING_CMD_FIXED flag that is to be used for sending io_uring
command with previously registered buffers. User-space passes the buffer
index in sqe->buf_index, same as done in read/write variants that uses
fixed buffers.
Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com>
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-3-anuj20.g@samsung.com
[axboe: shuffle valid flags check before acting on it]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9cda70f6

io_uring: add io_uring_cmd_import_fixed · a9216fac

由 Anuj Gupta 提交于 9月 30, 2022

This is a new helper that callers can use to obtain a bvec iterator for
the previously mapped buffer. This is preparatory work to enable
fixed-buffer support for io_uring_cmd.
Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com>
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-2-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

a9216fac

io_uring/net: fix notif cqe reordering · 108893dd

由 Pavel Begunkov 提交于 9月 29, 2022

send zc is not restricted to !IO_URING_F_UNLOCKED anymore and so
we can't use task-tw ordering trick to order notification cqes
with requests completions. In this case leave it alone and let
io_send_zc_cleanup() flush it.

Cc: stable@vger.kernel.org
Fixes: 53bdc88a ("io_uring/notif: order notif vs send CQEs")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0031f3a00d492e814a4a0935a2029a46d9c9ba06.1664486545.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

108893dd

io_uring/net: don't update msg_name if not provided · 6f10ae8a

由 Pavel Begunkov 提交于 9月 29, 2022

io_sendmsg_copy_hdr() may clear msg->msg_name if the userspace didn't
provide it, we should retain NULL in this case.

Cc: stable@vger.kernel.org
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/97d49f61b5ec76d0900df658cfde3aa59ff22121.1664486545.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

6f10ae8a

io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL · 46a525e1

由 Jens Axboe 提交于 9月 29, 2022

This isn't a reliable mechanism to tell if we have task_work pending, we
really should be looking at whether we have any items queued. This is
problematic if forward progress is gated on running said task_work. One
such example is reading from a pipe, where the write side has been closed
right before the read is started. The fput() of the file queues TWA_RESUME
task_work, and we need that task_work to be run before ->release() is
called for the pipe. If ->release() isn't called, then the read will sit
forever waiting on data that will never arise.

Fix this by io_run_task_work() so it checks if we have task_work pending
rather than rely on TIF_NOTIFY_SIGNAL for that. The latter obviously
doesn't work for task_work that is queued without TWA_SIGNAL.
Reported-by: NChristiano Haesbaert <haesbaert@haesbaert.org>
Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/issues/665Signed-off-by: NJens Axboe <axboe@kernel.dk>

46a525e1

io_uring/rw: defer fsnotify calls to task context · b000145e

由 Jens Axboe 提交于 9月 29, 2022

We can't call these off the kiocb completion as that might be off
soft/hard irq context. Defer the calls to when we process the
task_work for this request. That avoids valid complaints like:

stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.0.0-rc6-syzkaller-00321-g105a36f3 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 print_usage_bug kernel/locking/lockdep.c:3961 [inline]
 valid_state kernel/locking/lockdep.c:3973 [inline]
 mark_lock_irq kernel/locking/lockdep.c:4176 [inline]
 mark_lock.part.0.cold+0x18/0xd8 kernel/locking/lockdep.c:4632
 mark_lock kernel/locking/lockdep.c:4596 [inline]
 mark_usage kernel/locking/lockdep.c:4527 [inline]
 __lock_acquire+0x11d9/0x56d0 kernel/locking/lockdep.c:5007
 lock_acquire kernel/locking/lockdep.c:5666 [inline]
 lock_acquire+0x1ab/0x570 kernel/locking/lockdep.c:5631
 __fs_reclaim_acquire mm/page_alloc.c:4674 [inline]
 fs_reclaim_acquire+0x115/0x160 mm/page_alloc.c:4688
 might_alloc include/linux/sched/mm.h:271 [inline]
 slab_pre_alloc_hook mm/slab.h:700 [inline]
 slab_alloc mm/slab.c:3278 [inline]
 __kmem_cache_alloc_lru mm/slab.c:3471 [inline]
 kmem_cache_alloc+0x39/0x520 mm/slab.c:3491
 fanotify_alloc_fid_event fs/notify/fanotify/fanotify.c:580 [inline]
 fanotify_alloc_event fs/notify/fanotify/fanotify.c:813 [inline]
 fanotify_handle_event+0x1130/0x3f40 fs/notify/fanotify/fanotify.c:948
 send_to_group fs/notify/fsnotify.c:360 [inline]
 fsnotify+0xafb/0x1680 fs/notify/fsnotify.c:570
 __fsnotify_parent+0x62f/0xa60 fs/notify/fsnotify.c:230
 fsnotify_parent include/linux/fsnotify.h:77 [inline]
 fsnotify_file include/linux/fsnotify.h:99 [inline]
 fsnotify_access include/linux/fsnotify.h:309 [inline]
 __io_complete_rw_common+0x485/0x720 io_uring/rw.c:195
 io_complete_rw+0x1a/0x1f0 io_uring/rw.c:228
 iomap_dio_complete_work fs/iomap/direct-io.c:144 [inline]
 iomap_dio_bio_end_io+0x438/0x5e0 fs/iomap/direct-io.c:178
 bio_endio+0x5f9/0x780 block/bio.c:1564
 req_bio_endio block/blk-mq.c:695 [inline]
 blk_update_request+0x3fc/0x1300 block/blk-mq.c:825
 scsi_end_request+0x7a/0x9a0 drivers/scsi/scsi_lib.c:541
 scsi_io_completion+0x173/0x1f70 drivers/scsi/scsi_lib.c:971
 scsi_complete+0x122/0x3b0 drivers/scsi/scsi_lib.c:1438
 blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1022
 __do_softirq+0x1d3/0x9c6 kernel/softirq.c:571
 invoke_softirq kernel/softirq.c:445 [inline]
 __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
 irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
 common_interrupt+0xa9/0xc0 arch/x86/kernel/irq.c:240

Fixes: f63cf519 ("io_uring: ensure that fsnotify is always called")
Link: https://lore.kernel.org/all/20220929135627.ykivmdks2w5vzrwg@quack3/
Reported-by: syzbot+dfcc5f4da15868df7d4d@syzkaller.appspotmail.com
Reported-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b000145e

29 9月, 2022 3 次提交

io_uring/net: fix fast_iov assignment in io_setup_async_msg() · 3e4cb6eb

由 Stefan Metzmacher 提交于 9月 29, 2022

I hit a very bad problem during my tests of SENDMSG_ZC.
BUG(); in first_iovec_segment() triggered very easily.
The problem was io_setup_async_msg() in the partial retry case,
which seems to happen more often with _ZC.

iov_iter_iovec_advance() may change i->iov in order to have i->iov_offset
being only relative to the first element.

Which means kmsg->msg.msg_iter.iov is no longer the
same as kmsg->fast_iov.

But this would rewind the copy to be the start of
async_msg->fast_iov, which means the internal
state of sync_msg->msg.msg_iter is inconsitent.

I tested with 5 vectors with length like this 4, 0, 64, 20, 8388608
and got a short writes with:
- ret=2675244 min_ret=8388692 => remaining 5713448 sr->done_io=2675244
- ret=-EAGAIN => io_uring_poll_arm
- ret=4911225 min_ret=5713448 => remaining 802223  sr->done_io=7586469
- ret=-EAGAIN => io_uring_poll_arm
- ret=802223  min_ret=802223  => res=8388692

While this was easily triggered with SENDMSG_ZC (queued for 6.1),
it was a potential problem starting with 7ba89d2a
in 5.18 for IORING_OP_RECVMSG.
And also with 4c3c0943 in 5.19
for IORING_OP_SENDMSG.

However 257e84a5 introduced the critical
code into io_setup_async_msg() in 5.11.

Fixes: 7ba89d2a ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly")
Fixes: 257e84a5 ("io_uring: refactor sendmsg/recvmsg iov managing")
Cc: stable@vger.kernel.org
Signed-off-by: NStefan Metzmacher <metze@samba.org>
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b2e7be246e2fb173520862b0c7098e55767567a2.1664436949.git.metze@samba.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

3e4cb6eb

io_uring/net: fix non-zc send with address · 04360d3e

由 Pavel Begunkov 提交于 9月 29, 2022

We're currently ignoring the dest address with non-zerocopy send because
even though we copy it from the userspace shortly after ->msg_name gets
zeroed. Move msghdr init earlier.

Fixes: 516e82f0 ("io_uring/net: support non-zerocopy sendto")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/176ced5e8568aa5d300ca899b7f05b303ebc49fd.1664409532.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

04360d3e

io_uring/poll: disable level triggered poll · d59bd748

由 Jens Axboe 提交于 9月 28, 2022

Stefan reports that there are issues with the level triggered
notification. Since we're late in the cycle, and it was introduced for
the 6.0 release, just disable it at prep time and we can bring this
back when Samba is happy with it.
Reported-by: NStefan Metzmacher <metze@samba.org>
Reviewed-by: NStefan Metzmacher <metze@samba.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d59bd748

28 9月, 2022 1 次提交

io_uring/net: don't skip notifs for failed requests · 6ae91ac9

由 Pavel Begunkov 提交于 9月 28, 2022

We currently only add a notification CQE when the send succeded, i.e.
cqe.res >= 0. However, it'd be more robust to do buffer notifications
for failed requests as well in case drivers decide do something fanky.

Always return a buffer notification after initial prep, don't hide it.
This behaviour is better aligned with documentation and the patch also
helps the userspace to respect it.

Cc: stable@vger.kernel.org # 6.0
Suggested-by: NStefan Metzmacher <metze@samba.org>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9c8bead87b2b980fcec441b8faef52188b4a6588.1664292100.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

6ae91ac9

27 9月, 2022 4 次提交

block: replace blk_queue_nowait with bdev_nowait · 568ec936

由 Christoph Hellwig 提交于 9月 27, 2022

Replace blk_queue_nowait with a bdev_nowait helpers that takes the
block_device given that the I/O submission path should not have to
look into the request_queue.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NPankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

568ec936

io_uring/rw: don't lose short results on io_setup_async_rw() · c278d9f8

由 Pavel Begunkov 提交于 9月 27, 2022

If a retry io_setup_async_rw() fails we lose result from the first
io_iter_do_read(), which is a problem mostly for streams/sockets.

Cc: stable@vger.kernel.org
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0e8d20cebe5fc9c96ed268463c394237daabc384.1664235732.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c278d9f8

io_uring/rw: fix unexpected link breakage · bf68b5b3

由 Pavel Begunkov 提交于 9月 27, 2022

req->cqe.res is set in io_read() to the amount of bytes left to be done,
which is used to figure out whether to fail a read or not. However,
io_read() may do another without returning, and we stash the previous
value into ->bytes_done but forget to update cqe.res. Then we ask a read
to do strictly less than cqe.res but expect the return to be exactly
cqe.res.

Fix the bug by updating cqe.res for retries.

Cc: stable@vger.kernel.org
Reported-and-Tested-by: NBeld Zhang <beldzhang@gmail.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3a1088440c7be98e5800267af922a67da0ef9f13.1664235732.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

bf68b5b3

io_uring: register single issuer task at creation · 7cae596b

由 Dylan Yudaken 提交于 9月 26, 2022

Instead of picking the task from the first submitter task, rather use the
creator task or in the case of disabled (IORING_SETUP_R_DISABLED) the
enabling task.

This approach allows a lot of simplification of the logic here. This
removes init logic from the submission path, which can always be a bit
confusing, but also removes the need for locking to write (or read) the
submitter_task.

Users that want to move a ring before submitting can create the ring
disabled and then enable it on the submitting task.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Fixes: 97bbdc06 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7cae596b

26 9月, 2022 1 次提交

io_uring/net: fix cleanup double free free_iov init · 4c17a496

由 Pavel Begunkov 提交于 9月 26, 2022

Having ->async_data doesn't mean it's initialised and previously we vere
relying on setting F_CLEANUP at the right moment. With zc sendmsg
though, we set F_CLEANUP early in prep when we alloc a notif and so we
may allocate async_data, fail in copy_msg_hdr() leaving
struct io_async_msghdr not initialised correctly but with F_CLEANUP
set, which causes a ->free_iov double free and probably other nastiness.

Always initialise ->free_iov. Also, now it might point to fast_iov when
fails, so avoid freeing it during cleanups.

Reported-by: syzbot+edfd15cd4246a3fc615a@syzkaller.appspotmail.com
Fixes: 493108d9 ("io_uring/net: zerocopy sendmsg")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4c17a496

24 9月, 2022 3 次提交

io_uring: ensure that cached task references are always put on exit · e775f93f

由 Jens Axboe 提交于 9月 23, 2022

io_uring caches task references to avoid doing atomics for each of them
per request. If a request is put from the same task that allocated it,
then we can maintain a per-ctx cache of them. This obviously relies
on io_uring always pruning caches in a reliable way, and there's
currently a case off io_uring fd release where we can miss that.

One example is a ring setup with IOPOLL, which relies on the task
polling for completions, which will free them. However, if such a task
submits a request and then exits or closes the ring without reaping
the completion, then ring release will reap and put. If release happens
from that very same task, the completed request task refs will get
put back into the cache pool. This is problematic, as we're now beyond
the point of pruning caches.

Manually drop these caches after doing an IOPOLL reap. This releases
references from the current task, which is enough. If another task
happens to be doing the release, then the caching will not be
triggered and there's no issue.

Cc: stable@vger.kernel.org
Fixes: e98e49b2 ("io_uring: extend task put optimisations")
Reported-by: NHomin Rhee <hominlab@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e775f93f

io_uring: fix CQE reordering · aa1df3a3

由 Pavel Begunkov 提交于 9月 23, 2022

Overflowing CQEs may result in reordering, which is buggy in case of
links, F_MORE and so on. If we guarantee that we don't reorder for
the unlikely event of a CQ ring overflow, then we can further extend
this to not have to terminate multishot requests if it happens. For
other operations, like zerocopy sends, we have no choice but to honor
CQE ordering.
Reported-by: NDylan Yudaken <dylany@fb.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ec3bc55687b0768bbe20fb62d7d06cfced7d7e70.1663892031.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

aa1df3a3

io_uring/net: fix UAF in io_sendrecv_fail() · a75155fa

由 Pavel Begunkov 提交于 9月 23, 2022

We should not assume anything about ->free_iov just from
REQ_F_ASYNC_DATA but rather rely on REQ_F_NEED_CLEANUP, as we may
allocate ->async_data but failed init would leave the field in not
consistent state. The easiest solution is to remove removing
REQ_F_NEED_CLEANUP and so ->async_data dealloc from io_sendrecv_fail()
and let io_send_zc_cleanup() do the job. The catch here is that we also
need to prevent double notif flushing, just test it for NULL and zero
where it's needed.

BUG: KASAN: use-after-free in io_sendrecv_fail+0x3b0/0x3e0 io_uring/net.c:1221
Write of size 8 at addr ffff8880771b4080 by task syz-executor.3/30199

CPU: 1 PID: 30199 Comm: syz-executor.3 Not tainted 6.0.0-rc6-next-20220923-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 print_address_description mm/kasan/report.c:284 [inline]
 print_report+0x15e/0x45d mm/kasan/report.c:395
 kasan_report+0xbb/0x1f0 mm/kasan/report.c:495
 io_sendrecv_fail+0x3b0/0x3e0 io_uring/net.c:1221
 io_req_complete_failed+0x155/0x1b0 io_uring/io_uring.c:873
 io_drain_req io_uring/io_uring.c:1648 [inline]
 io_queue_sqe_fallback.cold+0x29f/0x788 io_uring/io_uring.c:1931
 io_submit_sqe io_uring/io_uring.c:2160 [inline]
 io_submit_sqes+0x1180/0x1df0 io_uring/io_uring.c:2276
 __do_sys_io_uring_enter+0xac6/0x2410 io_uring/io_uring.c:3216
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: c4c0009e ("io_uring/net: combine fail handlers")
Reported-by: syzbot+4c597a574a3f5a251bda@syzkaller.appspotmail.com
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/23ab8346e407ea50b1198a172c8a97e1cf22915b.1663945875.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

a75155fa

22 9月, 2022 2 次提交

io_uring: ensure local task_work marks task as running · ec7fd256

由 Jens Axboe 提交于 9月 21, 2022

io_uring will run task_work from contexts that have been prepared for
waiting, and in doing so it'll implicitly set the task running again
to avoid issues with blocking conditions. The new deferred local
task_work doesn't do that, which can result in spews on this being
an invalid condition:

  [  112.917576] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000ad64af64>] prepare_to_wait_exclusive+0x3f/0xd0
[  112.983088] WARNING: CPU: 1 PID: 190 at kernel/sched/core.c:9819 __might_sleep+0x5a/0x60
[  112.987240] Modules linked in:
[  112.990504] CPU: 1 PID: 190 Comm: io_uring Not tainted 6.0.0-rc6+ #1617
[  113.053136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[  113.133650] RIP: 0010:__might_sleep+0x5a/0x60
[  113.136507] Code: ee 48 89 df 5b 31 d2 5d e9 33 ff ff ff 48 8b 90 30 0b 00 00 48 c7 c7 90 de 45 82 c6 05 20 8b 79 01 01 48 89 d1 e8 3a 49 77 00 <0f> 0b eb d1 66 90 0f 1f 44 00 00 9c 58 f6 c4 02 74 35 65 8b 05 ed
[  113.223940] RSP: 0018:ffffc90000537ca0 EFLAGS: 00010286
[  113.232903] RAX: 0000000000000000 RBX: ffffffff8246782c RCX: ffffffff8270bcc8
IOPS=133.15K, BW=520MiB/s, IOS/call=32/31
[  113.353457] RDX: ffffc90000537b50 RSI: 00000000ffffdfff RDI: 0000000000000001
[  113.358970] RBP: 00000000000003bc R08: 0000000000000000 R09: c0000000ffffdfff
[  113.361746] R10: 0000000000000001 R11: ffffc90000537b48 R12: ffff888103f97280
[  113.424038] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
[  113.428009] FS:  00007f67ae7fc700(0000) GS:ffff88842fc80000(0000) knlGS:0000000000000000
[  113.432794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  113.503186] CR2: 00007f67b8b9b3b0 CR3: 0000000102b9b005 CR4: 0000000000770ee0
[  113.507291] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  113.512669] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  113.574374] PKRU: 55555554
[  113.576800] Call Trace:
[  113.578325]  <TASK>
[  113.579799]  set_page_dirty_lock+0x1b/0x90
[  113.582411]  __bio_release_pages+0x141/0x160
[  113.673078]  ? set_next_entity+0xd7/0x190
[  113.675632]  blk_rq_unmap_user+0xaa/0x210
[  113.678398]  ? timerqueue_del+0x2a/0x40
[  113.679578]  nvme_uring_task_cb+0x94/0xb0
[  113.683025]  __io_run_local_work+0x8a/0x150
[  113.743724]  ? io_cqring_wait+0x33d/0x500
[  113.746091]  io_run_local_work.part.76+0x2e/0x60
[  113.750091]  io_cqring_wait+0x2e7/0x500
[  113.752395]  ? trace_event_raw_event_io_uring_req_failed+0x180/0x180
[  113.823533]  __x64_sys_io_uring_enter+0x131/0x3c0
[  113.827382]  ? switch_fpu_return+0x49/0xc0
[  113.830753]  do_syscall_64+0x34/0x80
[  113.832620]  entry_SYSCALL_64_after_hwframe+0x5e/0xc8

Ensure that we mark current as TASK_RUNNING for deferred task_work
as well.

Fixes: c0e0d6ba ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Reported-by: NStefan Roesch <shr@fb.com>
Reviewed-by: NDylan Yudaken <dylany@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ec7fd256

io_uring/net: zerocopy sendmsg · 493108d9

由 Pavel Begunkov 提交于 9月 21, 2022

Add a zerocopy version of sendmsg.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6aabc4bdfc0ec78df6ec9328137e394af9d4e7ef.1663668091.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

493108d9

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功