提交 · 534ca6d684f1feaf2edd90e641129725cba7e86d · openeuler / Kernel

01 10月, 2020 15 次提交

io_uring: split SQPOLL data into separate structure · 534ca6d6

由 Jens Axboe 提交于 9月 02, 2020

Move all the necessary state out of io_ring_ctx, and into a new
structure, io_sq_data. The latter now deals with any state or
variables associated with the SQPOLL thread itself.

In preparation for supporting more than one io_ring_ctx per SQPOLL
thread.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

534ca6d6

io_uring: split work handling part of SQPOLL into helper · c8d1ba58

由 Jens Axboe 提交于 9月 14, 2020

This is done in preparation for handling more than one ctx, but it also
cleans up the code a bit since io_sq_thread() was a bit too unwieldy to
get a get overview on.

__io_sq_thread() is now the main handler, and it returns an enum sq_ret
that tells io_sq_thread() what it ended up doing. The parent then makes
a decision on idle, spinning, or work handling based on that.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c8d1ba58

io_uring: move SQPOLL post-wakeup ring need wakeup flag into wake handler · 3f0e64d0

由 Jens Axboe 提交于 9月 02, 2020

We need to decouple the clearing on wakeup from the the inline schedule,
as that is going to be required for handling multiple rings in one
thread.

Wrap our wakeup handler so we can clear it when we get the wakeup, by
definition that is when we no longer need the flag set.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3f0e64d0

io_uring: use private ctx wait queue entries for SQPOLL · 6a779382

由 Jens Axboe 提交于 9月 02, 2020

This is in preparation to sharing the poller thread between rings. For
that we need per-ring wait_queue_entry storage, and we can't easily put
that on the stack if one thread is managing multiple rings.

We'll also be sharing the wait_queue_head across rings for the purposes
of wakeups, provide the usual private ring wait_queue_head for now but
make it a pointer so we can easily override it when sharing.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6a779382

io_uring: io_sq_thread() doesn't need to flush signals · e35afcf9

由 Jens Axboe 提交于 9月 02, 2020

We're not handling signals by default in kernel threads, and we never
use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never
have a signal pending, and we don't need to check for it (nor flush it).
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e35afcf9

io_uring: allow disabling rings during the creation · 7e84e1c7

由 Stefano Garzarella 提交于 8月 27, 2020

This patch adds a new IORING_SETUP_R_DISABLED flag to start the
rings disabled, allowing the user to register restrictions,
buffers, files, before to start processing SQEs.

When IORING_SETUP_R_DISABLED is set, SQE are not processed and
SQPOLL kthread is not started.

The restrictions registration are allowed only when the rings
are disable to prevent concurrency issue while processing SQEs.

The rings can be enabled using IORING_REGISTER_ENABLE_RINGS
opcode with io_uring_register(2).
Suggested-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7e84e1c7

io_uring: add IOURING_REGISTER_RESTRICTIONS opcode · 21b55dbc

由 Stefano Garzarella 提交于 8月 27, 2020

The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode
permanently installs a feature allowlist on an io_ring_ctx.
The io_ring_ctx can then be passed to untrusted code with the
knowledge that only operations present in the allowlist can be
executed.

The allowlist approach ensures that new features added to io_uring
do not accidentally become available when an existing application
is launched on a newer kernel version.

Currently is it possible to restrict sqe opcodes, sqe flags, and
register opcodes.

IOURING_REGISTER_RESTRICTIONS can only be made once. Afterwards
it is not possible to change restrictions anymore.
This prevents untrusted code from removing restrictions.
Suggested-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

21b55dbc

io_uring: reference ->nsproxy for file table commands · 9b828492

由 Jens Axboe 提交于 9月 18, 2020

If we don't get and assign the namespace for the async work, then certain
paths just don't work properly (like /dev/stdin, /proc/mounts, etc).
Anything that references the current namespace of the given task should
be assigned for async work on behalf of that task.

Cc: stable@vger.kernel.org # v5.5+
Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9b828492

io_uring: don't rely on weak ->files references · 0f212204

由 Jens Axboe 提交于 9月 13, 2020

Grab actual references to the files_struct. To avoid circular references
issues due to this, we add a per-task note that keeps track of what
io_uring contexts a task has used. When the tasks execs or exits its
assigned files, we cancel requests based on this tracking.

With that, we can grab proper references to the files table, and no
longer need to rely on stashing away ring_fd and ring_file to check
if the ring_fd may have been closed.

Cc: stable@vger.kernel.org # v5.5+
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0f212204

io_uring: enable task/files specific overflow flushing · e6c8aa9a

由 Jens Axboe 提交于 9月 28, 2020

This allows us to selectively flush out pending overflows, depending on
the task and/or files_struct being passed in.

No intended functional changes in this patch.
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e6c8aa9a

io_uring: return cancelation status from poll/timeout/files handlers · 76e1b642

由 Jens Axboe 提交于 9月 26, 2020

Return whether we found and canceled requests or not. This is in
preparation for using this information, no functional changes in this
patch.
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

76e1b642

io_uring: unconditionally grab req->task · e3bc8e9d

由 Jens Axboe 提交于 9月 24, 2020

Sometimes we assign a weak reference to it, sometimes we grab a
reference to it. Clean this up and make it unconditional, and drop the
flag related to tracking this state.
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e3bc8e9d

io_uring: stash ctx task reference for SQPOLL · 2aede0e4

由 Jens Axboe 提交于 9月 14, 2020

We can grab a reference to the task instead of stashing away the task
files_struct. This is doable without creating a circular reference
between the ring fd and the task itself.
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2aede0e4

io_uring: move dropping of files into separate helper · f573d384

由 Jens Axboe 提交于 9月 22, 2020

No functional changes in this patch, prep patch for grabbing references
to the files_struct.
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f573d384

io_uring: allow timeout/poll/files killing to take task into account · f3606e3a

由 Jens Axboe 提交于 9月 22, 2020

We currently cancel these when the ring exits, and we cancel all of
them. This is in preparation for killing only the ones associated
with a given task.
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f3606e3a

29 9月, 2020 1 次提交

io_uring: fix async buffered reads when readahead is disabled · c8d317aa

由 Hao Xu 提交于 9月 29, 2020

The async buffered reads feature is not working when readahead is
turned off. There are two things to concern:

- when doing retry in io_read, not only the IOCB_WAITQ flag but also
  the IOCB_NOWAIT flag is still set, which makes it goes to would_block
  phase in generic_file_buffered_read() and then return -EAGAIN. After
  that, the io-wq thread work is queued, and later doing the async
  reads in the old way.

- even if we remove IOCB_NOWAIT when doing retry, the feature is still
  not running properly, since in generic_file_buffered_read() it goes to
  lock_page_killable() after calling mapping->a_ops->readpage() to do
  IO, and thus causing process to sleep.

Fixes: 1a0a7853 ("mm: support async buffered reads in generic_file_buffered_read()")
Fixes: 3b2a4439 ("io_uring: get rid of kiocb_wait_page_queue_init()")
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c8d317aa

28 9月, 2020 2 次提交

io_uring: fix potential ABBA deadlock in ->show_fdinfo() · fad8e0de

由 Jens Axboe 提交于 9月 28, 2020

syzbot reports a potential lock deadlock between the normal IO path and
->show_fdinfo():

======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc6-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.2/19710 is trying to acquire lock:
ffff888098ddc450 (sb_writers#4){.+.+}-{0:0}, at: io_write+0x6b5/0xb30 fs/io_uring.c:3296

but task is already holding lock:
ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&ctx->uring_lock){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:956 [inline]
       __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
       __io_uring_show_fdinfo fs/io_uring.c:8417 [inline]
       io_uring_show_fdinfo+0x194/0xc70 fs/io_uring.c:8460
       seq_show+0x4a8/0x700 fs/proc/fd.c:65
       seq_read+0x432/0x1070 fs/seq_file.c:208
       do_loop_readv_writev fs/read_write.c:734 [inline]
       do_loop_readv_writev fs/read_write.c:721 [inline]
       do_iter_read+0x48e/0x6e0 fs/read_write.c:955
       vfs_readv+0xe5/0x150 fs/read_write.c:1073
       kernel_readv fs/splice.c:355 [inline]
       default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
       do_splice_to+0x137/0x170 fs/splice.c:871
       splice_direct_to_actor+0x307/0x980 fs/splice.c:950
       do_splice_direct+0x1b3/0x280 fs/splice.c:1059
       do_sendfile+0x55f/0xd40 fs/read_write.c:1540
       __do_sys_sendfile64 fs/read_write.c:1601 [inline]
       __se_sys_sendfile64 fs/read_write.c:1587 [inline]
       __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #1 (&p->lock){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:956 [inline]
       __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
       seq_read+0x61/0x1070 fs/seq_file.c:155
       pde_read fs/proc/inode.c:306 [inline]
       proc_reg_read+0x221/0x300 fs/proc/inode.c:318
       do_loop_readv_writev fs/read_write.c:734 [inline]
       do_loop_readv_writev fs/read_write.c:721 [inline]
       do_iter_read+0x48e/0x6e0 fs/read_write.c:955
       vfs_readv+0xe5/0x150 fs/read_write.c:1073
       kernel_readv fs/splice.c:355 [inline]
       default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
       do_splice_to+0x137/0x170 fs/splice.c:871
       splice_direct_to_actor+0x307/0x980 fs/splice.c:950
       do_splice_direct+0x1b3/0x280 fs/splice.c:1059
       do_sendfile+0x55f/0xd40 fs/read_write.c:1540
       __do_sys_sendfile64 fs/read_write.c:1601 [inline]
       __se_sys_sendfile64 fs/read_write.c:1587 [inline]
       __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (sb_writers#4){.+.+}-{0:0}:
       check_prev_add kernel/locking/lockdep.c:2496 [inline]
       check_prevs_add kernel/locking/lockdep.c:2601 [inline]
       validate_chain kernel/locking/lockdep.c:3218 [inline]
       __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
       lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
       percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
       __sb_start_write+0x228/0x450 fs/super.c:1672
       io_write+0x6b5/0xb30 fs/io_uring.c:3296
       io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
       __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
       io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
       io_submit_sqe fs/io_uring.c:6324 [inline]
       io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
       __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

other info that might help us debug this:

Chain exists of:
  sb_writers#4 --> &p->lock --> &ctx->uring_lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&ctx->uring_lock);
                               lock(&p->lock);
                               lock(&ctx->uring_lock);
  lock(sb_writers#4);

 *** DEADLOCK ***

1 lock held by syz-executor.2/19710:
 #0: ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348

stack backtrace:
CPU: 0 PID: 19710 Comm: syz-executor.2 Not tainted 5.9.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x198/0x1fd lib/dump_stack.c:118
 check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
 check_prev_add kernel/locking/lockdep.c:2496 [inline]
 check_prevs_add kernel/locking/lockdep.c:2601 [inline]
 validate_chain kernel/locking/lockdep.c:3218 [inline]
 __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
 lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
 percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
 __sb_start_write+0x228/0x450 fs/super.c:1672
 io_write+0x6b5/0xb30 fs/io_uring.c:3296
 io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
 __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
 io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
 io_submit_sqe fs/io_uring.c:6324 [inline]
 io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
 __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45e179
Code: 3d b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 0b b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f1194e74c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
RAX: ffffffffffffffda RBX: 00000000000082c0 RCX: 000000000045e179
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004
RBP: 000000000118cf98 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118cf4c
R13: 00007ffd1aa5756f R14: 00007f1194e759c0 R15: 000000000118cf4c

Fix this by just not diving into details if we fail to trylock the
io_uring mutex. We know the ctx isn't going away during this operation,
but we cannot safely iterate buffers/files/personalities if we don't
hold the io_uring mutex.

Reported-by: syzbot+2f8fa4e860edc3066aba@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fad8e0de

io_uring: always delete double poll wait entry on match · 8706e04e

由 Jens Axboe 提交于 9月 28, 2020

syzbot reports a crash with tty polling, which is using the double poll
handling:

general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
CPU: 0 PID: 6874 Comm: syz-executor749 Not tainted 5.9.0-rc6-next-20200924-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:io_poll_get_single fs/io_uring.c:4778 [inline]
RIP: 0010:io_poll_double_wake+0x51/0x510 fs/io_uring.c:4845
Code: fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 9e 03 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 5d 08 48 8d 7b 48 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 06 0f 8e 63 03 00 00 0f b6 6b 48 bf 06 00 00
RSP: 0018:ffffc90001c1fb70 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000004
RDX: 0000000000000009 RSI: ffffffff81d9b3ad RDI: 0000000000000048
RBP: dffffc0000000000 R08: ffff8880a3cac798 R09: ffffc90001c1fc60
R10: fffff52000383f73 R11: 0000000000000000 R12: 0000000000000004
R13: ffff8880a3cac798 R14: ffff8880a3cac7a0 R15: 0000000000000004
FS:  0000000001f98880(0000) GS:ffff8880ae400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f18886916c0 CR3: 0000000094c5a000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 __wake_up_common+0x147/0x650 kernel/sched/wait.c:93
 __wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:123
 tty_ldisc_hangup+0x1cf/0x680 drivers/tty/tty_ldisc.c:735
 __tty_hangup.part.0+0x403/0x870 drivers/tty/tty_io.c:625
 __tty_hangup drivers/tty/tty_io.c:575 [inline]
 tty_vhangup+0x1d/0x30 drivers/tty/tty_io.c:698
 pty_close+0x3f5/0x550 drivers/tty/pty.c:79
 tty_release+0x455/0xf60 drivers/tty/tty_io.c:1679
 __fput+0x285/0x920 fs/file_table.c:281
 task_work_run+0xdd/0x190 kernel/task_work.c:141
 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:165 [inline]
 exit_to_user_mode_prepare+0x1e2/0x1f0 kernel/entry/common.c:192
 syscall_exit_to_user_mode+0x7a/0x2c0 kernel/entry/common.c:267
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x401210

which is due to a failure in removing the double poll wait entry if we
hit a wakeup match. This can cause multiple invocations of the wakeup,
which isn't safe.

Cc: stable@vger.kernel.org # v5.8
Reported-by: syzbot+81b3883093f772addf6d@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8706e04e

26 9月, 2020 1 次提交

io_uring: ensure async buffered read-retry is setup properly · f38c7e3a

由 Jens Axboe 提交于 9月 25, 2020

A previous commit for fixing up short reads botched the async retry
path, so we ended up going to worker threads more often than we should.
Fix this up, so retries work the way they originally were intended to.

Fixes: 227c0c96 ("io_uring: internally retry short reads")
Reported-by: NHao_Xu <haoxu@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f38c7e3a

25 9月, 2020 2 次提交

io_uring: don't unconditionally set plug->nowait = true · 62c774ed

由 Jens Axboe 提交于 9月 25, 2020

This causes all the bios to be submitted with REQ_NOWAIT, which can be
problematic on either btrfs or on file systems that otherwise use a mix
of block devices where only some of them support it.

For now, just remove the setting of plug->nowait = true.
Reported-by: NDan Melnic <dmm@fb.com>
Reported-by: NBrian Foster <bfoster@redhat.com>
Fixes: b63534c4 ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

62c774ed

io_uring: ensure open/openat2 name is cleaned on cancelation · f3cd4850

由 Jens Axboe 提交于 9月 24, 2020

If we cancel these requests, we'll leak the memory associated with the
filename. Add them to the table of ops that need cleaning, if
REQ_F_NEED_CLEANUP is set.

Cc: stable@vger.kernel.org
Fixes: e62753e4 ("io_uring: call statx directly")
Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f3cd4850

21 9月, 2020 4 次提交

io_uring: fix openat/openat2 unified prep handling · 4eb8dded

由 Jens Axboe 提交于 9月 18, 2020

A previous commit unified how we handle prep for these two functions,
but this means that we check the allowed context (SQPOLL, specifically)
later than we should. Move the ring type checking into the two parent
functions, instead of doing it after we've done some setup work.

Fixes: ec65fea5 ("io_uring: deduplicate io_openat{,2}_prep()")
Reported-by: NAndy Lutomirski <luto@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4eb8dded

io_uring: mark statx/files_update/epoll_ctl as non-SQPOLL · 6ca56f84

由 Jens Axboe 提交于 9月 18, 2020

These will naturally fail when attempted through SQPOLL, but either
with -EFAULT or -EBADF. Make it explicit that these are not workable
through SQPOLL and return -EINVAL, just like other ops that need to
use ->files.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6ca56f84

io_uring: don't use retry based buffered reads for non-async bdev · f5cac8b1

由 Jens Axboe 提交于 9月 14, 2020

Some block devices, like dm, bubble back -EAGAIN through the completion
handler. We check for this in io_read(), but don't honor it for when
we have copied the iov. Return -EAGAIN for this case before retrying,
to force punt to io-wq.

Fixes: bcf5a063 ("io_uring: support true async buffered reads, if file provides it")
Reported-by: NZorro Lang <zlang@redhat.com>
Tested-by: NZorro Lang <zlang@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f5cac8b1

io_uring: don't re-setup vecs/iter in io_resumit_prep() is already there · 8f3d7496

由 Jens Axboe 提交于 9月 14, 2020

If we already have mapped the necessary data for retry, then don't set
it up again. It's a pointless operation, and we leak the iovec if it's
a large (non-stack) vec.

Fixes: b63534c4 ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8f3d7496

15 9月, 2020 2 次提交

io_uring: don't run task work on an exiting task · 6200b0ae

由 Jens Axboe 提交于 9月 13, 2020

This isn't safe, and isn't needed either. We are guaranteed that any
work we queue is on a live task (and will be run), or it goes to
our backup io-wq threads if the task is exiting.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6200b0ae

io_uring: drop 'ctx' ref on task work cancelation · 87ceb6a6

由 Jens Axboe 提交于 9月 14, 2020

If task_work ends up being marked for cancelation, we go through a
cancelation helper instead of the queue path. In converting task_work to
always hold a ctx reference, this path was missed. Make sure that
io_req_task_cancel() puts the reference that is being held against the
ctx.

Fixes: 6d816e08 ("io_uring: hold 'ctx' reference around task_work queue + execute")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

87ceb6a6

14 9月, 2020 1 次提交

io_uring: grab any needed state during defer prep · 202700e1

由 Jens Axboe 提交于 9月 12, 2020

Always grab work environment for deferred links. The assumption that we
will be running it always from the task in question is false, as exiting
tasks may mean that we're deferring this one to a thread helper. And at
that point it's too late to grab the work environment.

Fixes: debb85f4 ("io_uring: factor out grab_env() from defer_prep()")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

202700e1

06 9月, 2020 2 次提交

io_uring: fix linked deferred ->files cancellation · c127a2a1

由 Pavel Begunkov 提交于 9月 06, 2020

While looking for ->files in ->defer_list, consider that requests there
may actually be links.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c127a2a1

io_uring: fix cancel of deferred reqs with ->files · b7ddce3c

由 Pavel Begunkov 提交于 9月 06, 2020

While trying to cancel requests with ->files, it also should look for
requests in ->defer_list, otherwise it might end up hanging a thread.

Cancel all requests in ->defer_list up to the last request there with
matching ->files, that's needed to follow drain ordering semantics.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b7ddce3c

05 9月, 2020 1 次提交

io_uring: fix explicit async read/write mapping for large segments · c183edff

由 Jens Axboe 提交于 9月 04, 2020

If we exceed UIO_FASTIOV, we don't handle the transition correctly
between an allocated vec for requests that are queued with IOSQE_ASYNC.
Store the iovec appropriately and re-set it in the iter iov in case
it changed.

Fixes: ff6165b2 ("io_uring: retain iov_iter state over io_read/io_write calls")
Reported-by: NNick Hill <nick@nickhill.org>
Tested-by: NNorman Maurer <norman.maurer@googlemail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c183edff

03 9月, 2020 1 次提交

io_uring: no read/write-retry on -EAGAIN error and O_NONBLOCK marked file · 355afaeb

由 Jens Axboe 提交于 9月 02, 2020

Actually two things that need fixing up here:

- The io_rw_reissue() -EAGAIN retry is explicit to block devices and
  regular files, so don't ever attempt to do that on other types of
  files.

- If we hit -EAGAIN on a nonblock marked file, don't arm poll handler for
  it. It should just complete with -EAGAIN.

Cc: stable@vger.kernel.org
Reported-by: NNorman Maurer <norman.maurer@googlemail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

355afaeb

02 9月, 2020 1 次提交

io_uring: set table->files[i] to NULL when io_sqe_file_register failed · 95d1c8e5

由 Jiufei Xue 提交于 9月 02, 2020

While io_sqe_file_register() failed in __io_sqe_files_update(),
table->files[i] still point to the original file which may freed
soon, and that will trigger use-after-free problems.

Cc: stable@vger.kernel.org
Fixes: f3bd9dae ("io_uring: fix memleak in __io_sqe_files_update()")
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

95d1c8e5

01 9月, 2020 1 次提交

io_uring: fix removing the wrong file in __io_sqe_files_update() · 98dfd502

由 Jiufei Xue 提交于 9月 01, 2020

Index here is already the position of the file in fixed_file_table, we
should not use io_file_from_index() again to get it. Otherwise, the
wrong file which still in use may be released unexpectedly.

Cc: stable@vger.kernel.org # v5.6
Fixes: 05f3fb3c ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

98dfd502

28 8月, 2020 2 次提交

io_uring: don't bounce block based -EAGAIN retry off task_work · fdee946d

由 Jens Axboe 提交于 8月 27, 2020

These events happen inline from submission, so there's no need to
bounce them through the original task. Just set them up for retry
and issue retry directly instead of going over task_work.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fdee946d

io_uring: fix IOPOLL -EAGAIN retries · eefdf30f

由 Jens Axboe 提交于 8月 27, 2020

This normally isn't hit, as polling is mostly done on NVMe with deep
queue depths. But if we do run into request starvation, we need to
ensure that retries are properly serialized.
Reported-by: NAndres Freund <andres@anarazel.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eefdf30f

27 8月, 2020 2 次提交

io_uring: clear req->result on IOPOLL re-issue · 56450c20

由 Jens Axboe 提交于 8月 26, 2020

Make sure we clear req->result, which was set to -EAGAIN for retry
purposes, when moving it to the reissue list. Otherwise we can end up
retrying a request more than once, which leads to weird results in
the io-wq handling (and other spots).

Cc: stable@vger.kernel.org
Reported-by: NAndres Freund <andres@anarazel.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

56450c20

io_uring: make offset == -1 consistent with preadv2/pwritev2 · 0fef9483

由 Jens Axboe 提交于 8月 26, 2020

The man page for io_uring generally claims were consistent with what
preadv2 and pwritev2 accept, but turns out there's a slight discrepancy
in how offset == -1 is handled for pipes/streams. preadv doesn't allow
it, but preadv2 does. This currently causes io_uring to return -EINVAL
if that is attempted, but we should allow that as documented.

This change makes us consistent with preadv2/pwritev2 for just passing
in a NULL ppos for streams if the offset is -1.

Cc: stable@vger.kernel.org # v5.7+
Reported-by: NBenedikt Ames <wisp3rwind@posteo.eu>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0fef9483

26 8月, 2020 2 次提交

io_uring: ensure read requests go through -ERESTART* transformation · 00d23d51

由 Jens Axboe 提交于 8月 25, 2020

We need to call kiocb_done() for any ret < 0 to ensure that we always
get the proper -ERESTARTSYS (and friends) transformation done.

At some point this should be tied into general error handling, so we
can get rid of the various (mostly network) related commands that check
and perform this substitution.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

00d23d51

io_uring: don't use poll handler if file can't be nonblocking read/written · 9dab14b8

由 Jens Axboe 提交于 8月 25, 2020

There's no point in using the poll handler if we can't do a nonblocking
IO attempt of the operation, since we'll need to go async anyway. In
fact this is actively harmful, as reading from eg pipes won't return 0
to indicate EOF.

Cc: stable@vger.kernel.org # v5.7+
Reported-by: NBenedikt Ames <wisp3rwind@posteo.eu>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9dab14b8

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功