提交 · f67676d160c6ee2ed82917fadfed6d29cab8237c · openeuler / Kernel

03 12月, 2019 3 次提交

io_uring: ensure async punted read/write requests copy iovec · f67676d1

由 Jens Axboe 提交于 12月 02, 2019

Currently we don't copy the iovecs when we punt to async context. This
can be problematic for applications that store the iovec on the stack,
as they often assume that it's safe to let the iovec go out of scope
as soon as IO submission has been called. This isn't always safe, as we
will re-copy the iovec once we're in async context.

Make this 100% safe by copying the iovec just once. With this change,
applications may safely store the iovec on the stack for all cases.
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f67676d1

io_uring: add general async offload context · 1a6b74fc

由 Jens Axboe 提交于 12月 02, 2019

Right now we just copy the sqe for async offload, but we want to store
more context across an async punt. In preparation for doing so, put the
sqe copy inside a structure that we can expand. With this pointer added,
we can get rid of REQ_F_FREE_SQE, as that is now indicated by whether
req->io is NULL or not.

No functional changes in this patch.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1a6b74fc

io_uring: transform send/recvmsg() -ERESTARTSYS to -EINTR · 441cdbd5

由 Jens Axboe 提交于 12月 02, 2019

We should never return -ERESTARTSYS to userspace, transform it into
-EINTR.

Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: NJens Axboe <axboe@kernel.dk>

441cdbd5

02 12月, 2019 1 次提交

io_uring: use current task creds instead of allocating a new one · 0b8c0ec7

由 Jens Axboe 提交于 12月 02, 2019

syzbot reports:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
  kthread+0x361/0x430 kernel/kthread.c:255
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Modules linked in:
---[ end trace f2e1a4307fbe2245 ]---
RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

which is caused by slab fault injection triggering a failure in
prepare_creds(). We don't actually need to create a copy of the creds
as we're not modifying it, we just need a reference on the current task
creds. This avoids the failure case as well, and propagates the const
throughout the stack.

Fixes: 181e448d ("io_uring: async workers should inherit the user creds")
Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0b8c0ec7

30 11月, 2019 1 次提交

io_uring: fix missing kmap() declaration on powerpc · aa4c3967

由 Jens Axboe 提交于 11月 29, 2019

Christophe reports that current master fails building on powerpc with
this error:

   CC      fs/io_uring.o
fs/io_uring.c: In function ‘loop_rw_iter’:
fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’
[-Werror=implicit-function-declaration]
     iovec.iov_base = kmap(iter->bvec->bv_page)
                      ^
fs/io_uring.c:1628:19: warning: assignment makes pointer from integer
without a cast [-Wint-conversion]
     iovec.iov_base = kmap(iter->bvec->bv_page)
                    ^
fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’
[-Werror=implicit-function-declaration]
     kunmap(iter->bvec->bv_page);
     ^

which is caused by a missing highmem.h include. Fix it by including
it.

Fixes: 311ae9e1 ("io_uring: fix dead-hung for non-iter fixed rw")
Reported-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Tested-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

aa4c3967

29 11月, 2019 1 次提交

io_uring: add mapping support for NOMMU archs · 6c5c240e

由 Roman Penyaev 提交于 11月 28, 2019

That is a bit weird scenario but I find it interesting to run fio loads
using LKL linux, where MMU is disabled.  Probably other real archs which
run uClinux can also benefit from this patch.
Signed-off-by: NRoman Penyaev <rpenyaev@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6c5c240e

27 11月, 2019 4 次提交

io_uring: make poll->wait dynamically allocated · e944475e

由 Jens Axboe 提交于 11月 26, 2019

In the quest to bring io_kiocb down to 3 cachelines, this one does
the trick. Make the wait_queue_entry for the poll command come out
of kmalloc instead of embedding it in struct io_poll_iocb, as the
latter is the largest member of io_kiocb. Once we trim this down a
bit, we're back at a healthy 192 bytes for struct io_kiocb.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e944475e

io_uring: cleanup io_import_fixed() · 7d009165

由 Pavel Begunkov 提交于 11月 25, 2019

Clean io_import_fixed() call site and make it return proper type.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7d009165

io_uring: inline struct sqe_submit · cf6fd4bd

由 Pavel Begunkov 提交于 11月 25, 2019

There is no point left in keeping struct sqe_submit. Inline it
into struct io_kiocb, so any req->submit.field is now just req->field

- moves initialisation of ring_file into io_get_req()
- removes duplicated req->sequence.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cf6fd4bd

io_uring: store timeout's sqe->off in proper place · cc42e0ac

由 Pavel Begunkov 提交于 11月 25, 2019

Timeouts' sequence offset (i.e. sqe->off) is stored in
req->submit.sequence under a false name. Keep it in timeout.data
instead. The unused space for sequence will be reclaimed in the
following patches.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cc42e0ac

26 11月, 2019 30 次提交

io_uring: remove superfluous check for sqe->off in io_accept() · 8042d6ce

由 Hrvoje Zeba 提交于 11月 25, 2019

This field contains a pointer to addrlen and checking to see if it's set
returns -EINVAL if the caller sets addr & addrlen pointers.

Fixes: 17f2fe35 ("io_uring: add support for IORING_OP_ACCEPT")
Signed-off-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8042d6ce

io_uring: async workers should inherit the user creds · 181e448d

由 Jens Axboe 提交于 11月 25, 2019

If we don't inherit the original task creds, then we can confuse users
like fuse that pass creds in the request header. See link below on
identical aio issue.

Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#uSigned-off-by: NJens Axboe <axboe@kernel.dk>

181e448d

io-wq: have io_wq_create() take a 'data' argument · 576a347b

由 Jens Axboe 提交于 11月 25, 2019

We currently pass in 4 arguments outside of the bounded size. In
preparation for adding one more argument, let's bundle them up in
a struct to make it more readable.

No functional changes in this patch.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

576a347b

io_uring: fix dead-hung for non-iter fixed rw · 311ae9e1

由 Pavel Begunkov 提交于 11月 24, 2019

Read/write requests to devices without implemented read/write_iter
using fixed buffers can cause general protection fault, which totally
hangs a machine.

io_import_fixed() initialises iov_iter with bvec, but loop_rw_iter()
accesses it as iovec, dereferencing random address.

kmap() page by page in this case

Cc: stable@vger.kernel.org
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

311ae9e1

io_uring: add support for IORING_OP_CONNECT · f8e85cf2

由 Jens Axboe 提交于 11月 23, 2019

This allows an application to call connect() in an async fashion. Like
other opcodes, we first try a non-blocking connect, then punt to async
context if we have to.

Note that we can still return -EINPROGRESS, and in that case the caller
should use IORING_OP_POLL_ADD to do an async wait for completion of the
connect request (just like for regular connect(2), except we can do it
async here too).
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f8e85cf2

io_uring: only return -EBUSY for submit on non-flushed backlog · c4a2ed72

由 Jens Axboe 提交于 11月 21, 2019

We return -EBUSY on submit when we have a CQ ring overflow backlog, but
that can be a bit problematic if the application is using pure userspace
poll of the CQ ring. For that case, if the ring briefly overflowed and
we have pending entries in the backlog, the submit flushes the backlog
successfully but still returns -EBUSY. If we're able to fully flush the
CQ ring backlog, let the submission proceed.
Reported-by: NDan Melnic <dmm@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c4a2ed72

io_uring: only !null ptr to io_issue_sqe() · f9bd67f6

由 Pavel Begunkov 提交于 11月 21, 2019

Pass only non-null @nxt to io_issue_sqe() and handle it at the caller's
side. And propagate it.

- kiocb_done() is only called from io_read() and io_write(), which are
only called from io_issue_sqe(), so it's @nxt != NULL

- io_put_req_find_next() is called either with explicitly non-null local
nxt, or from one of the functions in io_issue_sqe() switch (or their
callees).
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f9bd67f6

io_uring: simplify io_req_link_next() · b18fdf71

由 Pavel Begunkov 提交于 11月 21, 2019

"if (nxt)" is always true, as it was checked in the while's condition.
io_wq_current_is_worker() is unnecessary, as non-async callers don't
pass nxt, so io_queue_async_work() will be called for them anyway.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b18fdf71

io_uring: pass only !null to io_req_find_next() · 944e58bf

由 Pavel Begunkov 提交于 11月 21, 2019

Make io_req_find_next() and io_req_link_next() to accept only non-null
nxt, and handle it in callers.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

944e58bf

io_uring: remove io_free_req_find_next() · 70cf9f32

由 Pavel Begunkov 提交于 11月 21, 2019

There is only one one-liner user of io_free_req_find_next(). Inline it.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

70cf9f32

io_uring: add likely/unlikely in io_get_sqring() · 9835d6fa

由 Pavel Begunkov 提交于 11月 21, 2019

The number of SQEs to submit is specified by a user, so io_get_sqring()
in most of the cases succeeds. Hint compilers about that.

Checking ASM genereted by gcc 9.2.0 for x64, there is one branch
misprediction.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9835d6fa

io_uring: rename __io_submit_sqe() · d732447f

由 Pavel Begunkov 提交于 11月 21, 2019

__io_submit_sqe() is issuing requests, so call it as
such. Moreover, it ends by calling io_iopoll_req_issued().

Rename it and make terminology clearer.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d732447f

io_uring: improve trace_io_uring_defer() trace point · 915967f6

由 Jens Axboe 提交于 11月 21, 2019

We don't have shadow requests anymore, so get rid of the shadow
argument. Add the user_data argument, as that's often useful to easily
match up requests, instead of having to look at request pointers.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

915967f6

io_uring: drain next sqe instead of shadowing · 1b4a51b6

由 Pavel Begunkov 提交于 11月 21, 2019

There's an issue with the shadow drain logic in that we drop the
completion lock after deciding to defer a request, then re-grab it later
and assume that the state is still the same. In the mean time, someone
else completing a request could have found and issued it. This can cause
a stall in the queue, by having a shadow request inserted that nobody is
going to drain.

Additionally, if we fail allocating the shadow request, we simply ignore
the drain.

Instead of using a shadow request, defer the next request/link instead.
This also has the following advantages:

- removes semi-duplicated code
- doesn't allocate memory for shadows
- works better if only the head marked for drain
- doesn't need complex synchronisation

On the flip side, it removes the shadow->seq ==
last_drain_in_in_link->seq optimization. That shouldn't be a common
case, and can always be added back, if needed.

Fixes: 4fe2c963 ("io_uring: add support for link with drain")
Cc: Jackie Liu <liuyun01@kylinos.cn>
Reported-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1b4a51b6

io_uring: close lookup gap for dependent next work · b76da70f

由 Jens Axboe 提交于 11月 20, 2019

When we find new work to process within the work handler, we queue the
linked timeout before we have issued the new work. This can be
problematic for very short timeouts, as we have a window where the new
work isn't visible.

Allow the work handler to store a callback function for this in the work
item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that
is set, then io-wq will call the callback when it has setup the new work
item.
Reported-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b76da70f

io_uring: allow finding next link independent of req reference count · 4d7dd462

由 Jens Axboe 提交于 11月 20, 2019

We currently try and start the next link when we put the request, and
only if we were going to free it. This means that the optimization to
continue executing requests from the same context often fails, as we're
not putting the final reference.

Add REQ_F_LINK_NEXT to keep track of this, and allow io_uring to find the
next request more efficiently.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4d7dd462

io_uring: io_allocate_scq_urings() should return a sane state · eb065d30

由 Jens Axboe 提交于 11月 20, 2019

We currently rely on the ring destroy on cleaning things up in case of
failure, but io_allocate_scq_urings() can leave things half initialized
if only parts of it fails.

Be nice and return with either everything setup in success, or return an
error with things nicely cleaned up.

Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eb065d30

io_uring: Always REQ_F_FREE_SQE for allocated sqe · bbad27b2

由 Pavel Begunkov 提交于 11月 19, 2019

Always mark requests with allocated sqe and deallocate it in
__io_free_req(). It's easier to follow and doesn't add edge cases.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bbad27b2

io_uring: io_fail_links() should only consider first linked timeout · 5d960724

由 Jens Axboe 提交于 11月 19, 2019

We currently clear the linked timeout field if we cancel such a timeout,
but we should only attempt to cancel if it's the first one we see.
Others should simply be freed like other requests, as they haven't
been started yet.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5d960724

io_uring: Fix leaking linked timeouts · 09fbb0a8

由 Pavel Begunkov 提交于 11月 19, 2019

let have a dependant link: REQ -> LINK_TIMEOUT -> LINK_TIMEOUT

1. submission stage: submission references for REQ and LINK_TIMEOUT
are dropped. So, references respectively (1,1,2)

2. io_put(REQ) + FAIL_LINKS stage: calls io_fail_links(), which for all
linked timeouts will call cancel_timeout() and drop 1 reference.
So, references after: (0,0,1). That's a leak.

Make it treat only the first linked timeout as such, and pass others
through __io_double_put_req().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

09fbb0a8

io_uring: remove redundant check · f70193d6

由 Pavel Begunkov 提交于 11月 19, 2019

Pass any IORING_OP_LINK_TIMEOUT request further, where it will
eventually fail in io_issue_sqe().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f70193d6

io_uring: break links for failed defer · d3b35796

由 Pavel Begunkov 提交于 11月 19, 2019

If io_req_defer() failed, it needs to cancel a dependant link.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d3b35796

io_uring: request cancellations should break links · fba38c27

由 Jens Axboe 提交于 11月 18, 2019

We currently don't explicitly break links if a request is cancelled, but
we should. Add explicitly link breakage for all types of request
cancellations that we support.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fba38c27

io_uring: correct poll cancel and linked timeout expiration completion · b0dd8a41

由 Jens Axboe 提交于 11月 18, 2019

Currently a poll request fills a completion entry of 0, even if it got
cancelled. This is odd, and it makes it harder to support with chains.
Ensure that it returns -ECANCELED in the completions events if it got
cancelled, and furthermore ensure that the linked timeout that triggered
it completes with -ETIME if we did indeed trigger the completions
through a timeout.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b0dd8a41

io_uring: remove dead REQ_F_SEQ_PREV flag · e0e328c4

由 Jens Axboe 提交于 11月 15, 2019

With the conversion to io-wq, we no longer use that flag. Kill it.

Fixes: 561fb04a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e0e328c4

io_uring: fix sequencing issues with linked timeouts · 94ae5e77

由 Jens Axboe 提交于 11月 14, 2019

We have an issue with timeout links that are deeper in the submit chain,
because we only handle it upfront, not from later submissions. Move the
prep + issue of the timeout link to the async work prep handler, and do
it normally for non-async queue. If we validate and prepare the timeout
links upfront when we first see them, there's nothing stopping us from
supporting any sort of nesting.

Fixes: 2665abfd ("io_uring: add support for linked SQE timeouts")
Reported-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

94ae5e77

io_uring: make req->timeout be dynamically allocated · ad8a48ac

由 Jens Axboe 提交于 11月 15, 2019

There are a few reasons for this:

- As a prep to improving the linked timeout logic
- io_timeout is the biggest member in the io_kiocb opcode union

This also enables a few cleanups, like unifying the timer setup between
IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple
arguments to the link/prep helpers.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ad8a48ac

io_uring: make io_double_put_req() use normal completion path · 978db57e

由 Jens Axboe 提交于 11月 14, 2019

If we don't use the normal completion path, we may skip killing links
that should be errored and freed. Add __io_double_put_req() for use
within the completion path itself, other calls should just use
io_double_put_req().
Signed-off-by: NJens Axboe <axboe@kernel.dk>

978db57e

io_uring: cleanup return values from the queueing functions · 0e0702da

由 Jens Axboe 提交于 11月 14, 2019

__io_queue_sqe(), io_queue_sqe(), io_queue_link_head() all return 0/err,
but the caller doesn't care since the errors are handled inline. Clean
these up and just make them void.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0e0702da

io_uring: io_async_cancel() should pass in 'nxt' request pointer · 95a5bbae

由 Jens Axboe 提交于 11月 14, 2019

If we have a linked request, this enables us to pass it back directly
without having to go through async context.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

95a5bbae

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功