提交 · d55e5f5b70dd6214ef81fb2313121b72a7dd2200 · openeuler / Kernel

20 12月, 2019 1 次提交

io_uring: use u64_to_user_ptr() consistently · d55e5f5b

由 Jens Axboe 提交于 12月 11, 2019

We use it in some spots, but not consistently. Convert the rest over,
makes it easier to read as well.

No functional changes in this patch.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d55e5f5b

19 12月, 2019 2 次提交

io_uring: io_wq_submit_work() should not touch req->rw · fd6c2e4c

由 Jens Axboe 提交于 12月 18, 2019

I've been chasing a weird and obscure crash that was userspace stack
corruption, and finally narrowed it down to a bit flip that made a
stack address invalid. io_wq_submit_work() unconditionally flips
the req->rw.ki_flags IOCB_NOWAIT bit, but since it's a generic work
handler, this isn't valid. Normal read/write operations own that
part of the request, on other types it could be something else.

Move the IOCB_NOWAIT clear to the read/write handlers where it belongs.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fd6c2e4c

io_uring: don't wait when under-submitting · 7c504e65

由 Pavel Begunkov 提交于 12月 18, 2019

There is no reliable way to submit and wait in a single syscall, as
io_submit_sqes() may under-consume sqes (in case of an early error).
Then it will wait for not-yet-submitted requests, deadlocking the user
in most cases.

Don't wait/poll if can't submit all sqes
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7c504e65

18 12月, 2019 9 次提交

io_uring: warn about unhandled opcode · e781573e

由 Jens Axboe 提交于 12月 17, 2019

Now that we have all the opcodes handled in terms of command prep and
SQE reuse, add a printk_once() to warn about any potentially new and
unhandled ones.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e781573e

io_uring: read opcode and user_data from SQE exactly once · d625c6ee

由 Jens Axboe 提交于 12月 17, 2019

If we defer a request, we can't be reading the opcode again. Ensure that
the user_data and opcode fields are stable. For the user_data we already
have a place for it, for the opcode we can fill a one byte hold and store
that as well. For both of them, assign them when we originally read the
SQE in io_get_sqring(). Any code that uses sqe->opcode or sqe->user_data
is switched to req->opcode and req->user_data.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d625c6ee

io_uring: make IORING_OP_TIMEOUT_REMOVE deferrable · b29472ee

由 Jens Axboe 提交于 12月 17, 2019

If we defer this command as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the timeout remove op into
the prep handling to make it safe for SQE reuse.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b29472ee

io_uring: make IORING_OP_CANCEL_ASYNC deferrable · fbf23849

由 Jens Axboe 提交于 12月 17, 2019

If we defer this command as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the async cancel op into
the prep handling to make it safe for SQE reuse.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fbf23849

io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrable · 0969e783

由 Jens Axboe 提交于 12月 17, 2019

If we defer these commands as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the poll add/remove into
the prep handling to make it safe for SQE reuse.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0969e783

io_uring: make HARDLINK imply LINK · ffbb8d6b

由 Pavel Begunkov 提交于 12月 17, 2019

The rules are as follows, if IOSQE_IO_HARDLINK is specified, then it's a
link and there is no need to set IOSQE_IO_LINK separately, though it
could be there. Add proper check and ensure that IOSQE_IO_HARDLINK
implies IOSQE_IO_LINK.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ffbb8d6b

io_uring: any deferred command must have stable sqe data · 8ed8d3c3

由 Jens Axboe 提交于 12月 16, 2019

We're currently not retaining sqe data for accept, fsync, and
sync_file_range. None of these commands need data outside of what
is directly provided, hence it can't go stale when the request is
deferred. However, it can get reused, if an application reuses
SQE entries.

Ensure that we retain the information we need and only read the sqe
contents once, off the submission path. Most of this is just moving
code into a prep and finish function.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8ed8d3c3

io_uring: remove 'sqe' parameter to the OP helpers that take it · fc4df999

由 Jens Axboe 提交于 12月 10, 2019

We pass in req->sqe for all of them, no need to pass it in as the
request is always passed in. This is a necessary prep patch to be
able to cleanup/fix the request prep path.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fc4df999

io_uring: fix pre-prepped issue with force_nonblock == true · b7bb4f7d

由 Jens Axboe 提交于 12月 15, 2019

Some of these code paths assume that any force_nonblock == true issue
is not prepped, but that's not true if we did prep as part of link setup
earlier. Check if we already have an async context allocate before
setting up a new one.

Cleanup the async context setup in general, we have a lot of duplicated
code there.

Fixes: 03b1230c ("io_uring: ensure async punted sendmsg/recvmsg requests copy data")
Fixes: f67676d1 ("io_uring: ensure async punted read/write requests copy iovec")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b7bb4f7d

16 12月, 2019 2 次提交

io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG · 0b416c3e

由 Jens Axboe 提交于 12月 15, 2019

If we have to punt the recvmsg to async context, we copy all the
context. But since the iovec used can be either on-stack (if small) or
dynamically allocated, if it's on-stack, then we need to ensure we reset
the iov pointer. If we don't, then we're reusing old stack data, and
that can lead to -EFAULTs if things get overwritten.

Ensure we retain the right pointers for the iov, and free it as well if
we end up having to go beyond UIO_FASTIOV number of vectors.

Fixes: 03b1230c ("io_uring: ensure async punted sendmsg/recvmsg requests copy data")
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0b416c3e

io_uring: fix stale comment and a few typos · d195a66e

由 Brian Gianforcaro 提交于 12月 13, 2019

- Fix a few typos found while reading the code.

- Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter
  was renamed to 'req', but the comment still holds.
Signed-off-by: NBrian Gianforcaro <b.gianfo@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d195a66e

12 12月, 2019 1 次提交

io_uring: ensure we return -EINVAL on unknown opcode · 9e3aa61a

由 Jens Axboe 提交于 12月 11, 2019

If we submit an unknown opcode and have fd == -1, io_op_needs_file()
will return true as we default to needing a file. Then when we go and
assign the file, we find the 'fd' invalid and return -EBADF. We really
should be returning -EINVAL for that case, as we normally do for
unsupported opcodes.

Change io_op_needs_file() to have the following return values:

0   - does not need a file
1   - does need a file
< 0 - error value

and use this to pass back the right value for this invalid case.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9e3aa61a

11 12月, 2019 7 次提交

io_uring: add sockets to list of files that support non-blocking issue · 10d59345

由 Jens Axboe 提交于 12月 09, 2019

In chasing a performance issue between using IORING_OP_RECVMSG and
IORING_OP_READV on sockets, tracing showed that we always punt the
socket reads to async offload. This is due to io_file_supports_async()
not checking for S_ISSOCK on the inode. Since sockets supports the
O_NONBLOCK (or MSG_DONTWAIT) flag just fine, add sockets to the list
of file types that we can do a non-blocking issue to.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

10d59345

io_uring: only hash regular files for async work execution · 53108d47

由 Jens Axboe 提交于 12月 09, 2019

We hash regular files to avoid having multiple threads hammer on the
inode mutex, but it should not be needed on other types of files
(like sockets).
Signed-off-by: NJens Axboe <axboe@kernel.dk>

53108d47

io_uring: run next sqe inline if possible · 4a0a7a18

由 Jens Axboe 提交于 12月 09, 2019

One major use case of linked commands is the ability to run the next
link inline, if at all possible. This is done correctly for async
offload, but somewhere along the line we lost the ability to do so when
we were able to complete a request without having to punt it. Ensure
that we do so correctly.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4a0a7a18

io_uring: don't dynamically allocate poll data · 392edb45

由 Jens Axboe 提交于 12月 09, 2019

This essentially reverts commit e944475e. For high poll ops
workloads, like TAO, the dynamic allocation of the wait_queue
entry for IORING_OP_POLL_ADD adds considerable extra overhead.
Go back to embedding the wait_queue_entry, but keep the usage of
wait->private for the pointer stashing.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

392edb45

io_uring: deferred send/recvmsg should assign iov · d9688565

由 Jens Axboe 提交于 12月 09, 2019

Don't just assign it from the main call path, that can miss the case
when we're called from issue deferral.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d9688565

io_uring: sqthread should grab ctx->uring_lock for submissions · 8a4955ff

由 Jens Axboe 提交于 12月 09, 2019

We use the mutex to guard against registered file updates, for instance.
Ensure we're safe in accessing that state against concurrent updates.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8a4955ff

io_uring: allow unbreakable links · 4e88d6e7

由 Jens Axboe 提交于 12月 07, 2019

Some commands will invariably end in a failure in the sense that the
completion result will be less than zero. One such example is timeouts
that don't have a completion count set, they will always complete with
-ETIME unless cancelled.

For linked commands, we sever links and fail the rest of the chain if
the result is less than zero. Since we have commands where we know that
will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever
regardless of the completion result. Note that the link will still sever
if we fail submitting the parent request, hard links are only resilient
in the presence of completion results for requests that did submit
correctly.

Cc: stable@vger.kernel.org # v5.4
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4e88d6e7

05 12月, 2019 6 次提交

io_uring: fix a typo in a comment · 0b4295b5

由 LimingWu 提交于 12月 05, 2019

thatn -> than.
Signed-off-by: NLiming Wu <19092205@suning.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0b4295b5

io_uring: hook all linked requests via link_list · 4493233e

由 Pavel Begunkov 提交于 12月 05, 2019

Links are created by chaining requests through req->list with an
exception that head uses req->link_list. (e.g. link_list->list->list)
Because of that, io_req_link_next() needs complex splicing to advance.

Link them all through list_list. Also, it seems to be simpler and more
consistent IMHO.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4493233e

io_uring: fix error handling in io_queue_link_head · 2e6e1fde

由 Pavel Begunkov 提交于 12月 05, 2019

In case of an error io_submit_sqe() drops a request and continues
without it, even if the request was a part of a link. Not only it
doesn't cancel links, but also may execute wrong sequence of actions.

Stop consuming sqes, and let the user handle errors.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2e6e1fde

io_uring: use hash table for poll command lookups · 78076bb6

由 Jens Axboe 提交于 12月 04, 2019

We recently changed this from a single list to an rbtree, but for some
real life workloads, the rbtree slows down the submission/insertion
case enough so that it's the top cycle consumer on the io_uring side.
In testing, using a hash table is a more well rounded compromise. It
is fast for insertion, and as long as it's sized appropriately, it
works well for the cancellation case as well. Running TAO with a lot
of network sockets, this removes io_poll_req_insert() from spending
2% of the CPU cycles.
Reported-by: NDan Melnic <dmm@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

78076bb6

io_uring: ensure deferred timeouts copy necessary data · 2d28390a

由 Jens Axboe 提交于 12月 04, 2019

If we defer a timeout, we should ensure that we copy the timespec
when we have consumed the sqe. This is similar to commit f67676d1
for read/write requests. We already did this correctly for timeouts
deferred as links, but do it generally and use the infrastructure added
by commit 1a6b74fc instead of having the timeout deferral use its
own.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2d28390a

io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT · 901e59bb

由 Jens Axboe 提交于 12月 04, 2019

There's really no reason why we forbid things like link/drain etc on
regular timeout commands. Enable the usual SQE flags on timeouts.
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

901e59bb

04 12月, 2019 1 次提交

io_uring: handle connect -EINPROGRESS like -EAGAIN · 87f80d62

由 Jens Axboe 提交于 12月 03, 2019

Right now we return it to userspace, which means the application has
to poll for the socket to be writeable. Let's just treat it like
-EAGAIN and have io_uring handle it internally, this makes it much
easier to use.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

87f80d62

03 12月, 2019 7 次提交

io_uring: remove parameter ctx of io_submit_state_start · 22efde59

由 Jackie Liu 提交于 12月 02, 2019

Parameter ctx we have never used, clean it up.
Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

22efde59

io_uring: mark us with IORING_FEAT_SUBMIT_STABLE · da8c9690

由 Jens Axboe 提交于 12月 02, 2019

If this flag is set, applications can be certain that any data for
async offload has been consumed when the kernel has consumed the
SQE.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

da8c9690

io_uring: ensure async punted connect requests copy data · f499a021

由 Jens Axboe 提交于 12月 02, 2019

Just like commit f67676d1 for read/write requests, this one ensures
that the sockaddr data has been copied for IORING_OP_CONNECT if we need
to punt the request to async context.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f499a021

io_uring: ensure async punted sendmsg/recvmsg requests copy data · 03b1230c

由 Jens Axboe 提交于 12月 02, 2019

Just like commit f67676d1 for read/write requests, this one ensures
that the msghdr data is fully copied if we need to punt a recvmsg or
sendmsg system call to async context.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

03b1230c

io_uring: ensure async punted read/write requests copy iovec · f67676d1

由 Jens Axboe 提交于 12月 02, 2019

Currently we don't copy the iovecs when we punt to async context. This
can be problematic for applications that store the iovec on the stack,
as they often assume that it's safe to let the iovec go out of scope
as soon as IO submission has been called. This isn't always safe, as we
will re-copy the iovec once we're in async context.

Make this 100% safe by copying the iovec just once. With this change,
applications may safely store the iovec on the stack for all cases.
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f67676d1

io_uring: add general async offload context · 1a6b74fc

由 Jens Axboe 提交于 12月 02, 2019

Right now we just copy the sqe for async offload, but we want to store
more context across an async punt. In preparation for doing so, put the
sqe copy inside a structure that we can expand. With this pointer added,
we can get rid of REQ_F_FREE_SQE, as that is now indicated by whether
req->io is NULL or not.

No functional changes in this patch.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1a6b74fc

io_uring: transform send/recvmsg() -ERESTARTSYS to -EINTR · 441cdbd5

由 Jens Axboe 提交于 12月 02, 2019

We should never return -ERESTARTSYS to userspace, transform it into
-EINTR.

Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: NJens Axboe <axboe@kernel.dk>

441cdbd5

02 12月, 2019 1 次提交

io_uring: use current task creds instead of allocating a new one · 0b8c0ec7

由 Jens Axboe 提交于 12月 02, 2019

syzbot reports:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
  kthread+0x361/0x430 kernel/kthread.c:255
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Modules linked in:
---[ end trace f2e1a4307fbe2245 ]---
RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

which is caused by slab fault injection triggering a failure in
prepare_creds(). We don't actually need to create a copy of the creds
as we're not modifying it, we just need a reference on the current task
creds. This avoids the failure case as well, and propagates the const
throughout the stack.

Fixes: 181e448d ("io_uring: async workers should inherit the user creds")
Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0b8c0ec7

30 11月, 2019 1 次提交

io_uring: fix missing kmap() declaration on powerpc · aa4c3967

由 Jens Axboe 提交于 11月 29, 2019

Christophe reports that current master fails building on powerpc with
this error:

   CC      fs/io_uring.o
fs/io_uring.c: In function ‘loop_rw_iter’:
fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’
[-Werror=implicit-function-declaration]
     iovec.iov_base = kmap(iter->bvec->bv_page)
                      ^
fs/io_uring.c:1628:19: warning: assignment makes pointer from integer
without a cast [-Wint-conversion]
     iovec.iov_base = kmap(iter->bvec->bv_page)
                    ^
fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’
[-Werror=implicit-function-declaration]
     kunmap(iter->bvec->bv_page);
     ^

which is caused by a missing highmem.h include. Fix it by including
it.

Fixes: 311ae9e1 ("io_uring: fix dead-hung for non-iter fixed rw")
Reported-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Tested-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

aa4c3967

29 11月, 2019 1 次提交

io_uring: add mapping support for NOMMU archs · 6c5c240e

由 Roman Penyaev 提交于 11月 28, 2019

That is a bit weird scenario but I find it interesting to run fio loads
using LKL linux, where MMU is disabled.  Probably other real archs which
run uClinux can also benefit from this patch.
Signed-off-by: NRoman Penyaev <rpenyaev@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6c5c240e

27 11月, 2019 1 次提交

io_uring: make poll->wait dynamically allocated · e944475e

由 Jens Axboe 提交于 11月 26, 2019

In the quest to bring io_kiocb down to 3 cachelines, this one does
the trick. Make the wait_queue_entry for the poll command come out
of kmalloc instead of embedding it in struct io_poll_iocb, as the
latter is the largest member of io_kiocb. Once we trim this down a
bit, we're back at a healthy 192 bytes for struct io_kiocb.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e944475e

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功