提交 · ad8a48acc23cb13cbf4332ebabb867b1baa81842 · openeuler / Kernel

26 11月, 2019 4 次提交

io_uring: make req->timeout be dynamically allocated · ad8a48ac

由 Jens Axboe 提交于 11月 15, 2019

There are a few reasons for this:

- As a prep to improving the linked timeout logic
- io_timeout is the biggest member in the io_kiocb opcode union

This also enables a few cleanups, like unifying the timer setup between
IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple
arguments to the link/prep helpers.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ad8a48ac

io_uring: make io_double_put_req() use normal completion path · 978db57e

由 Jens Axboe 提交于 11月 14, 2019

If we don't use the normal completion path, we may skip killing links
that should be errored and freed. Add __io_double_put_req() for use
within the completion path itself, other calls should just use
io_double_put_req().
Signed-off-by: NJens Axboe <axboe@kernel.dk>

978db57e

io_uring: cleanup return values from the queueing functions · 0e0702da

由 Jens Axboe 提交于 11月 14, 2019

__io_queue_sqe(), io_queue_sqe(), io_queue_link_head() all return 0/err,
but the caller doesn't care since the errors are handled inline. Clean
these up and just make them void.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0e0702da

io_uring: io_async_cancel() should pass in 'nxt' request pointer · 95a5bbae

由 Jens Axboe 提交于 11月 14, 2019

If we have a linked request, this enables us to pass it back directly
without having to go through async context.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

95a5bbae

15 11月, 2019 1 次提交

io_uring: make POLL_ADD/POLL_REMOVE scale better · eac406c6

由 Jens Axboe 提交于 11月 14, 2019

One of the obvious use cases for these commands is networking, where
it's not uncommon to have tons of sockets open and polled for. The
current implementation uses a list for insertion and lookup, which works
fine for file based use cases where the count is usually low, it breaks
down somewhat for higher number of files / sockets. A test case with
30k sockets being polled for and cancelled takes:

real    0m6.968s
user    0m0.002s
sys     0m6.936s

with the patch it takes:

real    0m0.233s
user    0m0.010s
sys     0m0.176s

If you go to 50k sockets, it gets even more abysmal with the current
code:

real    0m40.602s
user    0m0.010s
sys     0m40.555s

with the patch it takes:

real    0m0.398s
user    0m0.000s
sys     0m0.341s

Change is pretty straight forward, just replace the cancel_list with
a red/black tree instead.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eac406c6

14 11月, 2019 7 次提交

io_uring: Fix getting file for non-fd opcodes · a320e9fa

由 Pavel Begunkov 提交于 11月 14, 2019

For timeout requests and bunch of others io_uring tries to grab a file
with specified fd, which is usually stdin/fd=0.
Update io_op_needs_file()
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a320e9fa

io_uring: introduce req_need_defer() · 9d858b21

由 Bob Liu 提交于 11月 13, 2019

Makes the code easier to read.
Signed-off-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9d858b21

io_uring: clean up io_uring_cancel_files() · 2f6d9b9d

由 Bob Liu 提交于 11月 13, 2019

We don't use the return value anymore, drop it. Also drop the
unecessary double cancel_req value check.
Signed-off-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2f6d9b9d

io_uring: ensure registered buffer import returns the IO length · 5e559561

由 Jens Axboe 提交于 11月 13, 2019

A test case was reported where two linked reads with registered buffers
failed the second link always. This is because we set the expected value
of a request in req->result, and if we don't get this result, then we
fail the dependent links. For some reason the registered buffer import
returned -ERROR/0, while the normal import returns -ERROR/length. This
broke linked commands with registered buffers.

Fix this by making io_import_fixed() correctly return the mapped length.

Cc: stable@vger.kernel.org # v5.3
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5e559561

io_uring: Fix getting file for timeout · 5683e540

由 Pavel Begunkov 提交于 11月 14, 2019

For timeout requests io_uring tries to grab a file with specified fd,
which is usually stdin/fd=0.
Update io_op_needs_file()
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5683e540

io_wq: add get/put_work handlers to io_wq_create() · 7d723065

由 Jens Axboe 提交于 11月 12, 2019

For cancellation, we need to ensure that the work item stays valid for
as long as ->cur_work is valid. Right now we can't safely dereference
the work item even under the wqe->lock, because while the ->cur_work
pointer will remain valid, the work could be completing and be freed
in parallel.

Only invoke ->get/put_work() on items we know that the caller queued
themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed
when we're queueing a flush item, for instance.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7d723065

io_uring: check for validity of ->rings in teardown · 15dff286

由 Jens Axboe 提交于 11月 13, 2019

Normally the rings are always valid, the exception is if we failed to
allocate the rings at setup time. syzbot reports this:

RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229
RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d
RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 8903 Comm: syz-executor410 Not tainted 5.4.0-rc7-next-20191113
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a
RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100
RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d
R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0
R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000
FS:  0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  io_cqring_overflow_flush+0x6b9/0xa90 fs/io_uring.c:673
  io_ring_ctx_wait_and_kill+0x24f/0x7c0 fs/io_uring.c:4260
  io_uring_create fs/io_uring.c:4600 [inline]
  io_uring_setup+0x1256/0x1cc0 fs/io_uring.c:4626
  __do_sys_io_uring_setup fs/io_uring.c:4639 [inline]
  __se_sys_io_uring_setup fs/io_uring.c:4636 [inline]
  __x64_sys_io_uring_setup+0x54/0x80 fs/io_uring.c:4636
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441229
Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229
RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d
RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
Modules linked in:
---[ end trace b0f5b127a57f623f ]---
RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a
RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100
RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d
R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0
R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000
FS:  0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

which is exactly the case of failing to allocate the SQ/CQ rings, and
then entering shutdown. Check if the rings are valid before trying to
access them at shutdown time.

Reported-by: syzbot+21147d79607d724bd6f3@syzkaller.appspotmail.com
Fixes: 1d7bb1d5 ("io_uring: add support for backlogged CQ ring")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

15dff286

13 11月, 2019 1 次提交

io_uring: fix potential deadlock in io_poll_wake() · 7c9e7f0f

由 Jens Axboe 提交于 11月 12, 2019

We attempt to run the poll completion inline, but we're using trylock to
do so. This avoids a deadlock since we're grabbing the locks in reverse
order at this point, we already hold the poll wq lock and we're trying
to grab the completion lock, while the normal rules are the reverse of
that order.

IO completion for a timeout link will need to grab the completion lock,
but that's not safe from this context. Put the completion under the
completion_lock in io_poll_wake(), and mark the request as entering
the completion with the completion_lock already held.

Fixes: 2665abfd ("io_uring: add support for linked SQE timeouts")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7c9e7f0f

12 11月, 2019 7 次提交

io_uring: use correct "is IO worker" helper · 960e432d

由 Jens Axboe 提交于 11月 12, 2019

Since we switched to io-wq, the dependent link optimization for when to
pass back work inline has been broken. Fix this by providing a suitable
io-wq helper for io_uring to use to detect when to do this.

Fixes: 561fb04a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

960e432d

io_uring: make timeout sequence == 0 mean no sequence · 93bd25bb

由 Jens Axboe 提交于 11月 11, 2019

Currently we make sequence == 0 be the same as sequence == 1, but that's
not super useful if the intent is really to have a timeout that's just
a pure timeout.

If the user passes in sqe->off == 0, then don't apply any sequence logic
to the request, let it purely be driven by the timeout specified.
Reported-by: N李通洲 <carter.li@eoitek.com>
Reviewed-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

93bd25bb

io_uring: fix -ENOENT issue with linked timer with short timeout · 76a46e06

由 Jens Axboe 提交于 11月 10, 2019

If you prep a read (for example) that needs to get punted to async
context with a timer, if the timeout is sufficiently short, the timer
request will get completed with -ENOENT as it could not find the read.

The issue is that we prep and start the timer before we start the read.
Hence the timer can trigger before the read is even started, and the end
result is then that the timer completes with -ENOENT, while the read
starts instead of being cancelled by the timer.

Fix this by splitting the linked timer into two parts:

1) Prep and validate the linked timer
2) Start timer

The read is then started between steps 1 and 2, so we know that the
timer will always have a consistent view of the read request state.
Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

76a46e06

io_uring: don't do flush cancel under inflight_lock · 768134d4

由 Jens Axboe 提交于 11月 10, 2019

We can't safely cancel under the inflight lock. If the work hasn't been
started yet, then io_wq_cancel_work() simply marks the work as cancelled
and invokes the work handler. But if the work completion needs to grab
the inflight lock because it's grabbing user files, then we'll deadlock
trying to finish the work as we already hold that lock.

Instead grab a reference to the request, if it isn't already zero. If
it's zero, then we know it's going through completion anyway, and we
can safely ignore it. If it's not zero, then we can drop the lock and
attempt to cancel from there.

This also fixes a missing finish_wait() at the end of
io_uring_cancel_files().
Signed-off-by: NJens Axboe <axboe@kernel.dk>

768134d4

io_uring: flag SQPOLL busy condition to userspace · c1edbf5f

由 Jens Axboe 提交于 11月 10, 2019

Now that we have backpressure, for SQPOLL, we have one more condition
that warrants flagging that the application needs to enter the kernel:
we failed to submit IO due to backpressure. Make sure we catch that
and flag it appropriately.

If we run into backpressure issues with the SQPOLL thread, flag it
as such to the application by setting IORING_SQ_NEED_WAKEUP. This will
cause the application to enter the kernel, and that will flush the
backlog and clear the condition.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c1edbf5f

io_uring: make ASYNC_CANCEL work with poll and timeout · 47f46768

由 Jens Axboe 提交于 11月 09, 2019

It's a little confusing that we have multiple types of command
cancellation opcodes now that we have a generic one. Make the generic
one work with POLL_ADD and TIMEOUT commands as well, that makes for an
easier to use API for the application. The fact that they currently
don't is a bit confusing.

Add a helper that takes care of it, so we can user it from both
IORING_OP_ASYNC_CANCEL and from the linked timeout cancellation.
Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

47f46768

io_uring: provide fallback request for OOM situations · 0ddf92e8

由 Jens Axboe 提交于 11月 08, 2019

One thing that really sucks for userspace APIs is if the kernel passes
back -ENOMEM/-EAGAIN for resource shortages. The application really has
no idea of what to do in those cases. Should it try and reap
completions? Probably a good idea. Will it solve the issue? Who knows.

This patch adds a simple fallback mechanism if we fail to allocate
memory for a request. If we fail allocating memory from the slab for a
request, we punt to a pre-allocated request. There's just one of these
per io_ring_ctx, but the important part is if we ever return -EBUSY to
the application, the applications knows that it can wait for events and
make forward progress when events have completed. This is the important
part.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0ddf92e8

11 11月, 2019 6 次提交

io_uring: convert accept4() -ERESTARTSYS into -EINTR · 8e3cca12

由 Jens Axboe 提交于 11月 09, 2019

If we cancel a pending accept operating with a signal, we get
-ERESTARTSYS returned. Turn that into -EINTR for userspace, we should
not be return -ERESTARTSYS.

Fixes: 17f2fe35 ("io_uring: add support for IORING_OP_ACCEPT")
Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8e3cca12

io_uring: fix error clear of ->file_table in io_sqe_files_register() · 46568e9b

由 Jens Axboe 提交于 11月 10, 2019

syzbot reports that when using failslab and friends, we can get a double
free in io_sqe_files_unregister():

BUG: KASAN: double-free or invalid-free in
io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185

CPU: 1 PID: 8819 Comm: syz-executor452 Not tainted 5.4.0-rc6-next-20191108
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x197/0x210 lib/dump_stack.c:118
  print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
  kasan_report_invalid_free+0x65/0xa0 mm/kasan/report.c:468
  __kasan_slab_free+0x13a/0x150 mm/kasan/common.c:450
  kasan_slab_free+0xe/0x10 mm/kasan/common.c:480
  __cache_free mm/slab.c:3426 [inline]
  kfree+0x10a/0x2c0 mm/slab.c:3757
  io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185
  io_ring_ctx_free fs/io_uring.c:3998 [inline]
  io_ring_ctx_wait_and_kill+0x348/0x700 fs/io_uring.c:4060
  io_uring_release+0x42/0x50 fs/io_uring.c:4068
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  exit_task_work include/linux/task_work.h:22 [inline]
  do_exit+0x904/0x2e60 kernel/exit.c:817
  do_group_exit+0x135/0x360 kernel/exit.c:921
  __do_sys_exit_group kernel/exit.c:932 [inline]
  __se_sys_exit_group kernel/exit.c:930 [inline]
  __x64_sys_exit_group+0x44/0x50 kernel/exit.c:930
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43f2c8
Code: 31 b8 c5 f7 ff ff 48 8b 5c 24 28 48 8b 6c 24 30 4c 8b 64 24 38 4c 8b
6c 24 40 4c 8b 74 24 48 4c 8b 7c 24 50 48 83 c4 58 c3 66 <0f> 1f 84 00 00
00 00 00 48 8d 35 59 ca 00 00 0f b6 d2 48 89 fb 48
RSP: 002b:00007ffd5b976008 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043f2c8
RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
RBP: 00000000004bf0a8 R08: 00000000000000e7 R09: ffffffffffffffd0
R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001
R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000

This happens if we fail allocating the file tables. For that case we do
free the file table correctly, but we forget to set it to NULL. This
means that ring teardown will see it as being non-NULL, and attempt to
free it again.

Fix this by clearing the file_table pointer if we free the table.

Reported-by: syzbot+3254bc44113ae1e331ee@syzkaller.appspotmail.com
Fixes: 65e19f54 ("io_uring: support for larger fixed file sets")
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

46568e9b

io_uring: separate the io_free_req and io_free_req_find_next interface · c69f8dbe

由 Jackie Liu 提交于 11月 09, 2019

Similar to the distinction between io_put_req and io_put_req_find_next,
io_free_req has been modified similarly, with no functional changes.
Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c69f8dbe

io_uring: keep io_put_req only responsible for release and put req · ec9c02ad

由 Jackie Liu 提交于 11月 08, 2019

We already have io_put_req_find_next to find the next req of the link.
we should not use the io_put_req function to find them. They should be
functions of the same level.
Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ec9c02ad

io_uring: remove passed in 'ctx' function parameter ctx if possible · a197f664

由 Jackie Liu 提交于 11月 08, 2019

Many times, the core of the function is req, and req has already set
req->ctx at initialization time, so there is no need to pass in the
ctx from the caller.

Cleanup, no functional change.
Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a197f664

io_uring: reduce/pack size of io_ring_ctx · 206aefde

由 Jens Axboe 提交于 11月 07, 2019

With the recent flurry of additions and changes to io_uring, the
layout of io_ring_ctx has become a bit stale. We're right now at
704 bytes in size on my x86-64 build, or 11 cachelines. This
patch does two things:

- We have to completion structs embedded, that we only use for
  quiesce of the ctx (or shutdown) and for sqthread init cases.
  That 2x32 bytes right there, let's dynamically allocate them.

- Reorder the struct a bit with an eye on cachelines, use cases,
  and holes.

With this patch, we're down to 512 bytes, or 8 cachelines.
Reviewed-by: NJackie Liu <liuyun01@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

206aefde

08 11月, 2019 2 次提交

io_uring: properly mark async work as bounded vs unbounded · 5f8fd2d3

由 Jens Axboe 提交于 11月 07, 2019

Now that io-wq supports separating the two request lifetime types, mark
the following IO as having unbounded runtimes:

- Any read/write to a non-regular file
- Any specific networked IO
- Any poll command
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5f8fd2d3

io-wq: add support for bounded vs unbunded work · c5def4ab

由 Jens Axboe 提交于 11月 07, 2019

io_uring supports request types that basically have two different
lifetimes:

1) Bounded completion time. These are requests like disk reads or writes,
   which we know will finish in a finite amount of time.
2) Unbounded completion time. These are generally networked IO, where we
   have no idea how long they will take to complete. Another example is
   POLL commands.

This patch provides support for io-wq to handle these differently, so we
don't starve bounded requests by tying up workers for too long. By default
all work is bounded, unless otherwise specified in the work item.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c5def4ab

10 11月, 2019 1 次提交

io_uring: add support for backlogged CQ ring · 1d7bb1d5

由 Jens Axboe 提交于 11月 06, 2019

Currently we drop completion events, if the CQ ring is full. That's fine
for requests with bounded completion times, but it may make it harder or
impossible to use io_uring with networked IO where request completion
times are generally unbounded. Or with POLL, for example, which is also
unbounded.

After this patch, we never overflow the ring, we simply store requests
in a backlog for later flushing. This flushing is done automatically by
the kernel. To prevent the backlog from growing indefinitely, if the
backlog is non-empty, we apply back pressure on IO submissions. Any
attempt to submit new IO with a non-empty backlog will get an -EBUSY
return from the kernel. This is a signal to the application that it has
backlogged CQ events, and that it must reap those before being allowed
to submit more IO.

Note that if we do return -EBUSY, we will have filled whatever
backlogged events into the CQ ring first, if there's room. This means
the application can safely reap events WITHOUT entering the kernel and
waiting for them, they are already available in the CQ ring.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1d7bb1d5

08 11月, 2019 4 次提交

io_uring: pass in io_kiocb to fill/add CQ handlers · 78e19bbe

由 Jens Axboe 提交于 11月 06, 2019

This is in preparation for handling CQ ring overflow a bit smarter. We
should not have any functional changes in this patch. Most of the
changes are fairly straight forward, the only ones that stick out a bit
are the ones that change __io_free_req() to take the reference count
into account. If the request hasn't been submitted yet, we know it's
safe to simply ignore references and free it. But let's clean these up
too, as later patches will depend on the caller doing the right thing if
the completion logging grabs a reference to the request.
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

78e19bbe

io_uring: make io_cqring_events() take 'ctx' as argument · 84f97dc2

由 Jens Axboe 提交于 11月 06, 2019

The rings can be derived from the ctx, and we need the ctx there for
a future change.

No functional changes in this patch.
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

84f97dc2

io_uring: add support for linked SQE timeouts · 2665abfd

由 Jens Axboe 提交于 11月 05, 2019

While we have support for generic timeouts, we don't have a way to tie
a timeout to a specific SQE. The generic timeouts simply trigger wakeups
on the CQ ring.

This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
as a link to a previous command. The timeout specific can be either
relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
the timeout triggers before the dependent command completes, it will
attempt to cancel that command. Likewise, if the dependent command
completes before the timeout triggers, it will cancel the timeout.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2665abfd

io_uring: abstract out io_async_cancel_one() helper · e977d6d3

由 Jens Axboe 提交于 11月 05, 2019

We're going to need this helper in a future patch, so move it out
of io_async_cancel() and into its own separate function.

No functional changes in this patch.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e977d6d3

07 11月, 2019 5 次提交

io_uring: use inlined struct sqe_submit · 267bc904

由 Pavel Begunkov 提交于 11月 07, 2019

req->submit is always up-to-date, use it directly
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

267bc904

io_uring: Use submit info inlined into req · 50585b9a

由 Pavel Begunkov 提交于 11月 07, 2019

Stack allocated struct sqe_submit is passed down to the submission path
along with a request (a.k.a. struct io_kiocb), and will be copied into
req->submit for async requests.

As space for it is already allocated, fill req->submit in the first
place instead of using on-stack one. As a result:

1. sqe->submit is the only place for sqe_submit and is always valid,
so we don't need to track which one to use.
2. don't need to copy in case of async
3. allows to simplify the code by not carrying it as an argument all
the way down
4. allows to reduce number of function arguments / potentially improve
spilling

The downside is that stack is most probably be cached, that's not true
for just allocated memory for a request. Another concern is cache
pollution. Though, a request would be touched and fetched along with
req->submit at some point anyway, so shouldn't be a problem.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

50585b9a

io_uring: allocate io_kiocb upfront · 196be95c

由 Pavel Begunkov 提交于 11月 07, 2019

Let io_submit_sqes() to allocate io_kiocb before fetching an sqe.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

196be95c

io_uring: io_queue_link*() right after submit · e5eb6366

由 Pavel Begunkov 提交于 11月 06, 2019

After a call to io_submit_sqe(), it's already known whether it needs
to queue a link or not. Do it there, as it's simplier and doesn't keep
an extra variable across the loop.

Reviewed-by：Bob Liu <bob.liu@oracle.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e5eb6366

io_uring: Merge io_submit_sqes and io_ring_submit · ae9428ca

由 Pavel Begunkov 提交于 11月 06, 2019

io_submit_sqes() and io_ring_submit() are doing the same stuff with
a little difference. Deduplicate them.

Reviewed-by：Bob Liu <bob.liu@oracle.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ae9428ca

06 11月, 2019 2 次提交

io_uring: kill dead REQ_F_LINK_DONE flag · 3aa5fa03

由 Jens Axboe 提交于 11月 05, 2019

We had no more use for this flag after the conversion to io-wq, kill it
off.

Fixes: 561fb04a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3aa5fa03

io_uring: fixup a few spots where link failure isn't flagged · f1f40853

由 Jens Axboe 提交于 11月 05, 2019

If a request fails, we need to ensure we set REQ_F_FAIL_LINK on it if
REQ_F_LINK is set. Any failure in the chain should break the chain.

We were missing a few spots where this should be done. It might be nice
to generalize this somewhat at some point, as long as we factor in the
fact that failure looks different for each request type.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f1f40853

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功