提交 · cf0dd9527eee5d6d183fa724fdee6a128cb17b8d · openeuler / Kernel

25 7月, 2022 40 次提交

io_uring: disable multishot recvmsg · cf0dd952

由 Dylan Yudaken 提交于 7月 04, 2022

recvmsg has semantics that do not make it trivial to extend to
multishot. Specifically it has user pointers and returns data in the
original parameter. In order to make this API useful these will need to be
somehow included with the provided buffers.

For now remove multishot for recvmsg as it is not useful.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220704140106.200167-1-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

cf0dd952

io_uring: only trace one of complete or overflow · e0486f3f

由 Dylan Yudaken 提交于 6月 30, 2022

In overflow we see a duplcate line in the trace, and in some cases 3
lines (if initial io_post_aux_cqe fails).
Instead just trace once for each CQE
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-13-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

e0486f3f

io_uring: fix io_uring_cqe_overflow trace format · 9b26e811

由 Dylan Yudaken 提交于 6月 30, 2022

Make the trace format consistent with io_uring_complete for cflags
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-12-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

9b26e811

io_uring: multishot recv · b3fdea6e

由 Dylan Yudaken 提交于 6月 30, 2022

Support multishot receive for io_uring.
Typical server applications will run a loop where for each recv CQE it
requeues another recv/recvmsg.

This can be simplified by using the existing multishot functionality
combined with io_uring's provided buffers.
The API is to add the IORING_RECV_MULTISHOT flag to the SQE. CQEs will
then be posted (with IORING_CQE_F_MORE flag set) when data is available
and is read. Once an error occurs or the socket ends, the multishot will
be removed and a completion without IORING_CQE_F_MORE will be posted.

The benefit to this is that the recv is much more performant.
 * Subsequent receives are queued up straight away without requiring the
   application to finish a processing loop.
 * If there are more data in the socket (sat the provided buffer size is
   smaller than the socket buffer) then the data is immediately
   returned, improving batching.
 * Poll is only armed once and reused, saving CPU cycles
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-11-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

b3fdea6e

io_uring: fix multishot accept ordering · cbd25748

由 Dylan Yudaken 提交于 6月 30, 2022

Similar to multishot poll, drop multishot accept when CQE overflow occurs.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-10-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

cbd25748

io_uring: fix multishot poll on overflow · a2da6763

由 Dylan Yudaken 提交于 6月 30, 2022

On overflow, multishot poll can still complete with the IORING_CQE_F_MORE
flag set.
If in the meantime the user clears a CQE and a the poll was cancelled then
the poll will post a CQE without the IORING_CQE_F_MORE (and likely result
-ECANCELED).

However when processing the application will encounter the non-overflow
CQE which indicates that there will be no more events posted. Typical
userspace applications would free memory associated with the poll in this
case.
It will then subsequently receive the earlier CQE which has overflowed,
which breaks the contract given by the IORING_CQE_F_MORE flag.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-9-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

a2da6763

io_uring: add allow_overflow to io_post_aux_cqe · 52120f0f

由 Dylan Yudaken 提交于 6月 30, 2022

Some use cases of io_post_aux_cqe would not want to overflow as is, but
might want to change the flags/result. For example multishot receive
requires in order CQE, and so if there is an overflow it would need to
stop receiving until the overflow is taken care of.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-8-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

52120f0f

io_uring: add IOU_STOP_MULTISHOT return code · 114eccdf

由 Dylan Yudaken 提交于 6月 30, 2022

For multishot we want a way to signal the caller that multishot has ended
but also this might not be an error return.

For example sockets return 0 when closed, which should end a multishot
recv, but still have a CQE with result 0

Introduce IOU_STOP_MULTISHOT which does this and indicates that the return
code is stored inside req->cqe
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-7-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

114eccdf

io_uring: clean up io_poll_check_events return values · 2ba69707

由 Dylan Yudaken 提交于 6月 30, 2022

The values returned are a bit confusing, where 0 and 1 have implied
meaning, so add some definitions for them.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-6-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

2ba69707

io_uring: recycle buffers on error · d4e097da

由 Dylan Yudaken 提交于 6月 30, 2022

Rather than passing an error back to the user with a buffer attached,
recycle the buffer immediately.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-5-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

d4e097da

io_uring: allow iov_len = 0 for recvmsg and buffer select · 5702196e

由 Dylan Yudaken 提交于 6月 30, 2022

When using BUFFER_SELECT there is no technical requirement that the user
actually provides iov, and this removes one copy_from_user call.

So allow iov_len to be 0.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-4-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

5702196e

io_uring: restore bgid in io_put_kbuf · 32f3c434

由 Dylan Yudaken 提交于 6月 30, 2022

Attempt to restore bgid. This is needed when recycling unused buffers as
the next time around it will want the correct bgid.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-3-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

32f3c434

io_uring: allow 0 length for buffer select · b8c01559

由 Dylan Yudaken 提交于 6月 30, 2022

If user gives 0 for length, we can set it from the available buffer size.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-2-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

b8c01559

io_uring: let to set a range for file slot allocation · 6e73dffb

由 Pavel Begunkov 提交于 6月 25, 2022

From recently io_uring provides an option to allocate a file index for
operation registering fixed files. However, it's utterly unusable with
mixed approaches when for a part of files the userspace knows better
where to place it, as it may race and users don't have any sane way to
pick a slot and hoping it will not be taken.

Let the userspace to register a range of fixed file slots in which the
auto-allocation happens. The use case is splittting the fixed table in
two parts, where on of them is used for auto-allocation and another for
slot-specified operations.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/66ab0394e436f38437cf7c44676e1920d09687ad.1656154403.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

6e73dffb

io_uring: add support for passing fixed file descriptors · e6130eba

由 Jens Axboe 提交于 6月 13, 2022

With IORING_OP_MSG_RING, one ring can send a message to another ring.
Extend that support to also allow sending a fixed file descriptor to
that ring, enabling one ring to pass a registered descriptor to another
one.

Arguments are extended to pass in:

sqe->addr3	fixed file slot in source ring
sqe->file_index	fixed file slot in destination ring

IORING_OP_MSG_RING is extended to take a command argument in sqe->addr.
If set to zero (or IORING_MSG_DATA), it sends just a message like before.
If set to IORING_MSG_SEND_FD, a fixed file descriptor is sent according
to the above arguments.

Two common use cases for this are:

1) Server needs to be shutdown or restarted, pass file descriptors to
   another onei

2) Backend is split, and one accepts connections, while others then get
  the fd passed and handle the actual connection.

Both of those are classic SCM_RIGHTS use cases, and it's not possible to
support them with direct descriptors today.

By default, this will post a CQE to the target ring, similarly to how
IORING_MSG_DATA does it. If IORING_MSG_RING_CQE_SKIP is set, no message
is posted to the target ring. The issuer is expected to notify the
receiver side separately.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e6130eba

io_uring: split out fixed file installation and removal · f110ed84

由 Jens Axboe 提交于 6月 13, 2022

Put it with the filetable code, which is where it belongs. While doing
so, have the helpers take a ctx rather than an io_kiocb. It doesn't make
sense to use a request, as it's not an operation on the request itself.
It applies to the ring itself.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f110ed84

io_uring: replace zero-length array with flexible-array member · 8fcf4c48

由 Gustavo A. R. Silva 提交于 6月 28, 2022

There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use “flexible array members”[1] for these
cases. The older style of one-element or zero-length arrays should
no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/78Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8fcf4c48

io_uring: remove ctx->refs pinning on enter · fbb8bb02

由 Pavel Begunkov 提交于 6月 25, 2022

io_uring_enter() takes ctx->refs, which was previously preventing racing
with register quiesce. However, as register now doesn't touch the refs,
we can freely kill extra ctx pinning and rely on the fact that we're
holding a file reference preventing the ring from being destroyed.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a11c57ad33a1be53541fce90669c1b79cf4d8940.1656153286.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

fbb8bb02

io_uring: don't check file ops of registered rings · 3273c440

由 Pavel Begunkov 提交于 6月 25, 2022

Registered rings are per definitions io_uring files, so we don't need to
additionally verify them.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/425cd64fd885b8e329a46c205ee811987691baaf.1656153286.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

3273c440

io_uring: remove extra TIF_NOTIFY_SIGNAL check · ad8b261d

由 Pavel Begunkov 提交于 6月 25, 2022

io_run_task_work() accounts for TIF_NOTIFY_SIGNAL, so no need to have an
second check in io_run_task_work_sig().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/52ce41a592ad904511697f432141e5690fd4b968.1656153285.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

ad8b261d

io_uring: fuse fallback_node and normal tw node · 3218e5d3

由 Pavel Begunkov 提交于 6月 25, 2022

Now as both normal and fallback paths use llist, just keep one node head
in struct io_task_work and kill off ->fallback_node.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d04ebde409f7b162fe247b361b4486b193293e46.1656153285.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

3218e5d3

io_uring: improve io_fail_links() · 37c7bd31

由 Pavel Begunkov 提交于 6月 25, 2022

io_fail_links() is called with ->completion_lock held and for that
reason we'd want to keep it as small as we can. Instead of doing
__io_req_complete_post() for each linked request under the lock, fail
them in a task_work handler under ->uring_lock.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a2f68708b970a21f4e84ddfa7b3abd67a8fffb27.1656153285.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

37c7bd31

io_uring: move POLLFREE handling to separate function · fe991a76

由 Jens Axboe 提交于 6月 21, 2022

We really don't care about this at all in terms of performance. Outside
of having it already be marked unlikely(), shove it into a separate
__cold function.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fe991a76

io_uring: kbuf: inline io_kbuf_recycle_ring() · 795bbbc8

由 Hao Xu 提交于 6月 23, 2022

Make io_kbuf_recycle_ring() inline since it is the fast path of
provided buffer.
Signed-off-by: NHao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220623130126.179232-1-hao.xu@linux.devSigned-off-by: NJens Axboe <axboe@kernel.dk>

795bbbc8

io_uring: optimise submission side poll_refs · 49f1c68e

由 Pavel Begunkov 提交于 6月 23, 2022

The final poll_refs put in __io_arm_poll_handler() takes quite some
cycles. When we're arming from the original task context task_work won't
be run, so in this case we can assume that we won't race with task_works
and so not take the initial ownership ref.

One caveat is that after arming a poll we may race with it, so we have
to add a bunch of io_poll_get_ownership() hidden inside of
io_poll_can_finish_inline() whenever we want to complete arming inline.
For the same reason we can't just set REQ_F_DOUBLE_POLL in
__io_queue_proc() and so need to sync with the first poll entry by
taking its wq head lock.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8825315d7f5e182ac1578a031e546f79b1c97d01.1655990418.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

49f1c68e

io_uring: refactor poll arm error handling · de08356f

由 Pavel Begunkov 提交于 6月 23, 2022

__io_arm_poll_handler() errors parsing is a horror, in case it failed it
returns 0 and the caller is expected to look at ipt.error, which already
led us to a number of problems before.

When it returns a valid mask, leave it as it's not, i.e. return 1 and
store the mask in ipt.result_mask. In case of a failure that can be
handled inline return an error code (negative value), and return 0 if
__io_arm_poll_handler() took ownership of the request and will complete
it.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/018cacdaef5fe95d7dc56b32e85d752cab7607f6.1655990418.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

de08356f

io_uring: change arm poll return values · 063a0079

由 Pavel Begunkov 提交于 6月 23, 2022

The rules for __io_arm_poll_handler()'s result parsing are complicated,
as the first step don't pass return a mask but pass back a positive
return code and fill ipt->result_mask.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/529e29e9f97f2e6e383ccd44234d8b576a83a921.1655990418.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

063a0079

io_uring: add a helper for apoll alloc · 5204aa8c

由 Pavel Begunkov 提交于 6月 23, 2022

Extract a helper function for apoll allocation, makes the code easier to
read.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2f93282b47dd678e805dd0d7097f66968ced495c.1655990418.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

5204aa8c

io_uring: remove events caching atavisms · 13a99017

由 Pavel Begunkov 提交于 6月 23, 2022

Remove events argument from *io_poll_execute(), it's not needed and not
used.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/12efd4e15c6a90cf9e5b59807cfcb57852b51dc7.1655990418.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

13a99017

io_uring: clean poll ->private flagging · 0638cd7b

由 Pavel Begunkov 提交于 6月 23, 2022

We store a req pointer in wqe->private but also take one bit to mark
double poll entries. Replace macro helpers with inline functions for
better type checking and also name the double flag.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9a61240555c64ac0b7a9b0eb59a9efeb638a35a4.1655990418.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

0638cd7b

io_uring: add sync cancelation API through io_uring_register() · 78a861b9

由 Jens Axboe 提交于 6月 18, 2022

The io_uring cancelation API is async, like any other API that we expose
there. For the case of finding a request to cancel, or not finding one,
it is fully sync in that when submission returns, the CQE for both the
cancelation request and the targeted request have been posted to the
CQ ring.

However, if the targeted work is being executed by io-wq, the API can
only start the act of canceling it. This makes it difficult to use in
some circumstances, as the caller then has to wait for the CQEs to come
in and match on the same cancelation data there.

Provide a IORING_REGISTER_SYNC_CANCEL command for io_uring_register()
that does sync cancelations, always. For the io-wq case, it'll wait
for the cancelation to come in before returning. The only expected
returns from this API is:

0		Request found and canceled fine.
> 0		Requests found and canceled. Only happens if asked to
		cancel multiple requests, and if the work wasn't in
		progress.
-ENOENT		Request not found.
-ETIME		A timeout on the operation was requested, but the timeout
		expired before we could cancel.

and we won't get -EALREADY via this API.

If the timeout value passed in is -1 (tv_sec and tv_nsec), then that
means that no timeout is requested. Otherwise, the timespec passed in
is the amount of time the sync cancel will wait for a successful
cancelation.

Link: https://github.com/axboe/liburing/discussions/608Signed-off-by: NJens Axboe <axboe@kernel.dk>

78a861b9

io_uring: add IORING_ASYNC_CANCEL_FD_FIXED cancel flag · 7d8ca725

由 Jens Axboe 提交于 6月 18, 2022

In preparation for not having a request to pass in that carries this
state, add a separate cancelation flag that allows the caller to ask
for a fixed file for cancelation.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7d8ca725

io_uring: have cancelation API accept io_uring_task directly · 88f52eaa

由 Jens Axboe 提交于 6月 18, 2022

We just use the io_kiocb passed in to find the io_uring_task, and we
already pass in the ctx via cd->ctx anyway.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

88f52eaa

io_uring: kbuf: kill __io_kbuf_recycle() · 024b8fde

由 Hao Xu 提交于 6月 22, 2022

__io_kbuf_recycle() is only called in io_kbuf_recycle(). Kill it and
tweak the code so that the legacy pbuf and ring pbuf code become clear
Signed-off-by: NHao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220622055551.642370-1-hao.xu@linux.devSigned-off-by: NJens Axboe <axboe@kernel.dk>

024b8fde

io_uring: trace task_work_run · c6dd763c

由 Dylan Yudaken 提交于 6月 22, 2022

trace task_work_run to help provide stats on how often task work is run
and what batch sizes are coming through.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-9-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c6dd763c

io_uring: add trace event for running task work · eccd8801

由 Dylan Yudaken 提交于 6月 22, 2022

This is useful for investigating if task_work is batching
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-8-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

eccd8801

io_uring: batch task_work · 3a0c037b

由 Dylan Yudaken 提交于 6月 22, 2022

Batching task work up is an important performance optimisation, as
task_work_add is expensive.

In order to keep the semantics replace the task_list with a fake node
while processing the old list, and then do a cmpxchg at the end to see if
there is more work.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-6-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

3a0c037b

io_uring: introduce llist helpers · 923d1592

由 Dylan Yudaken 提交于 6月 22, 2022

Introduce helpers to atomically switch llist.

Will later move this into common code
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-5-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

923d1592

io_uring: lockless task list · f88262e6

由 Dylan Yudaken 提交于 6月 22, 2022

With networking use cases we see contention on the spinlock used to
protect the task_list when multiple threads try and add completions at once.
Instead we can use a lockless list, and assume that the first caller to
add to the list is responsible for kicking off task work.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-4-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

f88262e6

io_uring: remove __io_req_task_work_add · c34398a8

由 Dylan Yudaken 提交于 6月 22, 2022

this is no longer needed as there is only one caller
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-3-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c34398a8

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功