提交 · c80ca4707d1aa8b6ba2cb8e57a521ebb6f9f22a2 · openeuler / Kernel

12 4月, 2021 40 次提交

io-wq: cancel task_work on exit only targeting the current 'wq' · c80ca470

由 Jens Axboe 提交于 4月 01, 2021

With using task_work_cancel(), we're potentially canceling task_work
that isn't related to this specific io_wq. Use the newly added
task_work_cancel_match() to ensure that we only remove and cancel work
items that are specific to this io_wq.

Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c80ca470

task_work: add helper for more targeted task_work canceling · c7aab1a7

由 Jens Axboe 提交于 4月 01, 2021

The only exported helper we have right now is task_work_cancel(), which
cancels any task_work from a given task where func matches the queued
work item. This is a bit too coarse for some use cases. Add a
task_work_cancel_match() that allows to more specifically target
individual work items outside of purely the callback function used.

task_work_cancel() can be trivially implemented on top of that, hence do
so.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c7aab1a7

io_uring: fix race around poll update and poll triggering · b2e720ac

由 Jens Axboe 提交于 3月 31, 2021

Joakim reports that in some conditions he sees a multishot poll request
being canceled, and that it coincides with getting -EALREADY on
modification. As part of the poll update procedure, there's a small window
where the request is marked as canceled, and if this coincides with the
event actually triggering, then we can get a spurious -ECANCELED and
termination of the multishot request.

Don't mark the poll request as being canceled for update. We also don't
care if we race on removal unless it's a one-shot request, we can safely
updated for either case.

Fixes: b69de288 ("io_uring: allow events and user_data update of running poll requests")
Reported-by: NJoakim Hassila <joj@mac.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b2e720ac

io_uring: reg buffer overflow checks hardening · 50e96989

由 Pavel Begunkov 提交于 3月 24, 2021

We are safe with overflows in io_sqe_buffer_register() because it will
just yield alloc failure, but it's nicer to check explicitly.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2b0625551be3d97b80a5fd21c8cd79dc1c91f0b5.1616624589.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

50e96989

io_uring: allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICE · 548d819d

由 Jens Axboe 提交于 3月 25, 2021

Now that we have any worker being attached to the original task as
threads, accounting of CPU time is directly attributed to the original
task as well. This means that we no longer have to restrict SQPOLL to
needing elevated privileges, as it's really no different from just having
the task spawn a busy looping thread in userspace.
Reported-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

548d819d

io-wq: eliminate the need for a manager thread · 685fe7fe

由 Jens Axboe 提交于 3月 08, 2021

io-wq relies on a manager thread to create/fork new workers, as needed.
But there's really no strong need for it anymore. We have the following
cases that fork a new worker:

1) Work queue. This is done from the task itself always, and it's trivial
to create a worker off that path, if needed.

2) All workers have gone to sleep, and we have more work. This is called
off the sched out path. For this case, use a task_work items to queue
a fork-worker operation.

3) Hashed work completion. Don't think we need to do anything off this
case. If need be, it could just use approach 2 as well.

Part of this change is incrementing the running worker count before the
fork, to avoid cases where we observe we need a worker and then queue
creation of one. Then new work comes in, we fork a new one. That last
queue operation should have waited for the previous worker to come up,
it's quite possible we don't even need it. Hence move the worker running
from before we fork it off to more efficiently handle that case.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

685fe7fe

kernel: allow fork with TIF_NOTIFY_SIGNAL pending · 66ae0d1e

由 Jens Axboe 提交于 3月 22, 2021

fork() fails if signal_pending() is true, but there are two conditions
that can lead to that:

1) An actual signal is pending. We want fork to fail for that one, like
   we always have.

2) TIF_NOTIFY_SIGNAL is pending, because the task has pending task_work.
   We don't need to make it fail for that case.

Allow fork() to proceed if just task_work is pending, by changing the
signal_pending() check to task_sigpending().
Signed-off-by: NJens Axboe <axboe@kernel.dk>

66ae0d1e

io_uring: allow events and user_data update of running poll requests · b69de288

由 Jens Axboe 提交于 3月 17, 2021

This adds two new POLL_ADD flags, IORING_POLL_UPDATE_EVENTS and
IORING_POLL_UPDATE_USER_DATA. As with the other POLL_ADD flag, these are
masked into sqe->len. If set, the POLL_ADD will have the following
behavior:

- sqe->addr must contain the the user_data of the poll request that
  needs to be modified. This field is otherwise invalid for a POLL_ADD
  command.

- If IORING_POLL_UPDATE_EVENTS is set, sqe->poll_events must contain the
  new mask for the existing poll request. There are no checks for whether
  these are identical or not, if a matching poll request is found, then it
  is re-armed with the new mask.

- If IORING_POLL_UPDATE_USER_DATA is set, sqe->off must contain the new
  user_data for the existing poll request.

A POLL_ADD with any of these flags set may complete with any of the
following results:

1) 0, which means that we successfully found the existing poll request
   specified, and performed the re-arm procedure. Any error from that
   re-arm will be exposed as a completion event for that original poll
   request, not for the update request.
2) -ENOENT, if no existing poll request was found with the given
   user_data.
3) -EALREADY, if the existing poll request was already in the process of
   being removed/canceled/completing.
4) -EACCES, if an attempt was made to modify an internal poll request
   (eg not one originally issued ass IORING_OP_POLL_ADD).

The usual -EINVAL cases apply as well, if any invalid fields are set
in the sqe for this command type.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b69de288

io_uring: abstract out a io_poll_find_helper() · b2cb805f

由 Jens Axboe 提交于 3月 17, 2021

We'll need this helper for another purpose, for now just abstract it
out and have io_poll_cancel() use it for lookups.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b2cb805f

io_uring: terminate multishot poll for CQ ring overflow · 5082620f

由 Jens Axboe 提交于 2月 23, 2021

If we hit overflow and fail to allocate an overflow entry for the
completion, terminate the multishot poll mode.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5082620f

io_uring: abstract out helper for removing poll waitqs/hashes · b2c3f7e1

由 Jens Axboe 提交于 2月 23, 2021

No functional changes in this patch, just preparation for kill multishot
poll on CQ overflow.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b2c3f7e1

io_uring: add multishot mode for IORING_OP_POLL_ADD · 88e41cf9

由 Jens Axboe 提交于 2月 22, 2021

The default io_uring poll mode is one-shot, where once the event triggers,
the poll command is completed and won't trigger any further events. If
we're doing repeated polling on the same file or socket, then it can be
more efficient to do multishot, where we keep triggering whenever the
event becomes true.

This deviates from the usual norm of having one CQE per SQE submitted. Add
a CQE flag, IORING_CQE_F_MORE, which tells the application to expect
further completion events from the submitted SQE. Right now the only user
of this is POLL_ADD in multishot mode.

Since sqe->poll_events is using the space that we normally use for adding
flags to commands, use sqe->len for the flag space for POLL_ADD. Multishot
mode is selected by setting IORING_POLL_ADD_MULTI in sqe->len. An
application should expect more CQEs for the specificed SQE if the CQE is
flagged with IORING_CQE_F_MORE. In multishot mode, only cancelation or an
error will terminate the poll request, in which case the flag will be
cleared.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

88e41cf9

io_uring: include cflags in completion trace event · 7471e1af

由 Jens Axboe 提交于 2月 22, 2021

We should be including the completion flags for better introspection on
exactly what completion event was logged.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7471e1af

io_uring: allocate memory for overflowed CQEs · 6c2450ae

由 Pavel Begunkov 提交于 2月 23, 2021

Instead of using a request itself for overflowed CQE stashing, allocate a
separate entry. The disadvantage is that the allocation may fail and it
will be accounted as lost (see rings->cq_overflow), so we lose reliability
in case of memory pressure if the application is driving the CQ ring into
overflow. However, it opens a way for for multiple CQEs per an SQE and
even generating SQE-less CQEs.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
[axboe: use GFP_ATOMIC | __GFP_ACCOUNT]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6c2450ae

io_uring: mask in error/nval/hangup consistently for poll · 464dca61

由 Jens Axboe 提交于 3月 19, 2021

Instead of masking these in as part of regular POLL_ADD prep, do it in
io_init_poll_iocb(), and include NVAL as that's generally unmaskable,
and RDHUP alongside the HUP that is already set.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

464dca61

io_uring: optimise rw complete error handling · 9532b99b

由 Pavel Begunkov 提交于 3月 22, 2021

Expect read/write to succeed and create a hot path for this case, in
particular hide all error handling with resubmission under a single
check with the desired result.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9532b99b

io_uring: hide iter revert in resubmit_prep · ab454438

由 Pavel Begunkov 提交于 3月 22, 2021

Move iov_iter_revert() resetting iterator in case of -EIOCBQUEUED into
io_resubmit_prep(), so we don't do heavy revert in hot path, also saves
a couple of checks.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ab454438

io_uring: don't alter iopoll reissue fail ret code · 8c130827

由 Pavel Begunkov 提交于 3月 22, 2021

When reissue_prep failed in io_complete_rw_iopoll(), we change return
code to -EIO to prevent io_iopoll_complete() from doing resubmission.
Mark requests with a new flag (i.e. REQ_F_DONT_REISSUE) instead and
retain the original return value.

It also removes io_rw_reissue() from io_iopoll_complete() that will be
used later.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8c130827

io_uring: optimise kiocb_end_write for !ISREG · 1c98679d

由 Pavel Begunkov 提交于 3月 22, 2021

file_end_write() is only for regular files, so the function do a couple
of dereferences to get inode and check for it. However, we already have
REQ_F_ISREG at hand, just use it and inline file_end_write().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1c98679d

io_uring: kill unused REQ_F_NO_FILE_TABLE · 59d70013

由 Pavel Begunkov 提交于 3月 22, 2021

current->files are always valid now even for io-wq threads, so kill not
used anymore REQ_F_NO_FILE_TABLE.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

59d70013

io_uring: don't init req->work fully in advance · e1d675df

由 Pavel Begunkov 提交于 3月 22, 2021

req->work is mostly unused unless it's punted, and io_init_req() is too
hot for fully initialising it. Fortunately, we can skip init work.next
as it's controlled by io-wq, and can not touch work.flags by moving
everything related into io_prep_async_work(). The only field left is
req->work.creds, but there is nothing can be done, keep maintaining it.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e1d675df

io-wq: refactor *_get_acct() · 8418f22a

由 Pavel Begunkov 提交于 3月 22, 2021

Extract a helper for io_work_get_acct() and io_wqe_get_acct() to avoid
duplication.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8418f22a

io_uring: remove tctx->sqpoll · 05356d86

由 Pavel Begunkov 提交于 3月 22, 2021

struct io_uring_task::sqpoll is not used anymore, kill it
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

05356d86

io_uring: don't do extra EXITING cancellations · 68207680

由 Pavel Begunkov 提交于 3月 22, 2021

io_match_task() matches all requests with PF_EXITING task, even though
those may be valid requests. It was necessary for SQPOLL cancellation,
but now it kills all requests before exiting via
io_uring_cancel_sqpoll(), so it's not needed.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

68207680

io_uring: don't clear REQ_F_LINK_TIMEOUT · d4729fbd

由 Pavel Begunkov 提交于 3月 22, 2021

REQ_F_LINK_TIMEOUT is a hint that to look for linked timeouts to cancel,
we're leaving it even when it's already fired. Hence don't care to clear
it in io_kill_linked_timeout(), it's safe and is called only once.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d4729fbd

io_uring: optimise io_req_task_work_add() · c15b79de

由 Pavel Begunkov 提交于 3月 19, 2021

Inline io_task_work_add() into io_req_task_work_add(). They both work
with a request, so keeping them separate doesn't make things much more
clear, but merging allows optimise it. Apart from small wins like not
reading req->ctx or not calculating @notify in the hot path, i.e. with
tctx->task_state set, it avoids doing wake_up_process() for every single
add, but only after actually done task_work_add().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c15b79de

io_uring: abolish old io_put_file() · e1d767f0

由 Pavel Begunkov 提交于 3月 19, 2021

io_put_file() doesn't do a good job at generating a good code. Inline
it, so we can check REQ_F_FIXED_FILE first, prioritising FIXED_FILE case
over requests without files, and saving a memory load in that case.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e1d767f0

io_uring: optimise io_dismantle_req() fast path · 094bae49

由 Pavel Begunkov 提交于 3月 19, 2021

Reshuffle io_dismantle_req() checks to put most of slow path stuff under
a single if.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

094bae49

io_uring: inline io_clean_op()'s fast path · 68fb8979

由 Pavel Begunkov 提交于 3月 19, 2021

Inline io_clean_op(), leaving __io_clean_op() but renaming it. This will
be used in following patches.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

68fb8979

io_uring: remove __io_req_task_cancel() · 2593553a

由 Pavel Begunkov 提交于 3月 19, 2021

Both io_req_complete_failed() and __io_req_task_cancel() do the same
thing: set failure flag, put both req refs and emit an CQE. The former
one is a bit more advance as it puts req back into a req cache, so make
it to take over __io_req_task_cancel() and remove the last one.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2593553a

io_uring: add helper flushing locked_free_list · dac7a098

由 Pavel Begunkov 提交于 3月 19, 2021

Add a new helper io_flush_cached_locked_reqs() that splices
locked_free_list to free_list, and does it right doing all sync and
invariant reinit.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dac7a098

io_uring: refactor io_free_req_deferred() · a05432fb

由 Pavel Begunkov 提交于 3月 19, 2021

We don't care about ret value in io_free_req_deferred(), make the code a
bit more concise.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a05432fb

io_uring: inline io_put_req and friends · 0d85035a

由 Pavel Begunkov 提交于 3月 19, 2021

One big omission is that io_put_req() haven't been marked inline, and at
least gcc 9 doesn't inline it, not to mention that it's really hot and
extra function call is intolerable, especially when it doesn't put a
final ref.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0d85035a

io_uring: refactor rsrc refnode allocation · 8dd03afe

由 Pavel Begunkov 提交于 3月 19, 2021

There are two problems:
1) we always allocate refnodes in advance and free them if those
haven't been used. It's expensive, takes two allocations, where one of
them is percpu. And it may be pretty common not actually using them.

2) Current API with allocating a refnode and setting some of the fields
is error prone, we don't ever want to have a file node runninng fixed
buffer callback...

Solve both with pre-init/get API. Pre-init just leaves the node for
later if not used, and for get (i.e. io_rsrc_refnode_get()), you need to
explicitly pass all arguments setting callbacks/etc., so it's more
resilient.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8dd03afe

io_uring: refactor io_flush_cached_reqs() · dd78f492

由 Pavel Begunkov 提交于 3月 19, 2021

Emphasize that return value of io_flush_cached_reqs() depends on number
of requests in the cache. It looks nicer and might help tools from
false-negative analyses.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dd78f492

io_uring: optimise success case of __io_queue_sqe · 1840038e

由 Pavel Begunkov 提交于 3月 19, 2021

Move the case of successfully issued request by doing that check first.
It's not much of a difference, just generates slightly better code for
me.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1840038e

io_uring: inline __io_queue_linked_timeout() · de968c18

由 Pavel Begunkov 提交于 3月 19, 2021

Inline __io_queue_linked_timeout(), we don't need it
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

de968c18

io_uring: keep io_req_free_batch() call locality · 96670657

由 Pavel Begunkov 提交于 3月 19, 2021

Don't do a function call (io_dismantle_req()) in the middle and place it
to near other function calls, otherwise may lead to excessive register
spilling.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

96670657

io_uring: optimise tctx node checks/alloc · cf27f3b1

由 Pavel Begunkov 提交于 3月 19, 2021

First of all, w need to set tctx->sqpoll only when we add a new entry
into ->xa, so move it from the hot path. Also extract a hot path for
io_uring_add_task_file() as an inline helper.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cf27f3b1

io_uring: optimise io_uring_enter() · 33f993da

由 Pavel Begunkov 提交于 3月 19, 2021

Add unlikely annotations, because my compiler pretty much mispredicts
every first check, and apart jumping around in the fast path, it also
generates extra instructions, like in advance setting ret value.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

33f993da

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功