提交 · a954587a6e334ff98e4147fd17660bb73f1036d6 · openeuler / Kernel

15 4月, 2021 40 次提交

io_uring: only !null ptr to io_issue_sqe() · a954587a

由 Pavel Begunkov 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit f9bd67f6
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

Pass only non-null @nxt to io_issue_sqe() and handle it at the caller's
side. And propagate it.

- kiocb_done() is only called from io_read() and io_write(), which are
only called from io_issue_sqe(), so it's @nxt != NULL

- io_put_req_find_next() is called either with explicitly non-null local
nxt, or from one of the functions in io_issue_sqe() switch (or their
callees).
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

a954587a

io_uring: simplify io_req_link_next() · d67d819a

由 Pavel Begunkov 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit b18fdf71
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

"if (nxt)" is always true, as it was checked in the while's condition.
io_wq_current_is_worker() is unnecessary, as non-async callers don't
pass nxt, so io_queue_async_work() will be called for them anyway.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

d67d819a

io_uring: pass only !null to io_req_find_next() · bf5b48ee

由 Pavel Begunkov 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 944e58bf
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

Make io_req_find_next() and io_req_link_next() to accept only non-null
nxt, and handle it in callers.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

bf5b48ee

io_uring: remove io_free_req_find_next() · 04ad9852

由 Pavel Begunkov 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 70cf9f32
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

There is only one one-liner user of io_free_req_find_next(). Inline it.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

04ad9852

io_uring: add likely/unlikely in io_get_sqring() · 21a4ae0c

由 Pavel Begunkov 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 9835d6fa
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

The number of SQEs to submit is specified by a user, so io_get_sqring()
in most of the cases succeeds. Hint compilers about that.

Checking ASM genereted by gcc 9.2.0 for x64, there is one branch
misprediction.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

21a4ae0c

io_uring: rename __io_submit_sqe() · 66b6bbd9

由 Pavel Begunkov 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit d732447f
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

__io_submit_sqe() is issuing requests, so call it as
such. Moreover, it ends by calling io_iopoll_req_issued().

Rename it and make terminology clearer.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

66b6bbd9

io_uring: improve trace_io_uring_defer() trace point · 3576dfdf

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 915967f6
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

We don't have shadow requests anymore, so get rid of the shadow
argument. Add the user_data argument, as that's often useful to easily
match up requests, instead of having to look at request pointers.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

3576dfdf

io_uring: drain next sqe instead of shadowing · 4251d384

由 Pavel Begunkov 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 1b4a51b6
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

There's an issue with the shadow drain logic in that we drop the
completion lock after deciding to defer a request, then re-grab it later
and assume that the state is still the same. In the mean time, someone
else completing a request could have found and issued it. This can cause
a stall in the queue, by having a shadow request inserted that nobody is
going to drain.

Additionally, if we fail allocating the shadow request, we simply ignore
the drain.

Instead of using a shadow request, defer the next request/link instead.
This also has the following advantages:

- removes semi-duplicated code
- doesn't allocate memory for shadows
- works better if only the head marked for drain
- doesn't need complex synchronisation

On the flip side, it removes the shadow->seq ==
last_drain_in_in_link->seq optimization. That shouldn't be a common
case, and can always be added back, if needed.

Fixes: 4fe2c963 ("io_uring: add support for link with drain")
Cc: Jackie Liu <liuyun01@kylinos.cn>
Reported-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

4251d384

io_uring: close lookup gap for dependent next work · 32d7d11b

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit b76da70f
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

When we find new work to process within the work handler, we queue the
linked timeout before we have issued the new work. This can be
problematic for very short timeouts, as we have a window where the new
work isn't visible.

Allow the work handler to store a callback function for this in the work
item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that
is set, then io-wq will call the callback when it has setup the new work
item.
Reported-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

32d7d11b

io_uring: allow finding next link independent of req reference count · 953d69cc

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 4d7dd462
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

We currently try and start the next link when we put the request, and
only if we were going to free it. This means that the optimization to
continue executing requests from the same context often fails, as we're
not putting the final reference.

Add REQ_F_LINK_NEXT to keep track of this, and allow io_uring to find the
next request more efficiently.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

953d69cc

io_uring: io_allocate_scq_urings() should return a sane state · 87162e3b

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit eb065d30
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

We currently rely on the ring destroy on cleaning things up in case of
failure, but io_allocate_scq_urings() can leave things half initialized
if only parts of it fails.

Be nice and return with either everything setup in success, or return an
error with things nicely cleaned up.

Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

87162e3b

io_uring: Always REQ_F_FREE_SQE for allocated sqe · 02ec86a4

由 Pavel Begunkov 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit bbad27b2
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

Always mark requests with allocated sqe and deallocate it in
__io_free_req(). It's easier to follow and doesn't add edge cases.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

02ec86a4

io_uring: io_fail_links() should only consider first linked timeout · ebb11a13

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 5d960724
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

We currently clear the linked timeout field if we cancel such a timeout,
but we should only attempt to cancel if it's the first one we see.
Others should simply be freed like other requests, as they haven't
been started yet.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

ebb11a13

io_uring: Fix leaking linked timeouts · 9129f9b8

由 Pavel Begunkov 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 09fbb0a8
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

let have a dependant link: REQ -> LINK_TIMEOUT -> LINK_TIMEOUT

1. submission stage: submission references for REQ and LINK_TIMEOUT
are dropped. So, references respectively (1,1,2)

2. io_put(REQ) + FAIL_LINKS stage: calls io_fail_links(), which for all
linked timeouts will call cancel_timeout() and drop 1 reference.
So, references after: (0,0,1). That's a leak.

Make it treat only the first linked timeout as such, and pass others
through __io_double_put_req().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

9129f9b8

io_uring: remove redundant check · f6c7e0cb

由 Pavel Begunkov 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit f70193d6
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

Pass any IORING_OP_LINK_TIMEOUT request further, where it will
eventually fail in io_issue_sqe().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

f6c7e0cb

io_uring: break links for failed defer · 528d1ef6

由 Pavel Begunkov 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit d3b35796
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

If io_req_defer() failed, it needs to cancel a dependant link.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

528d1ef6

io_uring: request cancellations should break links · deb6b159

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit fba38c27
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

We currently don't explicitly break links if a request is cancelled, but
we should. Add explicitly link breakage for all types of request
cancellations that we support.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

deb6b159

io_uring: correct poll cancel and linked timeout expiration completion · 90026541

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit b0dd8a41
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

Currently a poll request fills a completion entry of 0, even if it got
cancelled. This is odd, and it makes it harder to support with chains.
Ensure that it returns -ECANCELED in the completions events if it got
cancelled, and furthermore ensure that the linked timeout that triggered
it completes with -ETIME if we did indeed trigger the completions
through a timeout.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

90026541

io_uring: remove dead REQ_F_SEQ_PREV flag · edbc5fb7

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit e0e328c4
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

With the conversion to io-wq, we no longer use that flag. Kill it.

Fixes: 561fb04a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

edbc5fb7

io_uring: fix sequencing issues with linked timeouts · 32c21196

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 94ae5e77
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

We have an issue with timeout links that are deeper in the submit chain,
because we only handle it upfront, not from later submissions. Move the
prep + issue of the timeout link to the async work prep handler, and do
it normally for non-async queue. If we validate and prepare the timeout
links upfront when we first see them, there's nothing stopping us from
supporting any sort of nesting.

Fixes: 2665abfd ("io_uring: add support for linked SQE timeouts")
Reported-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

32c21196

io_uring: make req->timeout be dynamically allocated · 818621bc

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit ad8a48ac
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

There are a few reasons for this:

- As a prep to improving the linked timeout logic
- io_timeout is the biggest member in the io_kiocb opcode union

This also enables a few cleanups, like unifying the timer setup between
IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple
arguments to the link/prep helpers.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

818621bc

io_uring: make io_double_put_req() use normal completion path · 584284ca

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 978db57e
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

If we don't use the normal completion path, we may skip killing links
that should be errored and freed. Add __io_double_put_req() for use
within the completion path itself, other calls should just use
io_double_put_req().
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

584284ca

io_uring: cleanup return values from the queueing functions · 84914a4c

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 0e0702da
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

__io_queue_sqe(), io_queue_sqe(), io_queue_link_head() all return 0/err,
but the caller doesn't care since the errors are handled inline. Clean
these up and just make them void.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

84914a4c

io_uring: io_async_cancel() should pass in 'nxt' request pointer · a6caf01e

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 95a5bbae
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

If we have a linked request, this enables us to pass it back directly
without having to go through async context.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

a6caf01e

io_uring: make POLL_ADD/POLL_REMOVE scale better · 8cb840fc

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit eac406c6
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

One of the obvious use cases for these commands is networking, where
it's not uncommon to have tons of sockets open and polled for. The
current implementation uses a list for insertion and lookup, which works
fine for file based use cases where the count is usually low, it breaks
down somewhat for higher number of files / sockets. A test case with
30k sockets being polled for and cancelled takes:

real    0m6.968s
user    0m0.002s
sys     0m6.936s

with the patch it takes:

real    0m0.233s
user    0m0.010s
sys     0m0.176s

If you go to 50k sockets, it gets even more abysmal with the current
code:

real    0m40.602s
user    0m0.010s
sys     0m40.555s

with the patch it takes:

real    0m0.398s
user    0m0.000s
sys     0m0.341s

Change is pretty straight forward, just replace the cancel_list with
a red/black tree instead.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

8cb840fc

io_uring: Fix getting file for non-fd opcodes · 39de31a1

由 Pavel Begunkov 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit a320e9fa
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

For timeout requests and bunch of others io_uring tries to grab a file
with specified fd, which is usually stdin/fd=0.
Update io_op_needs_file()
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

39de31a1

io_uring: introduce req_need_defer() · f5c255d2

由 Bob Liu 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 9d858b21
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

Makes the code easier to read.
Signed-off-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

f5c255d2

io_uring: clean up io_uring_cancel_files() · 6214622b

由 Bob Liu 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 2f6d9b9d
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

We don't use the return value anymore, drop it. Also drop the
unecessary double cancel_req value check.
Signed-off-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

6214622b

io_uring: ensure registered buffer import returns the IO length · efec3474

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.4-rc8
commit 5e559561
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

A test case was reported where two linked reads with registered buffers
failed the second link always. This is because we set the expected value
of a request in req->result, and if we don't get this result, then we
fail the dependent links. For some reason the registered buffer import
returned -ERROR/0, while the normal import returns -ERROR/length. This
broke linked commands with registered buffers.

Fix this by making io_import_fixed() correctly return the mapped length.

Cc: stable@vger.kernel.org # v5.3
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

efec3474

io_wq: add get/put_work handlers to io_wq_create() · e31d7411

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 7d723065
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

For cancellation, we need to ensure that the work item stays valid for
as long as ->cur_work is valid. Right now we can't safely dereference
the work item even under the wqe->lock, because while the ->cur_work
pointer will remain valid, the work could be completing and be freed
in parallel.

Only invoke ->get/put_work() on items we know that the caller queued
themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed
when we're queueing a flush item, for instance.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

e31d7411

io_uring: Fix getting file for timeout · a8c7c250

由 Pavel Begunkov 提交于 4月 15, 2021

mainline inclusion
from mainline-5.4-rc8
commit 5683e540
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

For timeout requests io_uring tries to grab a file with specified fd,
which is usually stdin/fd=0.
Update io_op_needs_file()
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

a8c7c250

io_uring: check for validity of ->rings in teardown · f4b7d5af

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 15dff286
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

Normally the rings are always valid, the exception is if we failed to
allocate the rings at setup time. syzbot reports this:

RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229
RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d
RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 8903 Comm: syz-executor410 Not tainted 5.4.0-rc7-next-20191113
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a
RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100
RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d
R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0
R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000
FS:  0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  io_cqring_overflow_flush+0x6b9/0xa90 fs/io_uring.c:673
  io_ring_ctx_wait_and_kill+0x24f/0x7c0 fs/io_uring.c:4260
  io_uring_create fs/io_uring.c:4600 [inline]
  io_uring_setup+0x1256/0x1cc0 fs/io_uring.c:4626
  __do_sys_io_uring_setup fs/io_uring.c:4639 [inline]
  __se_sys_io_uring_setup fs/io_uring.c:4636 [inline]
  __x64_sys_io_uring_setup+0x54/0x80 fs/io_uring.c:4636
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441229
Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229
RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d
RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
Modules linked in:
---[ end trace b0f5b127a57f623f ]---
RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a
RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100
RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d
R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0
R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000
FS:  0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

which is exactly the case of failing to allocate the SQ/CQ rings, and
then entering shutdown. Check if the rings are valid before trying to
access them at shutdown time.

Reported-by: syzbot+21147d79607d724bd6f3@syzkaller.appspotmail.com
Fixes: 1d7bb1d5 ("io_uring: add support for backlogged CQ ring")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

f4b7d5af

io_uring: fix potential deadlock in io_poll_wake() · 1cbc18e5

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 7c9e7f0f
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

We attempt to run the poll completion inline, but we're using trylock to
do so. This avoids a deadlock since we're grabbing the locks in reverse
order at this point, we already hold the poll wq lock and we're trying
to grab the completion lock, while the normal rules are the reverse of
that order.

IO completion for a timeout link will need to grab the completion lock,
but that's not safe from this context. Put the completion under the
completion_lock in io_poll_wake(), and mark the request as entering
the completion with the completion_lock already held.

Fixes: 2665abfd ("io_uring: add support for linked SQE timeouts")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

1cbc18e5

io_uring: use correct "is IO worker" helper · 8165b072

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 960e432d
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

Since we switched to io-wq, the dependent link optimization for when to
pass back work inline has been broken. Fix this by providing a suitable
io-wq helper for io_uring to use to detect when to do this.

Fixes: 561fb04a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

8165b072

io_uring: make timeout sequence == 0 mean no sequence · 94ce3cdb

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.4-rc8
commit 93bd25bb
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

Currently we make sequence == 0 be the same as sequence == 1, but that's
not super useful if the intent is really to have a timeout that's just
a pure timeout.

If the user passes in sqe->off == 0, then don't apply any sequence logic
to the request, let it purely be driven by the timeout specified.
Reported-by: N李通洲 <carter.li@eoitek.com>
Reviewed-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

94ce3cdb

io_uring: fix -ENOENT issue with linked timer with short timeout · d92e15aa

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 76a46e06
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

If you prep a read (for example) that needs to get punted to async
context with a timer, if the timeout is sufficiently short, the timer
request will get completed with -ENOENT as it could not find the read.

The issue is that we prep and start the timer before we start the read.
Hence the timer can trigger before the read is even started, and the end
result is then that the timer completes with -ENOENT, while the read
starts instead of being cancelled by the timer.

Fix this by splitting the linked timer into two parts:

1) Prep and validate the linked timer
2) Start timer

The read is then started between steps 1 and 2, so we know that the
timer will always have a consistent view of the read request state.
Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

d92e15aa

io_uring: don't do flush cancel under inflight_lock · f6c60cf7

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 768134d4
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

We can't safely cancel under the inflight lock. If the work hasn't been
started yet, then io_wq_cancel_work() simply marks the work as cancelled
and invokes the work handler. But if the work completion needs to grab
the inflight lock because it's grabbing user files, then we'll deadlock
trying to finish the work as we already hold that lock.

Instead grab a reference to the request, if it isn't already zero. If
it's zero, then we know it's going through completion anyway, and we
can safely ignore it. If it's not zero, then we can drop the lock and
attempt to cancel from there.

This also fixes a missing finish_wait() at the end of
io_uring_cancel_files().
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

f6c60cf7

io_uring: flag SQPOLL busy condition to userspace · 7587136d

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit c1edbf5f
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

Now that we have backpressure, for SQPOLL, we have one more condition
that warrants flagging that the application needs to enter the kernel:
we failed to submit IO due to backpressure. Make sure we catch that
and flag it appropriately.

If we run into backpressure issues with the SQPOLL thread, flag it
as such to the application by setting IORING_SQ_NEED_WAKEUP. This will
cause the application to enter the kernel, and that will flush the
backlog and clear the condition.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

7587136d

io_uring: make ASYNC_CANCEL work with poll and timeout · 917b0ff3

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 47f46768
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

It's a little confusing that we have multiple types of command
cancellation opcodes now that we have a generic one. Make the generic
one work with POLL_ADD and TIMEOUT commands as well, that makes for an
easier to use API for the application. The fact that they currently
don't is a bit confusing.

Add a helper that takes care of it, so we can user it from both
IORING_OP_ASYNC_CANCEL and from the linked timeout cancellation.
Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

917b0ff3

io_uring: provide fallback request for OOM situations · c8e7d4db

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 0ddf92e8
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

One thing that really sucks for userspace APIs is if the kernel passes
back -ENOMEM/-EAGAIN for resource shortages. The application really has
no idea of what to do in those cases. Should it try and reap
completions? Probably a good idea. Will it solve the issue? Who knows.

This patch adds a simple fallback mechanism if we fail to allocate
memory for a request. If we fail allocating memory from the slab for a
request, we punt to a pre-allocated request. There's just one of these
per io_ring_ctx, but the important part is if we ever return -EBUSY to
the application, the applications knows that it can wait for events and
make forward progress when events have completed. This is the important
part.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

c8e7d4db

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功