提交 · db71020d7ec634e34ea72cd0c295f1b6b7460a63 · openanolis / cloud-kernel

11 2月, 2020 40 次提交

io_uring: allow application controlled CQ ring size · db71020d

由 Jens Axboe 提交于 10月 04, 2019

commit 33a107f0a1b8df0ad925e39d8afc97bb78e0cec1 upstream.

We currently size the CQ ring as twice the SQ ring, to allow some
flexibility in not overflowing the CQ ring. This is done because the
SQE life time is different than that of the IO request itself, the SQE
is consumed as soon as the kernel has seen the entry.

Certain application don't need a huge SQ ring size, since they just
submit IO in batches. But they may have a lot of requests pending, and
hence need a big CQ ring to hold them all. By allowing the application
to control the CQ ring size multiplier, we can cater to those
applications more efficiently.

If an application wants to define its own CQ ring size, it must set
IORING_SETUP_CQSIZE in the setup flags, and fill out
io_uring_params->cq_entries. The value must be a power of two.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

db71020d

io_uring: add support for IORING_REGISTER_FILES_UPDATE · 02017a40

由 Jens Axboe 提交于 10月 03, 2019

commit c3a31e605620c279163c14068a60869ea3fda203 upstream.

Allows the application to remove/replace/add files to/from a file set.
Passes in a struct:

struct io_uring_files_update {
	__u32 offset;
	__s32 *fds;
};

that holds an array of fds, size of array passed in through the usual
nr_args part of the io_uring_register() system call. The logic is as
follows:

1) If ->fds[i] is -1, the existing file at i + ->offset is removed from
   the set.
2) If ->fds[i] is a valid fd, the existing file at i + ->offset is
   replaced with ->fds[i].

For case #2, is the existing file is currently empty (fd == -1), the
new fd is simply added to the array.
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

02017a40

io_uring: allow sparse fixed file sets · deb05ca0

由 Jens Axboe 提交于 10月 03, 2019

commit 08a451739a9b5783f67de51e84cb6d9559bb9dc4 upstream.

This is in preparation for allowing updates to fixed file sets without
requiring a full unregister+register.
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

deb05ca0

io_uring: run dependent links inline if possible · 6412618c

由 Jens Axboe 提交于 9月 28, 2019

commit ba816ad61fdf31f59f423a773b00bfa2ed38243a upstream.

Currently any dependent link is executed from a new workqueue context,
which means that we'll be doing a context switch per link in the chain.
If we are running the completion of the current request from our async
workqueue and find that the next request is a link, then run it directly
from the workqueue context instead of forcing another switch.

This improves the performance of linked SQEs, and reduces the CPU
overhead.
Reviewed-by: NJackie Liu <liuyun01@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

6412618c

io_uring: don't touch ctx in setup after ring fd install · bb4d087e

由 Jens Axboe 提交于 10月 28, 2019

commit 044c1ab399afbe9f2ebef49a3204ef1509826dc7 upstream.

syzkaller reported an issue where it looks like a malicious app can
trigger a use-after-free of reading the ctx ->sq_array and ->rings
value right after having installed the ring fd in the process file
table.

Defer ring fd installation until after we're done reading those
values.

Fixes: 75b28affdd6a ("io_uring: allocate the two rings together")
Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

bb4d087e

io_uring: Fix leaked shadow_req · 4903035e

由 Pavel Begunkov 提交于 10月 27, 2019

commit 7b20238d28da46f394d37d4d51cc420e1ff9414a upstream.

io_queue_link_head() owns shadow_req after taking it as an argument.
By not freeing it in case of an error, it can leak the request along
with taken ctx->refs.
Reviewed-by: NJackie Liu <liuyun01@kylinos.cn>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4903035e

io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD · c45e560e

由 Jens Axboe 提交于 10月 25, 2019

commit 2b2ed9750fc9d040b9f6d076afcef6f00b6f1f7c upstream.

We currently assume that submissions from the sqthread are successful,
and if IO polling is enabled, we use that value for knowing how many
completions to look for. But if we overflowed the CQ ring or some
requests simply got errored and already completed, they won't be
available for polling.

For the case of IO polling and SQTHREAD usage, look at the pending
poll list. If it ever hits empty then we know that we don't have
anymore pollable requests inflight. For that case, simply reset
the inflight count to zero.
Reported-by: NPavel Begunkov <asml.silence@gmail.com>
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

c45e560e

io_uring: used cached copies of sq->dropped and cq->overflow · 431e9ab1

由 Jens Axboe 提交于 10月 25, 2019

commit 498ccd9eda49117c34e0041563d0da6ac40e52b8 upstream.

We currently use the ring values directly, but that can lead to issues
if the application is malicious and changes these values on our behalf.
Created in-kernel cached versions of them, and just overwrite the user
side when we update them. This is similar to how we treat the sq/cq
ring tail/head updates.
Reported-by: NPavel Begunkov <asml.silence@gmail.com>
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

431e9ab1

io_uring: Fix race for sqes with userspace · 332ec4c6

由 Pavel Begunkov 提交于 10月 25, 2019

commit 935d1e45908afb8853c497f2c2bbbb685dec51dc upstream.

io_ring_submit() finalises with
1. io_commit_sqring(), which releases sqes to the userspace
2. Then calls to io_queue_link_head(), accessing released head's sqe

Reorder them.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

332ec4c6

io_uring: Fix broken links with offloading · 777f45c0

由 Pavel Begunkov 提交于 10月 25, 2019

commit fb5ccc98782f654778cb8d96ba8a998304f9a51f upstream.

io_sq_thread() processes sqes by 8 without considering links. As a
result, links will be randomely subdivided.

The easiest way to fix it is to call io_get_sqring() inside
io_submit_sqes() as do io_ring_submit().

Downsides:
1. This removes optimisation of not grabbing mm_struct for fixed files
2. It submitting all sqes in one go, without finer-grained sheduling
with cq processing.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

777f45c0

io_uring: Fix corrupted user_data · 47be7764

由 Pavel Begunkov 提交于 10月 25, 2019

commit 84d55dc5b9e57b513a702fbc358e1b5489651590 upstream.

There is a bug, where failed linked requests are returned not with
specified @user_data, but with garbage from a kernel stack.

The reason is that io_fail_links() uses req->user_data, which is
uninitialised when called from io_queue_sqe() on fail path.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

47be7764

io_uring: correct timeout req sequence when inserting a new entry · 8280a743

由 zhangyi (F) 提交于 10月 23, 2019

commit a1f58ba46f794b1168d1107befcf3d4b9f9fd453 upstream.

The sequence number of the timeout req (req->sequence) indicate the
expected completion request. Because of each timeout req consume a
sequence number, so the sequence of each timeout req on the timeout
list shouldn't be the same. But now, we may get the same number (also
incorrect) if we insert a new entry before the last one, such as submit
such two timeout reqs on a new ring instance below.

                    req->sequence
 req_1 (count = 2):       2
 req_2 (count = 1):       2

Then, if we submit a nop req, req_2 will still timeout even the nop req
finished. This patch fix this problem by adjust the sequence number of
each reordered reqs when inserting a new entry.
Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

8280a743

io_uring : correct timeout req sequence when waiting timeout · 1f0e6877

由 zhangyi (F) 提交于 10月 23, 2019

commit ef03681ae8df770745978148a7fb84796ae99cba upstream.

The sequence number of reqs on the timeout_list before the timeout req
should be adjusted in io_timeout_fn(), because the current timeout req
will consumes a slot in the cq_ring and cq_tail pointer will be
increased, otherwise other timeout reqs may return in advance without
waiting for enough wait_nr.
Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

1f0e6877

io_uring: revert "io_uring: optimize submit_and_wait API" · e6c79819

由 Jens Axboe 提交于 10月 22, 2019

commit bc808bced39f4e4b626c5ea8c63d5e41fce7205a upstream.

There are cases where it isn't always safe to block for submission,
even if the caller asked to wait for events as well. Revert the
previous optimization of doing that.

This reverts two commits:

bf7ec93c644cb
c576666863b78

Fixes: c576666863b78 ("io_uring: optimize submit_and_wait API")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

e6c79819

io_uring: fix logic error in io_timeout · 15c77f44

由 yangerkun 提交于 10月 17, 2019

commit 8b07a65ad30e5612d9590fb50468ff4fa314cfc7 upstream.

If ctx->cached_sq_head < nxt_sq_head, we should add UINT_MAX to tmp, not
tmp_nxt.

Fixes: 5da0fb1ab34c ("io_uring: consider the overflow of sequence for timeout req")
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

15c77f44

io_uring: fix up O_NONBLOCK handling for sockets · d388e8b4

由 Jens Axboe 提交于 10月 17, 2019

commit 491381ce07ca57f68c49c79a8a43da5b60749e32 upstream.

We've got two issues with the non-regular file handling for non-blocking
IO:

1) We don't want to re-do a short read in full for a non-regular file,
   as we can't just read the data again.
2) For non-regular files that don't support non-blocking IO attempts,
   we need to punt to async context even if the file is opened as
   non-blocking. Otherwise the caller always gets -EAGAIN.

Add two new request flags to handle these cases. One is just a cache
of the inode S_ISREG() status, the other tells io_uring that we always
need to punt this request to async context, even if REQ_F_NOWAIT is set.

Cc: stable@vger.kernel.org
Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
Tested-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

d388e8b4

io_uring: consider the overflow of sequence for timeout req · 2ab45c71

由 yangerkun 提交于 10月 15, 2019

commit 5da0fb1ab34ccfe6d49210b4f5a739c59fcbf25e upstream.

Now we recalculate the sequence of timeout with 'req->sequence =
ctx->cached_sq_head + count - 1', judge the right place to insert
for timeout_list by compare the number of request we still expected for
completion. But we have not consider about the situation of overflow:

1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for
the new timeout req can have a small req->sequence.

2. cached_sq_head of now may overflow compare with before req. And it
will lead the timeout req with small req->sequence.

This overflow will lead to the misorder of timeout_list, which can lead
to the wrong order of the completion of timeout_list. Fix it by reuse
req->submit.sequence to store the count, and change the logic of
inserting sort in io_timeout.
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

2ab45c71

io_uring: fix sequence logic for timeout requests · a241f540

由 Jens Axboe 提交于 10月 10, 2019

commit 7adf4eaf60f3d8c3584bed51fe7066d4dfc2cbe1 upstream.

We have two ways a request can be deferred:

1) It's a regular request that depends on another one
2) It's a timeout that tracks completions

We have a shared helper to determine whether to defer, and that
attempts to make the right decision based on the request. But we
only have some of this information in the caller. Un-share the
two timeout/defer helpers so the caller can use the right one.

Fixes: 5262f567987d ("io_uring: IORING_OP_TIMEOUT support")
Reported-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: NJackie Liu <liuyun01@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

a241f540

io_uring: only flush workqueues on fileset removal · 5fa689c8

由 Jens Axboe 提交于 10月 09, 2019

commit 8a99734081775c012a4a6c442fdef0379fe52bdf upstream.

We should not remove the workqueue, we just need to ensure that the
workqueues are synced. The workqueues are torn down on ctx removal.

Cc: stable@vger.kernel.org
Fixes: 6b06314c47e1 ("io_uring: add file set registration")
Reported-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

5fa689c8

io_uring: remove wait loop spurious wakeups · aa6eb993

由 Pavel Begunkov 提交于 10月 08, 2019

commit 6805b32ec2b0897eb180295385efe306e5ac3b3d upstream.

Any changes interesting to tasks waiting in io_cqring_wait() are
commited with io_cqring_ev_posted(). However, io_ring_drop_ctx_refs()
also tries to do that but with no reason, that means spurious wakeups
every io_free_req() and io_uring_enter().

Just use percpu_ref_put() instead.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

aa6eb993

io_uring: fix reversed nonblock flag for link submission · 151eede2

由 Pavel Begunkov 提交于 10月 04, 2019

commit bf7ec93c644cb0064ba7d2fc40d4841c5ba382ab upstream.

io_queue_link_head() accepts @force_nonblock flag, but io_ring_submit()
passes something opposite.

Fixes: c576666863b78 ("io_uring: optimize submit_and_wait API")
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

151eede2

io_uring: use __kernel_timespec in timeout ABI · 84336694

由 Arnd Bergmann 提交于 10月 01, 2019

commit bdf200731145f07a6127cb16753e2e8fdc159cf4 upstream.

All system calls use struct __kernel_timespec instead of the old struct
timespec, but this one was just added with the old-style ABI. Change it
now to enforce the use of __kernel_timespec, avoiding ABI confusion and
the need for compat handlers on 32-bit architectures.

Any user space caller will have to use __kernel_timespec now, but this
is unambiguous and works for any C library regardless of the time_t
definition. A nicer way to specify the timeout would have been a less
ambiguous 64-bit nanosecond value, but I suppose it's too late now to
change that as this would impact both 32-bit and 64-bit users.

Fixes: 5262f567987d ("io_uring: IORING_OP_TIMEOUT support")
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

84336694

io_uring: make CQ ring wakeups be more efficient · 393e9bf6

由 Jens Axboe 提交于 9月 24, 2019

commit bda521624e75c665c407b3d9cece6e7a28178cd8 upstream.

For batched IO, it's not uncommon for waiters to ask for more than 1
IO to complete before being woken up. This is a problem with
wait_event() since tasks will get woken for every IO that completes,
re-check condition, then go back to sleep. For batch counts on the
order of what you do for high IOPS, that can result in 10s of extra
wakeups for the waiting task.

Add a private wake function that checks for the wake up count criteria
being met before calling autoremove_wake_function(). Pavel reports that
one test case he has runs 40% faster with proper batching of wakeups.
Reported-by: NPavel Begunkov <asml.silence@gmail.com>
Tested-by: NPavel Begunkov <asml.silence@gmail.com>
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

393e9bf6

io_uring: compare cached_cq_tail with cq.head in_io_uring_poll · f88256b5

由 yangerkun 提交于 9月 24, 2019

commit daa5de5415849b9a53056ec1e1e88fe4c5c9aa2b upstream.

After 75b28af("io_uring: allocate the two rings together"), we compare
sq.head with cached_cq_tail to determine does there any cq invalid.
Actually, we should use cq.head.

Fixes: 75b28affdd6a ("io_uring: allocate the two rings together")
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

f88256b5

io_uring: correctly handle non ->{read,write}_iter() file_operations · b8623077

由 Jens Axboe 提交于 9月 23, 2019

commit 32960613b7c3352ddf38c42596e28a16ae36335e upstream.

Currently we just -EINVAL a read or write to an fd that isn't backed
by ->read_iter() or ->write_iter(). But we can handle them just fine,
as long as we punt fo async context first.

Implement a simple loop function for doing ->read() or ->write()
instead, and ensure we call it appropriately.
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

b8623077

io_uring: IORING_OP_TIMEOUT support · 81d2a78f

由 Jens Axboe 提交于 9月 17, 2019

commit 5262f567987d3c30052b22e78c35c2313d07b230 upstream.

There's been a few requests for functionality similar to io_getevents()
and epoll_wait(), where the user can specify a timeout for waiting on
events. I deliberately did not add support for this through the system
call initially to avoid overloading the args, but I can see that the use
cases for this are valid.

This adds support for IORING_OP_TIMEOUT. If a user wants to get woken
when waiting for events, simply submit one of these timeout commands
with your wait call (or before). This ensures that the application
sleeping on the CQ ring waiting for events will get woken. The timeout
command is passed in as a pointer to a struct timespec. Timeouts are
relative. The timeout command also includes a way to auto-cancel after
N events has passed.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

81d2a78f

io_uring: use cond_resched() in sqthread · c1a41ef0

由 Jens Axboe 提交于 9月 19, 2019

commit 9831a90ce64362f8429e8fd23838a9db2cdf7803 upstream.

If preempt isn't enabled in the kernel, we can run into hang issues with
sqthread submissions. Use cond_resched() to play nice instead of
cpu_relax(), if we end up starting the loop and not having any events
pending for submissions.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

c1a41ef0

io_uring: fix potential crash issue due to io_get_req failure · f37acde5

由 Jackie Liu 提交于 9月 18, 2019

commit a1041c27b64ce744632147e19701c95fed14fab1 upstream.

Sometimes io_get_req will return a NUL, then we need to do the
correct error handling, otherwise it will cause the kernel null
pointer exception.

Fixes: 4fe2c963154c ("io_uring: add support for link with drain")
Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

f37acde5

io_uring: ensure poll commands clear ->sqe · 6a503523

由 Jens Axboe 提交于 9月 18, 2019

commit 6cc47d1d2a9b631f62405f56df651975c7587a97 upstream.

If we end up getting woken in poll (due to a signal), then we may need
to punt the poll request to an async worker. When we do that, we look up
the list to queue at, deferefencing req->submit.sqe, however that is
only set for requests we initially decided to queue async.

This fixes a crash with poll command usage and wakeups that need to punt
to async context.

Fixes: 54a91f3bb9b9 ("io_uring: limit parallelism of buffered writes")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

6a503523

io_uring: fix use-after-free of shadow_req · ca573b47

由 Jackie Liu 提交于 9月 18, 2019

commit 5f5ad9ced33621d353be6429c3900f8a526fcae8 upstream.

There is a potential dangling pointer problem. we never clean
shadow_req, if there are multiple link lists in this series of
sqes, then the shadow_req will not reallocate, and continue to
use the last one. but in the previous, his memory has been
released, thus forming a dangling pointer. let's clean up him
and make sure that every new link list can reapply for a new
shadow_req.

Fixes: 4fe2c963154c ("io_uring: add support for link with drain")
Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

ca573b47

io_uring: use kmemdup instead of kmalloc and memcpy · 7adbf4bd

由 Jackie Liu 提交于 9月 18, 2019

commit 954dab193d19cbbff8f83b58c9360bf00ddb273c upstream.

Just clean up the code, no function changes.
Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

7adbf4bd

io_uring: increase IORING_MAX_ENTRIES to 32K · eec2c657

由 Daniel Xu 提交于 9月 14, 2019

commit 5277deaab9f98229bdfb8d1e30019b6c25052708 upstream.

Some workloads can require far more than 4K oustanding entries. For
example memcached can have ~300K sockets over ~40 cores. Bumping the max
to 32K seems to work pretty well.
Reported-by: NDan Melnic <dmm@fb.com>
Signed-off-by: NDaniel Xu <dxu@dxuuu.xyz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

eec2c657

io_uring: make sqpoll wakeup possible with getevents · 5a349b91

由 Jens Axboe 提交于 9月 12, 2019

commit b2a9eadab85730935f5a6fe19f3f61faaaced601 upstream.

The way the logic is setup in io_uring_enter() means that you can't wake
up the SQ poller thread while at the same time waiting (or polling) for
completions afterwards. There's no reason for that to be the case.
Reported-by: NLewis Baker <lbaker@fb.com>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

5a349b91

io_uring: extend async work merging · 4b60edc3

由 Jens Axboe 提交于 9月 11, 2019

commit 6d5d5ac522b20b65167dafe0656b7cad05ec48b3 upstream.

We currently merge async work items if we see a strict sequential hit.
This helps avoid unnecessary workqueue switches when we don't need
them. We can extend this merging to cover cases where it's not a strict
sequential hit, but the IO still fits within the same page. If an
application is doing multiple requests within the same page, we don't
want separate workers waiting on the same page to complete IO. It's much
faster to let the first worker bring in the page, then operate on that
page from the same worker to complete the next request(s).
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4b60edc3

io_uring: limit parallelism of buffered writes · b9da29f6

由 Jens Axboe 提交于 9月 10, 2019

commit 54a91f3bb9b96ed86bc12b2f7e06b3fce8e86503 upstream.

All the popular filesystems need to grab the inode lock for buffered
writes. With io_uring punting buffered writes to async context, we
observe a lot of contention with all workers hamming this mutex.

For buffered writes, we generally don't need a lot of parallelism on
the submission side, as the flushing will take care of that for us.
Hence we don't need a deep queue on the write side, as long as we
can safely punt from the original submission context.

Add a workqueue with a limit of 2 that we can use for buffered writes.
This greatly improves the performance and efficiency of higher queue
depth buffered async writes with io_uring.
Reported-by: NAndres Freund <andres@anarazel.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

b9da29f6

io_uring: add io_queue_async_work() helper · 24c5de37

由 Jens Axboe 提交于 9月 10, 2019

commit 18d9be1a970c3704366df902b00871bea88d9f14 upstream.

Add a helper for queueing a request for async execution, in preparation
for optimizing it.

No functional change in this patch.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

24c5de37

io_uring: optimize submit_and_wait API · 691657a2

由 Jens Axboe 提交于 9月 09, 2019

commit c576666863b788c2d7e8ab4ef4edd0e9059cb47b upstream.

For some applications that end up using a submit-and-wait type of
approach for certain batches of IO, we can make that a bit more
efficient by allowing the application to block for the last IO
submission. This prevents an async when we don't need it, as the
application will be blocking for the completion event(s) anyway.

Typical use cases are using the liburing
io_uring_submit_and_wait() API, or just using io_uring_enter()
doing both submissions and completions. As a specific example,
RocksDB doing MultiGet() is sped up quite a bit with this
change.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

691657a2

io_uring: add support for link with drain · b891e566

由 Jackie Liu 提交于 9月 09, 2019

commit 4fe2c963154c31227bec2f2d690e01f9cab383ea upstream.

To support the link with drain, we need to do two parts.

There is an sqes:

    0     1     2     3     4     5     6
 +-----+-----+-----+-----+-----+-----+-----+
 |  N  |  L  |  L  | L+D |  N  |  N  |  N  |
 +-----+-----+-----+-----+-----+-----+-----+

First, we need to ensure that the io before the link is completed,
there is a easy way is set drain flag to the link list's head, so
all subsequent io will be inserted into the defer_list.

	+-----+
    (0) |  N  |
	+-----+
           |          (2)         (3)         (4)
	+-----+     +-----+     +-----+     +-----+
    (1) | L+D | --> |  L  | --> | L+D | --> |  N  |
	+-----+     +-----+     +-----+     +-----+
           |
	+-----+
    (5) |  N  |
	+-----+
           |
	+-----+
    (6) |  N  |
	+-----+

Second, ensure that the following IO will not be completed first,
an easy way is to create a mirror of drain io and insert it into
defer_list, in this way, as long as drain io is not processed, the
following io in the defer_list will not be actively process.

	+-----+
    (0) |  N  |
	+-----+
           |          (2)         (3)         (4)
	+-----+     +-----+     +-----+     +-----+
    (1) | L+D | --> |  L  | --> | L+D | --> |  N  |
	+-----+     +-----+     +-----+     +-----+
           |
	+-----+
   ('3) |  D  |   <== This is a shadow of (3)
	+-----+
           |
	+-----+
    (5) |  N  |
	+-----+
           |
	+-----+
    (6) |  N  |
	+-----+
Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

b891e566

io_uring: fix wrong sequence setting logic · ee110e60

由 Jackie Liu 提交于 9月 09, 2019

commit 8776f3fa15a5cd213c4dfab7ddaf557983374ea6 upstream.

Sqo_thread will get sqring in batches, which will cause
ctx->cached_sq_head to be added in batches. if one of these
sqes is set with the DRAIN flag, then he will never get a
chance to process, and finally sqo_thread will not exit.

Fixes: de0617e4671 ("io_uring: add support for marking commands as draining")
Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

ee110e60

io_uring: expose single mmap capability · 4c60dc2e

由 Jens Axboe 提交于 9月 06, 2019

commit ac90f249e15cd2a850daa9e36e15f81ce1ff6550 upstream.

After commit 75b28affdd6a we can get by with just a single mmap to
map both the sq and cq ring. However, userspace doesn't know that.

Add a features variable to io_uring_params, and notify userspace
that the kernel has this ability. This can then be used in liburing
(or in applications directly) to avoid the second mmap.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4c60dc2e

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功