提交 · 36f72fe2792c4304f1203a44a6a7178e49b447f7 · openeuler / Kernel

10 12月, 2020 28 次提交

io_uring: share fixed_file_refs b/w multiple rsrcs · 36f72fe2

由 Pavel Begunkov 提交于 11月 18, 2020

Double fixed files for splice/tee are done in a nasty way, it takes 2
ref_node refs, and during the second time it blindly overrides
req->fixed_file_refs hoping that it haven't changed. That works because
all that is done under iouring_lock in a single go but is error-prone.

Bind everything explicitly to a single ref_node and take only one ref,
with current ref_node ordering it's guaranteed to keep all files valid
awhile the request is inflight.

That's mainly a cleanup + preparation for generic resource handling,
but also saves pcpu_ref get/put for splice/tee with 2 fixed files.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

36f72fe2

io_uring: replace inflight_wait with tctx->wait · c98de08c

由 Pavel Begunkov 提交于 11月 15, 2020

As tasks now cancel only theirs requests, and inflight_wait is awaited
only in io_uring_cancel_files(), which should be called with ->in_idle
set, instead of keeping a separate inflight_wait use tctx->wait.

That will add some spurious wakeups but actually is safer from point of
not hanging the task.

e.g.
task1                   | IRQ
                        | *start* io_complete_rw_common(link)
                        |        link: req1 -> req2 -> req3(with files)
*cancel_files()         |
io_wq_cancel(), etc.    |
                        | put_req(link), adds to io-wq req2
schedule()              |

So, task1 will never try to cancel req2 or req3. If req2 is
long-standing (e.g. read(empty_pipe)), this may hang.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c98de08c

io_uring: don't take fs for recvmsg/sendmsg · 10cad2c4

由 Pavel Begunkov 提交于 11月 07, 2020

We don't even allow not plain data msg_control, which is disallowed in
__sys_{send,revb}msg_sock(). So no need in fs for IORING_OP_SENDMSG and
IORING_OP_RECVMSG. fs->lock is less contanged not as much as before, but
there are cases that can be, e.g. IOSQE_ASYNC.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

10cad2c4

io_uring: only wake up sq thread while current task is in io worker context · 2e9dbe90

由 Xiaoguang Wang 提交于 11月 13, 2020

If IORING_SETUP_SQPOLL is enabled, sqes are either handled in sq thread
task context or in io worker task context. If current task context is sq
thread, we don't need to check whether should wake up sq thread.

io_iopoll_req_issued() calls wq_has_sleeper(), which has smp_mb() memory
barrier, before this patch, perf shows obvious overhead:
  Samples: 481K of event 'cycles', Event count (approx.): 299807382878
  Overhead  Comma  Shared Object     Symbol
     3.69%  :9630  [kernel.vmlinux]  [k] io_issue_sqe

With this patch, perf shows:
  Samples: 482K of event 'cycles', Event count (approx.): 299929547283
  Overhead  Comma  Shared Object     Symbol
     0.70%  :4015  [kernel.vmlinux]  [k] io_issue_sqe

It shows some obvious improvements.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2e9dbe90

io_uring: don't acquire uring_lock twice · 906a3c6f

由 Xiaoguang Wang 提交于 11月 12, 2020

Both IOPOLL and sqes handling need to acquire uring_lock, combine
them together, then we just need to acquire uring_lock once.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

906a3c6f

io_uring: initialize 'timeout' properly in io_sq_thread() · a0d9205f

由 Xiaoguang Wang 提交于 11月 12, 2020

Some static checker reports below warning:
    fs/io_uring.c:6939 io_sq_thread()
    error: uninitialized symbol 'timeout'.

This is a false positive, but let's just initialize 'timeout' to make
sure we don't trip over this.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a0d9205f

io_uring: refactor io_sq_thread() handling · 08369246

由 Xiaoguang Wang 提交于 11月 03, 2020

There are some issues about current io_sq_thread() implementation:
  1. The prepare_to_wait() usage in __io_sq_thread() is weird. If
multiple ctxs share one same poll thread, one ctx will put poll thread
in TASK_INTERRUPTIBLE, but if other ctxs have work to do, we don't
need to change task's stat at all. I think only if all ctxs don't have
work to do, we can do it.
  2. We use round-robin strategy to make multiple ctxs share one same
poll thread, but there are various condition in __io_sq_thread(), which
seems complicated and may affect round-robin strategy.

To improve above issues, I take below actions:
  1. If multiple ctxs share one same poll thread, only if all all ctxs
don't have work to do, we can call prepare_to_wait() and schedule() to
make poll thread enter sleep state.
  2. To make round-robin strategy more straight, I simplify
__io_sq_thread() a bit, it just does io poll and sqes submit work once,
does not check various condition.
  3. For multiple ctxs share one same poll thread, we choose the biggest
sq_thread_idle among these ctxs as timeout condition, and will update
it when ctx is in or out.
  4. Not need to check EBUSY especially, if io_submit_sqes() returns
EBUSY, IORING_SQ_CQ_OVERFLOW should be set, helper in liburing should
be aware of cq overflow and enters kernel to flush work.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

08369246

io_uring: always batch cancel in *cancel_files() · f6edbabb

由 Pavel Begunkov 提交于 11月 06, 2020

Instead of iterating over each request and cancelling it individually in
io_uring_cancel_files(), try to cancel all matching requests and use
->inflight_list only to check if there anything left.

In many cases it should be faster, and we can reuse a lot of code from
task cancellation.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f6edbabb

io_uring: pass files into kill timeouts/poll · 6b81928d

由 Pavel Begunkov 提交于 11月 06, 2020

Make io_poll_remove_all() and io_kill_timeouts() to match against files
as well. A preparation patch, effectively not used by now.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6b81928d

io_uring: don't iterate io_uring_cancel_files() · b52fda00

由 Pavel Begunkov 提交于 11月 06, 2020

io_uring_cancel_files() guarantees to cancel all matching requests,
that's not necessary to do that in a loop. Move it up in the callchain
into io_uring_cancel_task_requests().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b52fda00

io_uring: cancel only requests of current task · df9923f9

由 Pavel Begunkov 提交于 11月 06, 2020

io_uring_cancel_files() cancels all request that match files regardless
of task. There is no real need in that, cancel only requests of the
specified task. That also handles SQPOLL case as it already changes task
to it.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

df9923f9

io_uring: add a {task,files} pair matching helper · 08d23634

由 Pavel Begunkov 提交于 11月 06, 2020

Add io_match_task() that matches both task and files.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

08d23634

io_uring: simplify io_task_match() · 06de5f59

由 Pavel Begunkov 提交于 11月 06, 2020

If IORING_SETUP_SQPOLL is set all requests belong to the corresponding
SQPOLL task, so skip task checking in that case and always match.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

06de5f59

io_uring: inline io_import_iovec() · 2846c481

由 Pavel Begunkov 提交于 11月 07, 2020

Inline io_import_iovec() and leave only its former __io_import_iovec()
renamed to the original name. That makes it more obious what is reused in
io_read/write().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2846c481

io_uring: remove duplicated io_size from rw · 632546c4

由 Pavel Begunkov 提交于 11月 07, 2020

io_size and iov_count in io_read() and io_write() hold the same value,
kill the last one.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

632546c4

fs/io_uring Don't use the return value from import_iovec(). · 10fc72e4

由 David Laight 提交于 11月 07, 2020

This is the only code that relies on import_iovec() returning
iter.count on success.
This allows a better interface to import_iovec().
Signed-off-by: NDavid Laight <david.laight@aculab.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

10fc72e4

io_uring: NULL files dereference by SQPOLL · 1a38ffc9

由 Pavel Begunkov 提交于 11月 08, 2020

SQPOLL task may find sqo_task->files == NULL and
__io_sq_thread_acquire_files() would leave it unset, so following
fget_many() and others try to dereference NULL and fault. Propagate
an error files are missing.

[  118.962785] BUG: kernel NULL pointer dereference, address:
	0000000000000020
[  118.963812] #PF: supervisor read access in kernel mode
[  118.964534] #PF: error_code(0x0000) - not-present page
[  118.969029] RIP: 0010:__fget_files+0xb/0x80
[  119.005409] Call Trace:
[  119.005651]  fget_many+0x2b/0x30
[  119.005964]  io_file_get+0xcf/0x180
[  119.006315]  io_submit_sqes+0x3a4/0x950
[  119.007481]  io_sq_thread+0x1de/0x6a0
[  119.007828]  kthread+0x114/0x150
[  119.008963]  ret_from_fork+0x22/0x30
Reported-by: NJosef Grieb <josef.grieb@gmail.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1a38ffc9

io_uring: add timeout support for io_uring_enter() · c73ebb68

由 Hao Xu 提交于 11月 03, 2020

Now users who want to get woken when waiting for events should submit a
timeout command first. It is not safe for applications that split SQ and
CQ handling between two threads, such as mysql. Users should synchronize
the two threads explicitly to protect SQ and that will impact the
performance.

This patch adds support for timeout to existing io_uring_enter(). To
avoid overloading arguments, it introduces a new parameter structure
which contains sigmask and timeout.

I have tested the workloads with one thread submiting nop requests
while the other reaping the cqe with timeout. It shows 1.8~2x faster
when the iodepth is 16.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
[axboe: various cleanups/fixes, and name change to SIG_IS_DATA]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c73ebb68

io_uring: only plug when appropriate · 27926b68

由 Jens Axboe 提交于 10月 28, 2020

We unconditionally call blk_start_plug() when starting the IO
submission, but we only really should do that if we have more than 1
request to submit AND we're potentially dealing with block based storage
underneath. For any other type of request, it's just a waste of time to
do so.

Add a ->plug bit to io_op_def and set it for read/write requests. We
could make this more precise and check the file itself as well, but it
doesn't matter that much and would quickly become more expensive.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

27926b68

io_uring: rearrange io_kiocb fields for better caching · 0415767e

由 Pavel Begunkov 提交于 10月 27, 2020

We've got extra 8 bytes in the 2nd cacheline, put ->fixed_file_refs
there, so inline execution path mostly doesn't touch the 3rd cacheline
for fixed_file requests as well.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0415767e

io_uring: link requests with singly linked list · f2f87370

由 Pavel Begunkov 提交于 10月 27, 2020

Singly linked list for keeping linked requests is enough, because we
almost always operate on the head and traverse forward with the
exception of linked timeouts going 1 hop backwards.

Replace ->link_list with a handmade singly linked list. Also kill
REQ_F_LINK_HEAD in favour of checking a newly added ->list for NULL
directly.

That saves 8B in io_kiocb, is not as heavy as list fixup, makes better
use of cache by not touching a previous request (i.e. last request of
the link) each time on list modification and optimises cache use further
in the following patch, and actually makes travesal easier removing in
the end some lines. Also, keeping invariant in ->list instead of having
REQ_F_LINK_HEAD is less error-prone.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f2f87370

io_uring: track link timeout's master explicitly · 90cd7e42

由 Pavel Begunkov 提交于 10月 27, 2020

In preparation for converting singly linked lists for chaining requests,
make linked timeouts save requests that they're responsible for and not
count on doubly linked list for back referencing.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

90cd7e42

io_uring: track link's head and tail during submit · 863e0560

由 Pavel Begunkov 提交于 10月 27, 2020

Explicitly save not only a link's head in io_submit_sqe[s]() but the
tail as well. That's in preparation for keeping linked requests in a
singly linked list.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

863e0560

io_uring: split poll and poll_remove structs · 018043be

由 Pavel Begunkov 提交于 10月 27, 2020

Don't use a single struct for polls and poll remove requests, they have
totally different layouts.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

018043be

io_uring: add support for IORING_OP_UNLINKAT · 14a1143b

由 Jens Axboe 提交于 9月 28, 2020

IORING_OP_UNLINKAT behaves like unlinkat(2) and takes the same flags
and arguments.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

14a1143b

io_uring: add support for IORING_OP_RENAMEAT · 80a261fd

由 Jens Axboe 提交于 9月 28, 2020

IORING_OP_RENAMEAT behaves like renameat2(), and takes the same flags
etc.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

80a261fd

io_uring: enable file table usage for SQPOLL rings · 14587a46

由 Jens Axboe 提交于 9月 05, 2020

Now that SQPOLL supports non-registered files and grabs the file table,
we can relax the restriction on open/close/accept/connect and allow
them on a ring that is setup with IORING_SETUP_SQPOLL.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

14587a46

io_uring: allow non-fixed files with SQPOLL · 28cea78a

由 Jens Axboe 提交于 9月 14, 2020

The restriction of needing fixed files for SQPOLL is problematic, and
prevents/inhibits several valid uses cases. With the referenced
files_struct that we have now, it's trivially supportable.

Treat ->files like we do the mm for the SQPOLL thread - grab a reference
to it (and assign it), and drop it when we're done.

This feature is exposed as IORING_FEAT_SQPOLL_NONFIXED.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

28cea78a

24 11月, 2020 2 次提交

io_uring: add support for shutdown(2) · 36f4fa68

由 Jens Axboe 提交于 9月 05, 2020

This adds support for the shutdown(2) system call, which is useful for
dealing with sockets.

shutdown(2) may block, so we have to punt it to async context.
Suggested-by: NNorman Maurer <norman.maurer@googlemail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

36f4fa68

io_uring: allow SQPOLL with CAP_SYS_NICE privileges · ce59fc69

由 Jens Axboe 提交于 9月 02, 2020

CAP_SYS_ADMIN is too restrictive for a lot of uses cases, allow
CAP_SYS_NICE based on the premise that such users are already allowed
to raise the priority of tasks.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ce59fc69

18 11月, 2020 3 次提交

io_uring: order refnode recycling · e297822b

由 Pavel Begunkov 提交于 11月 18, 2020

Don't recycle a refnode until we're done with all requests of nodes
ejected before.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e297822b

io_uring: get an active ref_node from files_data · 1e5d770b

由 Pavel Begunkov 提交于 11月 18, 2020

An active ref_node always can be found in ctx->files_data, it's much
safer to get it this way instead of poking into files_data->ref_list.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1e5d770b

io_uring: don't double complete failed reissue request · c993df5a

由 Jens Axboe 提交于 11月 17, 2020

Zorro reports that an xfstest test case is failing, and it turns out that
for the reissue path we can potentially issue a double completion on the
request for the failure path. There's an issue around the retry as well,
but for now, at least just make sure that we handle the error path
correctly.

Cc: stable@vger.kernel.org
Fixes: b63534c4 ("io_uring: re-issue block requests that failed because of resources")
Reported-by: NZorro Lang <zlang@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c993df5a

15 11月, 2020 1 次提交

io_uring: handle -EOPNOTSUPP on path resolution · 944d1444

由 Jens Axboe 提交于 11月 13, 2020

Any attempt to do path resolution on /proc/self from an async worker will
yield -EOPNOTSUPP. We can safely do that resolution from the task itself,
and without blocking, so retry it from there.

Ideally io_uring would know this upfront and not have to go through the
worker thread to find out, but that doesn't currently seem feasible.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

944d1444

12 11月, 2020 1 次提交

io_uring: round-up cq size before comparing with rounded sq size · 88ec3211

由 Jens Axboe 提交于 11月 11, 2020

If an application specifies IORING_SETUP_CQSIZE to set the CQ ring size
to a specific size, we ensure that the CQ size is at least that of the
SQ ring size. But in doing so, we compare the already rounded up to power
of two SQ size to the as-of yet unrounded CQ size. This means that if an
application passes in non power of two sizes, we can return -EINVAL when
the final value would've been fine. As an example, an application passing
in 100/100 for sq/cq size should end up with 128 for both. But since we
round the SQ size first, we compare the CQ size of 100 to 128, and return
-EINVAL as that is too small.

Cc: stable@vger.kernel.org
Fixes: 33a107f0 ("io_uring: allow application controlled CQ ring size")
Reported-by: NDan Melnic <dmm@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

88ec3211

11 11月, 2020 1 次提交

vfs: separate __sb_start_write into blocking and non-blocking helpers · 8a3c84b6

由 Darrick J. Wong 提交于 11月 10, 2020

Break this function into two helpers so that it's obvious that the
trylock versions return a value that must be checked, and the blocking
versions don't require that.  While we're at it, clean up the return
type mismatch.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

8a3c84b6

06 11月, 2020 3 次提交

io_uring: fix link lookup racing with link timeout · 9a472ef7

由 Pavel Begunkov 提交于 11月 05, 2020

We can't just go over linked requests because it may race with linked
timeouts. Take ctx->completion_lock in that case.

Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9a472ef7

io_uring: use correct pointer for io_uring_show_cred() · 6b47ab81

由 Jens Axboe 提交于 11月 05, 2020

Previous commit changed how we index the registered credentials, but
neglected to update one spot that is used when the personalities are
iterated through ->show_fdinfo(). Ensure we use the right struct type
for the iteration.

Reported-by: syzbot+a6d494688cdb797bdfce@syzkaller.appspotmail.com
Fixes: 1e6fa521 ("io_uring: COW io_identity on mismatch")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6b47ab81

io_uring: don't forget to task-cancel drained reqs · ef9865a4

由 Pavel Begunkov 提交于 11月 05, 2020

If there is a long-standing request of one task locking up execution of
deferred requests, and the defer list contains requests of another task
(all files-less), then a potential execution of __io_uring_task_cancel()
by that another task will sleep until that first long-standing request
completion, and that may take long.

E.g.
tsk1: req1/read(empty_pipe) -> tsk2: req(DRAIN)
Then __io_uring_task_cancel(tsk2) waits for req1 completion.

It seems we even can manufacture a complicated case with many tasks
sharing many rings that can lock them forever.

Cancel deferred requests for __io_uring_task_cancel() as well.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ef9865a4

05 11月, 2020 1 次提交

io_uring: fix overflowed cancel w/ linked ->files · 99b32808

由 Pavel Begunkov 提交于 11月 04, 2020

Current io_match_files() check in io_cqring_overflow_flush() is useless
because requests drop ->files before going to the overflow list, however
linked to it request do not, and we don't check them.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

99b32808

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功