提交 · 071698e13ac6ba786dfa22349a7b62deb5a9464d · openeuler / Kernel

29 1月, 2020 3 次提交

io_uring: allow registering credentials · 071698e1

由 Jens Axboe 提交于 1月 28, 2020

If an application wants to use a ring with different kinds of
credentials, it can register them upfront. We don't lookup credentials,
the credentials of the task calling IORING_REGISTER_PERSONALITY is used.

An 'id' is returned for the application to use in subsequent personality
support.
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

071698e1

io_uring: add io-wq workqueue sharing · 24369c2e

由 Pavel Begunkov 提交于 1月 28, 2020

If IORING_SETUP_ATTACH_WQ is set, it expects wq_fd in io_uring_params to
be a valid io_uring fd io-wq of which will be shared with the newly
created io_uring instance. If the flag is set but it can't share io-wq,
it fails.

This allows creation of "sibling" io_urings, where we prefer to keep the
SQ/CQ private, but want to share the async backend to minimize the amount
of overhead associated with having multiple rings that belong to the same
backend.
Reported-by: NJens Axboe <axboe@kernel.dk>
Reported-by: NDaurnimator <quae@daurnimator.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

24369c2e

io_uring/io-wq: don't use static creds/mm assignments · cccf0ee8

由 Jens Axboe 提交于 1月 27, 2020

We currently setup the io_wq with a static set of mm and creds. Even for
a single-use io-wq per io_uring, this is suboptimal as we have may have
multiple enters of the ring. For sharing the io-wq backend, it doesn't
work at all.

Switch to passing in the creds and mm when the work item is setup. This
means that async work is no longer deferred to the io_uring mm and creds,
it is done with the current mm and creds.

Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know
they can rely on the current personality (mm and creds) being the same
for direct issue and async issue.
Reviewed-by: NStefan Metzmacher <metze@samba.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cccf0ee8

21 1月, 2020 17 次提交

io_uring: optimise sqe-to-req flags translation · 6b47ee6e

由 Pavel Begunkov 提交于 1月 18, 2020

For each IOSQE_* flag there is a corresponding REQ_F_* flag. And there
is a repetitive pattern of their translation:
e.g. if (sqe->flags & SQE_FLAG*) req->flags |= REQ_F_FLAG*

Use same numeric values/bits for them and copy instead of manual
handling.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6b47ee6e

io_uring: add support for probing opcodes · 66f4af93

由 Jens Axboe 提交于 1月 16, 2020

The application currently has no way of knowing if a given opcode is
supported or not without having to try and issue one and see if we get
-EINVAL or not. And even this approach is fraught with peril, as maybe
we're getting -EINVAL due to some fields being missing, or maybe it's
just not that easy to issue that particular command without doing some
other leg work in terms of setup first.

This adds IORING_REGISTER_PROBE, which fills in a structure with info
on what it supported or not. This will work even with sparse opcode
fields, which may happen in the future or even today if someone
backports specific features to older kernels.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

66f4af93

io_uring: add support for IORING_OP_OPENAT2 · cebdb986

由 Jens Axboe 提交于 1月 08, 2020

Add support for the new openat2(2) system call. It's trivial to do, as
we can have openat(2) just be wrapped around it.
Suggested-by: NStefan Metzmacher <metze@samba.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cebdb986

io_uring: enable option to only trigger eventfd for async completions · f2842ab5

由 Jens Axboe 提交于 1月 08, 2020

If an application is using eventfd notifications with poll to know when
new SQEs can be issued, it's expecting the following read/writes to
complete inline. And with that, it knows that there are events available,
and don't want spurious wakeups on the eventfd for those requests.

This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like
IORING_REGISTER_EVENTFD, except it only triggers notifications for events
that happen from async completions (IRQ, or io-wq worker completions).
Any completions inline from the submission itself will not trigger
notifications.
Suggested-by: NMark Papadakis <markuspapadakis@icloud.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f2842ab5

io_uring: add support for send(2) and recv(2) · fddaface

由 Jens Axboe 提交于 1月 04, 2020

This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for
recv(2) support.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fddaface

io_uring: add support for IORING_SETUP_CLAMP · 8110c1a6

由 Jens Axboe 提交于 12月 28, 2019

Some applications like to start small in terms of ring size, and then
ramp up as needed. This is a bit tricky to do currently, since we don't
advertise the max ring size.

This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring
size exceed what we support, then clamp them at the max values instead
of returning -EINVAL. Since we return the chosen ring sizes after setup,
no further changes are needed on the application side. io_uring already
changes the ring sizes if the application doesn't ask for power-of-two
sizes, for example.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8110c1a6

io_uring: add IORING_OP_MADVISE · c1ca757b

由 Jens Axboe 提交于 12月 25, 2019

This adds support for doing madvise(2) through io_uring. We assume that
any operation can block, and hence punt everything async. This could be
improved, but hard to make bullet proof. The async punt ensures it's
safe.
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c1ca757b

io_uring: add IORING_OP_FADVISE · 4840e418

由 Jens Axboe 提交于 12月 25, 2019

This adds support for doing fadvise through io_uring. We assume that
WILLNEED doesn't block, but that DONTNEED may block.
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4840e418

io_uring: allow use of offset == -1 to mean file position · ba04291e

由 Jens Axboe 提交于 12月 25, 2019

This behaves like preadv2/pwritev2 with offset == -1, it'll use (and
update) the current file position. This obviously comes with the caveat
that if the application has multiple read/writes in flight, then the
end result will not be as expected. This is similar to threads sharing
a file descriptor and doing IO using the current file position.

Since this feature isn't easily detectable by doing a read or write,
add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to
detect presence of this feature.
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ba04291e

io_uring: add non-vectored read/write commands · 3a6820f2

由 Jens Axboe 提交于 12月 22, 2019

For uses cases that don't already naturally have an iovec, it's easier
(or more convenient) to just use a buffer address + length. This is
particular true if the use case is from languages that want to create
a memory safe abstraction on top of io_uring, and where introducing
the need for the iovec may impose an ownership issue. For those cases,
they currently need an indirection buffer, which means allocating data
just for this purpose.

Add basic read/write that don't require the iovec.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3a6820f2

io_uring: add IOSQE_ASYNC · ce35a47a

由 Jens Axboe 提交于 12月 17, 2019

io_uring defaults to always doing inline submissions, if at all
possible. But for larger copies, even if the data is fully cached, that
can take a long time. Add an IOSQE_ASYNC flag that the application can
set on the SQE - if set, it'll ensure that we always go async for those
kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we
get the concurrency we desire for this case.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ce35a47a

io_uring: add support for IORING_OP_STATX · eddc7ef5

由 Jens Axboe 提交于 12月 13, 2019

This provides support for async statx(2) through io_uring.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eddc7ef5

io_uring: avoid ring quiesce for fixed file set unregister and update · 05f3fb3c

由 Jens Axboe 提交于 12月 09, 2019

We currently fully quiesce the ring before an unregister or update of
the fixed fileset. This is very expensive, and we can be a bit smarter
about this.

Add a percpu refcount for the file tables as a whole. Grab a percpu ref
when we use a registered file, and put it on completion. This is cheap
to do. Upon removal of a file from a set, switch the ref count to atomic
mode. When we hit zero ref on the completion side, then we know we can
drop the previously registered files. When the old files have been
dropped, switch the ref back to percpu mode for normal operation.

Since there's a period between doing the update and the kernel being
done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the
same action. The application knows the update has completed when it gets
the CQE for it. Between doing the update and receiving this completion,
the application must continue to use the unregistered fd if submitting
IO on this particular file.

This takes the runtime of test/file-register from liburing from 14s to
about 0.7s.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

05f3fb3c

io_uring: add support for IORING_OP_CLOSE · b5dba59e

由 Jens Axboe 提交于 12月 11, 2019

This works just like close(2), unsurprisingly. We remove the file
descriptor and post the completion inline, then offload the actual
(potential) last file put to async context.

Mark the async part of this work as uncancellable, as we really must
guarantee that the latter part of the close is run.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b5dba59e

io_uring: add support for IORING_OP_OPENAT · 15b71abe

由 Jens Axboe 提交于 12月 11, 2019

This works just like openat(2), except it can be performed async. For
the normal case of a non-blocking path lookup this will complete
inline. If we have to do IO to perform the open, it'll be done from
async context.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

15b71abe

J
io_uring: add support for fallocate() · d63d1b5e
由 Jens Axboe 提交于 12月 10, 2019
```
This exposes fallocate(2) through io_uring.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
```
d63d1b5e

io_uring: fix compat for IORING_REGISTER_FILES_UPDATE · 1292e972

由 Eugene Syromiatnikov 提交于 1月 15, 2020

fds field of struct io_uring_files_update is problematic with regards
to compat user space, as pointer size is different in 32-bit, 32-on-64-bit,
and 64-bit user space. In order to avoid custom handling of compat in
the syscall implementation, make fds __u64 and use u64_to_user_ptr in
order to retrieve it. Also, align the field naturally and check that
no garbage is passed there.

Fixes: c3a31e60 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
Signed-off-by: NEugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1292e972

12 12月, 2019 1 次提交

io_uring: ensure we return -EINVAL on unknown opcode · 9e3aa61a

由 Jens Axboe 提交于 12月 11, 2019

If we submit an unknown opcode and have fd == -1, io_op_needs_file()
will return true as we default to needing a file. Then when we go and
assign the file, we find the 'fd' invalid and return -EBADF. We really
should be returning -EINVAL for that case, as we normally do for
unsupported opcodes.

Change io_op_needs_file() to have the following return values:

0   - does not need a file
1   - does need a file
< 0 - error value

and use this to pass back the right value for this invalid case.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9e3aa61a

11 12月, 2019 1 次提交

io_uring: allow unbreakable links · 4e88d6e7

由 Jens Axboe 提交于 12月 07, 2019

Some commands will invariably end in a failure in the sense that the
completion result will be less than zero. One such example is timeouts
that don't have a completion count set, they will always complete with
-ETIME unless cancelled.

For linked commands, we sever links and fail the rest of the chain if
the result is less than zero. Since we have commands where we know that
will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever
regardless of the completion result. Note that the link will still sever
if we fail submitting the parent request, hard links are only resilient
in the presence of completion results for requests that did submit
correctly.

Cc: stable@vger.kernel.org # v5.4
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4e88d6e7

03 12月, 2019 1 次提交

io_uring: mark us with IORING_FEAT_SUBMIT_STABLE · da8c9690

由 Jens Axboe 提交于 12月 02, 2019

If this flag is set, applications can be certain that any data for
async offload has been consumed when the kernel has consumed the
SQE.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

da8c9690

26 11月, 2019 1 次提交

io_uring: add support for IORING_OP_CONNECT · f8e85cf2

由 Jens Axboe 提交于 11月 23, 2019

This allows an application to call connect() in an async fashion. Like
other opcodes, we first try a non-blocking connect, then punt to async
context if we have to.

Note that we can still return -EINPROGRESS, and in that case the caller
should use IORING_OP_POLL_ADD to do an async wait for completion of the
connect request (just like for regular connect(2), except we can do it
async here too).
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f8e85cf2

10 11月, 2019 1 次提交

io_uring: add support for backlogged CQ ring · 1d7bb1d5

由 Jens Axboe 提交于 11月 06, 2019

Currently we drop completion events, if the CQ ring is full. That's fine
for requests with bounded completion times, but it may make it harder or
impossible to use io_uring with networked IO where request completion
times are generally unbounded. Or with POLL, for example, which is also
unbounded.

After this patch, we never overflow the ring, we simply store requests
in a backlog for later flushing. This flushing is done automatically by
the kernel. To prevent the backlog from growing indefinitely, if the
backlog is non-empty, we apply back pressure on IO submissions. Any
attempt to submit new IO with a non-empty backlog will get an -EBUSY
return from the kernel. This is a signal to the application that it has
backlogged CQ events, and that it must reap those before being allowed
to submit more IO.

Note that if we do return -EBUSY, we will have filled whatever
backlogged events into the CQ ring first, if there's room. This means
the application can safely reap events WITHOUT entering the kernel and
waiting for them, they are already available in the CQ ring.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1d7bb1d5

08 11月, 2019 1 次提交

io_uring: add support for linked SQE timeouts · 2665abfd

由 Jens Axboe 提交于 11月 05, 2019

While we have support for generic timeouts, we don't have a way to tie
a timeout to a specific SQE. The generic timeouts simply trigger wakeups
on the CQ ring.

This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
as a link to a previous command. The timeout specific can be either
relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
the timeout triggers before the dependent command completes, it will
attempt to cancel that command. Likewise, if the dependent command
completes before the timeout triggers, it will cancel the timeout.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2665abfd

01 11月, 2019 1 次提交

io_uring: support for generic async request cancel · 62755e35

由 Jens Axboe 提交于 10月 28, 2019

This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
cancel requests that have been punted to async context and are now
in-flight. This works for regular read/write requests to files, as
long as they haven't been started yet. For socket based IO (or things
like accept4(2)), we can cancel work that is already running as well.

To cancel a request, the sqe must have ->addr set to the user_data of
the request it wishes to cancel. If the request is cancelled
successfully, the original request is completed with -ECANCELED
and the cancel request is completed with a result of 0. If the
request was already running, the original may or may not complete
in error. The cancel request will complete with -EALREADY for that
case. And finally, if the request to cancel wasn't found, the cancel
request is completed with -ENOENT.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

62755e35

30 10月, 2019 5 次提交

io_uring: add support for IORING_OP_ACCEPT · 17f2fe35

由 Jens Axboe 提交于 10月 17, 2019

This allows an application to call accept4() in an async fashion. Like
other opcodes, we first try a non-blocking accept, then punt to async
context if we have to.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

17f2fe35

io_uring: add support for canceling timeout requests · 11365043

由 Jens Axboe 提交于 10月 16, 2019

We might have cases where the need for a specific timeout is gone, add
support for canceling an existing timeout operation. This works like the
POLL_REMOVE command, where the application passes in the user_data of
the timeout it wishes to cancel in the sqe->addr field.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

11365043

io_uring: add support for absolute timeouts · a41525ab

由 Jens Axboe 提交于 10月 15, 2019

This is a pretty trivial addition on top of the relative timeouts
we have now, but it's handy for ensuring tighter timing for those
that are building scheduling primitives on top of io_uring.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a41525ab

io_uring: allow application controlled CQ ring size · 33a107f0

由 Jens Axboe 提交于 10月 04, 2019

We currently size the CQ ring as twice the SQ ring, to allow some
flexibility in not overflowing the CQ ring. This is done because the
SQE life time is different than that of the IO request itself, the SQE
is consumed as soon as the kernel has seen the entry.

Certain application don't need a huge SQ ring size, since they just
submit IO in batches. But they may have a lot of requests pending, and
hence need a big CQ ring to hold them all. By allowing the application
to control the CQ ring size multiplier, we can cater to those
applications more efficiently.

If an application wants to define its own CQ ring size, it must set
IORING_SETUP_CQSIZE in the setup flags, and fill out
io_uring_params->cq_entries. The value must be a power of two.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

33a107f0

io_uring: add support for IORING_REGISTER_FILES_UPDATE · c3a31e60

由 Jens Axboe 提交于 10月 03, 2019

Allows the application to remove/replace/add files to/from a file set.
Passes in a struct:

struct io_uring_files_update {
	__u32 offset;
	__s32 *fds;
};

that holds an array of fds, size of array passed in through the usual
nr_args part of the io_uring_register() system call. The logic is as
follows:

1) If ->fds[i] is -1, the existing file at i + ->offset is removed from
   the set.
2) If ->fds[i] is a valid fd, the existing file at i + ->offset is
   replaced with ->fds[i].

For case #2, is the existing file is currently empty (fd == -1), the
new fd is simply added to the array.
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c3a31e60

19 9月, 2019 1 次提交

io_uring: IORING_OP_TIMEOUT support · 5262f567

由 Jens Axboe 提交于 9月 17, 2019

There's been a few requests for functionality similar to io_getevents()
and epoll_wait(), where the user can specify a timeout for waiting on
events. I deliberately did not add support for this through the system
call initially to avoid overloading the args, but I can see that the use
cases for this are valid.

This adds support for IORING_OP_TIMEOUT. If a user wants to get woken
when waiting for events, simply submit one of these timeout commands
with your wait call (or before). This ensures that the application
sleeping on the CQ ring waiting for events will get woken. The timeout
command is passed in as a pointer to a struct timespec. Timeouts are
relative. The timeout command also includes a way to auto-cancel after
N events has passed.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5262f567

07 9月, 2019 1 次提交

io_uring: expose single mmap capability · ac90f249

由 Jens Axboe 提交于 9月 06, 2019

After commit 75b28aff we can get by with just a single mmap to
map both the sq and cq ring. However, userspace doesn't know that.

Add a features variable to io_uring_params, and notify userspace
that the kernel has this ability. This can then be used in liburing
(or in applications directly) to avoid the second mmap.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ac90f249

10 7月, 2019 2 次提交

io_uring: add support for recvmsg() · aa1fa28f

由 Jens Axboe 提交于 4月 19, 2019

This is done through IORING_OP_RECVMSG. This opcode uses the same
sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the
msghdr struct in the sqe->addr field as well.

We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't
block, and punt to async execution if it would have.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

aa1fa28f

io_uring: add support for sendmsg() · 0fa03c62

由 Jens Axboe 提交于 4月 19, 2019

This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags
for the flags argument, and the msghdr struct is passed in the
sqe->addr field.

We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't
block, and punt to async execution if it would have.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0fa03c62

24 6月, 2019 1 次提交

io_uring: add support for sqe links · 9e645e11

由 Jens Axboe 提交于 5月 10, 2019

With SQE links, we can create chains of dependent SQEs. One example
would be queueing an SQE that's a read from one file descriptor, with
the linked SQE being a write to another with the same set of buffers.

An SQE link will not stall the pipeline, it'll just ensure that
dependent SQEs aren't issued before the previous link has completed.

Any error at submission or completion time will break the chain of SQEs.
For completions, this also includes short reads or writes, as the next
SQE could depend on the previous one being fully completed.

Any SQE in a chain that gets canceled due to any of the above errors,
will get an CQE fill with -ECANCELED as the error value.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9e645e11

03 5月, 2019 3 次提交

io_uring: add support for eventfd notifications · 9b402849

由 Jens Axboe 提交于 4月 11, 2019

Allow registration of an eventfd, which will trigger an event every
time a completion event happens for this io_uring instance.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9b402849

J
io_uring: add support for IORING_OP_SYNC_FILE_RANGE · 5d17b4a4
由 Jens Axboe 提交于 4月 09, 2019
```
This behaves just like sync_file_range(2) does.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
```
5d17b4a4

io_uring: add support for marking commands as draining · de0617e4

由 Jens Axboe 提交于 4月 06, 2019

There are no ordering constraints between the submission and completion
side of io_uring. But sometimes that would be useful to have. One common
example is doing an fsync, for instance, and have it ordered with
previous writes. Without support for that, the application must do this
tracking itself.

This adds a general SQE flag, IOSQE_IO_DRAIN. If a command is marked
with this flag, then it will not be issued before previous commands have
completed, and subsequent commands submitted after the drain will not be
issued before the drain is started.. If there are no pending commands,
setting this flag will not change the behavior of the issue of the
command.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

de0617e4

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功