- 25 7月, 2022 14 次提交
-
-
由 Pavel Begunkov 提交于
Allow to flush notifiers as a part of sendzc request by setting IORING_SENDZC_FLUSH flag. When the sendzc request succeedes it will flush the used [active] notifier. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e0b4d9a6797e2fd6092824fe42953db7a519bbc8.1657643355.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Allow zerocopy sends to use fixed buffers. There is an optimisation for this case, the network layer don't need to reference the pages, see SKBFL_MANAGED_FRAG_REFS, so io_uring have to ensure validity of fixed buffers until the notifier is released. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e1d8bd1b5934e541d90c1824eb4020ae3f5f43f3.1657643355.git.asml.silence@gmail.com [axboe: fold in 32-bit pointer cast warning fix] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Allow to specify an address to zerocopy sends making it more like sendto(2). Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/70417a8f7c5b51ab454690bae08adc0c187f89e8.1657643355.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Add a new io_uring opcode IORING_OP_SENDZC. The main distinction from IORING_OP_SEND is that the user should specify a notification slot index in sqe::notification_idx and the buffers are safe to reuse only when the used notification is flushed and completes. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a80387c6a68ce9cf99b3b6ef6f71068468761fb7.1657643355.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Let the userspace to register and unregister notification slots. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a0aa8161fe3ebb2a4cc6e5dbd0cffb96e6881cf5.1657643355.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
Similar to multishot recv, this will require provided buffers to be used. However recvmsg is much more complex than recv as it has multiple outputs. Specifically flags, name, and control messages. Support this by introducing a new struct io_uring_recvmsg_out with 4 fields. namelen, controllen and flags match the similar out fields in msghdr from standard recvmsg(2), payloadlen is the length of the payload following the header. This struct is placed at the start of the returned buffer. Based on what the user specifies in struct msghdr, the next bytes of the buffer will be name (the next msg_namelen bytes), and then control (the next msg_controllen bytes). The payload will come at the end. The return value in the CQE is the total used size of the provided buffer. Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220714110258.1336200-4-dylany@fb.com [axboe: style fixups, see link] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
Support multishot receive for io_uring. Typical server applications will run a loop where for each recv CQE it requeues another recv/recvmsg. This can be simplified by using the existing multishot functionality combined with io_uring's provided buffers. The API is to add the IORING_RECV_MULTISHOT flag to the SQE. CQEs will then be posted (with IORING_CQE_F_MORE flag set) when data is available and is read. Once an error occurs or the socket ends, the multishot will be removed and a completion without IORING_CQE_F_MORE will be posted. The benefit to this is that the recv is much more performant. * Subsequent receives are queued up straight away without requiring the application to finish a processing loop. * If there are more data in the socket (sat the provided buffer size is smaller than the socket buffer) then the data is immediately returned, improving batching. * Poll is only armed once and reused, saving CPU cycles Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-11-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
From recently io_uring provides an option to allocate a file index for operation registering fixed files. However, it's utterly unusable with mixed approaches when for a part of files the userspace knows better where to place it, as it may race and users don't have any sane way to pick a slot and hoping it will not be taken. Let the userspace to register a range of fixed file slots in which the auto-allocation happens. The use case is splittting the fixed table in two parts, where on of them is used for auto-allocation and another for slot-specified operations. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66ab0394e436f38437cf7c44676e1920d09687ad.1656154403.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
With IORING_OP_MSG_RING, one ring can send a message to another ring. Extend that support to also allow sending a fixed file descriptor to that ring, enabling one ring to pass a registered descriptor to another one. Arguments are extended to pass in: sqe->addr3 fixed file slot in source ring sqe->file_index fixed file slot in destination ring IORING_OP_MSG_RING is extended to take a command argument in sqe->addr. If set to zero (or IORING_MSG_DATA), it sends just a message like before. If set to IORING_MSG_SEND_FD, a fixed file descriptor is sent according to the above arguments. Two common use cases for this are: 1) Server needs to be shutdown or restarted, pass file descriptors to another onei 2) Backend is split, and one accepts connections, while others then get the fd passed and handle the actual connection. Both of those are classic SCM_RIGHTS use cases, and it's not possible to support them with direct descriptors today. By default, this will post a CQE to the target ring, similarly to how IORING_MSG_DATA does it. If IORING_MSG_RING_CQE_SKIP is set, no message is posted to the target ring. The issuer is expected to notify the receiver side separately. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Gustavo A. R. Silva 提交于
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays Link: https://github.com/KSPP/linux/issues/78Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
The io_uring cancelation API is async, like any other API that we expose there. For the case of finding a request to cancel, or not finding one, it is fully sync in that when submission returns, the CQE for both the cancelation request and the targeted request have been posted to the CQ ring. However, if the targeted work is being executed by io-wq, the API can only start the act of canceling it. This makes it difficult to use in some circumstances, as the caller then has to wait for the CQEs to come in and match on the same cancelation data there. Provide a IORING_REGISTER_SYNC_CANCEL command for io_uring_register() that does sync cancelations, always. For the io-wq case, it'll wait for the cancelation to come in before returning. The only expected returns from this API is: 0 Request found and canceled fine. > 0 Requests found and canceled. Only happens if asked to cancel multiple requests, and if the work wasn't in progress. -ENOENT Request not found. -ETIME A timeout on the operation was requested, but the timeout expired before we could cancel. and we won't get -EALREADY via this API. If the timeout value passed in is -1 (tv_sec and tv_nsec), then that means that no timeout is requested. Otherwise, the timespec passed in is the amount of time the sync cancel will wait for a successful cancelation. Link: https://github.com/axboe/liburing/discussions/608Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
In preparation for not having a request to pass in that carries this state, add a separate cancelation flag that allows the caller to ask for a fixed file for cancelation. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Add a new IORING_SETUP_SINGLE_ISSUER flag and the userspace visible part of it, i.e. put limitations of submitters. Also, don't allow it together with IOPOLL as we're not going to put it to good use. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4bcc41ee467fdf04c8aab8baf6ce3ba21858c3d4.1655371007.git.asml.silence@gmail.comReviewed-by: NHao Xu <howeyxu@tencent.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
By default, the POLL_ADD command does edge triggered poll - if we get a non-zero mask on the initial poll attempt, we complete the request successfully. Support level triggered by always waiting for a notification, regardless of whether or not the initial mask matches the file state. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 08 7月, 2022 1 次提交
-
-
由 Pavel Begunkov 提交于
32 bit sqe->cmd_op is an union with 64 bit values. It's always a good idea to do padding explicitly. Also zero check it in prep, so it can be used in the future if needed without compatibility concerns. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e6b95a05e970af79000435166185e85b196b2ba2.1657202417.git.asml.silence@gmail.com [axboe: turn bitwise OR into logical variant] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 30 6月, 2022 1 次提交
-
-
由 Pavel Begunkov 提交于
We waste a u64 SQE field for flags even though we don't need as many bits and it can be used for something more useful later. Store io_uring specific send/recv flags in sqe->ioprio instead of ->addr2. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Fixes: 0455d4cc ("io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg") [axboe: change comment in io_uring.h as well] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 15 6月, 2022 1 次提交
-
-
由 Pavel Begunkov 提交于
This partially reverts a7c41b46 Even though IORING_CLOSE_FD_AND_FILE_SLOT might save cycles for some users, but it tries to do two things at a time and it's not clear how to handle errors and what to return in a single result field when one part fails and another completes well. Kill it for now. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/837c745019b3795941eee4fcfd7de697886d645b.1655224415.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 31 5月, 2022 1 次提交
-
-
由 Xiaoguang Wang 提交于
One big issue with the file registration feature is that it needs user space apps to maintain free slot info about io_uring's fixed file table, which really is a burden for development. io_uring now supports choosing free file slot for user space apps by using IORING_FILE_INDEX_ALLOC flag in accept, open, and socket operations, but they need the app to use direct accept or direct open, which not all apps are prepared to use yet. To support apps that still need real fds, make use of the registration feature easier. Let IORING_OP_FILES_UPDATE support choosing fixed file slots, which will store picked fixed files slots in fd array and let cqe return the number of slots allocated. Suggested-by: NHao Xu <howeyxu@tencent.com> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> [axboe: move flag to uapi io_uring header, change goto to break, init] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 18 5月, 2022 1 次提交
-
-
由 Jens Axboe 提交于
Provided buffers allow an application to supply io_uring with buffers that can then be grabbed for a read/receive request, when the data source is ready to deliver data. The existing scheme relies on using IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use in real world applications. It's pretty efficient if the application is able to supply back batches of provided buffers when they have been consumed and the application is ready to recycle them, but if fragmentation occurs in the buffer space, it can become difficult to supply enough buffers at the time. This hurts efficiency. Add a register op, IORING_REGISTER_PBUF_RING, which allows an application to setup a shared queue for each buffer group of provided buffers. The application can then supply buffers simply by adding them to this ring, and the kernel can consume then just as easily. The ring shares the head with the application, the tail remains private in the kernel. Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the ring, they must use the mapped ring. Mapped provided buffer rings can co-exist with normal provided buffers, just not within the same group ID. To gauge overhead of the existing scheme and evaluate the mapped ring approach, a simple NOP benchmark was written. It uses a ring of 128 entries, and submits/completes 32 at the time. 'Replenish' is how many buffers are provided back at the time after they have been consumed: Test Replenish NOPs/sec ================================================================ No provided buffers NA ~30M Provided buffers 32 ~16M Provided buffers 1 ~10M Ring buffers 32 ~27M Ring buffers 1 ~27M The ring mapped buffers perform almost as well as not using provided buffers at all, and they don't care if you provided 1 or more back at the same time. This means application can just replenish as they go, rather than need to batch and compact, further reducing overhead in the application. The NOP benchmark above doesn't need to do any compaction, so that overhead isn't even reflected in the above test. Co-developed-by: NDylan Yudaken <dylany@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 14 5月, 2022 1 次提交
-
-
由 Hao Xu 提交于
add an accept_flag IORING_ACCEPT_MULTISHOT for accept, which is to support multishot. Signed-off-by: NHao Xu <howeyxu@tencent.com> Link: https://lore.kernel.org/r/20220514142046.58072-2-haoxu.linux@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 13 5月, 2022 2 次提交
-
-
由 Jens Axboe 提交于
Currently to setup a fully sparse descriptor space upfront, the app needs to alloate an array of the full size and memset it to -1 and then pass that in. Make this a bit easier by allowing a flag that simply does this internally rather than needing to copy each slot separately. This works with IORING_REGISTER_FILES2 as the flag is set in struct io_uring_rsrc_register, and is only allow when the type is IORING_RSRC_FILE as this doesn't make sense for registered buffers. Reviewed-by: NHao Xu <howeyxu@tencent.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot, then that's a hint to allocate a fixed file descriptor rather than have one be passed in directly. This can be useful for having io_uring manage the direct descriptor space. Normal open direct requests will complete with 0 for success, and < 0 in case of error. If io_uring is asked to allocated the direct descriptor, then the direct descriptor is returned in case of success. Reviewed-by: NHao Xu <howeyxu@tencent.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 11 5月, 2022 1 次提交
-
-
由 Jens Axboe 提交于
file_operations->uring_cmd is a file private handler. This is somewhat similar to ioctl but hopefully a lot more sane and useful as it can be used to enable many io_uring capabilities for the underlying operation. IORING_OP_URING_CMD is a file private kind of request. io_uring doesn't know what is in this command type, it's for the provider of ->uring_cmd() to deal with. Co-developed-by: NKanchan Joshi <joshi.k@samsung.com> Signed-off-by: NKanchan Joshi <joshi.k@samsung.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220511054750.20432-2-joshi.k@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 09 5月, 2022 2 次提交
-
-
由 Stefan Roesch 提交于
This adds the big_cqe array to the struct io_uring_cqe to support large CQE's. Co-developed-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NStefan Roesch <shr@fb.com> Reviewed-by: NKanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220426182134.136504-2-shr@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Normal SQEs are 64-bytes in length, which is fine for all the commands we support. However, in preparation for supporting passthrough IO, provide an option for setting up a ring with 128-byte SQEs. We continue to use the same type for io_uring_sqe, it's marked and commented with a zero sized array pad at the end. This provides up to 80 bytes of data for a passthrough command - 64 bytes for the extra added data, and 16 bytes available at the end of the existing SQE. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 06 5月, 2022 1 次提交
-
-
由 Jens Axboe 提交于
If IORING_RECVSEND_POLL_FIRST is set for recv/recvmsg or send/sendmsg, then we arm poll first rather than attempt a receive or send upfront. This can be useful if we expect there to be no data (or space) available for the request, as we can then avoid wasting time on the initial issue attempt. Reviewed-by: NHao Xu <howeyxu@tencent.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 30 4月, 2022 3 次提交
-
-
由 Jens Axboe 提交于
If IORING_SETUP_COOP_TASKRUN is set to use cooperative scheduling for running task_work, then IORING_SETUP_TASKRUN_FLAG can be set so the application can tell if task_work is pending in the kernel for this ring. This allows use cases like io_uring_peek_cqe() to still function appropriately, or for the task to know when it would be useful to call io_uring_wait_cqe() to run pending events. Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20220426014904.60384-7-axboe@kernel.dkSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If this is set, io_uring will never use an IPI to deliver a task_work notification. This can be used in the common case where a single task or thread communicates with the ring, and doesn't rely on io_uring_cqe_peek(). This provides a noticeable win in performance, both from eliminating the IPI itself, but also from avoiding interrupting the submitting task unnecessarily. Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20220426014904.60384-6-axboe@kernel.dkSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
For now just use a CQE flag for this, with big CQE support we could return the actual number of bytes left. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 4月, 2022 1 次提交
-
-
由 Dylan Yudaken 提交于
It is useful to have a type enum for opcodes, to allow the compiler to assert that every value is used in a switch statement. Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220426082907.3600028-2-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 25 4月, 2022 6 次提交
-
-
由 Jens Axboe 提交于
Supports both regular socket(2) where a normal file descriptor is instantiated when called, or direct descriptors. Link: https://lore.kernel.org/r/20220412202240.234207-3-axboe@kernel.dkSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Stefan Roesch 提交于
This adds support to io_uring for the fgetxattr and getxattr API. Signed-off-by: NStefan Roesch <shr@fb.com> Acked-by: NChristian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20220323154420.3301504-5-shr@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Stefan Roesch 提交于
This adds support to io_uring for the fsetxattr and setxattr API. Signed-off-by: NStefan Roesch <shr@fb.com> Acked-by: NChristian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20220323154420.3301504-4-shr@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Rather than match on a specific key, be it user_data or file, allow canceling any request that we can lookup. Works like IORING_ASYNC_CANCEL_ALL in that it cancels multiple requests, but it doesn't key off user_data or the file. Can't be set with IORING_ASYNC_CANCEL_FD, as that's a key selector. Only one may be used at the time. Signed-off-by: NJens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220418164402.75259-6-axboe@kernel.dkSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Currently sqe->addr must contain the user_data of the request being canceled. Introduce the IORING_ASYNC_CANCEL_FD flag, which tells the kernel that we're keying off the file fd instead for cancelation. This allows canceling any request that a) uses a file, and b) was assigned the file based on the value being passed in. Signed-off-by: NJens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220418164402.75259-5-axboe@kernel.dk
-
由 Jens Axboe 提交于
The current cancelation will lookup and cancel the first request it finds based on the key passed in. Add a flag that allows to cancel any request that matches they key. It completes with the number of requests found and canceled, or res < 0 if an error occured. Signed-off-by: NJens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220418164402.75259-4-axboe@kernel.dk
-
- 11 4月, 2022 1 次提交
-
-
由 Jens Axboe 提交于
Give applications a way to tell if the kernel supports sane linked files, as in files being assigned at the right time to be able to reliably do <open file direct into slot X><read file from slot X> while using IOSQE_IO_LINK to order them. Not really a bug fix, but flag it as such so that it gets pulled in with backports of the deferred file assignment. Fixes: 6bf9c47a ("io_uring: defer file assignment") Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 24 3月, 2022 1 次提交
-
-
由 Jens Axboe 提交于
This was introduced with the message ring opcode, but isn't strictly required for the request itself. The sender can encode what is needed in user_data, which is passed to the receiver. It's unclear if having a separate flag that essentially says "This CQE did not originate from an SQE on this ring" provides any real utility to applications. While we can always re-introduce a flag to provide this information, we cannot take it away at a later point in time. Remove the flag while we still can, before it's in a released kernel. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 11 3月, 2022 2 次提交
-
-
由 Jens Axboe 提交于
By default, io_uring will stop submitting a batch of requests if we run into an error submitting a request. This isn't strictly necessary, as the error result is passed out-of-band via a CQE anyway. And it can be a bit confusing for some applications. Provide a way to setup a ring that will continue submitting on error, when the error CQE has been posted. There's still one case that will break out of submission. If we fail allocating a request, then we'll still return -ENOMEM. We could in theory post a CQE for that condition too even if we never got a request. Leave that for a potential followup. Reported-by: NDylan Yudaken <dylany@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
This adds support for IORING_OP_MSG_RING, which allows an SQE to signal another ring. That allows either waking up someone waiting on the ring, or even passing a 64-bit value via the user_data field in the CQE. sqe->fd must contain the fd of a ring that should receive the CQE. sqe->off will be propagated to the cqe->user_data on the target ring, and sqe->len will be propagated to cqe->res. The results CQE will have IORING_CQE_F_MSG set in its flags, to indicate that this CQE was generated from a messaging request rather than a SQE issued locally on that ring. This effectively allows passing a 64-bit and a 32-bit quantify between the two rings. This request type has the following request specific error cases: - -EBADFD. Set if the sqe->fd doesn't point to a file descriptor that is of the io_uring type. - -EOVERFLOW. Set if we were not able to deliver a request to the target ring. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-