- 22 9月, 2022 14 次提交
-
-
由 Pavel Begunkov 提交于
We try to restrict CQ waiters when IORING_SETUP_DEFER_TASKRUN is set, but if nothing has been submitted yet it'll allow any waiter, which violates the contract. Fixes: c0e0d6ba ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Reviewed-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/b4f0d3f14236d7059d08c5abe2661ef0b78b5528.1662652536.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
In case of DEFER_TASK_WORK we try to restrict waiters to only one task, which is also the only submitter; however, we don't do it reliably, which might be very confusing and backfire in the future. E.g. we currently allow multiple tasks in io_iopoll_check(). Fixes: c0e0d6ba ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Reviewed-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/94c83c0a7fe468260ee2ec31bdb0095d6e874ba2.1662652536.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
There is one place when we forgot to change hand coded spin locking with io_cq_lock(), change it to be more consistent. Note, the unlock part is already __io_cq_unlock_post(). Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/91699b9a00a07128f7ca66136bdbbfc67a64659e.1662639236.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Request referencing has changed a while ago and there is no notion left of submission/completion references, kill an outdated comment. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/38902e7229d68cecd62702436d627d4858b0d9d4.1662639236.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Combine the two checks we have for task_work running and whether or not we need to shuffle the mutex into one, so we unify how task_work is run in the iopoll loop. This helps ensure that local task_work is run when needed, and also optimizes that path to avoid a mutex shuffle if it's not needed. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We have a few spots that drop the mutex just to run local task_work, which immediately tries to grab it again. Add a helper that just passes in whether we're locked already. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Kanchan Joshi 提交于
Put this up in the same way as iopoll is done for regular read/write IO. Make place for storing a cookie into struct io_uring_cmd on submission. Perform the completion using the ->uring_cmd_iopoll handler. Signed-off-by: NKanchan Joshi <joshi.k@samsung.com> Signed-off-by: NPankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20220823161443.49436-3-joshi.k@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
Add tracing for io_run_local_task_work Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220830125013.570060-8-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
Some workloads rely on a registered eventfd (via io_uring_register_eventfd(3)) in order to wake up and process the io_uring. In the case of a ring setup with IORING_SETUP_DEFER_TASKRUN, that eventfd also needs to be signalled when there are tasks to run. This changes an old behaviour which assumed 1 eventfd signal implied at least 1 CQE, however only when this new flag is set (and so old users will not notice). This should be expected with the IORING_SETUP_DEFER_TASKRUN flag as it is not guaranteed that every task will result in a CQE. Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220830125013.570060-7-dylany@fb.com [axboe: fold in call_rcu() serialization fix] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
Non functional change: move this function above io_eventfd_signal so it can be used from there Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220830125013.570060-6-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
Allow deferring async tasks until the user calls io_uring_enter(2) with the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at io_uring_setup time. This functionality requires that the later io_uring_enter will be called from the same submission task, and therefore restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also set. Being able to hand pick when tasks are run prevents the problem where there is current work to be done, however task work runs anyway. For example, a common workload would obtain a batch of CQEs, and process each one. Interrupting this to additional taskwork would add latency but not gain anything. If instead task work is deferred to just before more CQEs are obtained then no additional latency is added. The way this is implemented is by trying to keep task work local to a io_ring_ctx, rather than to the submission task. This is required, as the application will want to wake up only a single io_ring_ctx at a time to process work, and so the lists of work have to be kept separate. This has some other benefits like not having to check the task continually in handle_tw_list (and potentially unlocking/locking those), and reducing locks in the submit & process completions path. There are networking cases where using this option can reduce request latency by 50%. For example a contrived example using [1] where the client sends 2k data and receives the same data back while doing some system calls (to trigger task work) shows this reduction. The reason ends up being that if sending responses is delayed by processing task work, then the client side sits idle. Whereas reordering the sends first means that the client runs it's workload in parallel with the local task work. [1]: Using https://github.com/DylanZA/netbench/tree/defer_run Client: ./netbench --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000" Server: ./netbench --server_only 1 --control_port 10000 --rx "io_uring --defer_taskrun 0 --workload 100" --rx "io_uring --defer_taskrun 1 --workload 100" Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220830125013.570060-5-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
This is not needed, and it is normally better to wait for task work until after submissions. This will allow greater batching if either work arrives in the meanwhile, or if the submissions cause task work to be queued up. For SQPOLL this also no longer runs task work, but this is handled inside the SQPOLL loop anyway. For IOPOLL io_iopoll_check will run task work anyway And otherwise io_cqring_wait will run task work Suggested-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220830125013.570060-4-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
This will be used later to know if the ring has outstanding work. Right now just if there is overflow CQEs to copy to the main CQE ring, but later will include deferred tasks Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220830125013.570060-3-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
'running' is set once and read once, so can easily just remove it Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220830125013.570060-2-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 08 9月, 2022 1 次提交
-
-
由 Pavel Begunkov 提交于
When we queue a request via tw for execution it's not going to be executed immediately, so when io_queue_async() hits IO_APOLL_READY and queues a tw but doesn't try to recycle/consume the buffer some other request may try to use the the buffer. Fixes: c7fb1942 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a19bc9e211e3184215a58e129b62f440180e9212.1662480490.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 01 9月, 2022 2 次提交
-
-
由 Pavel Begunkov 提交于
Following user feedback, this patch simplifies zerocopy send API. One of the main complaints is that the current API is difficult with the userspace managing notification slots, and then send retries with error handling make it even worse. Instead of keeping notification slots change it to the per-request notifications model, which posts both completion and notification CQEs for each request when any data has been sent, and only one CQE if it fails. All notification CQEs will have IORING_CQE_F_NOTIF set and IORING_CQE_F_MORE in completion CQEs indicates whether to wait a notification or not. IOSQE_CQE_SKIP_SUCCESS is disallowed with zerocopy sends for now. This is less flexible, but greatly simplifies the user API and also the kernel implementation. We reuse notif helpers in this patch, but in the future there won't be need for keeping two requests. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/95287640ab98fc9417370afb16e310677c63e6ce.1662027856.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
We're going to remove the userspace exposed zerocopy notification API, remove notification registration. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6ff00b97be99869c386958a990593c9c31cf105b.1662027856.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 24 8月, 2022 1 次提交
-
-
由 Pavel Begunkov 提交于
There are opcodes that need ->async_data only in some cases and allocation it unconditionally may hurt performance. Add an option to opdef to make move the allocation part from the core io_uring to opcode specific code. Note, we can't just set opdef->async_size to zero because there are other helpers that rely on it, e.g. io_alloc_async_data(). Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9dc62be9e88dd0ed63c48365340e8922d2498293.1661342812.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 13 8月, 2022 1 次提交
-
-
由 Stefan Metzmacher 提交于
Signed-off-by: NStefan Metzmacher <metze@samba.org> Link: https://lore.kernel.org/r/ffcaf8dc4778db4af673822df60dbda6efdd3065.1660201408.git.metze@samba.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 27 7月, 2022 2 次提交
-
-
由 Pavel Begunkov 提交于
We want to use all optimisations that we have for io_uring requests like completion batching, memory caching and more but for zc notifications. Fortunately, notification perfectly fit the request model so we can overlay them onto struct io_kiocb and use all the infratructure. Most of the fields of struct io_notif natively fits into io_kiocb, so we replace struct io_notif with struct io_kiocb carrying struct io_notif_data in the cmd cache line. Then we adapt io_alloc_notif() to use io_alloc_req()/io_alloc_req_refill(), and kill leftovers of hand coded caching. __io_notif_complete_tw() is converted to use io_uring's tw infra. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9e010125175e80baf51f0ca63bdc7cc6a4a9fa56.1658913593.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
We want to do request allocation out of the core io_uring code, make the allocation functions public for other io_uring parts. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0314fedd3a02a514210ba42d4720332538c65956.1658913593.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 25 7月, 2022 19 次提交
-
-
由 Pavel Begunkov 提交于
Allow to flush notifiers as a part of sendzc request by setting IORING_SENDZC_FLUSH flag. When the sendzc request succeedes it will flush the used [active] notifier. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e0b4d9a6797e2fd6092824fe42953db7a519bbc8.1657643355.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Let the userspace to register and unregister notification slots. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a0aa8161fe3ebb2a4cc6e5dbd0cffb96e6881cf5.1657643355.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
kmalloc'ing struct io_notif is too expensive when done frequently, cache them as many other resources in io_uring. Keep two list, the first one is from where we're getting notifiers, it's protected by ->uring_lock. The second is protected by ->completion_lock, to which we queue released notifiers. Then we splice one list into another when needed. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9dec18f7fcbab9f4bd40b96e5ae158b119945230.1657643355.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Add internal part of send zerocopy notifications. There are two main structures, the first one is struct io_notif, which carries inside struct ubuf_info and maps 1:1 to it. io_uring will be binding a number of zerocopy send requests to it and ask to complete (aka flush) it. When flushed and all attached requests and skbs complete, it'll generate one and only one CQE. There are intended to be passed into the network layer as struct msghdr::msg_ubuf. The second concept is notification slots. The userspace will be able to register an array of slots and subsequently addressing them by the index in the array. Slots are independent of each other. Each slot can have only one notifier at a time (called active notifier) but many notifiers during the lifetime. When active, a notifier not going to post any completion but the userspace can attach requests to it by specifying the corresponding slot while issueing send zc requests. Eventually, the userspace will want to "flush" the notifier losing any way to attach new requests to it, however it can use the next atomatically added notifier of this slot or of any other slot. When the network layer is done with all enqueued skbs attached to a notifier and doesn't need the specified in them user data, the flushed notifier will post a CQE. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3ecf54c31a85762bf679b0a432c9f43ecf7e61cc.1657643355.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Make io_put_task() available to non-core parts of io_uring, we'll need it for notification infrastructure. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3686807d4c03b72e389947b0e8692d4d44334ef0.1657643355.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If we're offloading requests directly to io-wq because IOSQE_ASYNC was set in the sqe, we can miss hashing writes appropriately because we haven't set REQ_F_ISREG yet. This can cause a performance regression with buffered writes, as io-wq then no longer correctly serializes writes to that file. Ensure that we set the flags in io_prep_async_work(), which will cause the io-wq work item to be hashed appropriately. Fixes: 584b0180 ("io_uring: move read/write file prep state into actual opcode handler") Link: https://lore.kernel.org/io-uring/20220608080054.GB22428@xsang-OptiPlex-9020/Reported-by: Nkernel test robot <oliver.sang@intel.com> Tested-by: NYin Fengwei <fengwei.yin@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Michal Koutný 提交于
The commit 8bb649ee ("io_uring: remove ring quiesce for io_uring_register") removed the worklow relying on reinit/resurrection of the percpu_ref, hence, initialization with that requested is a relic. This is based on code review, this causes no real bug (and theoretically can't). Technically it's a revert of commit 21482896 ("io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT") but since the flag omission is now justified, I'm not making this a revert. Fixes: 8bb649ee ("io_uring: remove ring quiesce for io_uring_register") Signed-off-by: NMichal Koutný <mkoutny@suse.com> Acked-by: NRoman Gushchin <roman.gushchin@linux.dev> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
For recvmsg/sendmsg, if they don't complete inline, we currently need to allocate a struct io_async_msghdr for each request. This is a somewhat large struct. Hook up sendmsg/recvmsg to use the io_alloc_cache. This reduces the alloc + free overhead considerably, yielding 4-5% of extra performance running netbench. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Caches like this tend to grow to the peak size, and then never get any smaller. Impose a max limit on the size, to prevent it from growing too big. A somewhat randomly chosen 512 is the max size we'll allow the cache to get. If a batch of frees come in and would bring it over that, we simply start kfree'ing the surplus. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
In preparation for adding limits, and one more user, abstract out the core bits of the allocation+free cache. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
This is where it's used, move the flush handler in there. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
In overflow we see a duplcate line in the trace, and in some cases 3 lines (if initial io_post_aux_cqe fails). Instead just trace once for each CQE Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-13-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
Some use cases of io_post_aux_cqe would not want to overflow as is, but might want to change the flags/result. For example multishot receive requires in order CQE, and so if there is an overflow it would need to stop receiving until the overflow is taken care of. Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-8-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
From recently io_uring provides an option to allocate a file index for operation registering fixed files. However, it's utterly unusable with mixed approaches when for a part of files the userspace knows better where to place it, as it may race and users don't have any sane way to pick a slot and hoping it will not be taken. Let the userspace to register a range of fixed file slots in which the auto-allocation happens. The use case is splittting the fixed table in two parts, where on of them is used for auto-allocation and another for slot-specified operations. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66ab0394e436f38437cf7c44676e1920d09687ad.1656154403.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
io_uring_enter() takes ctx->refs, which was previously preventing racing with register quiesce. However, as register now doesn't touch the refs, we can freely kill extra ctx pinning and rely on the fact that we're holding a file reference preventing the ring from being destroyed. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a11c57ad33a1be53541fce90669c1b79cf4d8940.1656153286.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Registered rings are per definitions io_uring files, so we don't need to additionally verify them. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/425cd64fd885b8e329a46c205ee811987691baaf.1656153286.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
io_run_task_work() accounts for TIF_NOTIFY_SIGNAL, so no need to have an second check in io_run_task_work_sig(). Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/52ce41a592ad904511697f432141e5690fd4b968.1656153285.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Now as both normal and fallback paths use llist, just keep one node head in struct io_task_work and kill off ->fallback_node. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d04ebde409f7b162fe247b361b4486b193293e46.1656153285.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
The io_uring cancelation API is async, like any other API that we expose there. For the case of finding a request to cancel, or not finding one, it is fully sync in that when submission returns, the CQE for both the cancelation request and the targeted request have been posted to the CQ ring. However, if the targeted work is being executed by io-wq, the API can only start the act of canceling it. This makes it difficult to use in some circumstances, as the caller then has to wait for the CQEs to come in and match on the same cancelation data there. Provide a IORING_REGISTER_SYNC_CANCEL command for io_uring_register() that does sync cancelations, always. For the io-wq case, it'll wait for the cancelation to come in before returning. The only expected returns from this API is: 0 Request found and canceled fine. > 0 Requests found and canceled. Only happens if asked to cancel multiple requests, and if the work wasn't in progress. -ENOENT Request not found. -ETIME A timeout on the operation was requested, but the timeout expired before we could cancel. and we won't get -EALREADY via this API. If the timeout value passed in is -1 (tv_sec and tv_nsec), then that means that no timeout is requested. Otherwise, the timespec passed in is the amount of time the sync cancel will wait for a successful cancelation. Link: https://github.com/axboe/liburing/discussions/608Signed-off-by: NJens Axboe <axboe@kernel.dk>
-