- 25 7月, 2022 40 次提交
-
-
由 Pavel Begunkov 提交于
Allow to flush notifiers as a part of sendzc request by setting IORING_SENDZC_FLUSH flag. When the sendzc request succeedes it will flush the used [active] notifier. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e0b4d9a6797e2fd6092824fe42953db7a519bbc8.1657643355.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Let the userspace to register and unregister notification slots. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a0aa8161fe3ebb2a4cc6e5dbd0cffb96e6881cf5.1657643355.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
kmalloc'ing struct io_notif is too expensive when done frequently, cache them as many other resources in io_uring. Keep two list, the first one is from where we're getting notifiers, it's protected by ->uring_lock. The second is protected by ->completion_lock, to which we queue released notifiers. Then we splice one list into another when needed. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9dec18f7fcbab9f4bd40b96e5ae158b119945230.1657643355.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Add internal part of send zerocopy notifications. There are two main structures, the first one is struct io_notif, which carries inside struct ubuf_info and maps 1:1 to it. io_uring will be binding a number of zerocopy send requests to it and ask to complete (aka flush) it. When flushed and all attached requests and skbs complete, it'll generate one and only one CQE. There are intended to be passed into the network layer as struct msghdr::msg_ubuf. The second concept is notification slots. The userspace will be able to register an array of slots and subsequently addressing them by the index in the array. Slots are independent of each other. Each slot can have only one notifier at a time (called active notifier) but many notifiers during the lifetime. When active, a notifier not going to post any completion but the userspace can attach requests to it by specifying the corresponding slot while issueing send zc requests. Eventually, the userspace will want to "flush" the notifier losing any way to attach new requests to it, however it can use the next atomatically added notifier of this slot or of any other slot. When the network layer is done with all enqueued skbs attached to a notifier and doesn't need the specified in them user data, the flushed notifier will post a CQE. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3ecf54c31a85762bf679b0a432c9f43ecf7e61cc.1657643355.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Make io_put_task() available to non-core parts of io_uring, we'll need it for notification infrastructure. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3686807d4c03b72e389947b0e8692d4d44334ef0.1657643355.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If we're offloading requests directly to io-wq because IOSQE_ASYNC was set in the sqe, we can miss hashing writes appropriately because we haven't set REQ_F_ISREG yet. This can cause a performance regression with buffered writes, as io-wq then no longer correctly serializes writes to that file. Ensure that we set the flags in io_prep_async_work(), which will cause the io-wq work item to be hashed appropriately. Fixes: 584b0180 ("io_uring: move read/write file prep state into actual opcode handler") Link: https://lore.kernel.org/io-uring/20220608080054.GB22428@xsang-OptiPlex-9020/Reported-by: Nkernel test robot <oliver.sang@intel.com> Tested-by: NYin Fengwei <fengwei.yin@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Michal Koutný 提交于
The commit 8bb649ee ("io_uring: remove ring quiesce for io_uring_register") removed the worklow relying on reinit/resurrection of the percpu_ref, hence, initialization with that requested is a relic. This is based on code review, this causes no real bug (and theoretically can't). Technically it's a revert of commit 21482896 ("io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT") but since the flag omission is now justified, I'm not making this a revert. Fixes: 8bb649ee ("io_uring: remove ring quiesce for io_uring_register") Signed-off-by: NMichal Koutný <mkoutny@suse.com> Acked-by: NRoman Gushchin <roman.gushchin@linux.dev> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
For recvmsg/sendmsg, if they don't complete inline, we currently need to allocate a struct io_async_msghdr for each request. This is a somewhat large struct. Hook up sendmsg/recvmsg to use the io_alloc_cache. This reduces the alloc + free overhead considerably, yielding 4-5% of extra performance running netbench. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Caches like this tend to grow to the peak size, and then never get any smaller. Impose a max limit on the size, to prevent it from growing too big. A somewhat randomly chosen 512 is the max size we'll allow the cache to get. If a batch of frees come in and would bring it over that, we simply start kfree'ing the surplus. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
In preparation for adding limits, and one more user, abstract out the core bits of the allocation+free cache. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
This is where it's used, move the flush handler in there. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
In overflow we see a duplcate line in the trace, and in some cases 3 lines (if initial io_post_aux_cqe fails). Instead just trace once for each CQE Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-13-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
Some use cases of io_post_aux_cqe would not want to overflow as is, but might want to change the flags/result. For example multishot receive requires in order CQE, and so if there is an overflow it would need to stop receiving until the overflow is taken care of. Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-8-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
From recently io_uring provides an option to allocate a file index for operation registering fixed files. However, it's utterly unusable with mixed approaches when for a part of files the userspace knows better where to place it, as it may race and users don't have any sane way to pick a slot and hoping it will not be taken. Let the userspace to register a range of fixed file slots in which the auto-allocation happens. The use case is splittting the fixed table in two parts, where on of them is used for auto-allocation and another for slot-specified operations. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66ab0394e436f38437cf7c44676e1920d09687ad.1656154403.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
io_uring_enter() takes ctx->refs, which was previously preventing racing with register quiesce. However, as register now doesn't touch the refs, we can freely kill extra ctx pinning and rely on the fact that we're holding a file reference preventing the ring from being destroyed. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a11c57ad33a1be53541fce90669c1b79cf4d8940.1656153286.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Registered rings are per definitions io_uring files, so we don't need to additionally verify them. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/425cd64fd885b8e329a46c205ee811987691baaf.1656153286.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
io_run_task_work() accounts for TIF_NOTIFY_SIGNAL, so no need to have an second check in io_run_task_work_sig(). Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/52ce41a592ad904511697f432141e5690fd4b968.1656153285.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Now as both normal and fallback paths use llist, just keep one node head in struct io_task_work and kill off ->fallback_node. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d04ebde409f7b162fe247b361b4486b193293e46.1656153285.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
The io_uring cancelation API is async, like any other API that we expose there. For the case of finding a request to cancel, or not finding one, it is fully sync in that when submission returns, the CQE for both the cancelation request and the targeted request have been posted to the CQ ring. However, if the targeted work is being executed by io-wq, the API can only start the act of canceling it. This makes it difficult to use in some circumstances, as the caller then has to wait for the CQEs to come in and match on the same cancelation data there. Provide a IORING_REGISTER_SYNC_CANCEL command for io_uring_register() that does sync cancelations, always. For the io-wq case, it'll wait for the cancelation to come in before returning. The only expected returns from this API is: 0 Request found and canceled fine. > 0 Requests found and canceled. Only happens if asked to cancel multiple requests, and if the work wasn't in progress. -ENOENT Request not found. -ETIME A timeout on the operation was requested, but the timeout expired before we could cancel. and we won't get -EALREADY via this API. If the timeout value passed in is -1 (tv_sec and tv_nsec), then that means that no timeout is requested. Otherwise, the timespec passed in is the amount of time the sync cancel will wait for a successful cancelation. Link: https://github.com/axboe/liburing/discussions/608Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
trace task_work_run to help provide stats on how often task work is run and what batch sizes are coming through. Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220622134028.2013417-9-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
Batching task work up is an important performance optimisation, as task_work_add is expensive. In order to keep the semantics replace the task_list with a fake node while processing the old list, and then do a cmpxchg at the end to see if there is more work. Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220622134028.2013417-6-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
Introduce helpers to atomically switch llist. Will later move this into common code Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220622134028.2013417-5-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
With networking use cases we see contention on the spinlock used to protect the task_list when multiple threads try and add completions at once. Instead we can use a lockless list, and assume that the first caller to add to the list is responsible for kicking off task work. Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220622134028.2013417-4-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
this is no longer needed as there is only one caller Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220622134028.2013417-3-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dylan Yudaken 提交于
This optimisation has some built in assumptions that make it easy to introduce bugs. It also does not have clear wins that make it worth keeping. Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220622134028.2013417-2-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Improve naming of the inline/deferred completion helper so it's consistent with it's *_post counterpart. Add some comments and extra lockdeps to ensure the locking is done right. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/797c619943dac06529e9d3fcb16e4c3cde6ad1a3.1655684496.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Since __io_commit_cqring_flush users moved to different files, introduce io_commit_cqring_flush() helper and encapsulate all flags testing details inside. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0da03887435dd9869ffe46dcd3962bf104afcca3.1655684496.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
spin_lock(&ctx->completion_lock); /* post CQEs */ io_commit_cqring(ctx); spin_unlock(&ctx->completion_lock); io_cqring_ev_posted(ctx); We have many places repeating this sequence, and the three function unlock section is not perfect from the maintainance perspective and also makes it harder to add new locking/sync trick. Introduce two helpers. io_cq_lock(), which is simple and only grabs ->completion_lock, and io_cq_unlock_post() encapsulating the three call section. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fe0c682bf7f7b55d9be55b0d034be9c1949277dc.1655684496.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Some io_uring-eventfd users assume that there won't be spurious wakeups. That assumption has to be honoured by all io_cqring_ev_posted() callers, which is inconvenient and from time to time leads to problems but should be maintained to not break the userspace. Instead of making the callers track whether a CQE was posted or not, hide it inside io_eventfd_signal(). It saves ->cached_cq_tail it saw last time and triggers the eventfd only when ->cached_cq_tail changed since then. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0ffc66bae37a2513080b601e4370e147faaa72c5.1655684496.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
io_uring_try_cancel_requests() loops until there is nothing left to do with the ring, however there might be several rings and they might have dependencies between them, e.g. via poll requests. Instead of cancelling rings one by one, try to cancel them all and only then loop over if we still potenially some work to do. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8d491fe02d8ac4c77ff38061cf86b9a827e8845c.1655684496.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
It's not clear how widely used IOSQE_CQE_SKIP_SUCCESS is, and how often ->flush_cqes flag prevents from completion being flushed. Sometimes it's high level of concurrency that enables it at least for one CQE, but sometimes it doesn't save much because nobody waiting on the CQ. Remove ->flush_cqes flag and the optimisation, it should benefit the normal use case. Note, that there is no spurious eventfd problem with that as checks for spuriousness were incorporated into io_eventfd_signal(). Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/692e81eeddccc096f449a7960365fa7b4a18f8e6.1655637157.git.asml.silence@gmail.com [axboe: remove now dead state->flush_cqes variable] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Move io_eventfd_signal() in the sources without any changes and kill its forward declaration. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9ebebb3f6f56f5a5448a621e0b6a537720c43334.1655637157.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
We don't post events in __io_commit_cqring_flush() anymore but send all requests to tw, so no need to do io_commit_cqring() there. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f2481e32375e749be89c42e4804268b608722cef.1655637157.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
We have lots of trace events accepting an io_uring request and wanting to print some of its fields like user_data, opcode, flags and so on. However, as trace points were unaware of io_uring structures, we had to pass all the fields as arguments. Teach trace/events/io_uring.h about struct io_kiocb and stop the misery of passing a horde of arguments to trace helpers. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/40ff72f92798114e56d400f2b003beb6cde6ef53.1655384063.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
io_uring/io_uring.h already includes io_uring_types.h, no need to include it every time. Kill it in a bunch of places, it prepares us for following patches. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/94d8c943fbe0ef949981c508ddcee7fc1c18850f.1655384063.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
With IORING_SETUP_CQE32 ->cqe_cached doesn't store a real address but rather an implicit offset into cqes. Store the real cqe pointer and increment it accordingly if CQE32. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1ee1838cba16bed96381a006950b36ba640d998c.1655455613.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
__io_fill_cqe_req() is hot and inlined, we want it to be as small as possible. Add io_req_cqe_overflow() accepting only a request and doing all overflow accounting, and replace with it two calls to 6 argument io_cqring_event_overflow(). Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/048b9fbcce56814d77a1a540409c98c3d383edcb.1655455613.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
__io_get_cqe() is not as hot as io_get_cqe(), no need to inline it, it sheds ~500B from the binary. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c1ac829198a881b7af8710926f99a3559b9f24c0.1655455613.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Deduplicate some code and add a helper for filling an aux CQE, locking and notification. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b7c6557c8f9dc5c4cfb01292116c682a0ff61081.1655455613.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Currently we do two extra spin lock/unlock pairs to add a poll/apoll request to the cancellation hash table and remove it from there. On the submission side we often already hold ->uring_lock and tw completion is likely to hold it as well. Add a second cancellation hash table protected by ->uring_lock. In concerns for latency because of a need to have the mutex locked on the completion side, use the new table only in following cases: 1) IORING_SETUP_SINGLE_ISSUER: only one task grabs uring_lock, so there is little to no contention and so the main tw hander will almost always end up grabbing it before calling callbacks. 2) IORING_SETUP_SQPOLL: same as with single issuer, only one task is a major user of ->uring_lock. 3) apoll: we normally grab the lock on the completion side anyway to execute the request, so it's free. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1bbad9c78c454b7b92f100bbf46730a37df7194f.1655371007.git.asml.silence@gmail.comReviewed-by: NHao Xu <howeyxu@tencent.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-