提交 · 8bb649ee1da321ec3a1bbec048a425c5ea18ea21 · openeuler / Kernel

10 3月, 2022 5 次提交

io_uring: remove ring quiesce for io_uring_register · 8bb649ee

由 Usama Arif 提交于 2月 04, 2022

None of the opcodes in io_uring_register use ring quiesce anymore. Hence
io_register_op_must_quiesce always returns false and io_ctx_quiesce is
never called.
Signed-off-by: NUsama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-6-usama.arif@bytedance.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

8bb649ee

io_uring: avoid ring quiesce while registering restrictions and enabling rings · ff16cfcf

由 Usama Arif 提交于 2月 04, 2022

IORING_SETUP_R_DISABLED prevents submitting requests and so there will be
no requests until IORING_REGISTER_ENABLE_RINGS is called. And
IORING_REGISTER_RESTRICTIONS works only before
IORING_REGISTER_ENABLE_RINGS is called. Hence ring quiesce is not needed
for these opcodes.
Signed-off-by: NUsama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-5-usama.arif@bytedance.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

ff16cfcf

io_uring: avoid ring quiesce while registering async eventfd · c75312dd

由 Usama Arif 提交于 2月 04, 2022

This is done using the RCU data structure (io_ev_fd). eventfd_async is
moved from io_ring_ctx to io_ev_fd which is RCU protected hence avoiding
ring quiesce which is much more expensive than an RCU lock. The place
where eventfd_async is read is already under rcu_read_lock so there is no
extra RCU read-side critical section needed.
Signed-off-by: NUsama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-4-usama.arif@bytedance.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c75312dd

io_uring: avoid ring quiesce while registering/unregistering eventfd · 77bc59b4

由 Usama Arif 提交于 2月 04, 2022

This is done by creating a new RCU data structure (io_ev_fd) as part of
io_ring_ctx that holds the eventfd_ctx.

The function io_eventfd_signal is executed under rcu_read_lock with a
single rcu_dereference to io_ev_fd so that if another thread unregisters
the eventfd while io_eventfd_signal is still being executed, the
eventfd_signal for which io_eventfd_signal was called completes
successfully.

The process of registering/unregistering eventfd is already done under
uring_lock so multiple threads won't enter a race condition while
registering/unregistering eventfd.

With the above approach ring quiesce can be avoided which is much more
expensive then using RCU lock. On the system tested, io_uring_register
with IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared
to 15ms before with ring quiesce.
Signed-off-by: NUsama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-3-usama.arif@bytedance.com
[axboe: long line fixups]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

77bc59b4

io_uring: remove trace for eventfd · 2757be22

由 Usama Arif 提交于 2月 04, 2022

The information on whether eventfd is registered is not very useful and
would result in the tracepoint being enclosed in an rcu_readlock in a
later patch that tries to avoid ring quiesce for registering eventfd.
Suggested-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NUsama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-2-usama.arif@bytedance.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

2757be22

23 2月, 2022 1 次提交

io_uring: disallow modification of rsrc_data during quiesce · 80912cef

由 Dylan Yudaken 提交于 2月 22, 2022

io_rsrc_ref_quiesce will unlock the uring while it waits for references to
the io_rsrc_data to be killed.
There are other places to the data that might add references to data via
calls to io_rsrc_node_switch.
There is a race condition where this reference can be added after the
completion has been signalled. At this point the io_rsrc_ref_quiesce call
will wake up and relock the uring, assuming the data is unused and can be
freed - although it is actually being used.

To fix this check in io_rsrc_ref_quiesce if a resource has been revived.

Reported-by: syzbot+ca8bf833622a1662745b@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220222161751.995746-1-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

80912cef

21 2月, 2022 1 次提交

io_uring: don't convert to jiffies for waiting on timeouts · 22833966

由 Jens Axboe 提交于 2月 21, 2022

If an application calls io_uring_enter(2) with a timespec passed in,
convert that timespec to ktime_t rather than jiffies. The latter does
not provide the granularity the application may expect, and may in
fact provided different granularity on different systems, depending
on what the HZ value is configured at.

Turn the timespec into an absolute ktime_t, and use that with
schedule_hrtimeout() instead.

Link: https://github.com/axboe/liburing/issues/531
Cc: stable@vger.kernel.org
Reported-by: NBob Chen <chenbo.chen@alibaba-inc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

22833966

15 2月, 2022 1 次提交

io_uring: add a schedule point in io_add_buffers() · f240762f

由 Eric Dumazet 提交于 2月 14, 2022

Looping ~65535 times doing kmalloc() calls can trigger soft lockups,
especially with DEBUG features (like KASAN).

[  253.536212] watchdog: BUG: soft lockup - CPU#64 stuck for 26s! [b219417889:12575]
[  253.544433] Modules linked in: vfat fat i2c_mux_pca954x i2c_mux spidev cdc_acm xhci_pci xhci_hcd sha3_generic gq(O)
[  253.544451] CPU: 64 PID: 12575 Comm: b219417889 Tainted: G S         O      5.17.0-smp-DEV #801
[  253.544457] RIP: 0010:kernel_text_address (./include/asm-generic/sections.h:192 ./include/linux/kallsyms.h:29 kernel/extable.c:67 kernel/extable.c:98)
[  253.544464] Code: 0f 93 c0 48 c7 c1 e0 63 d7 a4 48 39 cb 0f 92 c1 20 c1 0f b6 c1 5b 5d c3 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 53 48 89 fb <48> c7 c0 00 00 80 a0 41 be 01 00 00 00 48 39 c7 72 0c 48 c7 c0 40
[  253.544468] RSP: 0018:ffff8882d8baf4c0 EFLAGS: 00000246
[  253.544471] RAX: 1ffff1105b175e00 RBX: ffffffffa13ef09a RCX: 00000000a13ef001
[  253.544474] RDX: ffffffffa13ef09a RSI: ffff8882d8baf558 RDI: ffffffffa13ef09a
[  253.544476] RBP: ffff8882d8baf4d8 R08: ffff8882d8baf5e0 R09: 0000000000000004
[  253.544479] R10: ffff8882d8baf5e8 R11: ffffffffa0d59a50 R12: ffff8882eab20380
[  253.544481] R13: ffffffffa0d59a50 R14: dffffc0000000000 R15: 1ffff1105b175eb0
[  253.544483] FS:  00000000016d3380(0000) GS:ffff88af48c00000(0000) knlGS:0000000000000000
[  253.544486] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  253.544488] CR2: 00000000004af0f0 CR3: 00000002eabfa004 CR4: 00000000003706e0
[  253.544491] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  253.544492] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  253.544494] Call Trace:
[  253.544496]  <TASK>
[  253.544498] ? io_queue_sqe (fs/io_uring.c:7143)
[  253.544505] __kernel_text_address (kernel/extable.c:78)
[  253.544508] unwind_get_return_address (arch/x86/kernel/unwind_frame.c:19)
[  253.544514] arch_stack_walk (arch/x86/kernel/stacktrace.c:27)
[  253.544517] ? io_queue_sqe (fs/io_uring.c:7143)
[  253.544521] stack_trace_save (kernel/stacktrace.c:123)
[  253.544527] ____kasan_kmalloc (mm/kasan/common.c:39 mm/kasan/common.c:45 mm/kasan/common.c:436 mm/kasan/common.c:515)
[  253.544531] ? ____kasan_kmalloc (mm/kasan/common.c:39 mm/kasan/common.c:45 mm/kasan/common.c:436 mm/kasan/common.c:515)
[  253.544533] ? __kasan_kmalloc (mm/kasan/common.c:524)
[  253.544535] ? kmem_cache_alloc_trace (./include/linux/kasan.h:270 mm/slab.c:3567)
[  253.544541] ? io_issue_sqe (fs/io_uring.c:4556 fs/io_uring.c:4589 fs/io_uring.c:6828)
[  253.544544] ? __io_queue_sqe (fs/io_uring.c:?)
[  253.544551] __kasan_kmalloc (mm/kasan/common.c:524)
[  253.544553] kmem_cache_alloc_trace (./include/linux/kasan.h:270 mm/slab.c:3567)
[  253.544556] ? io_issue_sqe (fs/io_uring.c:4556 fs/io_uring.c:4589 fs/io_uring.c:6828)
[  253.544560] io_issue_sqe (fs/io_uring.c:4556 fs/io_uring.c:4589 fs/io_uring.c:6828)
[  253.544564] ? __kasan_slab_alloc (mm/kasan/common.c:45 mm/kasan/common.c:436 mm/kasan/common.c:469)
[  253.544567] ? __kasan_slab_alloc (mm/kasan/common.c:39 mm/kasan/common.c:45 mm/kasan/common.c:436 mm/kasan/common.c:469)
[  253.544569] ? kmem_cache_alloc_bulk (mm/slab.h:732 mm/slab.c:3546)
[  253.544573] ? __io_alloc_req_refill (fs/io_uring.c:2078)
[  253.544578] ? io_submit_sqes (fs/io_uring.c:7441)
[  253.544581] ? __se_sys_io_uring_enter (fs/io_uring.c:10154 fs/io_uring.c:10096)
[  253.544584] ? __x64_sys_io_uring_enter (fs/io_uring.c:10096)
[  253.544587] ? do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[  253.544590] ? entry_SYSCALL_64_after_hwframe (??:?)
[  253.544596] __io_queue_sqe (fs/io_uring.c:?)
[  253.544600] io_queue_sqe (fs/io_uring.c:7143)
[  253.544603] io_submit_sqe (fs/io_uring.c:?)
[  253.544608] io_submit_sqes (fs/io_uring.c:?)
[  253.544612] __se_sys_io_uring_enter (fs/io_uring.c:10154 fs/io_uring.c:10096)
[  253.544616] __x64_sys_io_uring_enter (fs/io_uring.c:10096)
[  253.544619] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[  253.544623] entry_SYSCALL_64_after_hwframe (??:?)

Fixes: ddf0322d ("io_uring: add IORING_OP_PROVIDE_BUFFERS")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: io-uring <io-uring@vger.kernel.org>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20220215041003.2394784-1-eric.dumazet@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

f240762f

07 2月, 2022 2 次提交

mm: io_uring: allow oom-killer from io_uring_setup · 0a3f1e0b

由 Shakeel Butt 提交于 1月 24, 2022

On an overcommitted system which is running multiple workloads of
varying priorities, it is preferred to trigger an oom-killer to kill a
low priority workload than to let the high priority workload receiving
ENOMEMs. On our memory overcommitted systems, we are seeing a lot of
ENOMEMs instead of oom-kills because io_uring_setup callchain is using
__GFP_NORETRY gfp flag which avoids the oom-killer. Let's remove it and
allow the oom-killer to kill a lower priority job.
Signed-off-by: NShakeel Butt <shakeelb@google.com>
Link: https://lore.kernel.org/r/20220125051736.2981459-1-shakeelb@google.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

0a3f1e0b

io_uring: Clean up a false-positive warning from GCC 9.3.0 · 0d7c1153

由 Alviro Iskandar Setiawan 提交于 2月 07, 2022

In io_recv(), if import_single_range() fails, the @flags variable is
uninitialized, then it will goto out_free.

After the goto, the compiler doesn't know that (ret < min_ret) is
always true, so it thinks the "if ((flags & MSG_WAITALL) ..."  path
could be taken.

The complaint comes from gcc-9 (Debian 9.3.0-22) 9.3.0:
```
  fs/io_uring.c:5238 io_recvfrom() error: uninitialized symbol 'flags'
```
Fix this by bypassing the @ret and @flags check when
import_single_range() fails.

Reasons:
 1. import_single_range() only returns -EFAULT when it fails.
 2. At that point, @flags is uninitialized and shouldn't be read.
Reported-by: Nkernel test robot <lkp@intel.com>
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Reported-by: N"Chen, Rong A" <rong.a.chen@intel.com>
Link: https://lore.gnuweeb.org/timl/d33bb5a9-8173-f65b-f653-51fc0681c6d6@intel.com/
Cc: Pavel Begunkov <asml.silence@gmail.com>
Suggested-by: NAmmar Faizi <ammarfaizi2@gnuweeb.org>
Fixes: 7297ce3d ("io_uring: improve send/recv error handling")
Signed-off-by: NAlviro Iskandar Setiawan <alviro.iskandar@gmail.com>
Signed-off-by: NAmmar Faizi <ammarfaizi2@gnuweeb.org>
Link: https://lore.kernel.org/r/20220207140533.565411-1-ammarfaizi2@gnuweeb.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

0d7c1153

28 1月, 2022 1 次提交

io_uring: remove unused argument from io_rsrc_node_alloc · f6133fbd

由 Usama Arif 提交于 1月 27, 2022

io_ring_ctx is not used in the function.
Signed-off-by: NUsama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220127140444.4016585-1-usama.arif@bytedance.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

f6133fbd

24 1月, 2022 1 次提交

io_uring: fix bug in slow unregistering of nodes · b36a2050

由 Dylan Yudaken 提交于 1月 21, 2022

In some cases io_rsrc_ref_quiesce will call io_rsrc_node_switch_start,
and then immediately flush the delayed work queue &ctx->rsrc_put_work.

However the percpu_ref_put does not immediately destroy the node, it
will be called asynchronously via RCU. That ends up with
io_rsrc_node_ref_zero only being called after rsrc_put_work has been
flushed, and so the process ends up sleeping for 1 second unnecessarily.

This patch executes the put code immediately if we are busy
quiescing.

Fixes: 4a38aed2 ("io_uring: batch reap of dead file registrations")
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220121123856.3557884-1-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

b36a2050

19 1月, 2022 1 次提交

io_uring: perform poll removal even if async work removal is successful · ccbf7261

由 Jens Axboe 提交于 1月 18, 2022

An active work can have poll armed, hence it's not enough to just do
the async work removal and return the value if it's different from "not
found". Rather than make poll removal special, just fall through to do
the remaining type lookups and removals.
Reported-by: NFlorian Fischer <florian.fl.fischer@fau.de>
Link: https://lore.kernel.org/io-uring/20220118151337.fac6cthvbnu7icoc@pasture/Signed-off-by: NJens Axboe <axboe@kernel.dk>

ccbf7261

14 1月, 2022 2 次提交

io_uring: fix UAF due to missing POLLFREE handling · 791f3465

由 Pavel Begunkov 提交于 1月 14, 2022

Fixes a problem described in 50252e4b
("aio: fix use-after-free due to missing POLLFREE handling")
and copies the approach used there.

In short, we have to forcibly eject a poll entry when we meet POLLFREE.
We can't rely on io_poll_get_ownership() as can't wait for potentially
running tw handlers, so we use the fact that wqs are RCU freed. See
Eric's patch and comments for more details.
Reported-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20211209010455.42744-6-ebiggers@kernel.org
Reported-and-tested-by: syzbot+5426c7ed6868c705ca14@syzkaller.appspotmail.com
Fixes: 221c5eb2 ("io_uring: add support for IORING_OP_POLL")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4ed56b6f548f7ea337603a82315750449412748a.1642161259.git.asml.silence@gmail.com
[axboe: drop non-functional change from patch]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

791f3465

io_uring: Remove unused function req_ref_put · c84b8a3f

由 Jiapeng Chong 提交于 1月 14, 2022

Fix the following clang warnings:

fs/io_uring.c:1195:20: warning: unused function 'req_ref_put'
[-Wunused-function].

Fixes: aa43477b ("io_uring: poll rework")
Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
Signed-off-by: NJiapeng Chong <jiapeng.chong@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220113162005.3011-1-jiapeng.chong@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c84b8a3f

10 1月, 2022 1 次提交

io_uring: fix not released cached task refs · 3cc7fdb9

由 Pavel Begunkov 提交于 1月 09, 2022

tctx_task_work() may get run after io_uring cancellation and so there
will be no one to put cached in tctx task refs that may have been added
back by tw handlers using inline completion infra, Call
io_uring_drop_tctx_refs() at the end of the main tw handler to release
them.

Cc: stable@vger.kernel.org # 5.15+
Reported-by: NLukas Bulwahn <lukas.bulwahn@gmail.com>
Fixes: e98e49b2 ("io_uring: extend task put optimisations")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/69f226b35fbdb996ab799a8bbc1c06bf634ccec1.1641688805.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

3cc7fdb9

06 1月, 2022 3 次提交

io_uring: remove redundant tab space · c0235652

由 GuoYong Zheng 提交于 1月 05, 2022

When show fdinfo, SqMask follow two tab space, which is inconsistent with
other parameters. Remove one, so it lines up nicely.
Signed-off-by: NGuoYong Zheng <zhenggy@chinatelecom.cn>
Link: https://lore.kernel.org/r/1641377585-1891-1-git-send-email-zhenggy@chinatelecom.cnSigned-off-by: NJens Axboe <axboe@kernel.dk>

c0235652

io_uring: remove unused function parameter · 00f6e68b

由 GuoYong Zheng 提交于 1月 05, 2022

Parameter res2 is not used in __io_complete_rw, remove it.

Fixes: 6b19b766 ("fs: get rid of the res2 iocb->ki_complete argument")
Signed-off-by: NGuoYong Zheng <zhenggy@chinatelecom.cn>
Link: https://lore.kernel.org/r/1641377522-1851-1-git-send-email-zhenggy@chinatelecom.cnSigned-off-by: NJens Axboe <axboe@kernel.dk>

00f6e68b

block: move rq_list macros to blk-mq.h · edce22e1

由 Keith Busch 提交于 1月 05, 2022

Move the request list macros to the header file that defines that struct
they operate on.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220105170518.3181469-2-kbusch@kernel.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

edce22e1

29 12月, 2021 7 次提交

io_uring: use completion batching for poll rem/upd · cc8e9ba7

由 Pavel Begunkov 提交于 12月 15, 2021

Use __io_req_complete() in io_poll_update(), so we can utilise
completion batching for both update/remove request and the poll
we're killing (if any).
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e2bdc6c5abd9e9b80f09b86d8823eb1c780362cd.1639605189.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

cc8e9ba7

io_uring: single shot poll removal optimisation · eb0089d6

由 Pavel Begunkov 提交于 12月 15, 2021

We don't need to poll oneshot request if we've got a desired mask in
io_poll_wake(), task_work will clean it up correctly, but as we already
hold a wq spinlock, we can remove ourselves and save on additional
spinlocking in io_poll_remove_entries().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ee170a344a18c9ef36b554d806c64caadfd61c31.1639605189.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

eb0089d6

io_uring: poll rework · aa43477b

由 Pavel Begunkov 提交于 12月 15, 2021

It's not possible to go forward with the current state of io_uring
polling, we need a more straightforward and easier synchronisation.
There are a lot of problems with how it is at the moment, including
missing events on rewait.

The main idea here is to introduce a notion of request ownership while
polling, no one but the owner can modify any part but ->poll_refs of
struct io_kiocb, that grants us protection against all sorts of races.

Main users of such exclusivity are poll task_work handler, so before
queueing a tw one should have/acquire ownership, which will be handed
off to the tw handler.
The other user is __io_arm_poll_handler() do initial poll arming. It
starts taking the ownership, so tw handlers won't be run until it's
released later in the function after vfs_poll. note: also prevents
races in __io_queue_proc().
Poll wake/etc. may not be able to get ownership, then they need to
increase the poll refcount and the task_work should notice it and retry
if necessary, see io_poll_check_events().
There is also IO_POLL_CANCEL_FLAG flag to notify that we want to kill
request.

It makes cancellations more reliable, enables double multishot polling,
fixes double poll rewait, fixes missing poll events and fixes another
bunch of races.

Even though it adds some overhead for new refcounting, and there are a
couple of nice performance wins:
- no req->refs refcounting for poll requests anymore
- if the data is already there (once measured for some test to be 1-2%
  of all apoll requests), it removes it doesn't add atomics and removes
  spin_lock/unlock pair.
- works well with multishots, we don't do remove from queue / add to
  queue for each new poll event.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6b652927c77ed9580ea4330ac5612f0e0848c946.1639605189.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

aa43477b

io_uring: kill poll linking optimisation · ab1dab96

由 Pavel Begunkov 提交于 12月 15, 2021

With IORING_FEAT_FAST_POLL in place, io_put_req_find_next() for poll
requests doesn't make much sense, and in any case re-adding it
shouldn't be a problem considering batching in tctx_task_work(). We can
remove it.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/15699682bf81610ec901d4e79d6da64baa9f70be.1639605189.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

ab1dab96

io_uring: move common poll bits · 5641897a

由 Pavel Begunkov 提交于 12月 15, 2021

Move some poll helpers/etc up, we'll need them there shortly
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6c5c3dba24c86aad5cd389a54a8c7412e6a0621d.1639605189.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

5641897a

io_uring: refactor poll update · 2bbb146d

由 Pavel Begunkov 提交于 12月 15, 2021

Clean up io_poll_update() and unify cancellation paths for remove and
update.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5937138b6265a1285220e2fab1b28132c1d73ce3.1639605189.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

2bbb146d

io_uring: remove double poll on poll update · e840b4ba

由 Pavel Begunkov 提交于 12月 15, 2021

Before updating a poll request we should remove it from poll queues,
including the double poll entry.

Fixes: b69de288 ("io_uring: allow events and user_data update of running poll requests")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ac39e7f80152613603b8a6cc29a2b6063ac2434f.1639605189.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

e840b4ba

23 12月, 2021 1 次提交

io_uring: zero iocb->ki_pos for stream file types · 7b9762a5

由 Jens Axboe 提交于 12月 22, 2021

io_uring supports using offset == -1 for using the current file position,
and we read that in as part of read/write command setup. For the non-iter
read/write types we pass in NULL for the position pointer, but for the
iter types we should not be passing any anything but 0 for the position
for a stream.

Clear kiocb->ki_pos if the file is a stream, don't leave it as -1. If we
do, then the request will error with -ESPIPE.

Fixes: ba04291e ("io_uring: allow use of offset == -1 to mean file position")
Link: https://github.com/axboe/liburing/discussions/501Reported-by: NSamuel Williams <samuel.williams@oriontransfer.co.nz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7b9762a5

14 12月, 2021 1 次提交

io_uring: code clean for some ctx usage · 33ce2aff

由 Hao Xu 提交于 12月 14, 2021

There are some functions doing ctx = req->ctx while still using
req->ctx, update those places.
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20211214055904.61772-1-haoxu@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

33ce2aff

11 12月, 2021 1 次提交

io_uring: ensure task_work gets run as part of cancelations · 78a78060

由 Jens Axboe 提交于 12月 09, 2021

If we successfully cancel a work item but that work item needs to be
processed through task_work, then we can be sleeping uninterruptibly
in io_uring_cancel_generic() and never process it. Hence we don't
make forward progress and we end up with an uninterruptible sleep
warning.

While in there, correct a comment that should be IFF, not IIF.

Reported-and-tested-by: syzbot+21e6887c0be14181206d@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

78a78060

09 12月, 2021 1 次提交

io_uring: batch completion in prior_task_list · f28c240e

由 Hao Xu 提交于 12月 08, 2021

In previous patches, we have already gathered some tw with
io_req_task_complete() as callback in prior_task_list, let's complete
them in batch while we cannot grab uring lock. In this way, we batch
the req_complete_post path.
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20211208052125.351587-1-haoxu@linux.alibaba.comReviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f28c240e

08 12月, 2021 3 次提交

io_uring: split io_req_complete_post() and add a helper · a37fae8a

由 Hao Xu 提交于 12月 07, 2021

Split io_req_complete_post(), this is a prep for the next patch.
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20211207093951.247840-5-haoxu@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

a37fae8a

io_uring: add helper for task work execution code · 9f8d032a

由 Hao Xu 提交于 12月 07, 2021

Add a helper for task work execution code. We will use it later.
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20211207093951.247840-4-haoxu@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

9f8d032a

io_uring: add a priority tw list for irq completion work · 4813c377

由 Hao Xu 提交于 12月 07, 2021

Now we have a lot of task_work users, some are just to complete a req
and generate a cqe. Let's put the work to a new tw list which has a
higher priority, so that it can be handled quickly and thus to reduce
avg req latency and users can issue next round of sqes earlier.
An explanatory case:

origin timeline:
    submit_sqe-->irq-->add completion task_work
    -->run heavy work0~n-->run completion task_work
now timeline:
    submit_sqe-->irq-->add completion task_work
    -->run completion task_work-->run heavy work0~n

Limitation: this optimization is only for those that submission and
reaping process are in different threads. Otherwise anyhow we have to
submit new sqes after returning to userspace, then the order of TWs
doesn't matter.

Tested this patch(and the following ones) by manually replace
__io_queue_sqe() in io_queue_sqe() by io_req_task_queue() to construct
'heavy' task works. Then test with fio:

ioengine=io_uring
sqpoll=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=600
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1

Tried various iodepth.
The peak IOPS for this patch is 710K, while the old one is 665K.
For avg latency, difference shows when iodepth grow:
depth and avg latency(usec):
	depth      new          old
	 1        7.05         7.10
	 2        8.47         8.60
	 4        10.42        10.42
	 8        13.78        13.22
	 16       27.41        24.33
	 32       49.40        53.08
	 64       102.53       103.36
	 128      196.98       205.61
	 256      372.99       414.88
         512      747.23       791.30
         1024     1472.59      1538.72
         2048     3153.49      3329.01
         4096     6387.86      6682.54
         8192     12150.25     12774.14
         16384    23085.58     26044.71
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20211207093951.247840-3-haoxu@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

4813c377

05 12月, 2021 4 次提交

io_uring: reuse io_req_task_complete for timeouts · a90c8bf6

由 Pavel Begunkov 提交于 12月 05, 2021

With kbuf unification io_req_task_complete() is now a generic function,
use it for timeout's tw completions.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7142fa3cbaf3a4140d59bcba45cbe168cf40fac2.1638714983.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

a90c8bf6

io_uring: tweak iopoll CQE_SKIP event counting · 83a13a41

由 Pavel Begunkov 提交于 12月 05, 2021

When iopolling the userspace specifies the minimum number of "events" it
expects. Previously, we had one CQE per request, so the definition of
an "event" was unequivocal, but that's not more the case anymore with
REQ_F_CQE_SKIP.

Currently it counts the number of completed requests, replace it with
the number of posted CQEs. This allows users of the "one CQE per link"
scheme to wait for all N links in a single syscall, which is not
possible without the patch and requires extra context switches.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d5a965c4d2249827392037bbd0186f87fea49c55.1638714983.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

83a13a41

io_uring: simplify selected buf handling · d1fd1c20

由 Pavel Begunkov 提交于 12月 05, 2021

As selected buffers are now stored in a separate field in a request, get
rid of rw/recv specific helpers and simplify the code.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bd4a866d8d91b044f748c40efff9e4eacd07536e.1638714983.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

d1fd1c20

io_uring: move up io_put_kbuf() and io_put_rw_kbuf() · 3648e526

由 Hao Xu 提交于 12月 05, 2021

Move them up to avoid explicit declaration. We will use them in later
patches.
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3631243d6fc4a79bbba0cd62597fc8cd5be95924.1638714983.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

3648e526

29 11月, 2021 1 次提交

io_uring: validate timespec for timeout removals · 2087009c

由 Ye Bin 提交于 11月 29, 2021

Like commit f6223ff7, timeout removal should also validate the
timespec that is being passed in.
Signed-off-by: NYe Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20211129041537.1936270-1-yebin10@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

2087009c

27 11月, 2021 2 次提交

io_uring: Fix undefined-behaviour in io_issue_sqe · f6223ff7

由 Ye Bin 提交于 11月 18, 2021

We got issue as follows:
================================================================================
UBSAN: Undefined behaviour in ./include/linux/ktime.h:42:14
signed integer overflow:
-4966321760114568020 * 1000000000 cannot be represented in type 'long long int'
CPU: 1 PID: 2186 Comm: syz-executor.2 Not tainted 4.19.90+ #12
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace+0x0/0x3f0 arch/arm64/kernel/time.c:78
 show_stack+0x28/0x38 arch/arm64/kernel/traps.c:158
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x170/0x1dc lib/dump_stack.c:118
 ubsan_epilogue+0x18/0xb4 lib/ubsan.c:161
 handle_overflow+0x188/0x1dc lib/ubsan.c:192
 __ubsan_handle_mul_overflow+0x34/0x44 lib/ubsan.c:213
 ktime_set include/linux/ktime.h:42 [inline]
 timespec64_to_ktime include/linux/ktime.h:78 [inline]
 io_timeout fs/io_uring.c:5153 [inline]
 io_issue_sqe+0x42c8/0x4550 fs/io_uring.c:5599
 __io_queue_sqe+0x1b0/0xbc0 fs/io_uring.c:5988
 io_queue_sqe+0x1ac/0x248 fs/io_uring.c:6067
 io_submit_sqe fs/io_uring.c:6137 [inline]
 io_submit_sqes+0xed8/0x1c88 fs/io_uring.c:6331
 __do_sys_io_uring_enter fs/io_uring.c:8170 [inline]
 __se_sys_io_uring_enter fs/io_uring.c:8129 [inline]
 __arm64_sys_io_uring_enter+0x490/0x980 fs/io_uring.c:8129
 invoke_syscall arch/arm64/kernel/syscall.c:53 [inline]
 el0_svc_common+0x374/0x570 arch/arm64/kernel/syscall.c:121
 el0_svc_handler+0x190/0x260 arch/arm64/kernel/syscall.c:190
 el0_svc+0x10/0x218 arch/arm64/kernel/entry.S:1017
================================================================================

As ktime_set only judge 'secs' if big than KTIME_SEC_MAX, but if we pass
negative value maybe lead to overflow.
To address this issue, we must check if 'sec' is negative.
Signed-off-by: NYe Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20211118015907.844807-1-yebin10@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

f6223ff7

io_uring: fix soft lockup when call __io_remove_buffers · 1d0254e6

由 Ye Bin 提交于 11月 22, 2021

I got issue as follows:
[ 567.094140] __io_remove_buffers: [1]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680
[  594.360799] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108]
[  594.364987] Modules linked in:
[  594.365405] irq event stamp: 604180238
[  594.365906] hardirqs last  enabled at (604180237): [<ffffffff93fec9bd>] _raw_spin_unlock_irqrestore+0x2d/0x50
[  594.367181] hardirqs last disabled at (604180238): [<ffffffff93fbbadb>] sysvec_apic_timer_interrupt+0xb/0xc0
[  594.368420] softirqs last  enabled at (569080666): [<ffffffff94200654>] __do_softirq+0x654/0xa9e
[  594.369551] softirqs last disabled at (569080575): [<ffffffff913e1d6a>] irq_exit_rcu+0x1ca/0x250
[  594.370692] CPU: 2 PID: 108 Comm: kworker/u32:5 Tainted: G            L    5.15.0-next-20211112+ #88
[  594.371891] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
[  594.373604] Workqueue: events_unbound io_ring_exit_work
[  594.374303] RIP: 0010:_raw_spin_unlock_irqrestore+0x33/0x50
[  594.375037] Code: 48 83 c7 18 53 48 89 f3 48 8b 74 24 10 e8 55 f5 55 fd 48 89 ef e8 ed a7 56 fd 80 e7 02 74 06 e8 43 13 7b fd fb bf 01 00 00 00 <e8> f8 78 474
[  594.377433] RSP: 0018:ffff888101587a70 EFLAGS: 00000202
[  594.378120] RAX: 0000000024030f0d RBX: 0000000000000246 RCX: 1ffffffff2f09106
[  594.379053] RDX: 0000000000000000 RSI: ffffffff9449f0e0 RDI: 0000000000000001
[  594.379991] RBP: ffffffff9586cdc0 R08: 0000000000000001 R09: fffffbfff2effcab
[  594.380923] R10: ffffffff977fe557 R11: fffffbfff2effcaa R12: ffff8881b8f3def0
[  594.381858] R13: 0000000000000246 R14: ffff888153a8b070 R15: 0000000000000000
[  594.382787] FS:  0000000000000000(0000) GS:ffff888399c00000(0000) knlGS:0000000000000000
[  594.383851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  594.384602] CR2: 00007fcbe71d2000 CR3: 00000000b4216000 CR4: 00000000000006e0
[  594.385540] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  594.386474] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  594.387403] Call Trace:
[  594.387738]  <TASK>
[  594.388042]  find_and_remove_object+0x118/0x160
[  594.389321]  delete_object_full+0xc/0x20
[  594.389852]  kfree+0x193/0x470
[  594.390275]  __io_remove_buffers.part.0+0xed/0x147
[  594.390931]  io_ring_ctx_free+0x342/0x6a2
[  594.392159]  io_ring_exit_work+0x41e/0x486
[  594.396419]  process_one_work+0x906/0x15a0
[  594.399185]  worker_thread+0x8b/0xd80
[  594.400259]  kthread+0x3bf/0x4a0
[  594.401847]  ret_from_fork+0x22/0x30
[  594.402343]  </TASK>

Message from syslogd@localhost at Nov 13 09:09:54 ...
kernel:watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108]
[  596.793660] __io_remove_buffers: [2099199]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680

We can reproduce this issue by follow syzkaller log:
r0 = syz_io_uring_setup(0x401, &(0x7f0000000300), &(0x7f0000003000/0x2000)=nil, &(0x7f0000ff8000/0x4000)=nil, &(0x7f0000000280)=<r1=>0x0, &(0x7f0000000380)=<r2=>0x0)
sendmsg$ETHTOOL_MSG_FEATURES_SET(0xffffffffffffffff, &(0x7f0000003080)={0x0, 0x0, &(0x7f0000003040)={&(0x7f0000000040)=ANY=[], 0x18}}, 0x0)
syz_io_uring_submit(r1, r2, &(0x7f0000000240)=@IORING_OP_PROVIDE_BUFFERS={0x1f, 0x5, 0x0, 0x401, 0x1, 0x0, 0x100, 0x0, 0x1, {0xfffd}}, 0x0)
io_uring_enter(r0, 0x3a2d, 0x0, 0x0, 0x0, 0x0)

The reason above issue  is 'buf->list' has 2,100,000 nodes, occupied cpu lead
to soft lockup.
To solve this issue, we need add schedule point when do while loop in
'__io_remove_buffers'.
After add  schedule point we do regression, get follow data.
[  240.141864] __io_remove_buffers: [1]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00
[  268.408260] __io_remove_buffers: [1]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180
[  275.899234] __io_remove_buffers: [2099199]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00
[  296.741404] __io_remove_buffers: [1]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380
[  305.090059] __io_remove_buffers: [2099199]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180
[  325.415746] __io_remove_buffers: [1]start ctx=0xffff8881b92d1000 bgid=65533 buf=0xffff8881a17d8f00
[  333.160318] __io_remove_buffers: [2099199]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380
...

Fixes:8bab4c09("io_uring: allow conditional reschedule for intensive iterators")
Signed-off-by: NYe Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20211122024737.2198530-1-yebin10@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

1d0254e6

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功