提交 · 78f8876c2d9f6fdeb9ff62ed1911505156cd7b3d · openeuler / Kernel

28 9月, 2021 1 次提交

io-wq: exclusively gate signal based exit on get_signal() return · 78f8876c

由 Jens Axboe 提交于 9月 27, 2021

io-wq threads block all signals, except SIGKILL and SIGSTOP. We should not
need any extra checking of signal_pending or fatal_signal_pending, rely
exclusively on whether or not get_signal() tells us to exit.

The original debugging of this issue led to the false positive that we
were exiting on non-fatal signals, but that is not the case. The issue
was around races with nr_workers accounting.

Fixes: 87c16966 ("io-wq: ensure we exit if thread group is exiting")
Fixes: 15e20db2 ("io-wq: only exit on fatal signals")
Reported-by: NEric W. Biederman <ebiederm@xmission.com>
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

78f8876c

25 9月, 2021 1 次提交

io-wq: ensure we exit if thread group is exiting · 87c16966

由 Jens Axboe 提交于 9月 21, 2021

Dave reports that a coredumping workload gets stuck in 5.15-rc2, and
identified the culprit in the Fixes line below. The problem is that
relying solely on fatal_signal_pending() to gate whether to exit or not
fails miserably if a process gets eg SIGILL sent. Don't exclusively
rely on fatal signals, also check if the thread group is exiting.

Fixes: 15e20db2 ("io-wq: only exit on fatal signals")
Reported-by: NDave Chinner <david@fromorbit.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

87c16966

14 9月, 2021 1 次提交

io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg items · dd47c104

由 Eugene Syromiatnikov 提交于 9月 13, 2021

The items passed in the array pointed by the arg parameter
of IORING_REGISTER_IOWQ_MAX_WORKERS io_uring_register operation
carry certain semantics: they refer to different io-wq worker categories;
provide IO_WQ_* constants in the UAPI, so these categories can be referenced
in the user space code.
Suggested-by: NJens Axboe <axboe@kernel.dk>
Complements: 2e480058 ("io-wq: provide a way to limit max number of workers")
Signed-off-by: NEugene Syromiatnikov <esyr@redhat.com>
Link: https://lore.kernel.org/r/20210913154415.GA12890@asgard.redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

dd47c104

13 9月, 2021 2 次提交

io-wq: fix potential race of acct->nr_workers · 767a65e9

由 Hao Xu 提交于 9月 12, 2021

Given max_worker is 1, and we currently have 1 running and it is
exiting. There may be race like:
 io_wqe_enqueue                   worker1
                               no work there and timeout
                               unlock(wqe->lock)
 ->insert work
                               -->io_worker_exit
 lock(wqe->lock)
 ->if(!nr_workers) //it's still 1
 unlock(wqe->lock)
    goto run_cancel
                                  lock(wqe->lock)
                                  nr_workers--
                                  ->dec_running
                                    ->worker creation fails
                                  unlock(wqe->lock)

We enqueued one work but there is no workers, causes hung.
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

767a65e9

io-wq: code clean of io_wqe_create_worker() · 7a842fb5

由 Hao Xu 提交于 9月 12, 2021

Remove do_create to save a local variable.
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7a842fb5

09 9月, 2021 2 次提交

io-wq: fix memory leak in create_io_worker() · 66e70be7

由 Qiang.zhang 提交于 9月 09, 2021

BUG: memory leak
unreferenced object 0xffff888126fcd6c0 (size 192):
  comm "syz-executor.1", pid 11934, jiffies 4294983026 (age 15.690s)
  backtrace:
    [<ffffffff81632c91>] kmalloc_node include/linux/slab.h:609 [inline]
    [<ffffffff81632c91>] kzalloc_node include/linux/slab.h:732 [inline]
    [<ffffffff81632c91>] create_io_worker+0x41/0x1e0 fs/io-wq.c:739
    [<ffffffff8163311e>] io_wqe_create_worker fs/io-wq.c:267 [inline]
    [<ffffffff8163311e>] io_wqe_enqueue+0x1fe/0x330 fs/io-wq.c:866
    [<ffffffff81620b64>] io_queue_async_work+0xc4/0x200 fs/io_uring.c:1473
    [<ffffffff8162c59c>] __io_queue_sqe+0x34c/0x510 fs/io_uring.c:6933
    [<ffffffff8162c7ab>] io_req_task_submit+0x4b/0xa0 fs/io_uring.c:2233
    [<ffffffff8162cb48>] io_async_task_func+0x108/0x1c0 fs/io_uring.c:5462
    [<ffffffff816259e3>] tctx_task_work+0x1b3/0x3a0 fs/io_uring.c:2158
    [<ffffffff81269b43>] task_work_run+0x73/0xb0 kernel/task_work.c:164
    [<ffffffff812dcdd1>] tracehook_notify_signal include/linux/tracehook.h:212 [inline]
    [<ffffffff812dcdd1>] handle_signal_work kernel/entry/common.c:146 [inline]
    [<ffffffff812dcdd1>] exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
    [<ffffffff812dcdd1>] exit_to_user_mode_prepare+0x151/0x180 kernel/entry/common.c:209
    [<ffffffff843ff25d>] __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
    [<ffffffff843ff25d>] syscall_exit_to_user_mode+0x1d/0x40 kernel/entry/common.c:302
    [<ffffffff843fa4a2>] do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
    [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae

when create_io_thread() return error, and not retry, the worker object
need to be freed.

Reported-by: syzbot+65454c239241d3d647da@syzkaller.appspotmail.com
Signed-off-by: NQiang.zhang <qiang.zhang@windriver.com>
Link: https://lore.kernel.org/r/20210909115822.181188-1-qiang.zhang@windriver.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

66e70be7

io-wq: fix silly logic error in io_task_work_match() · 3b33e3f4

由 Jens Axboe 提交于 9月 08, 2021

We check for the func with an OR condition, which means it always ends
up being false and we never match the task_work we want to cancel. In
the unexpected case that we do exit with that pending, we can trigger
a hang waiting for a worker to exit, but it was never created. syzbot
reports that as such:

INFO: task syz-executor687:8514 blocked for more than 143 seconds.
      Not tainted 5.14.0-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor687 state:D stack:27296 pid: 8514 ppid:  8479 flags:0x00024004
Call Trace:
 context_switch kernel/sched/core.c:4940 [inline]
 __schedule+0x940/0x26f0 kernel/sched/core.c:6287
 schedule+0xd3/0x270 kernel/sched/core.c:6366
 schedule_timeout+0x1db/0x2a0 kernel/time/timer.c:1857
 do_wait_for_common kernel/sched/completion.c:85 [inline]
 __wait_for_common kernel/sched/completion.c:106 [inline]
 wait_for_common kernel/sched/completion.c:117 [inline]
 wait_for_completion+0x176/0x280 kernel/sched/completion.c:138
 io_wq_exit_workers fs/io-wq.c:1162 [inline]
 io_wq_put_and_exit+0x40c/0xc70 fs/io-wq.c:1197
 io_uring_clean_tctx fs/io_uring.c:9607 [inline]
 io_uring_cancel_generic+0x5fe/0x740 fs/io_uring.c:9687
 io_uring_files_cancel include/linux/io_uring.h:16 [inline]
 do_exit+0x265/0x2a30 kernel/exit.c:780
 do_group_exit+0x125/0x310 kernel/exit.c:922
 get_signal+0x47f/0x2160 kernel/signal.c:2868
 arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:865
 handle_signal_work kernel/entry/common.c:148 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
 exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:209
 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
 syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:302
 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x445cd9
RSP: 002b:00007fc657f4b308 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: 0000000000000001 RBX: 00000000004cb448 RCX: 0000000000445cd9
RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000004cb44c
RBP: 00000000004cb440 R08: 000000000000000e R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000049b154
R13: 0000000000000003 R14: 00007fc657f4b400 R15: 0000000000022000

While in there, also decrement accr->nr_workers. This isn't strictly
needed as we're exiting, but let's make sure the accounting matches up.

Fixes: 3146cba9 ("io-wq: make worker creation resilient against signals")
Reported-by: syzbot+f62d3e0a4ea4f38f5326@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3b33e3f4

08 9月, 2021 1 次提交

io-wq: fix cancellation on create-worker failure · 713b9825

由 Pavel Begunkov 提交于 9月 08, 2021

WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 req_ref_put_and_test
fs/io_uring.c:1151 [inline]
WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 req_ref_put_and_test
fs/io_uring.c:1146 [inline]
WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151
io_req_complete_post+0xf5b/0x1190 fs/io_uring.c:1794
Modules linked in:
Call Trace:
 tctx_task_work+0x1e5/0x570 fs/io_uring.c:2158
 task_work_run+0xe0/0x1a0 kernel/task_work.c:164
 tracehook_notify_signal include/linux/tracehook.h:212 [inline]
 handle_signal_work kernel/entry/common.c:146 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
 exit_to_user_mode_prepare+0x232/0x2a0 kernel/entry/common.c:209
 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
 syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:302
 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x44/0xae

When io_wqe_enqueue() -> io_wqe_create_worker() fails, we can't just
call io_run_cancel() to clean up the request, it's already enqueued via
io_wqe_insert_work() and will be executed either by some other worker
during cancellation (e.g. in io_wq_put_and_exit()).
Reported-by: NHao Sun <sunhao.th@gmail.com>
Fixes: 3146cba9 ("io-wq: make worker creation resilient against signals")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/93b9de0fcf657affab0acfd675d4abcd273ee863.1631092071.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

713b9825

03 9月, 2021 2 次提交

io-wq: make worker creation resilient against signals · 3146cba9

由 Jens Axboe 提交于 9月 01, 2021

If a task is queueing async work and also handling signals, then we can
run into the case where create_io_thread() is interrupted and returns
failure because of that. If this happens for creating the first worker
in a group, then that worker will never get created and we can hang the
ring.

If we do get a fork failure, retry from task_work. With signals we have
to be a bit careful as we cannot simply queue as task_work, as we'll
still have signals pending at that point. Punt over a normal workqueue
first and then create from task_work after that.

Lastly, ensure that we handle fatal worker creations. Worker creation
failures are normally not fatal, only if we fail to create one in an empty
worker group can we not make progress. Right now that is ignored, ensure
that we handle that and run cancel on the work item.

There are two paths that create new workers - one is the "existing worker
going to sleep", and the other is "no workers found for this work, create
one". The former is never fatal, as workers do exist in the group. Only
the latter needs to be carefully handled.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3146cba9

io-wq: get rid of FIXED worker flag · 05c5f4ee

由 Jens Axboe 提交于 9月 01, 2021

It makes the logic easier to follow if we just get rid of the fixed worker
flag, and simply ensure that we never exit the last worker in the group.
This also means that no particular worker is special.

Just track the last timeout state, and if we have hit it and no work
is pending, check if there are other workers. If yes, then we can exit
this one safely.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

05c5f4ee

02 9月, 2021 2 次提交

io-wq: only exit on fatal signals · 15e20db2

由 Jens Axboe 提交于 9月 01, 2021

If the application uses io_uring and also relies heavily on signals
for communication, that can cause io-wq workers to spuriously exit
just because the parent has a signal pending. Just ignore signals
unless they are fatal.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

15e20db2

io-wq: split bounded and unbounded work into separate lists · f95dc207

由 Jens Axboe 提交于 8月 31, 2021

We've got a few issues that all boil down to the fact that we have one
list of pending work items, yet two different types of workers to
serve them. This causes some oddities around workers switching type and
even hashed work vs regular work on the same bounded list.

Just separate them out cleanly, similarly to how we already do
accounting of what is running. That provides a clean separation and
removes some corner cases that can cause stalls when handling IO
that is punted to io-wq.

Fixes: ecc53c48 ("io-wq: check max_worker limits if a worker transitions bound state")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f95dc207

01 9月, 2021 3 次提交

io-wq: fix queue stalling race · 0242f642

由 Jens Axboe 提交于 8月 31, 2021

We need to set the stalled bit early, before we drop the lock for adding
us to the stall hash queue. If not, then we can race with new work being
queued between adding us to the stall hash and io_worker_handle_work()
marking us stalled.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0242f642

io-wq: ensure that hash wait lock is IRQ disabling · 08bdbd39

由 Jens Axboe 提交于 8月 31, 2021

A previous commit removed the IRQ safety of the worker and wqe locks,
but that left one spot of the hash wait lock now being done without
already having IRQs disabled.

Ensure that we use the right locking variant for the hashed waitqueue
lock.

Fixes: a9a4aa9f ("io-wq: wqe and worker locks no longer need to be IRQ safe")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

08bdbd39

io-wq: fix race between adding work and activating a free worker · 94ffb0a2

由 Jens Axboe 提交于 8月 30, 2021

The attempt to find and activate a free worker for new work is currently
combined with creating a new one if we don't find one, but that opens
io-wq up to a race where the worker that is found and activated can
put itself to sleep without knowing that it has been selected to perform
this new work.

Fix this by moving the activation into where we add the new work item,
then we can retain it within the wqe->lock scope and elimiate the race
with the worker itself checking inside the lock, but sleeping outside of
it.

Cc: stable@vger.kernel.org
Reported-by: NAndres Freund <andres@anarazel.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

94ffb0a2

30 8月, 2021 3 次提交

io-wq: fix wakeup race when adding new work · 87df7fb9

由 Jens Axboe 提交于 8月 30, 2021

When new work is added, io_wqe_enqueue() checks if we need to wake or
create a new worker. But that check is done outside the lock that
otherwise synchronizes us with a worker going to sleep, so we can end
up in the following situation:

CPU0				CPU1
lock
insert work
unlock
atomic_read(nr_running) != 0
				lock
				atomic_dec(nr_running)
no wakeup needed

Hold the wqe lock around the "need to wakeup" check. Then we can also get
rid of the temporary work_flags variable, as we know the work will remain
valid as long as we hold the lock.

Cc: stable@vger.kernel.org
Reported-by: NAndres Freund <andres@anarazel.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

87df7fb9

io-wq: wqe and worker locks no longer need to be IRQ safe · a9a4aa9f

由 Jens Axboe 提交于 8月 30, 2021

io_uring no longer queues async work off completion handlers that run in
hard or soft interrupt context, and that use case was the only reason that
io-wq had to use IRQ safe locks for wqe and worker locks.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a9a4aa9f

io-wq: check max_worker limits if a worker transitions bound state · ecc53c48

由 Jens Axboe 提交于 8月 29, 2021

For the two places where new workers are created, we diligently check if
we are allowed to create a new worker. If we're currently at the limit
of how many workers of a given type we can have, then we don't create
any new ones.

If you have a mixed workload with various types of bound and unbounded
work, then it can happen that a worker finishes one type of work and
is then transitioned to the other type. For this case, we don't check
if we are actually allowed to do so. This can cause io-wq to temporarily
exceed the allowed number of workers for a given type.

When retrieving work, check that the types match. If they don't, check
if we are allowed to transition to the other type. If not, then don't
handle the new work.

Cc: stable@vger.kernel.org
Reported-by: NJohannes Lundberg <johalun0@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ecc53c48

29 8月, 2021 1 次提交

io-wq: provide a way to limit max number of workers · 2e480058

由 Jens Axboe 提交于 8月 27, 2021

io-wq divides work into two categories:

1) Work that completes in a bounded time, like reading from a regular file
or a block device. This type of work is limited based on the size of
the SQ ring.

2) Work that may never complete, we call this unbounded work. The amount
of workers here is just limited by RLIMIT_NPROC.

For various uses cases, it's handy to have the kernel limit the maximum
amount of pending workers for both categories. Provide a way to do with
with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation.

IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets
the max worker count to what is being passed in for each category. The
old values are returned into that same array. If 0 is being passed in for
either category, it simply returns the current value.

The value is capped at RLIMIT_NPROC. This actually isn't that important
as it's more of a hint, if we're exceeding the value then our attempt
to fork a new worker will fail. This happens naturally already if more
than one node is in the system, as these values are per-node internally
for io-wq.
Reported-by: NJohannes Lundberg <johalun0@gmail.com>
Link: https://github.com/axboe/liburing/issues/420Signed-off-by: NJens Axboe <axboe@kernel.dk>

2e480058

24 8月, 2021 2 次提交

io-wq: move nr_running and worker_refs out of wqe->lock protection · 79dca184

由 Hao Xu 提交于 8月 10, 2021

We don't need to protect nr_running and worker_refs by wqe->lock, so
narrow the range of raw_spin_lock_irq - raw_spin_unlock_irq
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210810125554.99229-1-haoxu@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

79dca184

io-wq: remove GFP_ATOMIC allocation off schedule out path · d3e9f732

由 Jens Axboe 提交于 8月 04, 2021

Daniel reports that the v5.14-rc4-rt4 kernel throws a BUG when running
stress-ng:

| [   90.202543] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:35
| [   90.202549] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2047, name: iou-wrk-2041
| [   90.202555] CPU: 5 PID: 2047 Comm: iou-wrk-2041 Tainted: G        W         5.14.0-rc4-rt4+ #89
| [   90.202559] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
| [   90.202561] Call Trace:
| [   90.202577]  dump_stack_lvl+0x34/0x44
| [   90.202584]  ___might_sleep.cold+0x87/0x94
| [   90.202588]  rt_spin_lock+0x19/0x70
| [   90.202593]  ___slab_alloc+0xcb/0x7d0
| [   90.202598]  ? newidle_balance.constprop.0+0xf5/0x3b0
| [   90.202603]  ? dequeue_entity+0xc3/0x290
| [   90.202605]  ? io_wqe_dec_running.isra.0+0x98/0xe0
| [   90.202610]  ? pick_next_task_fair+0xb9/0x330
| [   90.202612]  ? __schedule+0x670/0x1410
| [   90.202615]  ? io_wqe_dec_running.isra.0+0x98/0xe0
| [   90.202618]  kmem_cache_alloc_trace+0x79/0x1f0
| [   90.202621]  io_wqe_dec_running.isra.0+0x98/0xe0
| [   90.202625]  io_wq_worker_sleeping+0x37/0x50
| [   90.202628]  schedule+0x30/0xd0
| [   90.202630]  schedule_timeout+0x8f/0x1a0
| [   90.202634]  ? __bpf_trace_tick_stop+0x10/0x10
| [   90.202637]  io_wqe_worker+0xfd/0x320
| [   90.202641]  ? finish_task_switch.isra.0+0xd3/0x290
| [   90.202644]  ? io_worker_handle_work+0x670/0x670
| [   90.202646]  ? io_worker_handle_work+0x670/0x670
| [   90.202649]  ret_from_fork+0x22/0x30

which is due to the RT kernel not liking a GFP_ATOMIC allocation inside
a raw spinlock. Besides that not working on RT, doing any kind of
allocation from inside schedule() is kind of nasty and should be avoided
if at all possible.

This particular path happens when an io-wq worker goes to sleep, and we
need a new worker to handle pending work. We currently allocate a small
data item to hold the information we need to create a new worker, but we
can instead include this data in the io_worker struct itself and just
protect it with a single bit lock. We only really need one per worker
anyway, as we will have run pending work between to sleep cycles.

https://lore.kernel.org/lkml/20210804082418.fbibprcwtzyt5qax@beryllium.lan/Reported-by: NDaniel Wagner <dwagner@suse.de>
Tested-by: NDaniel Wagner <dwagner@suse.de>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d3e9f732

10 8月, 2021 2 次提交

io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker() · 47cae0c7

由 Hao Xu 提交于 8月 08, 2021

There may be cases like:
        A                                 B
spin_lock(wqe->lock)
nr_workers is 0
nr_workers++
spin_unlock(wqe->lock)
                                     spin_lock(wqe->lock)
                                     nr_wokers is 1
                                     nr_workers++
                                     spin_unlock(wqe->lock)
create_io_worker()
  acct->worker is 1
                                     create_io_worker()
                                       acct->worker is 1

There should be one worker marked IO_WORKER_F_FIXED, but no one is.
Fix this by introduce a new agrument for create_io_worker() to indicate
if it is the first worker.

Fixes: 3d4e4fac ("io-wq: fix no lock protection of acct->nr_worker")
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210808135434.68667-3-haoxu@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

47cae0c7

io-wq: fix bug of creating io-wokers unconditionally · 49e7f0c7

由 Hao Xu 提交于 8月 08, 2021

The former patch to add check between nr_workers and max_workers has a
bug, which will cause unconditionally creating io-workers. That's
because the result of the check doesn't affect the call of
create_io_worker(), fix it by bringing in a boolean value for it.

Fixes: 21698274 ("io-wq: fix lack of acct->nr_workers < acct->max_workers judgement")
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210808135434.68667-2-haoxu@linux.alibaba.com
[axboe: drop hunk that isn't strictly needed]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

49e7f0c7

06 8月, 2021 2 次提交

io-wq: fix lack of acct->nr_workers < acct->max_workers judgement · 21698274

由 Hao Xu 提交于 8月 05, 2021

There should be this judgement before we create an io-worker

Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

21698274

io-wq: fix no lock protection of acct->nr_worker · 3d4e4fac

由 Hao Xu 提交于 8月 05, 2021

There is an acct->nr_worker visit without lock protection. Think about
the case: two callers call io_wqe_wake_worker(), one is the original
context and the other one is an io-worker(by calling
io_wqe_enqueue(wqe, linked)), on two cpus paralelly, this may cause
nr_worker to be larger than max_worker.
Let's fix it by adding lock for it, and let's do nr_workers++ before
create_io_worker. There may be a edge cause that the first caller fails
to create an io-worker, but the second caller doesn't know it and then
quit creating io-worker as well:

say nr_worker = max_worker - 1
        cpu 0                        cpu 1
   io_wqe_wake_worker()          io_wqe_wake_worker()
      nr_worker < max_worker
      nr_worker++
      create_io_worker()         nr_worker == max_worker
         failed                  return
      return

But the chance of this case is very slim.

Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
[axboe: fix unconditional create_io_worker() call]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3d4e4fac

05 8月, 2021 1 次提交

io-wq: fix race between worker exiting and activating free worker · 83d6c393

由 Jens Axboe 提交于 8月 03, 2021

Nadav correctly reports that we have a race between a worker exiting,
and new work being queued. This can lead to work being queued behind
an existing worker that could be sleeping on an event before it can
run to completion, and hence introducing potential big latency gaps
if we hit this race condition:

cpu0					cpu1
----					----
					io_wqe_worker()
					schedule_timeout()
					 // timed out
io_wqe_enqueue()
io_wqe_wake_worker()
// work_flags & IO_WQ_WORK_CONCURRENT
io_wqe_activate_free_worker()
					 io_worker_exit()

Fix this by having the exiting worker go through the normal decrement
of a running worker, which will spawn a new one if needed.

The free worker activation is modified to only return success if we
were able to find a sleeping worker - if not, we keep looking through
the list. If we fail, we create a new worker as per usual.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/io-uring/BFF746C0-FEDE-4646-A253-3021C57C26C9@gmail.com/Reported-by: NNadav Amit <nadav.amit@gmail.com>
Tested-by: NNadav Amit <nadav.amit@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

83d6c393

24 7月, 2021 1 次提交

io_uring: explicitly catch any illegal async queue attempt · 991468dc

由 Jens Axboe 提交于 7月 23, 2021

Catch an illegal case to queue async from an unrelated task that got
the ring fd passed to it. This should not be possible to hit, but
better be proactive and catch it explicitly. io-wq is extended to
check for early IO_WQ_WORK_CANCEL being set on a work item as well,
so it can run the request through the normal cancelation path.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

991468dc

18 6月, 2021 3 次提交

io_uring: fix false WARN_ONCE · e6ab8991

由 Pavel Begunkov 提交于 6月 17, 2021

WARNING: CPU: 1 PID: 11749 at fs/io-wq.c:244 io_wqe_wake_worker fs/io-wq.c:244 [inline]
WARNING: CPU: 1 PID: 11749 at fs/io-wq.c:244 io_wqe_enqueue+0x7f6/0x910 fs/io-wq.c:751

A WARN_ON_ONCE() in io_wqe_wake_worker() can be triggered by a valid
userspace setup. Replace it with pr_warn.

Reported-by: syzbot+ea2f1484cffe5109dc10@syzkaller.appspotmail.com
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f7ede342c3342c4c26668f5168e2993e38bbd99c.1623949695.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

e6ab8991

io_uring: allow user configurable IO thread CPU affinity · fe76421d

由 Jens Axboe 提交于 6月 17, 2021

io-wq defaults to per-node masks for IO workers. This works fine by
default, but isn't particularly handy for workloads that prefer more
specific affinities, for either performance or isolation reasons.

This adds IORING_REGISTER_IOWQ_AFF that allows the user to pass in a CPU
mask that is then applied to IO thread workers, and an
IORING_UNREGISTER_IOWQ_AFF that simply resets the masks back to the
default of per-node.

Note that no care is given to existing IO threads, they will need to go
through a reschedule before the affinity is correct if they are already
running or sleeping.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fe76421d

io-wq: use private CPU mask · 0e03496d

由 Jens Axboe 提交于 6月 17, 2021

In preparation for allowing user specific CPU masks for IO thread
creation, switch to using a mask embedded in the per-node wqe
structure.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0e03496d

16 6月, 2021 2 次提交

io-wq: remove header files not needed anymore · ec16d35b

由 Olivier Langlois 提交于 5月 31, 2021

mm related header files are not needed for io-wq module.
remove them for a small clean-up.
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NOlivier Langlois <olivier@trillion01.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ec16d35b

io-wq: remove redundant initialization of variable ret · b1b2fc35

由 Colin Ian King 提交于 6月 15, 2021

The variable ret is being initialized with a value that is never read, the
assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210615143424.60449-1-colin.king@canonical.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

b1b2fc35

14 6月, 2021 4 次提交

io-wq: simplify worker exiting · e587227b

由 Pavel Begunkov 提交于 6月 14, 2021

io_worker_handle_work() already takes care of the empty list case and
releases spinlock, so get rid of ugly conditional unlocking and
unconditionally call handle_work()
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7521e485677f381036676943e876a0afecc23017.1623634181.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

e587227b

io-wq: don't repeat IO_WQ_BIT_EXIT check by worker · 769e6837

由 Pavel Begunkov 提交于 6月 14, 2021

io_wqe_worker()'s main loop does check IO_WQ_BIT_EXIT flag, so no need
for a second test_bit at the end as it will immediately jump to the
first check afterwards.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d6af4a51c86523a527fb5417c9fbc775c4b26497.1623634181.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

769e6837

io-wq: remove unused io-wq refcounting · 382cb030

由 Pavel Begunkov 提交于 6月 14, 2021

iowq->refs is initialised to one and killed on exit, so it's not used
and we can kill it.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/401007393528ea7c102360e69a29b64498e15db2.1623634181.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

382cb030

io-wq: embed wqe ptr array into struct io_wq · c7f405d6

由 Pavel Begunkov 提交于 6月 14, 2021

io-wq keeps an array of pointers to struct io_wqe, allocate this array
as a part of struct io-wq, it's easier to code and saves an extra
indirection for nearly each io-wq call.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1482c6a001923bbed662dc38a8a580fb08b1ed8c.1623634181.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c7f405d6

26 5月, 2021 2 次提交

io-wq: Fix UAF when wakeup wqe in hash waitqueue · 3743c172

由 Zqiang 提交于 5月 26, 2021

BUG: KASAN: use-after-free in __wake_up_common+0x637/0x650
Read of size 8 at addr ffff8880304250d8 by task iou-wrk-28796/28802

Call Trace:
 __dump_stack [inline]
 dump_stack+0x141/0x1d7
 print_address_description.constprop.0.cold+0x5b/0x2c6
 __kasan_report [inline]
 kasan_report.cold+0x7c/0xd8
 __wake_up_common+0x637/0x650
 __wake_up_common_lock+0xd0/0x130
 io_worker_handle_work+0x9dd/0x1790
 io_wqe_worker+0xb2a/0xd40
 ret_from_fork+0x1f/0x30

Allocated by task 28798:
 kzalloc_node [inline]
 io_wq_create+0x3c4/0xdd0
 io_init_wq_offload [inline]
 io_uring_alloc_task_context+0x1bf/0x6b0
 __io_uring_add_task_file+0x29a/0x3c0
 io_uring_add_task_file [inline]
 io_uring_install_fd [inline]
 io_uring_create [inline]
 io_uring_setup+0x209a/0x2bd0
 do_syscall_64+0x3a/0xb0
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Freed by task 28798:
 kfree+0x106/0x2c0
 io_wq_destroy+0x182/0x380
 io_wq_put [inline]
 io_wq_put_and_exit+0x7a/0xa0
 io_uring_clean_tctx [inline]
 __io_uring_cancel+0x428/0x530
 io_uring_files_cancel
 do_exit+0x299/0x2a60
 do_group_exit+0x125/0x310
 get_signal+0x47f/0x2150
 arch_do_signal_or_restart+0x2a8/0x1eb0
 handle_signal_work[inline]
 exit_to_user_mode_loop [inline]
 exit_to_user_mode_prepare+0x171/0x280
 __syscall_exit_to_user_mode_work [inline]
 syscall_exit_to_user_mode+0x19/0x60
 do_syscall_64+0x47/0xb0
 entry_SYSCALL_64_after_hwframe

There are the following scenarios, hash waitqueue is shared by
io-wq1 and io-wq2. (note: wqe is worker)

io-wq1:worker2     | locks bit1
io-wq2:worker1     | waits bit1
io-wq1:worker3     | waits bit1

io-wq1:worker2     | completes all wqe bit1 work items
io-wq1:worker2     | drop bit1, exit

io-wq2:worker1     | locks bit1
io-wq1:worker3     | can not locks bit1, waits bit1 and exit
io-wq1             | exit and free io-wq1
io-wq2:worker1     | drops bit1
io-wq1:worker3     | be waked up, even though wqe is freed

After all iou-wrk belonging to io-wq1 have exited, remove wqe
form hash waitqueue, it is guaranteed that there will be no more
wqe belonging to io-wq1 in the hash waitqueue.

Reported-by: syzbot+6cb11ade52aa17095297@syzkaller.appspotmail.com
Signed-off-by: NZqiang <qiang.zhang@windriver.com>
Link: https://lore.kernel.org/r/20210526050826.30500-1-qiang.zhang@windriver.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

3743c172

io_uring/io-wq: close io-wq full-stop gap · 17a91051

由 Pavel Begunkov 提交于 5月 23, 2021

There is an old problem with io-wq cancellation where requests should be
killed and are in io-wq but are not discoverable, e.g. in @next_hashed
or @linked vars of io_worker_handle_work(). It adds some unreliability
to individual request canellation, but also may potentially get
__io_uring_cancel() stuck. For instance:

1) An __io_uring_cancel()'s cancellation round have not found any
   request but there are some as desribed.
2) __io_uring_cancel() goes to sleep
3) Then workers wake up and try to execute those hidden requests
   that happen to be unbound.

As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT
in advance, so preventing 3) from executing unbound requests. The
workers will initially break looping because of getting a signal as they
are threads of the dying/exec()'ing user task.

Cc: stable@vger.kernel.org
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

17a91051

21 4月, 2021 1 次提交

io-wq: remove unused io_wqe_need_worker() function · eb372672

由 Jens Axboe 提交于 4月 20, 2021

A previous commit removed the need for this, but overlooked that we no
longer use it at all. Get rid of it.

Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eb372672

12 4月, 2021 1 次提交

io-wq: Fix io_wq_worker_affinity() · e0051d7d

由 Peter Zijlstra 提交于 4月 08, 2021

Do not include private headers and do not frob in internals.

On top of that, while the previous code restores the affinity, it
doesn't ensure the task actually moves there if it was running,
leading to the fun situation that it can be observed running outside
of its allowed mask for potentially significant time.

Use the proper API instead.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/YG7QkiUzlEbW85TU@hirez.programming.kicks-ass.netSigned-off-by: NJens Axboe <axboe@kernel.dk>

e0051d7d

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功