提交 · 87df7fb922d18e96992aa5e824aa34b2065fef59 · openeuler / Kernel

30 8月, 2021 3 次提交

io-wq: fix wakeup race when adding new work · 87df7fb9

由 Jens Axboe 提交于 8月 30, 2021

When new work is added, io_wqe_enqueue() checks if we need to wake or
create a new worker. But that check is done outside the lock that
otherwise synchronizes us with a worker going to sleep, so we can end
up in the following situation:

CPU0				CPU1
lock
insert work
unlock
atomic_read(nr_running) != 0
				lock
				atomic_dec(nr_running)
no wakeup needed

Hold the wqe lock around the "need to wakeup" check. Then we can also get
rid of the temporary work_flags variable, as we know the work will remain
valid as long as we hold the lock.

Cc: stable@vger.kernel.org
Reported-by: NAndres Freund <andres@anarazel.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

87df7fb9

io-wq: wqe and worker locks no longer need to be IRQ safe · a9a4aa9f

由 Jens Axboe 提交于 8月 30, 2021

io_uring no longer queues async work off completion handlers that run in
hard or soft interrupt context, and that use case was the only reason that
io-wq had to use IRQ safe locks for wqe and worker locks.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a9a4aa9f

io-wq: check max_worker limits if a worker transitions bound state · ecc53c48

由 Jens Axboe 提交于 8月 29, 2021

For the two places where new workers are created, we diligently check if
we are allowed to create a new worker. If we're currently at the limit
of how many workers of a given type we can have, then we don't create
any new ones.

If you have a mixed workload with various types of bound and unbounded
work, then it can happen that a worker finishes one type of work and
is then transitioned to the other type. For this case, we don't check
if we are actually allowed to do so. This can cause io-wq to temporarily
exceed the allowed number of workers for a given type.

When retrieving work, check that the types match. If they don't, check
if we are allowed to transition to the other type. If not, then don't
handle the new work.

Cc: stable@vger.kernel.org
Reported-by: NJohannes Lundberg <johalun0@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ecc53c48

29 8月, 2021 1 次提交

io-wq: provide a way to limit max number of workers · 2e480058

由 Jens Axboe 提交于 8月 27, 2021

io-wq divides work into two categories:

1) Work that completes in a bounded time, like reading from a regular file
or a block device. This type of work is limited based on the size of
the SQ ring.

2) Work that may never complete, we call this unbounded work. The amount
of workers here is just limited by RLIMIT_NPROC.

For various uses cases, it's handy to have the kernel limit the maximum
amount of pending workers for both categories. Provide a way to do with
with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation.

IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets
the max worker count to what is being passed in for each category. The
old values are returned into that same array. If 0 is being passed in for
either category, it simply returns the current value.

The value is capped at RLIMIT_NPROC. This actually isn't that important
as it's more of a hint, if we're exceeding the value then our attempt
to fork a new worker will fail. This happens naturally already if more
than one node is in the system, as these values are per-node internally
for io-wq.
Reported-by: NJohannes Lundberg <johalun0@gmail.com>
Link: https://github.com/axboe/liburing/issues/420Signed-off-by: NJens Axboe <axboe@kernel.dk>

2e480058

24 8月, 2021 2 次提交

io-wq: move nr_running and worker_refs out of wqe->lock protection · 79dca184

由 Hao Xu 提交于 8月 10, 2021

We don't need to protect nr_running and worker_refs by wqe->lock, so
narrow the range of raw_spin_lock_irq - raw_spin_unlock_irq
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210810125554.99229-1-haoxu@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

79dca184

io-wq: remove GFP_ATOMIC allocation off schedule out path · d3e9f732

由 Jens Axboe 提交于 8月 04, 2021

Daniel reports that the v5.14-rc4-rt4 kernel throws a BUG when running
stress-ng:

| [   90.202543] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:35
| [   90.202549] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2047, name: iou-wrk-2041
| [   90.202555] CPU: 5 PID: 2047 Comm: iou-wrk-2041 Tainted: G        W         5.14.0-rc4-rt4+ #89
| [   90.202559] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
| [   90.202561] Call Trace:
| [   90.202577]  dump_stack_lvl+0x34/0x44
| [   90.202584]  ___might_sleep.cold+0x87/0x94
| [   90.202588]  rt_spin_lock+0x19/0x70
| [   90.202593]  ___slab_alloc+0xcb/0x7d0
| [   90.202598]  ? newidle_balance.constprop.0+0xf5/0x3b0
| [   90.202603]  ? dequeue_entity+0xc3/0x290
| [   90.202605]  ? io_wqe_dec_running.isra.0+0x98/0xe0
| [   90.202610]  ? pick_next_task_fair+0xb9/0x330
| [   90.202612]  ? __schedule+0x670/0x1410
| [   90.202615]  ? io_wqe_dec_running.isra.0+0x98/0xe0
| [   90.202618]  kmem_cache_alloc_trace+0x79/0x1f0
| [   90.202621]  io_wqe_dec_running.isra.0+0x98/0xe0
| [   90.202625]  io_wq_worker_sleeping+0x37/0x50
| [   90.202628]  schedule+0x30/0xd0
| [   90.202630]  schedule_timeout+0x8f/0x1a0
| [   90.202634]  ? __bpf_trace_tick_stop+0x10/0x10
| [   90.202637]  io_wqe_worker+0xfd/0x320
| [   90.202641]  ? finish_task_switch.isra.0+0xd3/0x290
| [   90.202644]  ? io_worker_handle_work+0x670/0x670
| [   90.202646]  ? io_worker_handle_work+0x670/0x670
| [   90.202649]  ret_from_fork+0x22/0x30

which is due to the RT kernel not liking a GFP_ATOMIC allocation inside
a raw spinlock. Besides that not working on RT, doing any kind of
allocation from inside schedule() is kind of nasty and should be avoided
if at all possible.

This particular path happens when an io-wq worker goes to sleep, and we
need a new worker to handle pending work. We currently allocate a small
data item to hold the information we need to create a new worker, but we
can instead include this data in the io_worker struct itself and just
protect it with a single bit lock. We only really need one per worker
anyway, as we will have run pending work between to sleep cycles.

https://lore.kernel.org/lkml/20210804082418.fbibprcwtzyt5qax@beryllium.lan/Reported-by: NDaniel Wagner <dwagner@suse.de>
Tested-by: NDaniel Wagner <dwagner@suse.de>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d3e9f732

10 8月, 2021 2 次提交

io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker() · 47cae0c7

由 Hao Xu 提交于 8月 08, 2021

There may be cases like:
        A                                 B
spin_lock(wqe->lock)
nr_workers is 0
nr_workers++
spin_unlock(wqe->lock)
                                     spin_lock(wqe->lock)
                                     nr_wokers is 1
                                     nr_workers++
                                     spin_unlock(wqe->lock)
create_io_worker()
  acct->worker is 1
                                     create_io_worker()
                                       acct->worker is 1

There should be one worker marked IO_WORKER_F_FIXED, but no one is.
Fix this by introduce a new agrument for create_io_worker() to indicate
if it is the first worker.

Fixes: 3d4e4fac ("io-wq: fix no lock protection of acct->nr_worker")
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210808135434.68667-3-haoxu@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

47cae0c7

io-wq: fix bug of creating io-wokers unconditionally · 49e7f0c7

由 Hao Xu 提交于 8月 08, 2021

The former patch to add check between nr_workers and max_workers has a
bug, which will cause unconditionally creating io-workers. That's
because the result of the check doesn't affect the call of
create_io_worker(), fix it by bringing in a boolean value for it.

Fixes: 21698274 ("io-wq: fix lack of acct->nr_workers < acct->max_workers judgement")
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210808135434.68667-2-haoxu@linux.alibaba.com
[axboe: drop hunk that isn't strictly needed]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

49e7f0c7

06 8月, 2021 2 次提交

io-wq: fix lack of acct->nr_workers < acct->max_workers judgement · 21698274

由 Hao Xu 提交于 8月 05, 2021

There should be this judgement before we create an io-worker

Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

21698274

io-wq: fix no lock protection of acct->nr_worker · 3d4e4fac

由 Hao Xu 提交于 8月 05, 2021

There is an acct->nr_worker visit without lock protection. Think about
the case: two callers call io_wqe_wake_worker(), one is the original
context and the other one is an io-worker(by calling
io_wqe_enqueue(wqe, linked)), on two cpus paralelly, this may cause
nr_worker to be larger than max_worker.
Let's fix it by adding lock for it, and let's do nr_workers++ before
create_io_worker. There may be a edge cause that the first caller fails
to create an io-worker, but the second caller doesn't know it and then
quit creating io-worker as well:

say nr_worker = max_worker - 1
        cpu 0                        cpu 1
   io_wqe_wake_worker()          io_wqe_wake_worker()
      nr_worker < max_worker
      nr_worker++
      create_io_worker()         nr_worker == max_worker
         failed                  return
      return

But the chance of this case is very slim.

Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
[axboe: fix unconditional create_io_worker() call]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3d4e4fac

05 8月, 2021 1 次提交

io-wq: fix race between worker exiting and activating free worker · 83d6c393

由 Jens Axboe 提交于 8月 03, 2021

Nadav correctly reports that we have a race between a worker exiting,
and new work being queued. This can lead to work being queued behind
an existing worker that could be sleeping on an event before it can
run to completion, and hence introducing potential big latency gaps
if we hit this race condition:

cpu0					cpu1
----					----
					io_wqe_worker()
					schedule_timeout()
					 // timed out
io_wqe_enqueue()
io_wqe_wake_worker()
// work_flags & IO_WQ_WORK_CONCURRENT
io_wqe_activate_free_worker()
					 io_worker_exit()

Fix this by having the exiting worker go through the normal decrement
of a running worker, which will spawn a new one if needed.

The free worker activation is modified to only return success if we
were able to find a sleeping worker - if not, we keep looking through
the list. If we fail, we create a new worker as per usual.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/io-uring/BFF746C0-FEDE-4646-A253-3021C57C26C9@gmail.com/Reported-by: NNadav Amit <nadav.amit@gmail.com>
Tested-by: NNadav Amit <nadav.amit@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

83d6c393

24 7月, 2021 1 次提交

io_uring: explicitly catch any illegal async queue attempt · 991468dc

由 Jens Axboe 提交于 7月 23, 2021

Catch an illegal case to queue async from an unrelated task that got
the ring fd passed to it. This should not be possible to hit, but
better be proactive and catch it explicitly. io-wq is extended to
check for early IO_WQ_WORK_CANCEL being set on a work item as well,
so it can run the request through the normal cancelation path.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

991468dc

18 6月, 2021 3 次提交

io_uring: fix false WARN_ONCE · e6ab8991

由 Pavel Begunkov 提交于 6月 17, 2021

WARNING: CPU: 1 PID: 11749 at fs/io-wq.c:244 io_wqe_wake_worker fs/io-wq.c:244 [inline]
WARNING: CPU: 1 PID: 11749 at fs/io-wq.c:244 io_wqe_enqueue+0x7f6/0x910 fs/io-wq.c:751

A WARN_ON_ONCE() in io_wqe_wake_worker() can be triggered by a valid
userspace setup. Replace it with pr_warn.

Reported-by: syzbot+ea2f1484cffe5109dc10@syzkaller.appspotmail.com
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f7ede342c3342c4c26668f5168e2993e38bbd99c.1623949695.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

e6ab8991

io_uring: allow user configurable IO thread CPU affinity · fe76421d

由 Jens Axboe 提交于 6月 17, 2021

io-wq defaults to per-node masks for IO workers. This works fine by
default, but isn't particularly handy for workloads that prefer more
specific affinities, for either performance or isolation reasons.

This adds IORING_REGISTER_IOWQ_AFF that allows the user to pass in a CPU
mask that is then applied to IO thread workers, and an
IORING_UNREGISTER_IOWQ_AFF that simply resets the masks back to the
default of per-node.

Note that no care is given to existing IO threads, they will need to go
through a reschedule before the affinity is correct if they are already
running or sleeping.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fe76421d

io-wq: use private CPU mask · 0e03496d

由 Jens Axboe 提交于 6月 17, 2021

In preparation for allowing user specific CPU masks for IO thread
creation, switch to using a mask embedded in the per-node wqe
structure.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0e03496d

16 6月, 2021 2 次提交

io-wq: remove header files not needed anymore · ec16d35b

由 Olivier Langlois 提交于 5月 31, 2021

mm related header files are not needed for io-wq module.
remove them for a small clean-up.
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NOlivier Langlois <olivier@trillion01.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ec16d35b

io-wq: remove redundant initialization of variable ret · b1b2fc35

由 Colin Ian King 提交于 6月 15, 2021

The variable ret is being initialized with a value that is never read, the
assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210615143424.60449-1-colin.king@canonical.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

b1b2fc35

14 6月, 2021 4 次提交

io-wq: simplify worker exiting · e587227b

由 Pavel Begunkov 提交于 6月 14, 2021

io_worker_handle_work() already takes care of the empty list case and
releases spinlock, so get rid of ugly conditional unlocking and
unconditionally call handle_work()
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7521e485677f381036676943e876a0afecc23017.1623634181.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

e587227b

io-wq: don't repeat IO_WQ_BIT_EXIT check by worker · 769e6837

由 Pavel Begunkov 提交于 6月 14, 2021

io_wqe_worker()'s main loop does check IO_WQ_BIT_EXIT flag, so no need
for a second test_bit at the end as it will immediately jump to the
first check afterwards.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d6af4a51c86523a527fb5417c9fbc775c4b26497.1623634181.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

769e6837

io-wq: remove unused io-wq refcounting · 382cb030

由 Pavel Begunkov 提交于 6月 14, 2021

iowq->refs is initialised to one and killed on exit, so it's not used
and we can kill it.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/401007393528ea7c102360e69a29b64498e15db2.1623634181.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

382cb030

io-wq: embed wqe ptr array into struct io_wq · c7f405d6

由 Pavel Begunkov 提交于 6月 14, 2021

io-wq keeps an array of pointers to struct io_wqe, allocate this array
as a part of struct io-wq, it's easier to code and saves an extra
indirection for nearly each io-wq call.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1482c6a001923bbed662dc38a8a580fb08b1ed8c.1623634181.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c7f405d6

26 5月, 2021 2 次提交

io-wq: Fix UAF when wakeup wqe in hash waitqueue · 3743c172

由 Zqiang 提交于 5月 26, 2021

BUG: KASAN: use-after-free in __wake_up_common+0x637/0x650
Read of size 8 at addr ffff8880304250d8 by task iou-wrk-28796/28802

Call Trace:
 __dump_stack [inline]
 dump_stack+0x141/0x1d7
 print_address_description.constprop.0.cold+0x5b/0x2c6
 __kasan_report [inline]
 kasan_report.cold+0x7c/0xd8
 __wake_up_common+0x637/0x650
 __wake_up_common_lock+0xd0/0x130
 io_worker_handle_work+0x9dd/0x1790
 io_wqe_worker+0xb2a/0xd40
 ret_from_fork+0x1f/0x30

Allocated by task 28798:
 kzalloc_node [inline]
 io_wq_create+0x3c4/0xdd0
 io_init_wq_offload [inline]
 io_uring_alloc_task_context+0x1bf/0x6b0
 __io_uring_add_task_file+0x29a/0x3c0
 io_uring_add_task_file [inline]
 io_uring_install_fd [inline]
 io_uring_create [inline]
 io_uring_setup+0x209a/0x2bd0
 do_syscall_64+0x3a/0xb0
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Freed by task 28798:
 kfree+0x106/0x2c0
 io_wq_destroy+0x182/0x380
 io_wq_put [inline]
 io_wq_put_and_exit+0x7a/0xa0
 io_uring_clean_tctx [inline]
 __io_uring_cancel+0x428/0x530
 io_uring_files_cancel
 do_exit+0x299/0x2a60
 do_group_exit+0x125/0x310
 get_signal+0x47f/0x2150
 arch_do_signal_or_restart+0x2a8/0x1eb0
 handle_signal_work[inline]
 exit_to_user_mode_loop [inline]
 exit_to_user_mode_prepare+0x171/0x280
 __syscall_exit_to_user_mode_work [inline]
 syscall_exit_to_user_mode+0x19/0x60
 do_syscall_64+0x47/0xb0
 entry_SYSCALL_64_after_hwframe

There are the following scenarios, hash waitqueue is shared by
io-wq1 and io-wq2. (note: wqe is worker)

io-wq1:worker2     | locks bit1
io-wq2:worker1     | waits bit1
io-wq1:worker3     | waits bit1

io-wq1:worker2     | completes all wqe bit1 work items
io-wq1:worker2     | drop bit1, exit

io-wq2:worker1     | locks bit1
io-wq1:worker3     | can not locks bit1, waits bit1 and exit
io-wq1             | exit and free io-wq1
io-wq2:worker1     | drops bit1
io-wq1:worker3     | be waked up, even though wqe is freed

After all iou-wrk belonging to io-wq1 have exited, remove wqe
form hash waitqueue, it is guaranteed that there will be no more
wqe belonging to io-wq1 in the hash waitqueue.

Reported-by: syzbot+6cb11ade52aa17095297@syzkaller.appspotmail.com
Signed-off-by: NZqiang <qiang.zhang@windriver.com>
Link: https://lore.kernel.org/r/20210526050826.30500-1-qiang.zhang@windriver.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

3743c172

io_uring/io-wq: close io-wq full-stop gap · 17a91051

由 Pavel Begunkov 提交于 5月 23, 2021

There is an old problem with io-wq cancellation where requests should be
killed and are in io-wq but are not discoverable, e.g. in @next_hashed
or @linked vars of io_worker_handle_work(). It adds some unreliability
to individual request canellation, but also may potentially get
__io_uring_cancel() stuck. For instance:

1) An __io_uring_cancel()'s cancellation round have not found any
   request but there are some as desribed.
2) __io_uring_cancel() goes to sleep
3) Then workers wake up and try to execute those hidden requests
   that happen to be unbound.

As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT
in advance, so preventing 3) from executing unbound requests. The
workers will initially break looping because of getting a signal as they
are threads of the dying/exec()'ing user task.

Cc: stable@vger.kernel.org
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

17a91051

21 4月, 2021 1 次提交

io-wq: remove unused io_wqe_need_worker() function · eb372672

由 Jens Axboe 提交于 4月 20, 2021

A previous commit removed the need for this, but overlooked that we no
longer use it at all. Get rid of it.

Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eb372672

12 4月, 2021 5 次提交

io-wq: Fix io_wq_worker_affinity() · e0051d7d

由 Peter Zijlstra 提交于 4月 08, 2021

Do not include private headers and do not frob in internals.

On top of that, while the previous code restores the affinity, it
doesn't ensure the task actually moves there if it was running,
leading to the fun situation that it can be observed running outside
of its allowed mask for potentially significant time.

Use the proper API instead.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/YG7QkiUzlEbW85TU@hirez.programming.kicks-ass.netSigned-off-by: NJens Axboe <axboe@kernel.dk>

e0051d7d

io-wq: simplify code in __io_worker_busy() · 417b5052

由 Hao Xu 提交于 4月 06, 2021

Leverage XOR to simplify the code in __io_worker_busy.
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/1617678525-3129-1-git-send-email-haoxu@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

417b5052

io-wq: cancel task_work on exit only targeting the current 'wq' · c80ca470

由 Jens Axboe 提交于 4月 01, 2021

With using task_work_cancel(), we're potentially canceling task_work
that isn't related to this specific io_wq. Use the newly added
task_work_cancel_match() to ensure that we only remove and cancel work
items that are specific to this io_wq.

Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c80ca470

io-wq: eliminate the need for a manager thread · 685fe7fe

由 Jens Axboe 提交于 3月 08, 2021

io-wq relies on a manager thread to create/fork new workers, as needed.
But there's really no strong need for it anymore. We have the following
cases that fork a new worker:

1) Work queue. This is done from the task itself always, and it's trivial
to create a worker off that path, if needed.

2) All workers have gone to sleep, and we have more work. This is called
off the sched out path. For this case, use a task_work items to queue
a fork-worker operation.

3) Hashed work completion. Don't think we need to do anything off this
case. If need be, it could just use approach 2 as well.

Part of this change is incrementing the running worker count before the
fork, to avoid cases where we observe we need a worker and then queue
creation of one. Then new work comes in, we fork a new one. That last
queue operation should have waited for the previous worker to come up,
it's quite possible we don't even need it. Hence move the worker running
from before we fork it off to more efficiently handle that case.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

685fe7fe

io-wq: refactor *_get_acct() · 8418f22a

由 Pavel Begunkov 提交于 3月 22, 2021

Extract a helper for io_work_get_acct() and io_wqe_get_acct() to avoid
duplication.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8418f22a

09 4月, 2021 1 次提交

io-wq: cancel unbounded works on io-wq destroy · c60eb049

由 Pavel Begunkov 提交于 4月 08, 2021

WARNING: CPU: 5 PID: 227 at fs/io_uring.c:8578 io_ring_exit_work+0xe6/0x470
RIP: 0010:io_ring_exit_work+0xe6/0x470
Call Trace:
 process_one_work+0x206/0x400
 worker_thread+0x4a/0x3d0
 kthread+0x129/0x170
 ret_from_fork+0x22/0x30

INFO: task lfs-openat:2359 blocked for more than 245 seconds.
task:lfs-openat      state:D stack:    0 pid: 2359 ppid:     1 flags:0x00000004
Call Trace:
 ...
 wait_for_completion+0x8b/0xf0
 io_wq_destroy_manager+0x24/0x60
 io_wq_put_and_exit+0x18/0x30
 io_uring_clean_tctx+0x76/0xa0
 __io_uring_files_cancel+0x1b9/0x2e0
 do_exit+0xc0/0xb40
 ...

Even after io-wq destroy has been issued io-wq worker threads will
continue executing all left work items as usual, and may hang waiting
for I/O that won't ever complete (aka unbounded).

[<0>] pipe_read+0x306/0x450
[<0>] io_iter_do_read+0x1e/0x40
[<0>] io_read+0xd5/0x330
[<0>] io_issue_sqe+0xd21/0x18a0
[<0>] io_wq_submit_work+0x6c/0x140
[<0>] io_worker_handle_work+0x17d/0x400
[<0>] io_wqe_worker+0x2c0/0x330
[<0>] ret_from_fork+0x22/0x30

Cancel all unbounded I/O instead of executing them. This changes the
user visible behaviour, but that's inevitable as io-wq is not per task.
Suggested-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cd4b543154154cba055cf86f351441c2174d7f71.1617842918.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c60eb049

01 4月, 2021 1 次提交

io_uring/io-wq: protect against sprintf overflow · 696ee88a

由 Pavel Begunkov 提交于 4月 01, 2021

task_pid may be large enough to not fit into the left space of
TASK_COMM_LEN-sized buffers and overflow in sprintf. We not so care
about uniqueness, so replace it with safer snprintf().
Reported-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1702c6145d7e1c46fbc382f28334c02e1a3d3994.1617267273.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

696ee88a

28 3月, 2021 1 次提交

io_uring: handle signals for IO threads like a normal thread · dbe1bdbb

由 Jens Axboe 提交于 3月 25, 2021

We go through various hoops to disallow signals for the IO threads, but
there's really no reason why we cannot just allow them. The IO threads
never return to userspace like a normal thread, and hence don't go through
normal signal processing. Instead, just check for a pending signal as part
of the work loop, and call get_signal() to handle it for us if anything
is pending.

With that, we can support receiving signals, including special ones like
SIGSTOP.
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dbe1bdbb

26 3月, 2021 1 次提交

io-wq: fix race around pending work on teardown · f5d2d23b

由 Jens Axboe 提交于 3月 25, 2021

syzbot reports that it's triggering the warning condition on having
pending work on shutdown:

WARNING: CPU: 1 PID: 12346 at fs/io-wq.c:1061 io_wq_destroy fs/io-wq.c:1061 [inline]
WARNING: CPU: 1 PID: 12346 at fs/io-wq.c:1061 io_wq_put+0x153/0x260 fs/io-wq.c:1072
Modules linked in:
CPU: 1 PID: 12346 Comm: syz-executor.5 Not tainted 5.12.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:io_wq_destroy fs/io-wq.c:1061 [inline]
RIP: 0010:io_wq_put+0x153/0x260 fs/io-wq.c:1072
Code: 8d e8 71 90 ea 01 49 89 c4 41 83 fc 40 7d 4f e8 33 4d 97 ff 42 80 7c 2d 00 00 0f 85 77 ff ff ff e9 7a ff ff ff e8 1d 4d 97 ff <0f> 0b eb b9 8d 6b ff 89 ee 09 de bf ff ff ff ff e8 18 51 97 ff 09
RSP: 0018:ffffc90001ebfb08 EFLAGS: 00010293
RAX: ffffffff81e16083 RBX: ffff888019038040 RCX: ffff88801e86b780
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
RBP: 1ffff1100b2f8a80 R08: ffffffff81e15fce R09: ffffed100b2f8a82
R10: ffffed100b2f8a82 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: ffff8880597c5400 R15: ffff888019038000
FS:  00007f8dcd89c700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055e9a054e160 CR3: 000000001dfb8000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 io_uring_clean_tctx+0x1b7/0x210 fs/io_uring.c:8802
 __io_uring_files_cancel+0x13c/0x170 fs/io_uring.c:8820
 io_uring_files_cancel include/linux/io_uring.h:47 [inline]
 do_exit+0x258/0x2340 kernel/exit.c:780
 do_group_exit+0x168/0x2d0 kernel/exit.c:922
 get_signal+0x1734/0x1ef0 kernel/signal.c:2773
 arch_do_signal_or_restart+0x3c/0x610 arch/x86/kernel/signal.c:811
 handle_signal_work kernel/entry/common.c:147 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
 exit_to_user_mode_prepare+0xac/0x1e0 kernel/entry/common.c:208
 __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
 syscall_exit_to_user_mode+0x48/0x180 kernel/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x465f69

which shouldn't happen, but seems to be possible due to a race on whether
or not the io-wq manager sees a fatal signal first, or whether the io-wq
workers do. If we race with queueing work and then send a fatal signal to
the owning task, and the io-wq worker sees that before the manager sets
IO_WQ_BIT_EXIT, then it's possible to have the worker exit and leave work
behind.

Just turn the WARN_ON_ONCE() into a cancelation condition instead.

Reported-by: syzbot+77a738a6bc947bf639ca@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f5d2d23b

22 3月, 2021 1 次提交

io_uring: don't use {test,clear}_tsk_thread_flag() for current · 0b8cfa97

由 Jens Axboe 提交于 3月 21, 2021

Linus correctly points out that this is both unnecessary and generates
much worse code on some archs as going from current to thread_info is
actually backwards - and obviously just wasteful, since the thread_info
is what we care about.

Since io_uring only operates on current for these operations, just use
test_thread_flag() instead. For io-wq, we can further simplify and use
tracehook_notify_signal() to handle the TIF_NOTIFY_SIGNAL work and clear
the flag. The latter isn't an actual bug right now, but it may very well
be in the future if we place other work items under TIF_NOTIFY_SIGNAL.
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/CAHk-=wgYhNck33YHKZ14mFB5MzTTk8gqXHcfj=RWTAXKwgQJgg@mail.gmail.com/Signed-off-by: NJens Axboe <axboe@kernel.dk>

0b8cfa97

21 3月, 2021 1 次提交

io-wq: ensure task is running before processing task_work · 00ddff43

由 Jens Axboe 提交于 3月 21, 2021

Mark the current task as running if we need to run task_work from the
io-wq threads as part of work handling. If that is the case, then return
as such so that the caller can appropriately loop back and reset if it
was part of a going-to-sleep flush.

Fixes: 3bfe6106 ("io-wq: fork worker threads from original task")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

00ddff43

13 3月, 2021 1 次提交

io_uring: allow IO worker threads to be frozen · 16efa4fc

由 Jens Axboe 提交于 3月 12, 2021

With the freezer using the proper signaling to notify us of when it's
time to freeze a thread, we can re-enable normal freezer usage for the
IO threads. Ensure that SQPOLL, io-wq, and the io-wq manager call
try_to_freeze() appropriately, and remove the default setting of
PF_NOFREEZE from create_io_thread().
Signed-off-by: NJens Axboe <axboe@kernel.dk>

16efa4fc

10 3月, 2021 3 次提交

kernel: make IO threads unfreezable by default · e22bc9b4

由 Jens Axboe 提交于 3月 09, 2021

The io-wq threads were already marked as no-freeze, but the manager was
not. On resume, we perpetually have signal_pending() being true, and
hence the manager will loop and spin 100% of the time.

Just mark the tasks created by create_io_thread() as PF_NOFREEZE by
default, and remove any knowledge of it in io-wq and io_uring.
Reported-by: NKevin Locke <kevin@kevinlocke.name>
Tested-by: NKevin Locke <kevin@kevinlocke.name>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e22bc9b4

io-wq: fix ref leak for req in case of exit cancelations · 70e35125

由 yangerkun 提交于 3月 09, 2021

do_work such as io_wq_submit_work that cancel the work may leave a ref of
req as 1 if we have links. Fix it by call io_run_cancel.

Fixes: 4fb6ac32 ("io-wq: improve manager/worker handling over exec")
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Link: https://lore.kernel.org/r/20210309030410.3294078-1-yangerkun@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

70e35125

io-wq: remove unused 'user' member of io_wq · cc20e3fe

由 Jens Axboe 提交于 3月 08, 2021

Previous patches killed the last user of this, now it's just a dead member
in the struct. Get rid of it.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cc20e3fe

08 3月, 2021 1 次提交

io-wq: warn on creating manager while exiting · 678eeba4

由 Pavel Begunkov 提交于 3月 06, 2021

Add a simple warning making sure that nobody tries to create a new
manager while we're under IO_WQ_BIT_EXIT. That can potentially happen
due to racy work submission after final put.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

678eeba4

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功