1. 02 9月, 2021 1 次提交
    • J
      io-wq: split bounded and unbounded work into separate lists · f95dc207
      Jens Axboe 提交于
      We've got a few issues that all boil down to the fact that we have one
      list of pending work items, yet two different types of workers to
      serve them. This causes some oddities around workers switching type and
      even hashed work vs regular work on the same bounded list.
      
      Just separate them out cleanly, similarly to how we already do
      accounting of what is running. That provides a clean separation and
      removes some corner cases that can cause stalls when handling IO
      that is punted to io-wq.
      
      Fixes: ecc53c48 ("io-wq: check max_worker limits if a worker transitions bound state")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f95dc207
  2. 01 9月, 2021 3 次提交
    • J
      io-wq: fix queue stalling race · 0242f642
      Jens Axboe 提交于
      We need to set the stalled bit early, before we drop the lock for adding
      us to the stall hash queue. If not, then we can race with new work being
      queued between adding us to the stall hash and io_worker_handle_work()
      marking us stalled.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0242f642
    • J
      io-wq: ensure that hash wait lock is IRQ disabling · 08bdbd39
      Jens Axboe 提交于
      A previous commit removed the IRQ safety of the worker and wqe locks,
      but that left one spot of the hash wait lock now being done without
      already having IRQs disabled.
      
      Ensure that we use the right locking variant for the hashed waitqueue
      lock.
      
      Fixes: a9a4aa9f ("io-wq: wqe and worker locks no longer need to be IRQ safe")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08bdbd39
    • J
      io-wq: fix race between adding work and activating a free worker · 94ffb0a2
      Jens Axboe 提交于
      The attempt to find and activate a free worker for new work is currently
      combined with creating a new one if we don't find one, but that opens
      io-wq up to a race where the worker that is found and activated can
      put itself to sleep without knowing that it has been selected to perform
      this new work.
      
      Fix this by moving the activation into where we add the new work item,
      then we can retain it within the wqe->lock scope and elimiate the race
      with the worker itself checking inside the lock, but sleeping outside of
      it.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      94ffb0a2
  3. 30 8月, 2021 3 次提交
    • J
      io-wq: fix wakeup race when adding new work · 87df7fb9
      Jens Axboe 提交于
      When new work is added, io_wqe_enqueue() checks if we need to wake or
      create a new worker. But that check is done outside the lock that
      otherwise synchronizes us with a worker going to sleep, so we can end
      up in the following situation:
      
      CPU0				CPU1
      lock
      insert work
      unlock
      atomic_read(nr_running) != 0
      				lock
      				atomic_dec(nr_running)
      no wakeup needed
      
      Hold the wqe lock around the "need to wakeup" check. Then we can also get
      rid of the temporary work_flags variable, as we know the work will remain
      valid as long as we hold the lock.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      87df7fb9
    • J
      io-wq: wqe and worker locks no longer need to be IRQ safe · a9a4aa9f
      Jens Axboe 提交于
      io_uring no longer queues async work off completion handlers that run in
      hard or soft interrupt context, and that use case was the only reason that
      io-wq had to use IRQ safe locks for wqe and worker locks.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a9a4aa9f
    • J
      io-wq: check max_worker limits if a worker transitions bound state · ecc53c48
      Jens Axboe 提交于
      For the two places where new workers are created, we diligently check if
      we are allowed to create a new worker. If we're currently at the limit
      of how many workers of a given type we can have, then we don't create
      any new ones.
      
      If you have a mixed workload with various types of bound and unbounded
      work, then it can happen that a worker finishes one type of work and
      is then transitioned to the other type. For this case, we don't check
      if we are actually allowed to do so. This can cause io-wq to temporarily
      exceed the allowed number of workers for a given type.
      
      When retrieving work, check that the types match. If they don't, check
      if we are allowed to transition to the other type. If not, then don't
      handle the new work.
      
      Cc: stable@vger.kernel.org
      Reported-by: NJohannes Lundberg <johalun0@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ecc53c48
  4. 29 8月, 2021 1 次提交
    • J
      io-wq: provide a way to limit max number of workers · 2e480058
      Jens Axboe 提交于
      io-wq divides work into two categories:
      
      1) Work that completes in a bounded time, like reading from a regular file
         or a block device. This type of work is limited based on the size of
         the SQ ring.
      
      2) Work that may never complete, we call this unbounded work. The amount
         of workers here is just limited by RLIMIT_NPROC.
      
      For various uses cases, it's handy to have the kernel limit the maximum
      amount of pending workers for both categories. Provide a way to do with
      with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation.
      
      IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets
      the max worker count to what is being passed in for each category. The
      old values are returned into that same array. If 0 is being passed in for
      either category, it simply returns the current value.
      
      The value is capped at RLIMIT_NPROC. This actually isn't that important
      as it's more of a hint, if we're exceeding the value then our attempt
      to fork a new worker will fail. This happens naturally already if more
      than one node is in the system, as these values are per-node internally
      for io-wq.
      Reported-by: NJohannes Lundberg <johalun0@gmail.com>
      Link: https://github.com/axboe/liburing/issues/420Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2e480058
  5. 24 8月, 2021 2 次提交
    • H
      io-wq: move nr_running and worker_refs out of wqe->lock protection · 79dca184
      Hao Xu 提交于
      We don't need to protect nr_running and worker_refs by wqe->lock, so
      narrow the range of raw_spin_lock_irq - raw_spin_unlock_irq
      Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20210810125554.99229-1-haoxu@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      79dca184
    • J
      io-wq: remove GFP_ATOMIC allocation off schedule out path · d3e9f732
      Jens Axboe 提交于
      Daniel reports that the v5.14-rc4-rt4 kernel throws a BUG when running
      stress-ng:
      
      | [   90.202543] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:35
      | [   90.202549] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2047, name: iou-wrk-2041
      | [   90.202555] CPU: 5 PID: 2047 Comm: iou-wrk-2041 Tainted: G        W         5.14.0-rc4-rt4+ #89
      | [   90.202559] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
      | [   90.202561] Call Trace:
      | [   90.202577]  dump_stack_lvl+0x34/0x44
      | [   90.202584]  ___might_sleep.cold+0x87/0x94
      | [   90.202588]  rt_spin_lock+0x19/0x70
      | [   90.202593]  ___slab_alloc+0xcb/0x7d0
      | [   90.202598]  ? newidle_balance.constprop.0+0xf5/0x3b0
      | [   90.202603]  ? dequeue_entity+0xc3/0x290
      | [   90.202605]  ? io_wqe_dec_running.isra.0+0x98/0xe0
      | [   90.202610]  ? pick_next_task_fair+0xb9/0x330
      | [   90.202612]  ? __schedule+0x670/0x1410
      | [   90.202615]  ? io_wqe_dec_running.isra.0+0x98/0xe0
      | [   90.202618]  kmem_cache_alloc_trace+0x79/0x1f0
      | [   90.202621]  io_wqe_dec_running.isra.0+0x98/0xe0
      | [   90.202625]  io_wq_worker_sleeping+0x37/0x50
      | [   90.202628]  schedule+0x30/0xd0
      | [   90.202630]  schedule_timeout+0x8f/0x1a0
      | [   90.202634]  ? __bpf_trace_tick_stop+0x10/0x10
      | [   90.202637]  io_wqe_worker+0xfd/0x320
      | [   90.202641]  ? finish_task_switch.isra.0+0xd3/0x290
      | [   90.202644]  ? io_worker_handle_work+0x670/0x670
      | [   90.202646]  ? io_worker_handle_work+0x670/0x670
      | [   90.202649]  ret_from_fork+0x22/0x30
      
      which is due to the RT kernel not liking a GFP_ATOMIC allocation inside
      a raw spinlock. Besides that not working on RT, doing any kind of
      allocation from inside schedule() is kind of nasty and should be avoided
      if at all possible.
      
      This particular path happens when an io-wq worker goes to sleep, and we
      need a new worker to handle pending work. We currently allocate a small
      data item to hold the information we need to create a new worker, but we
      can instead include this data in the io_worker struct itself and just
      protect it with a single bit lock. We only really need one per worker
      anyway, as we will have run pending work between to sleep cycles.
      
      https://lore.kernel.org/lkml/20210804082418.fbibprcwtzyt5qax@beryllium.lan/Reported-by: NDaniel Wagner <dwagner@suse.de>
      Tested-by: NDaniel Wagner <dwagner@suse.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d3e9f732
  6. 10 8月, 2021 2 次提交
  7. 06 8月, 2021 2 次提交
    • H
      io-wq: fix lack of acct->nr_workers < acct->max_workers judgement · 21698274
      Hao Xu 提交于
      There should be this judgement before we create an io-worker
      
      Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
      Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      21698274
    • H
      io-wq: fix no lock protection of acct->nr_worker · 3d4e4fac
      Hao Xu 提交于
      There is an acct->nr_worker visit without lock protection. Think about
      the case: two callers call io_wqe_wake_worker(), one is the original
      context and the other one is an io-worker(by calling
      io_wqe_enqueue(wqe, linked)), on two cpus paralelly, this may cause
      nr_worker to be larger than max_worker.
      Let's fix it by adding lock for it, and let's do nr_workers++ before
      create_io_worker. There may be a edge cause that the first caller fails
      to create an io-worker, but the second caller doesn't know it and then
      quit creating io-worker as well:
      
      say nr_worker = max_worker - 1
              cpu 0                        cpu 1
         io_wqe_wake_worker()          io_wqe_wake_worker()
            nr_worker < max_worker
            nr_worker++
            create_io_worker()         nr_worker == max_worker
               failed                  return
            return
      
      But the chance of this case is very slim.
      
      Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
      Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
      [axboe: fix unconditional create_io_worker() call]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d4e4fac
  8. 05 8月, 2021 1 次提交
  9. 24 7月, 2021 1 次提交
    • J
      io_uring: explicitly catch any illegal async queue attempt · 991468dc
      Jens Axboe 提交于
      Catch an illegal case to queue async from an unrelated task that got
      the ring fd passed to it. This should not be possible to hit, but
      better be proactive and catch it explicitly. io-wq is extended to
      check for early IO_WQ_WORK_CANCEL being set on a work item as well,
      so it can run the request through the normal cancelation path.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      991468dc
  10. 18 6月, 2021 3 次提交
  11. 16 6月, 2021 2 次提交
  12. 14 6月, 2021 4 次提交
  13. 26 5月, 2021 2 次提交
    • Z
      io-wq: Fix UAF when wakeup wqe in hash waitqueue · 3743c172
      Zqiang 提交于
      BUG: KASAN: use-after-free in __wake_up_common+0x637/0x650
      Read of size 8 at addr ffff8880304250d8 by task iou-wrk-28796/28802
      
      Call Trace:
       __dump_stack [inline]
       dump_stack+0x141/0x1d7
       print_address_description.constprop.0.cold+0x5b/0x2c6
       __kasan_report [inline]
       kasan_report.cold+0x7c/0xd8
       __wake_up_common+0x637/0x650
       __wake_up_common_lock+0xd0/0x130
       io_worker_handle_work+0x9dd/0x1790
       io_wqe_worker+0xb2a/0xd40
       ret_from_fork+0x1f/0x30
      
      Allocated by task 28798:
       kzalloc_node [inline]
       io_wq_create+0x3c4/0xdd0
       io_init_wq_offload [inline]
       io_uring_alloc_task_context+0x1bf/0x6b0
       __io_uring_add_task_file+0x29a/0x3c0
       io_uring_add_task_file [inline]
       io_uring_install_fd [inline]
       io_uring_create [inline]
       io_uring_setup+0x209a/0x2bd0
       do_syscall_64+0x3a/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Freed by task 28798:
       kfree+0x106/0x2c0
       io_wq_destroy+0x182/0x380
       io_wq_put [inline]
       io_wq_put_and_exit+0x7a/0xa0
       io_uring_clean_tctx [inline]
       __io_uring_cancel+0x428/0x530
       io_uring_files_cancel
       do_exit+0x299/0x2a60
       do_group_exit+0x125/0x310
       get_signal+0x47f/0x2150
       arch_do_signal_or_restart+0x2a8/0x1eb0
       handle_signal_work[inline]
       exit_to_user_mode_loop [inline]
       exit_to_user_mode_prepare+0x171/0x280
       __syscall_exit_to_user_mode_work [inline]
       syscall_exit_to_user_mode+0x19/0x60
       do_syscall_64+0x47/0xb0
       entry_SYSCALL_64_after_hwframe
      
      There are the following scenarios, hash waitqueue is shared by
      io-wq1 and io-wq2. (note: wqe is worker)
      
      io-wq1:worker2     | locks bit1
      io-wq2:worker1     | waits bit1
      io-wq1:worker3     | waits bit1
      
      io-wq1:worker2     | completes all wqe bit1 work items
      io-wq1:worker2     | drop bit1, exit
      
      io-wq2:worker1     | locks bit1
      io-wq1:worker3     | can not locks bit1, waits bit1 and exit
      io-wq1             | exit and free io-wq1
      io-wq2:worker1     | drops bit1
      io-wq1:worker3     | be waked up, even though wqe is freed
      
      After all iou-wrk belonging to io-wq1 have exited, remove wqe
      form hash waitqueue, it is guaranteed that there will be no more
      wqe belonging to io-wq1 in the hash waitqueue.
      
      Reported-by: syzbot+6cb11ade52aa17095297@syzkaller.appspotmail.com
      Signed-off-by: NZqiang <qiang.zhang@windriver.com>
      Link: https://lore.kernel.org/r/20210526050826.30500-1-qiang.zhang@windriver.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      3743c172
    • P
      io_uring/io-wq: close io-wq full-stop gap · 17a91051
      Pavel Begunkov 提交于
      There is an old problem with io-wq cancellation where requests should be
      killed and are in io-wq but are not discoverable, e.g. in @next_hashed
      or @linked vars of io_worker_handle_work(). It adds some unreliability
      to individual request canellation, but also may potentially get
      __io_uring_cancel() stuck. For instance:
      
      1) An __io_uring_cancel()'s cancellation round have not found any
         request but there are some as desribed.
      2) __io_uring_cancel() goes to sleep
      3) Then workers wake up and try to execute those hidden requests
         that happen to be unbound.
      
      As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT
      in advance, so preventing 3) from executing unbound requests. The
      workers will initially break looping because of getting a signal as they
      are threads of the dying/exec()'ing user task.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      17a91051
  14. 21 4月, 2021 1 次提交
  15. 12 4月, 2021 5 次提交
  16. 09 4月, 2021 1 次提交
    • P
      io-wq: cancel unbounded works on io-wq destroy · c60eb049
      Pavel Begunkov 提交于
      WARNING: CPU: 5 PID: 227 at fs/io_uring.c:8578 io_ring_exit_work+0xe6/0x470
      RIP: 0010:io_ring_exit_work+0xe6/0x470
      Call Trace:
       process_one_work+0x206/0x400
       worker_thread+0x4a/0x3d0
       kthread+0x129/0x170
       ret_from_fork+0x22/0x30
      
      INFO: task lfs-openat:2359 blocked for more than 245 seconds.
      task:lfs-openat      state:D stack:    0 pid: 2359 ppid:     1 flags:0x00000004
      Call Trace:
       ...
       wait_for_completion+0x8b/0xf0
       io_wq_destroy_manager+0x24/0x60
       io_wq_put_and_exit+0x18/0x30
       io_uring_clean_tctx+0x76/0xa0
       __io_uring_files_cancel+0x1b9/0x2e0
       do_exit+0xc0/0xb40
       ...
      
      Even after io-wq destroy has been issued io-wq worker threads will
      continue executing all left work items as usual, and may hang waiting
      for I/O that won't ever complete (aka unbounded).
      
      [<0>] pipe_read+0x306/0x450
      [<0>] io_iter_do_read+0x1e/0x40
      [<0>] io_read+0xd5/0x330
      [<0>] io_issue_sqe+0xd21/0x18a0
      [<0>] io_wq_submit_work+0x6c/0x140
      [<0>] io_worker_handle_work+0x17d/0x400
      [<0>] io_wqe_worker+0x2c0/0x330
      [<0>] ret_from_fork+0x22/0x30
      
      Cancel all unbounded I/O instead of executing them. This changes the
      user visible behaviour, but that's inevitable as io-wq is not per task.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/cd4b543154154cba055cf86f351441c2174d7f71.1617842918.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      c60eb049
  17. 01 4月, 2021 1 次提交
  18. 28 3月, 2021 1 次提交
  19. 26 3月, 2021 1 次提交
    • J
      io-wq: fix race around pending work on teardown · f5d2d23b
      Jens Axboe 提交于
      syzbot reports that it's triggering the warning condition on having
      pending work on shutdown:
      
      WARNING: CPU: 1 PID: 12346 at fs/io-wq.c:1061 io_wq_destroy fs/io-wq.c:1061 [inline]
      WARNING: CPU: 1 PID: 12346 at fs/io-wq.c:1061 io_wq_put+0x153/0x260 fs/io-wq.c:1072
      Modules linked in:
      CPU: 1 PID: 12346 Comm: syz-executor.5 Not tainted 5.12.0-rc2-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:io_wq_destroy fs/io-wq.c:1061 [inline]
      RIP: 0010:io_wq_put+0x153/0x260 fs/io-wq.c:1072
      Code: 8d e8 71 90 ea 01 49 89 c4 41 83 fc 40 7d 4f e8 33 4d 97 ff 42 80 7c 2d 00 00 0f 85 77 ff ff ff e9 7a ff ff ff e8 1d 4d 97 ff <0f> 0b eb b9 8d 6b ff 89 ee 09 de bf ff ff ff ff e8 18 51 97 ff 09
      RSP: 0018:ffffc90001ebfb08 EFLAGS: 00010293
      RAX: ffffffff81e16083 RBX: ffff888019038040 RCX: ffff88801e86b780
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
      RBP: 1ffff1100b2f8a80 R08: ffffffff81e15fce R09: ffffed100b2f8a82
      R10: ffffed100b2f8a82 R11: 0000000000000000 R12: 0000000000000000
      R13: dffffc0000000000 R14: ffff8880597c5400 R15: ffff888019038000
      FS:  00007f8dcd89c700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055e9a054e160 CR3: 000000001dfb8000 CR4: 00000000001506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       io_uring_clean_tctx+0x1b7/0x210 fs/io_uring.c:8802
       __io_uring_files_cancel+0x13c/0x170 fs/io_uring.c:8820
       io_uring_files_cancel include/linux/io_uring.h:47 [inline]
       do_exit+0x258/0x2340 kernel/exit.c:780
       do_group_exit+0x168/0x2d0 kernel/exit.c:922
       get_signal+0x1734/0x1ef0 kernel/signal.c:2773
       arch_do_signal_or_restart+0x3c/0x610 arch/x86/kernel/signal.c:811
       handle_signal_work kernel/entry/common.c:147 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0xac/0x1e0 kernel/entry/common.c:208
       __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
       syscall_exit_to_user_mode+0x48/0x180 kernel/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x465f69
      
      which shouldn't happen, but seems to be possible due to a race on whether
      or not the io-wq manager sees a fatal signal first, or whether the io-wq
      workers do. If we race with queueing work and then send a fatal signal to
      the owning task, and the io-wq worker sees that before the manager sets
      IO_WQ_BIT_EXIT, then it's possible to have the worker exit and leave work
      behind.
      
      Just turn the WARN_ON_ONCE() into a cancelation condition instead.
      
      Reported-by: syzbot+77a738a6bc947bf639ca@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f5d2d23b
  20. 22 3月, 2021 1 次提交
  21. 21 3月, 2021 1 次提交
  22. 13 3月, 2021 1 次提交
    • J
      io_uring: allow IO worker threads to be frozen · 16efa4fc
      Jens Axboe 提交于
      With the freezer using the proper signaling to notify us of when it's
      time to freeze a thread, we can re-enable normal freezer usage for the
      IO threads. Ensure that SQPOLL, io-wq, and the io-wq manager call
      try_to_freeze() appropriately, and remove the default setting of
      PF_NOFREEZE from create_io_thread().
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      16efa4fc