1. 12 4月, 2021 5 次提交
  2. 09 4月, 2021 1 次提交
    • P
      io-wq: cancel unbounded works on io-wq destroy · c60eb049
      Pavel Begunkov 提交于
      WARNING: CPU: 5 PID: 227 at fs/io_uring.c:8578 io_ring_exit_work+0xe6/0x470
      RIP: 0010:io_ring_exit_work+0xe6/0x470
      Call Trace:
       process_one_work+0x206/0x400
       worker_thread+0x4a/0x3d0
       kthread+0x129/0x170
       ret_from_fork+0x22/0x30
      
      INFO: task lfs-openat:2359 blocked for more than 245 seconds.
      task:lfs-openat      state:D stack:    0 pid: 2359 ppid:     1 flags:0x00000004
      Call Trace:
       ...
       wait_for_completion+0x8b/0xf0
       io_wq_destroy_manager+0x24/0x60
       io_wq_put_and_exit+0x18/0x30
       io_uring_clean_tctx+0x76/0xa0
       __io_uring_files_cancel+0x1b9/0x2e0
       do_exit+0xc0/0xb40
       ...
      
      Even after io-wq destroy has been issued io-wq worker threads will
      continue executing all left work items as usual, and may hang waiting
      for I/O that won't ever complete (aka unbounded).
      
      [<0>] pipe_read+0x306/0x450
      [<0>] io_iter_do_read+0x1e/0x40
      [<0>] io_read+0xd5/0x330
      [<0>] io_issue_sqe+0xd21/0x18a0
      [<0>] io_wq_submit_work+0x6c/0x140
      [<0>] io_worker_handle_work+0x17d/0x400
      [<0>] io_wqe_worker+0x2c0/0x330
      [<0>] ret_from_fork+0x22/0x30
      
      Cancel all unbounded I/O instead of executing them. This changes the
      user visible behaviour, but that's inevitable as io-wq is not per task.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/cd4b543154154cba055cf86f351441c2174d7f71.1617842918.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      c60eb049
  3. 01 4月, 2021 1 次提交
  4. 28 3月, 2021 1 次提交
  5. 26 3月, 2021 1 次提交
    • J
      io-wq: fix race around pending work on teardown · f5d2d23b
      Jens Axboe 提交于
      syzbot reports that it's triggering the warning condition on having
      pending work on shutdown:
      
      WARNING: CPU: 1 PID: 12346 at fs/io-wq.c:1061 io_wq_destroy fs/io-wq.c:1061 [inline]
      WARNING: CPU: 1 PID: 12346 at fs/io-wq.c:1061 io_wq_put+0x153/0x260 fs/io-wq.c:1072
      Modules linked in:
      CPU: 1 PID: 12346 Comm: syz-executor.5 Not tainted 5.12.0-rc2-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:io_wq_destroy fs/io-wq.c:1061 [inline]
      RIP: 0010:io_wq_put+0x153/0x260 fs/io-wq.c:1072
      Code: 8d e8 71 90 ea 01 49 89 c4 41 83 fc 40 7d 4f e8 33 4d 97 ff 42 80 7c 2d 00 00 0f 85 77 ff ff ff e9 7a ff ff ff e8 1d 4d 97 ff <0f> 0b eb b9 8d 6b ff 89 ee 09 de bf ff ff ff ff e8 18 51 97 ff 09
      RSP: 0018:ffffc90001ebfb08 EFLAGS: 00010293
      RAX: ffffffff81e16083 RBX: ffff888019038040 RCX: ffff88801e86b780
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
      RBP: 1ffff1100b2f8a80 R08: ffffffff81e15fce R09: ffffed100b2f8a82
      R10: ffffed100b2f8a82 R11: 0000000000000000 R12: 0000000000000000
      R13: dffffc0000000000 R14: ffff8880597c5400 R15: ffff888019038000
      FS:  00007f8dcd89c700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055e9a054e160 CR3: 000000001dfb8000 CR4: 00000000001506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       io_uring_clean_tctx+0x1b7/0x210 fs/io_uring.c:8802
       __io_uring_files_cancel+0x13c/0x170 fs/io_uring.c:8820
       io_uring_files_cancel include/linux/io_uring.h:47 [inline]
       do_exit+0x258/0x2340 kernel/exit.c:780
       do_group_exit+0x168/0x2d0 kernel/exit.c:922
       get_signal+0x1734/0x1ef0 kernel/signal.c:2773
       arch_do_signal_or_restart+0x3c/0x610 arch/x86/kernel/signal.c:811
       handle_signal_work kernel/entry/common.c:147 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0xac/0x1e0 kernel/entry/common.c:208
       __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
       syscall_exit_to_user_mode+0x48/0x180 kernel/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x465f69
      
      which shouldn't happen, but seems to be possible due to a race on whether
      or not the io-wq manager sees a fatal signal first, or whether the io-wq
      workers do. If we race with queueing work and then send a fatal signal to
      the owning task, and the io-wq worker sees that before the manager sets
      IO_WQ_BIT_EXIT, then it's possible to have the worker exit and leave work
      behind.
      
      Just turn the WARN_ON_ONCE() into a cancelation condition instead.
      
      Reported-by: syzbot+77a738a6bc947bf639ca@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f5d2d23b
  6. 22 3月, 2021 1 次提交
  7. 21 3月, 2021 1 次提交
  8. 13 3月, 2021 1 次提交
    • J
      io_uring: allow IO worker threads to be frozen · 16efa4fc
      Jens Axboe 提交于
      With the freezer using the proper signaling to notify us of when it's
      time to freeze a thread, we can re-enable normal freezer usage for the
      IO threads. Ensure that SQPOLL, io-wq, and the io-wq manager call
      try_to_freeze() appropriately, and remove the default setting of
      PF_NOFREEZE from create_io_thread().
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      16efa4fc
  9. 10 3月, 2021 3 次提交
  10. 08 3月, 2021 1 次提交
  11. 07 3月, 2021 1 次提交
    • J
      io-wq: fix race in freeing 'wq' and worker access · 886d0137
      Jens Axboe 提交于
      Ran into a use-after-free on the main io-wq struct, wq. It has a worker
      ref and completion event, but the manager itself isn't holding a
      reference. This can lead to a race where the manager thinks there are
      no workers and exits, but a worker is being added. That leads to the
      following trace:
      
      BUG: KASAN: use-after-free in io_wqe_worker+0x4c0/0x5e0
      Read of size 8 at addr ffff888108baa8a0 by task iou-wrk-3080422/3080425
      
      CPU: 5 PID: 3080425 Comm: iou-wrk-3080422 Not tainted 5.12.0-rc1+ #110
      Hardware name: Micro-Star International Co., Ltd. MS-7C60/TRX40 PRO 10G (MS-7C60), BIOS 1.60 05/13/2020
      Call Trace:
       dump_stack+0x90/0xbe
       print_address_description.constprop.0+0x67/0x28d
       ? io_wqe_worker+0x4c0/0x5e0
       kasan_report.cold+0x7b/0xd4
       ? io_wqe_worker+0x4c0/0x5e0
       __asan_load8+0x6d/0xa0
       io_wqe_worker+0x4c0/0x5e0
       ? io_worker_handle_work+0xc00/0xc00
       ? recalc_sigpending+0xe5/0x120
       ? io_worker_handle_work+0xc00/0xc00
       ? io_worker_handle_work+0xc00/0xc00
       ret_from_fork+0x1f/0x30
      
      Allocated by task 3080422:
       kasan_save_stack+0x23/0x60
       __kasan_kmalloc+0x80/0xa0
       kmem_cache_alloc_node_trace+0xa0/0x480
       io_wq_create+0x3b5/0x600
       io_uring_alloc_task_context+0x13c/0x380
       io_uring_add_task_file+0x109/0x140
       __x64_sys_io_uring_enter+0x45f/0x660
       do_syscall_64+0x32/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Freed by task 3080422:
       kasan_save_stack+0x23/0x60
       kasan_set_track+0x20/0x40
       kasan_set_free_info+0x24/0x40
       __kasan_slab_free+0xe8/0x120
       kfree+0xa8/0x400
       io_wq_put+0x14a/0x220
       io_wq_put_and_exit+0x9a/0xc0
       io_uring_clean_tctx+0x101/0x140
       __io_uring_files_cancel+0x36e/0x3c0
       do_exit+0x169/0x1340
       __x64_sys_exit+0x34/0x40
       do_syscall_64+0x32/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Have the manager itself hold a reference, and now both drop points drop
      and complete if we hit zero, and the manager can unconditionally do a
      wait_for_completion() instead of having a race between reading the ref
      count and waiting if it was non-zero.
      
      Fixes: fb3a1f6c ("io-wq: have manager wait for all workers to exit")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      886d0137
  12. 05 3月, 2021 2 次提交
    • J
      io-wq: kill hashed waitqueue before manager exits · 09ca6c40
      Jens Axboe 提交于
      If we race with shutting down the io-wq context and someone queueing
      a hashed entry, then we can exit the manager with it armed. If it then
      triggers after the manager has exited, we can have a use-after-free where
      io_wqe_hash_wake() attempts to wake a now gone manager process.
      
      Move the killing of the hashed write queue into the manager itself, so
      that we know we've killed it before the task exits.
      
      Fixes: e941894e ("io-wq: make buffered file write hashed work map per-ctx")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      09ca6c40
    • J
      io_uring: move to using create_io_thread() · 46fe18b1
      Jens Axboe 提交于
      This allows us to do task creation and setup without needing to use
      completions to try and synchronize with the starting thread. Get rid of
      the old io_wq_fork_thread() wrapper, and the 'wq' and 'worker' startup
      completion events - we can now do setup before the task is running.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      46fe18b1
  13. 04 3月, 2021 10 次提交
    • J
      io-wq: ensure all pending work is canceled on exit · f0127254
      Jens Axboe 提交于
      If we race on shutting down the io-wq, then we should ensure that any
      work that was queued after workers shutdown is canceled. Harden the
      add work check a bit too, checking for IO_WQ_BIT_EXIT and cancel if
      it's set.
      
      Add a WARN_ON() for having any work before we kill the io-wq context.
      
      Reported-by: syzbot+91b4b56ead187d35c9d3@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f0127254
    • J
      io_uring: ensure that threads freeze on suspend · e4b4a13f
      Jens Axboe 提交于
      Alex reports that his system fails to suspend using 5.12-rc1, with the
      following dump:
      
      [  240.650300] PM: suspend entry (deep)
      [  240.650748] Filesystems sync: 0.000 seconds
      [  240.725605] Freezing user space processes ...
      [  260.739483] Freezing of tasks failed after 20.013 seconds (3 tasks refusing to freeze, wq_busy=0):
      [  260.739497] task:iou-mgr-446     state:S stack:    0 pid:  516 ppid:   439 flags:0x00004224
      [  260.739504] Call Trace:
      [  260.739507]  ? sysvec_apic_timer_interrupt+0xb/0x81
      [  260.739515]  ? pick_next_task_fair+0x197/0x1cde
      [  260.739519]  ? sysvec_reschedule_ipi+0x2f/0x6a
      [  260.739522]  ? asm_sysvec_reschedule_ipi+0x12/0x20
      [  260.739525]  ? __schedule+0x57/0x6d6
      [  260.739529]  ? del_timer_sync+0xb9/0x115
      [  260.739533]  ? schedule+0x63/0xd5
      [  260.739536]  ? schedule_timeout+0x219/0x356
      [  260.739540]  ? __next_timer_interrupt+0xf1/0xf1
      [  260.739544]  ? io_wq_manager+0x73/0xb1
      [  260.739549]  ? io_wq_create+0x262/0x262
      [  260.739553]  ? ret_from_fork+0x22/0x30
      [  260.739557] task:iou-mgr-517     state:S stack:    0 pid:  522 ppid:   439 flags:0x00004224
      [  260.739561] Call Trace:
      [  260.739563]  ? sysvec_apic_timer_interrupt+0xb/0x81
      [  260.739566]  ? pick_next_task_fair+0x16f/0x1cde
      [  260.739569]  ? sysvec_apic_timer_interrupt+0xb/0x81
      [  260.739571]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
      [  260.739574]  ? __schedule+0x5b7/0x6d6
      [  260.739578]  ? del_timer_sync+0x70/0x115
      [  260.739581]  ? schedule_timeout+0x211/0x356
      [  260.739585]  ? __next_timer_interrupt+0xf1/0xf1
      [  260.739588]  ? io_wq_check_workers+0x15/0x11f
      [  260.739592]  ? io_wq_manager+0x69/0xb1
      [  260.739596]  ? io_wq_create+0x262/0x262
      [  260.739600]  ? ret_from_fork+0x22/0x30
      [  260.739603] task:iou-wrk-517     state:S stack:    0 pid:  523 ppid:   439 flags:0x00004224
      [  260.739607] Call Trace:
      [  260.739609]  ? __schedule+0x5b7/0x6d6
      [  260.739614]  ? schedule+0x63/0xd5
      [  260.739617]  ? schedule_timeout+0x219/0x356
      [  260.739621]  ? __next_timer_interrupt+0xf1/0xf1
      [  260.739624]  ? task_thread.isra.0+0x148/0x3af
      [  260.739628]  ? task_thread_unbound+0xa/0xa
      [  260.739632]  ? task_thread_bound+0x7/0x7
      [  260.739636]  ? ret_from_fork+0x22/0x30
      [  260.739647] OOM killer enabled.
      [  260.739648] Restarting tasks ... done.
      [  260.740077] PM: suspend exit
      
      Play nice and ensure that any thread we create will call try_to_freeze()
      at an opportune time so that memory suspend can proceed. For the io-wq
      worker threads, mark them as PF_NOFREEZE. They could potentially be
      blocked for a long time.
      Reported-by: NAlex Xu (Hello71) <alex_y_xu@yahoo.ca>
      Tested-by: NAlex Xu (Hello71) <alex_y_xu@yahoo.ca>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e4b4a13f
    • J
      io-wq: fix error path leak of buffered write hash map · dc7bbc9e
      Jens Axboe 提交于
      The 'err' path should include the hash put, we already grabbed a reference
      once we get that far.
      
      Fixes: e941894e ("io-wq: make buffered file write hashed work map per-ctx")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dc7bbc9e
    • J
      io_uring: move cred assignment into io_issue_sqe() · 5730b27e
      Jens Axboe 提交于
      If we move it in there, then we no longer have to care about it in io-wq.
      This means we can drop the cred handling in io-wq, and we can drop the
      REQ_F_WORK_INITIALIZED flag and async init functions as that was the last
      user of it since we moved to the new workers. Then we can also drop
      io_wq_work->creds, and just hold the personality u16 in there instead.
      Suggested-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5730b27e
    • J
      io-wq: provide an io_wq_put_and_exit() helper · afcc4015
      Jens Axboe 提交于
      If we put the io-wq from io_uring, we really want it to exit. Provide
      a helper that does that for us. Couple that with not having the manager
      hold a reference to the 'wq' and the normal SQPOLL exit will tear down
      the io-wq context appropriate.
      
      On the io-wq side, our wq context is per task, so only the task itself
      is manipulating ->manager and hence it's safe to check and clear without
      any extra locking. We just need to ensure that the manager task stays
      around, in case it exits.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      afcc4015
    • J
      io-wq: fix double put of 'wq' in error path · 470ec4ed
      Jens Axboe 提交于
      We are already freeing the wq struct in both spots, so don't put it and
      get it freed twice.
      
      Reported-by: syzbot+7bf785eedca35ca05501@syzkaller.appspotmail.com
      Fixes: 4fb6ac32 ("io-wq: improve manager/worker handling over exec")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      470ec4ed
    • J
      io-wq: wait for manager exit on wq destroy · d364d9e5
      Jens Axboe 提交于
      The manager waits for the workers, hence the manager is always valid if
      workers are running. Now also have wq destroy wait for the manager on
      exit, so we now everything is gone.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d364d9e5
    • J
      io-wq: rename wq->done completion to wq->started · dbf99620
      Jens Axboe 提交于
      This is a leftover from a different use cases, it's used to wait for
      the manager to startup. Rename it as such.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dbf99620
    • J
      io-wq: don't ask for a new worker if we're exiting · 613eeb60
      Jens Axboe 提交于
      If we're in the process of shutting down the async context, then don't
      create new workers if we already have at least the fixed one.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      613eeb60
    • J
      io-wq: have manager wait for all workers to exit · fb3a1f6c
      Jens Axboe 提交于
      Instead of having to wait separately on workers and manager, just have
      the manager wait on the workers. We use an atomic_t for the reference
      here, as we need to start at 0 and allow increment from that. Since the
      number of workers is naturally capped by the allowed nr of processes,
      and that uses an int, there is no risk of overflow.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fb3a1f6c
  14. 02 3月, 2021 1 次提交
  15. 26 2月, 2021 3 次提交
    • J
      io-wq: remove now unused IO_WQ_BIT_ERROR · d6ce7f67
      Jens Axboe 提交于
      This flag is now dead, remove it.
      
      Fixes: 1cbd9c2b ("io-wq: don't create any IO workers upfront")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d6ce7f67
    • J
      io-wq: improve manager/worker handling over exec · 4fb6ac32
      Jens Axboe 提交于
      exec will cancel any threads, including the ones that io-wq is using. This
      isn't a problem, in fact we'd prefer it to be that way since it means we
      know that any async work cancels naturally without having to handle it
      proactively.
      
      But it does mean that we need to setup a new manager, as the manager and
      workers are gone. Handle this at queue time, and cancel work if we fail.
      Since the manager can go away without us noticing, ensure that the manager
      itself holds a reference to the 'wq' as well. Rename io_wq_destroy() to
      io_wq_put() to reflect that.
      
      In the future we can now simplify exec cancelation handling, for now just
      leave it the same.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4fb6ac32
    • J
      io-wq: make buffered file write hashed work map per-ctx · e941894e
      Jens Axboe 提交于
      Before the io-wq thread change, we maintained a hash work map and lock
      per-node per-ring. That wasn't ideal, as we really wanted it to be per
      ring. But now that we have per-task workers, the hash map ends up being
      just per-task. That'll work just fine for the normal case of having
      one task use a ring, but if you share the ring between tasks, then it's
      considerably worse than it was before.
      
      Make the hash map per ctx instead, which provides full per-ctx buffered
      write serialization on hashed writes.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e941894e
  16. 24 2月, 2021 3 次提交
    • J
      io-wq: fix race around io_worker grabbing · eb2de941
      Jens Axboe 提交于
      There's a small window between lookup dropping the reference to the
      worker and calling wake_up_process() on the worker task, where the worker
      itself could have exited. We ensure that the worker struct itself is
      valid, but worker->task may very well be gone by the time we issue the
      wakeup.
      
      Fix the race by using a completion triggered by the reference going to
      zero, and having exit wait for that completion before proceeding.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eb2de941
    • J
      io-wq: fix races around manager/worker creation and task exit · 8b3e78b5
      Jens Axboe 提交于
      These races have always been there, they are just more apparent now that
      we do early cancel of io-wq when the task exits.
      
      Ensure that the io-wq manager sets task state correctly to not miss
      wakeups for task creation. This is important if we get a wakeup after
      having marked ourselves as TASK_INTERRUPTIBLE. If we do end up creating
      workers, then we flip the state back to running, making the subsequent
      schedule() a no-op. Also increment the wq ref count before forking the
      thread, to avoid a use-after-free.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8b3e78b5
    • J
      io-wq: remove nr_process accounting · 728f13e7
      Jens Axboe 提交于
      We're now just using fork like we would from userspace, so there's no
      need to try and impose extra restrictions or accounting on the user
      side of things. That's already being done for us. That also means we
      don't have to pass in the user_struct anymore, that's correctly inherited
      through ->creds on fork.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      728f13e7
  17. 22 2月, 2021 4 次提交