1. 04 3月, 2021 8 次提交
    • J
      io_uring: remove unused argument 'tsk' from io_req_caches_free() · 4010fec4
      Jens Axboe 提交于
      We prune the full cache regardless, get rid of the dead argument.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4010fec4
    • P
      io_uring: destroy io-wq on exec · 8452d4a6
      Pavel Begunkov 提交于
      Destroy current's io-wq backend and tctx on __io_uring_task_cancel(),
      aka exec(). Looks it's not strictly necessary, because it will be done
      at some point when the task dies and changes of creds/files/etc. are
      handled, but better to do that earlier to free io-wq and not potentially
      lock previous mm and other resources for the time being.
      
      It's safe to do because we wait for all requests of the current task to
      complete, so no request will use tctx afterwards. Note, that
      io_uring_files_cancel() may leave some requests for later reaping, so it
      leaves tctx intact, that's ok as the task is dying anyway.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8452d4a6
    • P
      io_uring: warn on not destroyed io-wq · ef8eaa4e
      Pavel Begunkov 提交于
      Make sure that we killed an io-wq by the time a task is dead.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ef8eaa4e
    • J
      io_uring: fix race condition in task_work add and clear · 1d5f360d
      Jens Axboe 提交于
      We clear the bit marking the ctx task_work as active after having run
      the queued work, but we really should be clearing it before. Otherwise
      we can hit a tiny race ala:
      
      CPU0					CPU1
      io_task_work_add()			tctx_task_work()
      					run_work
      	add_to_list
      	test_and_set_bit
      					clear_bit
      		already set
      
      and CPU0 will return thinking the task_work is queued, while in reality
      it's already being run. If we hit the condition after __tctx_task_work()
      found no more work, but before we've cleared the bit, then we'll end up
      thinking it's queued and will be run. In reality it is queued, but we
      didn't queue the ctx task_work to ensure that it gets run.
      
      Fixes: 7cbf1722 ("io_uring: provide FIFO ordering for task_work")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1d5f360d
    • J
      io-wq: provide an io_wq_put_and_exit() helper · afcc4015
      Jens Axboe 提交于
      If we put the io-wq from io_uring, we really want it to exit. Provide
      a helper that does that for us. Couple that with not having the manager
      hold a reference to the 'wq' and the normal SQPOLL exit will tear down
      the io-wq context appropriate.
      
      On the io-wq side, our wq context is per task, so only the task itself
      is manipulating ->manager and hence it's safe to check and clear without
      any extra locking. We just need to ensure that the manager task stays
      around, in case it exits.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      afcc4015
    • J
      io_uring: don't use complete_all() on SQPOLL thread exit · 8629397e
      Jens Axboe 提交于
      We want to reuse this completion, and a single complete should do just
      fine. Ensure that we park ourselves first if requested, as that is what
      lead to the initial deadlock in this area. If we've got someone attempting
      to park us, then we can't proceed without having them finish first.
      
      Fixes: 37d1e2e3 ("io_uring: move SQPOLL thread io-wq forked worker")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8629397e
    • P
      io_uring: run fallback on cancellation · ba50a036
      Pavel Begunkov 提交于
      io_uring_try_cancel_requests() matches not only current's requests, but
      also of other exiting tasks, so we need to actively cancel them and not
      just wait, especially since the function can be called on flush during
      do_exit() -> exit_files().
      Even if it's not a problem for now, it's much nicer to know that the
      function tries to cancel everything it can.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ba50a036
    • J
      io_uring: SQPOLL stop error handling fixes · e54945ae
      Jens Axboe 提交于
      If we fail to fork an SQPOLL worker, we can hit cancel, and hence
      attempted thread stop, with the thread already being stopped. Ensure
      we check for that.
      
      Also guard thread stop fully by the sqd mutex, just like we do for
      park.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e54945ae
  2. 26 2月, 2021 4 次提交
    • J
      io_uring: fix SQPOLL thread handling over exec · 5f3f26f9
      Jens Axboe 提交于
      Just like the changes for io-wq, ensure that we re-fork the SQPOLL
      thread if the owner execs. Mark the ctx sq thread as sqo_exec if
      it dies, and the ring as needing a wakeup which will force the task
      to enter the kernel. When it does, setup the new thread and proceed
      as usual.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5f3f26f9
    • J
      io-wq: improve manager/worker handling over exec · 4fb6ac32
      Jens Axboe 提交于
      exec will cancel any threads, including the ones that io-wq is using. This
      isn't a problem, in fact we'd prefer it to be that way since it means we
      know that any async work cancels naturally without having to handle it
      proactively.
      
      But it does mean that we need to setup a new manager, as the manager and
      workers are gone. Handle this at queue time, and cancel work if we fail.
      Since the manager can go away without us noticing, ensure that the manager
      itself holds a reference to the 'wq' as well. Rename io_wq_destroy() to
      io_wq_put() to reflect that.
      
      In the future we can now simplify exec cancelation handling, for now just
      leave it the same.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4fb6ac32
    • J
      io_uring: ensure SQPOLL startup is triggered before error shutdown · eb85890b
      Jens Axboe 提交于
      syzbot reports the following hang:
      
      INFO: task syz-executor.0:12538 can't die for more than 143 seconds.
      task:syz-executor.0  state:D stack:28352 pid:12538 ppid:  8423 flags:0x00004004
      Call Trace:
       context_switch kernel/sched/core.c:4324 [inline]
       __schedule+0x90c/0x21a0 kernel/sched/core.c:5075
       schedule+0xcf/0x270 kernel/sched/core.c:5154
       schedule_timeout+0x1db/0x250 kernel/time/timer.c:1868
       do_wait_for_common kernel/sched/completion.c:85 [inline]
       __wait_for_common kernel/sched/completion.c:106 [inline]
       wait_for_common kernel/sched/completion.c:117 [inline]
       wait_for_completion+0x168/0x270 kernel/sched/completion.c:138
       io_sq_thread_finish+0x96/0x580 fs/io_uring.c:7152
       io_sq_offload_create fs/io_uring.c:7929 [inline]
       io_uring_create fs/io_uring.c:9465 [inline]
       io_uring_setup+0x1fb2/0x2c20 fs/io_uring.c:9550
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      which is due to exiting after the SQPOLL thread has been created, but
      hasn't been started yet. Ensure that we always complete the startup
      side when waiting for it to exit.
      
      Reported-by: syzbot+c927c937cba8ef66dd4a@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eb85890b
    • J
      io-wq: make buffered file write hashed work map per-ctx · e941894e
      Jens Axboe 提交于
      Before the io-wq thread change, we maintained a hash work map and lock
      per-node per-ring. That wasn't ideal, as we really wanted it to be per
      ring. But now that we have per-task workers, the hash map ends up being
      just per-task. That'll work just fine for the normal case of having
      one task use a ring, but if you share the ring between tasks, then it's
      considerably worse than it was before.
      
      Make the hash map per ctx instead, which provides full per-ctx buffered
      write serialization on hashed writes.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e941894e
  3. 25 2月, 2021 1 次提交
    • J
      Revert "io_uring: wait potential ->release() on resurrect" · cb5e1b81
      Jens Axboe 提交于
      This reverts commit 88f171ab.
      
      I ran into a case where the ref resurrect now spins, so revert
      this change for now until we can further investigate why it's
      broken. The bug seems to indicate spinning on the lock itself,
      likely there's some ABBA deadlock involved:
      
      [<0>] __percpu_ref_switch_mode+0x45/0x180
      [<0>] percpu_ref_resurrect+0x46/0x70
      [<0>] io_refs_resurrect+0x25/0xa0
      [<0>] __io_uring_register+0x135/0x10c0
      [<0>] __x64_sys_io_uring_register+0xc2/0x1a0
      [<0>] do_syscall_64+0x42/0x110
      [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cb5e1b81
  4. 24 2月, 2021 7 次提交
  5. 22 2月, 2021 8 次提交
    • P
      io_uring: clear request count when freeing caches · 8e5c66c4
      Pavel Begunkov 提交于
      BUG: KASAN: double-free or invalid-free in io_req_caches_free.constprop.0+0x3ce/0x530 fs/io_uring.c:8709
      
      Workqueue: events_unbound io_ring_exit_work
      Call Trace:
       [...]
       __cache_free mm/slab.c:3424 [inline]
       kmem_cache_free_bulk+0x4b/0x1b0 mm/slab.c:3744
       io_req_caches_free.constprop.0+0x3ce/0x530 fs/io_uring.c:8709
       io_ring_ctx_free fs/io_uring.c:8764 [inline]
       io_ring_exit_work+0x518/0x6b0 fs/io_uring.c:8846
       process_one_work+0x98d/0x1600 kernel/workqueue.c:2275
       worker_thread+0x64c/0x1120 kernel/workqueue.c:2421
       kthread+0x3b1/0x4a0 kernel/kthread.c:292
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
      Freed by task 11900:
       [...]
       kmem_cache_free_bulk+0x4b/0x1b0 mm/slab.c:3744
       io_req_caches_free.constprop.0+0x3ce/0x530 fs/io_uring.c:8709
       io_uring_flush+0x483/0x6e0 fs/io_uring.c:9237
       filp_close+0xb4/0x170 fs/open.c:1286
       close_files fs/file.c:403 [inline]
       put_files_struct fs/file.c:418 [inline]
       put_files_struct+0x1d0/0x350 fs/file.c:415
       exit_files+0x7e/0xa0 fs/file.c:435
       do_exit+0xc27/0x2ae0 kernel/exit.c:820
       do_group_exit+0x125/0x310 kernel/exit.c:922
       [...]
      
      io_req_caches_free() doesn't zero submit_state->free_reqs, so io_uring
      considers just freed requests to be good and sound and will reuse or
      double free them. Zero the counter.
      
      Reported-by: syzbot+30b4936dcdb3aafa4fb4@syzkaller.appspotmail.com
      Fixes: 41be53e9 ("io_uring: kill cached requests from exiting task closing the ring")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8e5c66c4
    • J
      io_uring: remove io_identity · 4379bf8b
      Jens Axboe 提交于
      We are no longer grabbing state, so no need to maintain an IO identity
      that we COW if there are changes.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4379bf8b
    • J
      io_uring: remove any grabbing of context · 44526bed
      Jens Axboe 提交于
      The async workers are siblings of the task itself, so by definition we
      have all the state that we need. Remove any of the state grabbing that
      we have, and requests flagging what they need.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      44526bed
    • J
      io-wq: fork worker threads from original task · 3bfe6106
      Jens Axboe 提交于
      Instead of using regular kthread kernel threads, create kernel threads
      that are like a real thread that the task would create. This ensures that
      we get all the context that we need, without having to carry that state
      around. This greatly reduces the code complexity, and the risk of missing
      state for a given request type.
      
      With the move away from kthread, we can also dump everything related to
      assigned state to the new threads.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3bfe6106
    • J
      io_uring: tie async worker side to the task context · 5aa75ed5
      Jens Axboe 提交于
      Move it outside of the io_ring_ctx, and tie it to the io_uring task
      context.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5aa75ed5
    • J
      io_uring: disable io-wq attaching · d25e3a3d
      Jens Axboe 提交于
      Moving towards making the io_wq per ring per task, so we can't really
      share it between rings. Which is fine, since we've now dropped some
      of that fat from it.
      
      Retain compatibility with how attaching works, so that any attempt to
      attach to an fd that doesn't exist, or isn't an io_uring fd, will fail
      like it did before.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d25e3a3d
    • J
      io_uring: remove the need for relying on an io-wq fallback worker · 7c25c0d1
      Jens Axboe 提交于
      We hit this case when the task is exiting, and we need somewhere to
      do background cleanup of requests. Instead of relying on the io-wq
      task manager to do this work for us, just stuff it somewhere where
      we can safely run it ourselves directly.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7c25c0d1
    • P
      io_uring: run task_work on io_uring_register() · b6c23dd5
      Pavel Begunkov 提交于
      Do run task_work before io_uring_register(), that might make a first
      quiesce round much nicer. We generally do that for any syscall invocation
      to avoid spurious -EINTR/-ERESTARTSYS, for task_work that we generate.
      This patch brings io_uring_register() inline with the two other io_uring
      syscalls.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b6c23dd5
  6. 21 2月, 2021 7 次提交
    • P
      io_uring: fix leaving invalid req->flags · ebf4a5db
      Pavel Begunkov 提交于
      sqe->flags are subset of req flags, so incorrectly copied may span into
      in-kernel flags and wreck havoc, e.g. by setting REQ_F_INFLIGHT.
      
      Fixes: 5be9ad1e ("io_uring: optimise io_init_req() flags setting")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ebf4a5db
    • P
      io_uring: wait potential ->release() on resurrect · 88f171ab
      Pavel Begunkov 提交于
      There is a short window where percpu_refs are already turned zero, but
      we try to do resurrect(). Play nicer and wait for ->release() to happen
      in this case and proceed as everything is ok. One downside for ctx refs
      is that we can ignore signal_pending() on a rare occasion, but someone
      else should check for it later if needed.
      
      Cc: <stable@vger.kernel.org> # 5.5+
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      88f171ab
    • P
      io_uring: keep generic rsrc infra generic · f2303b1f
      Pavel Begunkov 提交于
      io_rsrc_ref_quiesce() is a generic resource function, though now it
      was wired to allocate and initialise ref nodes with file-specific
      callbacks/etc. Keep it sane by passing in as a parameters everything we
      need for initialisations, otherwise it will hurt us badly one day.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f2303b1f
    • P
      io_uring: zero ref_node after killing it · e6cb007c
      Pavel Begunkov 提交于
      After a rsrc/files reference node's refs are killed, it must never be
      used. And that's how it works, it either assigns a new node or kills the
      whole data table.
      
      Let's explicitly NULL it, that shouldn't be necessary, but if something
      would go wrong I'd rather catch a NULL dereference to using a dangling
      pointer.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e6cb007c
    • J
      io_uring: make the !CONFIG_NET helpers a bit more robust · 99a10081
      Jens Axboe 提交于
      With the prep and prep async split, we now have potentially 3 helpers
      that need to be defined for !CONFIG_NET. Add some helpers to do just
      that.
      
      Fixes the following compile error on !CONFIG_NET:
      
      fs/io_uring.c:6171:10: error: implicit declaration of function
      'io_sendmsg_prep_async'; did you mean 'io_req_prep_async'?
      [-Werror=implicit-function-declaration]
         return io_sendmsg_prep_async(req);
                   ^~~~~~~~~~~~~~~~~~~~~
      	     io_req_prep_async
      
      Fixes: 93642ef8 ("io_uring: split sqe-prep and async setup")
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      99a10081
    • H
      io_uring: don't hold uring_lock when calling io_run_task_work* · 8bad28d8
      Hao Xu 提交于
      Abaci reported the below issue:
      [  141.400455] hrtimer: interrupt took 205853 ns
      [  189.869316] process 'usr/local/ilogtail/ilogtail_0.16.26' started with executable stack
      [  250.188042]
      [  250.188327] ============================================
      [  250.189015] WARNING: possible recursive locking detected
      [  250.189732] 5.11.0-rc4 #1 Not tainted
      [  250.190267] --------------------------------------------
      [  250.190917] a.out/7363 is trying to acquire lock:
      [  250.191506] ffff888114dbcbe8 (&ctx->uring_lock){+.+.}-{3:3}, at: __io_req_task_submit+0x29/0xa0
      [  250.192599]
      [  250.192599] but task is already holding lock:
      [  250.193309] ffff888114dbfbe8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_register+0xad/0x210
      [  250.194426]
      [  250.194426] other info that might help us debug this:
      [  250.195238]  Possible unsafe locking scenario:
      [  250.195238]
      [  250.196019]        CPU0
      [  250.196411]        ----
      [  250.196803]   lock(&ctx->uring_lock);
      [  250.197420]   lock(&ctx->uring_lock);
      [  250.197966]
      [  250.197966]  *** DEADLOCK ***
      [  250.197966]
      [  250.198837]  May be due to missing lock nesting notation
      [  250.198837]
      [  250.199780] 1 lock held by a.out/7363:
      [  250.200373]  #0: ffff888114dbfbe8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_register+0xad/0x210
      [  250.201645]
      [  250.201645] stack backtrace:
      [  250.202298] CPU: 0 PID: 7363 Comm: a.out Not tainted 5.11.0-rc4 #1
      [  250.203144] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  250.203887] Call Trace:
      [  250.204302]  dump_stack+0xac/0xe3
      [  250.204804]  __lock_acquire+0xab6/0x13a0
      [  250.205392]  lock_acquire+0x2c3/0x390
      [  250.205928]  ? __io_req_task_submit+0x29/0xa0
      [  250.206541]  __mutex_lock+0xae/0x9f0
      [  250.207071]  ? __io_req_task_submit+0x29/0xa0
      [  250.207745]  ? 0xffffffffa0006083
      [  250.208248]  ? __io_req_task_submit+0x29/0xa0
      [  250.208845]  ? __io_req_task_submit+0x29/0xa0
      [  250.209452]  ? __io_req_task_submit+0x5/0xa0
      [  250.210083]  __io_req_task_submit+0x29/0xa0
      [  250.210687]  io_async_task_func+0x23d/0x4c0
      [  250.211278]  task_work_run+0x89/0xd0
      [  250.211884]  io_run_task_work_sig+0x50/0xc0
      [  250.212464]  io_sqe_files_unregister+0xb2/0x1f0
      [  250.213109]  __io_uring_register+0x115a/0x1750
      [  250.213718]  ? __x64_sys_io_uring_register+0xad/0x210
      [  250.214395]  ? __fget_files+0x15a/0x260
      [  250.214956]  __x64_sys_io_uring_register+0xbe/0x210
      [  250.215620]  ? trace_hardirqs_on+0x46/0x110
      [  250.216205]  do_syscall_64+0x2d/0x40
      [  250.216731]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [  250.217455] RIP: 0033:0x7f0fa17e5239
      [  250.218034] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05  3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
      [  250.220343] RSP: 002b:00007f0fa1eeac48 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab
      [  250.221360] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0fa17e5239
      [  250.222272] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000008
      [  250.223185] RBP: 00007f0fa1eeae20 R08: 0000000000000000 R09: 0000000000000000
      [  250.224091] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      [  250.224999] R13: 0000000000021000 R14: 0000000000000000 R15: 00007f0fa1eeb700
      
      This is caused by calling io_run_task_work_sig() to do work under
      uring_lock while the caller io_sqe_files_unregister() already held
      uring_lock.
      To fix this issue, briefly drop uring_lock when calling
      io_run_task_work_sig(), and there are two things to concern:
      
      - hold uring_lock in io_ring_ctx_free() around io_sqe_files_unregister()
          this is for consistency of lock/unlock.
      - add new fixed rsrc ref node before dropping uring_lock
          it's not safe to do io_uring_enter-->percpu_ref_get() with a dying one.
      - check if rsrc_data->refs is dying to avoid parallel io_sqe_files_unregister
      Reported-by: NAbaci <abaci@linux.alibaba.com>
      Fixes: 1ffc5422 ("io_uring: fix io_sqe_files_unregister() hangs")
      Suggested-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
      [axboe: fixes from Pavel folded in]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8bad28d8
    • P
      io_uring: fail io-wq submission from a task_work · a3df7698
      Pavel Begunkov 提交于
      In case of failure io_wq_submit_work() needs to post an CQE and so
      potentially take uring_lock. The safest way to deal with it is to do
      that from under task_work where we can safely take the lock.
      
      Also, as io_iopoll_check() holds the lock tight and releases it
      reluctantly, it will play nicer in the furuter with notifying an
      iopolling task about new such pending failed requests.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a3df7698
  7. 19 2月, 2021 5 次提交
    • P
      io_uring: don't take uring_lock during iowq cancel · 792bb6eb
      Pavel Begunkov 提交于
      [   97.866748] a.out/2890 is trying to acquire lock:
      [   97.867829] ffff8881046763e8 (&ctx->uring_lock){+.+.}-{3:3}, at:
      io_wq_submit_work+0x155/0x240
      [   97.869735]
      [   97.869735] but task is already holding lock:
      [   97.871033] ffff88810dfe0be8 (&ctx->uring_lock){+.+.}-{3:3}, at:
      __x64_sys_io_uring_enter+0x3f0/0x5b0
      [   97.873074]
      [   97.873074] other info that might help us debug this:
      [   97.874520]  Possible unsafe locking scenario:
      [   97.874520]
      [   97.875845]        CPU0
      [   97.876440]        ----
      [   97.877048]   lock(&ctx->uring_lock);
      [   97.877961]   lock(&ctx->uring_lock);
      [   97.878881]
      [   97.878881]  *** DEADLOCK ***
      [   97.878881]
      [   97.880341]  May be due to missing lock nesting notation
      [   97.880341]
      [   97.881952] 1 lock held by a.out/2890:
      [   97.882873]  #0: ffff88810dfe0be8 (&ctx->uring_lock){+.+.}-{3:3}, at:
      __x64_sys_io_uring_enter+0x3f0/0x5b0
      [   97.885108]
      [   97.885108] stack backtrace:
      [   97.890457] Call Trace:
      [   97.891121]  dump_stack+0xac/0xe3
      [   97.891972]  __lock_acquire+0xab6/0x13a0
      [   97.892940]  lock_acquire+0x2c3/0x390
      [   97.894894]  __mutex_lock+0xae/0x9f0
      [   97.901101]  io_wq_submit_work+0x155/0x240
      [   97.902112]  io_wq_cancel_cb+0x162/0x490
      [   97.904126]  io_async_find_and_cancel+0x3b/0x140
      [   97.905247]  io_issue_sqe+0x86d/0x13e0
      [   97.909122]  __io_queue_sqe+0x10b/0x550
      [   97.913971]  io_queue_sqe+0x235/0x470
      [   97.914894]  io_submit_sqes+0xcce/0xf10
      [   97.917872]  __x64_sys_io_uring_enter+0x3fb/0x5b0
      [   97.921424]  do_syscall_64+0x2d/0x40
      [   97.922329]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      While holding uring_lock, e.g. from inline execution, async cancel
      request may attempt cancellations through io_wq_submit_work, which may
      try to grab a lock. Delay it to task_work, so we do it from a clean
      context and don't have to worry about locking.
      
      Cc: <stable@vger.kernel.org> # 5.5+
      Fixes: c07e6719 ("io_uring: hold uring_lock while completing failed polled io in io_wq_submit_work()")
      Reported-by: NAbaci <abaci@linux.alibaba.com>
      Reported-by: NHao Xu <haoxu@linux.alibaba.com>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      792bb6eb
    • P
      io_uring: fail links more in io_submit_sqe() · de59bc10
      Pavel Begunkov 提交于
      Instead of marking a link with REQ_F_FAIL_LINK on an error and delaying
      its failing to the caller, do it eagerly right when after getting an
      error in io_submit_sqe(). This renders FAIL_LINK checks in
      io_queue_link_head() useless and we can skip it.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      de59bc10
    • P
      io_uring: don't do async setup for links' heads · 1ee43ba8
      Pavel Begunkov 提交于
      Now, as we can do async setup without holding an SQE, we can skip doing
      io_req_defer_prep() for link heads, it will be tried to be executed
      inline and follows all the rules of the non-linked requests.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1ee43ba8
    • P
      io_uring: do io_*_prep() early in io_submit_sqe() · be7053b7
      Pavel Begunkov 提交于
      Now as preparations are split from async setup, we can do the first one
      pretty early not spilling it across multiple call sites. And after it's
      done SQE is not needed anymore and we can save on passing it deeply into
      the submission stack.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      be7053b7
    • P
      io_uring: split sqe-prep and async setup · 93642ef8
      Pavel Begunkov 提交于
      There are two kinds of opcode-specific preparations we do. The first is
      just initialising req with what is always needed for an opcode and
      reading all non-generic SQE fields. And the second is copying some of
      the stuff like iovec preparing to punt a request to somewhere async,
      e.g. to io-wq or for draining. For requests that have tried an inline
      execution but still needing to be punted, the second prep type is done
      by the opcode handler itself.
      
      Currently, we don't explicitly split those preparation steps, but
      combining both of them into io_*_prep(), altering the behaviour by
      allocating ->async_data. That's pretty messy and hard to follow and also
      gets in the way of some optimisations.
      
      Split the steps, leave the first type as where it is now, and put the
      second into a new io_req_prep_async() helper. It may make us to do opcode
      switch twice, but it's worth it.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      93642ef8