1. 13 2月, 2021 3 次提交
  2. 12 2月, 2021 11 次提交
  3. 11 2月, 2021 3 次提交
    • J
      io_uring: assign file_slot prior to calling io_sqe_file_register() · e68a3ff8
      Jens Axboe 提交于
      We use the assigned slot in io_sqe_file_register(), and a previous
      patch moved the assignment to after we have called it. This isn't
      super pretty, and will get cleaned up in the future. For now, fix
      the regression by restoring the previous assignment/clear of the
      file_slot.
      
      Fixes: ea64ec02 ("io_uring: deduplicate file table slot calculation")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e68a3ff8
    • C
      io_uring: remove redundant initialization of variable ret · 4a245479
      Colin Ian King 提交于
      The variable ret is being initialized with a value that is never read
      and it is being updated later with a new value.  The initialization is
      redundant and can be removed.
      
      Addresses-Coverity: ("Unused value")
      Fixes: b63534c4 ("io_uring: re-issue block requests that failed because of resources")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4a245479
    • P
      io_uring: unpark SQPOLL thread for cancelation · 34343786
      Pavel Begunkov 提交于
      We park SQPOLL task before going into io_uring_cancel_files(), so the
      task won't run task_works including those that might be important for
      the cancellation passes. In this case it's io_poll_remove_one(), which
      frees requests via io_put_req_deferred().
      
      Unpark it for while waiting, it's ok as we disable submissions
      beforehand, so no new requests will be generated.
      
      INFO: task syz-executor893:8493 blocked for more than 143 seconds.
      Call Trace:
       context_switch kernel/sched/core.c:4327 [inline]
       __schedule+0x90c/0x21a0 kernel/sched/core.c:5078
       schedule+0xcf/0x270 kernel/sched/core.c:5157
       io_uring_cancel_files fs/io_uring.c:8912 [inline]
       io_uring_cancel_task_requests+0xe70/0x11a0 fs/io_uring.c:8979
       __io_uring_files_cancel+0x110/0x1b0 fs/io_uring.c:9067
       io_uring_files_cancel include/linux/io_uring.h:51 [inline]
       do_exit+0x2fe/0x2ae0 kernel/exit.c:780
       do_group_exit+0x125/0x310 kernel/exit.c:922
       __do_sys_exit_group kernel/exit.c:933 [inline]
       __se_sys_exit_group kernel/exit.c:931 [inline]
       __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Cc: stable@vger.kernel.org # 5.5+
      Reported-by: syzbot+695b03d82fa8e4901b06@syzkaller.appspotmail.com
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      34343786
  4. 10 2月, 2021 21 次提交
    • J
      io_uring: place ring SQ/CQ arrays under memcg memory limits · 26bfa89e
      Jens Axboe 提交于
      Instead of imposing rlimit memlock limits for the rings themselves,
      ensure that we account them properly under memcg with __GFP_ACCOUNT.
      We retain rlimit memlock for registered buffers, this is just for the
      ring arrays themselves.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      26bfa89e
    • J
      io_uring: enable kmemcg account for io_uring requests · 91f245d5
      Jens Axboe 提交于
      This puts io_uring under the memory cgroups accounting and limits for
      requests.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      91f245d5
    • J
      io_uring: enable req cache for IRQ driven IO · c7dae4ba
      Jens Axboe 提交于
      This is the last class of requests that cannot utilize the req alloc
      cache. Add a per-ctx req cache that is protected by the completion_lock,
      and refill our submit side cache when it gets over our batch count.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c7dae4ba
    • H
      io_uring: fix possible deadlock in io_uring_poll · ed670c3f
      Hao Xu 提交于
      Abaci reported follow issue:
      
      [   30.615891] ======================================================
      [   30.616648] WARNING: possible circular locking dependency detected
      [   30.617423] 5.11.0-rc3-next-20210115 #1 Not tainted
      [   30.618035] ------------------------------------------------------
      [   30.618914] a.out/1128 is trying to acquire lock:
      [   30.619520] ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220
      [   30.620505]
      [   30.620505] but task is already holding lock:
      [   30.621218] ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
      [   30.622349]
      [   30.622349] which lock already depends on the new lock.
      [   30.622349]
      [   30.623289]
      [   30.623289] the existing dependency chain (in reverse order) is:
      [   30.624243]
      [   30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}:
      [   30.625263]        lock_acquire+0x2c7/0x390
      [   30.625868]        __mutex_lock+0xae/0x9f0
      [   30.626451]        io_cqring_overflow_flush.part.95+0x6d/0x70
      [   30.627278]        io_uring_poll+0xcb/0xd0
      [   30.627890]        ep_item_poll.isra.14+0x4e/0x90
      [   30.628531]        do_epoll_ctl+0xb7e/0x1120
      [   30.629122]        __x64_sys_epoll_ctl+0x70/0xb0
      [   30.629770]        do_syscall_64+0x2d/0x40
      [   30.630332]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   30.631187]
      [   30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}:
      [   30.631985]        check_prevs_add+0x226/0xb00
      [   30.632584]        __lock_acquire+0x1237/0x13a0
      [   30.633207]        lock_acquire+0x2c7/0x390
      [   30.633740]        __mutex_lock+0xae/0x9f0
      [   30.634258]        __ep_eventpoll_poll+0x9f/0x220
      [   30.634879]        __io_arm_poll_handler+0xbf/0x220
      [   30.635462]        io_issue_sqe+0xa6b/0x13e0
      [   30.635982]        __io_queue_sqe+0x10b/0x550
      [   30.636648]        io_queue_sqe+0x235/0x470
      [   30.637281]        io_submit_sqes+0xcce/0xf10
      [   30.637839]        __x64_sys_io_uring_enter+0x3fb/0x5b0
      [   30.638465]        do_syscall_64+0x2d/0x40
      [   30.638999]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   30.639643]
      [   30.639643] other info that might help us debug this:
      [   30.639643]
      [   30.640618]  Possible unsafe locking scenario:
      [   30.640618]
      [   30.641402]        CPU0                    CPU1
      [   30.641938]        ----                    ----
      [   30.642664]   lock(&ctx->uring_lock);
      [   30.643425]                                lock(&ep->mtx);
      [   30.644498]                                lock(&ctx->uring_lock);
      [   30.645668]   lock(&ep->mtx);
      [   30.646321]
      [   30.646321]  *** DEADLOCK ***
      [   30.646321]
      [   30.647642] 1 lock held by a.out/1128:
      [   30.648424]  #0: ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
      [   30.649954]
      [   30.649954] stack backtrace:
      [   30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-20210115 #1
      [   30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [   30.652290] Call Trace:
      [   30.652688]  dump_stack+0xac/0xe3
      [   30.653164]  check_noncircular+0x11e/0x130
      [   30.653747]  ? check_prevs_add+0x226/0xb00
      [   30.654303]  check_prevs_add+0x226/0xb00
      [   30.654845]  ? add_lock_to_list.constprop.49+0xac/0x1d0
      [   30.655564]  __lock_acquire+0x1237/0x13a0
      [   30.656262]  lock_acquire+0x2c7/0x390
      [   30.656788]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.657379]  ? __io_queue_proc.isra.88+0x180/0x180
      [   30.658014]  __mutex_lock+0xae/0x9f0
      [   30.658524]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.659112]  ? mark_held_locks+0x5a/0x80
      [   30.659648]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.660229]  ? _raw_spin_unlock_irqrestore+0x2d/0x40
      [   30.660885]  ? trace_hardirqs_on+0x46/0x110
      [   30.661471]  ? __io_queue_proc.isra.88+0x180/0x180
      [   30.662102]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.662696]  __ep_eventpoll_poll+0x9f/0x220
      [   30.663273]  ? __ep_eventpoll_poll+0x220/0x220
      [   30.663875]  __io_arm_poll_handler+0xbf/0x220
      [   30.664463]  io_issue_sqe+0xa6b/0x13e0
      [   30.664984]  ? __lock_acquire+0x782/0x13a0
      [   30.665544]  ? __io_queue_proc.isra.88+0x180/0x180
      [   30.666170]  ? __io_queue_sqe+0x10b/0x550
      [   30.666725]  __io_queue_sqe+0x10b/0x550
      [   30.667252]  ? __fget_files+0x131/0x260
      [   30.667791]  ? io_req_prep+0xd8/0x1090
      [   30.668316]  ? io_queue_sqe+0x235/0x470
      [   30.668868]  io_queue_sqe+0x235/0x470
      [   30.669398]  io_submit_sqes+0xcce/0xf10
      [   30.669931]  ? xa_load+0xe4/0x1c0
      [   30.670425]  __x64_sys_io_uring_enter+0x3fb/0x5b0
      [   30.671051]  ? lockdep_hardirqs_on_prepare+0xde/0x180
      [   30.671719]  ? syscall_enter_from_user_mode+0x2b/0x80
      [   30.672380]  do_syscall_64+0x2d/0x40
      [   30.672901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   30.673503] RIP: 0033:0x7fd89c813239
      [   30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05  3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
      [   30.675920] RSP: 002b:00007ffc65a7c628 EFLAGS: 00000217 ORIG_RAX: 00000000000001aa
      [   30.676791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd89c813239
      [   30.677594] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 0000000000000003
      [   30.678678] RBP: 00007ffc65a7c720 R08: 0000000000000000 R09: 0000000003000000
      [   30.679492] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000400ff0
      [   30.680282] R13: 00007ffc65a7c840 R14: 0000000000000000 R15: 0000000000000000
      
      This might happen if we do epoll_wait on a uring fd while reading/writing
      the former epoll fd in a sqe in the former uring instance.
      So let's don't flush cqring overflow list, just do a simple check.
      Reported-by: NAbaci <abaci@linux.alibaba.com>
      Fixes: 6c503150 ("io_uring: patch up IOPOLL overflow_flush sync")
      Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ed670c3f
    • P
      io_uring: defer flushing cached reqs · e5d1bc0a
      Pavel Begunkov 提交于
      Awhile there are requests in the allocation cache -- use them, only if
      those ended go for the stashed memory in comp.free_list. As list
      manipulation are generally heavy and are not good for caches, flush them
      all or as much as can in one go.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      [axboe: return success/failure from io_flush_cached_reqs()]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e5d1bc0a
    • P
      io_uring: take comp_state from ctx · c5eef2b9
      Pavel Begunkov 提交于
      __io_queue_sqe() is always called with a non-NULL comp_state, which is
      taken directly from context. Don't pass it around but infer from ctx.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c5eef2b9
    • J
      io_uring: enable req cache for task_work items · 65453d1e
      Jens Axboe 提交于
      task_work is run without utilizing the req alloc cache, so any deferred
      items don't get to take advantage of either the alloc or free side of it.
      With task_work now being wrapped by io_uring, we can use the ctx
      completion state to both use the req cache and the completion flush
      batching.
      
      With this, the only request type that cannot take advantage of the req
      cache is IRQ driven IO for regular files / block devices. Anything else,
      including IOPOLL polled IO to those same tyes, will take advantage of it.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      65453d1e
    • J
      io_uring: provide FIFO ordering for task_work · 7cbf1722
      Jens Axboe 提交于
      task_work is a LIFO list, due to how it's implemented as a lockless
      list. For long chains of task_work, this can be problematic as the
      first entry added is the last one processed. Similarly, we'd waste
      a lot of CPU cycles reversing this list.
      
      Wrap the task_work so we have a single task_work entry per task per
      ctx, and use that to run it in the right order.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7cbf1722
    • J
      io_uring: use persistent request cache · 1b4c351f
      Jens Axboe 提交于
      Now that we have the submit_state in the ring itself, we can have io_kiocb
      allocations that are persistent across invocations. This reduces the time
      spent doing slab allocations and frees.
      
      [sil: rebased]
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1b4c351f
    • P
      io_uring: feed reqs back into alloc cache · 6ff119a6
      Pavel Begunkov 提交于
      Make io_req_free_batch(), which is used for inline executed requests and
      IOPOLL, to return requests back into the allocation cache, so avoid
      most of kmalloc()/kfree() for those cases.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6ff119a6
    • P
      io_uring: persistent req cache · bf019da7
      Pavel Begunkov 提交于
      Don't free batch-allocated requests across syscalls.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bf019da7
    • P
      io_uring: count ctx refs separately from reqs · 9ae72463
      Pavel Begunkov 提交于
      Currently batch free handles request memory freeing and ctx ref putting
      together. Separate them and use different counters, that will be needed
      for reusing reqs memory.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9ae72463
    • P
      io_uring: remove fallback_req · 3893f39f
      Pavel Begunkov 提交于
      Remove fallback_req for now, it gets in the way of other changes.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3893f39f
    • P
      io_uring: submit-completion free batching · 905c172f
      Pavel Begunkov 提交于
      io_submit_flush_completions() does completion batching, but may also use
      free batching as iopoll does. The main beneficiaries should be buffered
      reads/writes and send/recv.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      905c172f
    • P
      io_uring: replace list with array for compl batch · 6dd0be1e
      Pavel Begunkov 提交于
      Reincarnation of an old patch that replaces a list in struct
      io_compl_batch with an array. It's needed to avoid hooking requests via
      their compl.list, because it won't be always available in the future.
      
      It's also nice to split io_submit_flush_completions() to avoid free
      under locks and remove unlock/lock with a long comment describing when
      it can be done.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6dd0be1e
    • P
      io_uring: don't reinit submit state every time · 5087275d
      Pavel Begunkov 提交于
      As now submit_state is retained across syscalls, we can save ourself
      from initialising it from ground up for each io_submit_sqes(). Set some
      fields during ctx allocation, and just keep them always consistent.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      [axboe: remove unnecessary zeroing of ctx members]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5087275d
    • P
      io_uring: remove ctx from comp_state · ba88ff11
      Pavel Begunkov 提交于
      completion state is closely bound to ctx, we don't need to store ctx
      inside as we always have it around to pass to flush.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ba88ff11
    • P
      io_uring: don't keep submit_state on stack · 258b29a9
      Pavel Begunkov 提交于
      struct io_submit_state is quite big (168 bytes) and going to grow. It's
      better to not keep it on stack as it is now. Move it to context, it's
      always protected by uring_lock, so it's fine to have only one instance
      of it.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      258b29a9
    • P
      io_uring: don't propagate io_comp_state · 889fca73
      Pavel Begunkov 提交于
      There is no reason to drag io_comp_state into opcode handlers, we just
      need a flag and the actual work will be done in __io_queue_sqe().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      889fca73
    • P
      io_uring: make op handlers always take issue flags · 61e98203
      Pavel Begunkov 提交于
      Make opcode handler interfaces a bit more consistent by always passing
      in issue flags. Bulky but pretty easy and mechanical change.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      61e98203
    • P
      io_uring: replace force_nonblock with flags · 45d189c6
      Pavel Begunkov 提交于
      Replace bool force_nonblock with flags. It has a long standing goal of
      differentiating context from which we execute. Currently we have some
      subtle places where some invariants, like holding of uring_lock, are
      subtly inferred.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      45d189c6
  5. 08 2月, 2021 1 次提交
  6. 05 2月, 2021 1 次提交
    • P
      io_uring: refactor sendmsg/recvmsg iov managing · 257e84a5
      Pavel Begunkov 提交于
      Current iov handling with recvmsg/sendmsg may be confusing. First make a
      rule for msg->iov: either it points to an allocated iov that have to be
      kfree()'d later, or it's NULL and we use fast_iov. That's much better
      than current 3-state (also can point to fast_iov). And rename it into
      free_iov for uniformity with read/write.
      
      Also, instead of after struct io_async_msghdr copy fixing up of
      msg.msg_iter.iov has been happening in io_recvmsg()/io_sendmsg(). Move
      it into io_setup_async_msg(), that's the right place.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      [axboe: add comment on NULL check before kfree()]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      257e84a5