1. 10 2月, 2021 20 次提交
    • J
      io_uring: enable kmemcg account for io_uring requests · 91f245d5
      Jens Axboe 提交于
      This puts io_uring under the memory cgroups accounting and limits for
      requests.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      91f245d5
    • J
      io_uring: enable req cache for IRQ driven IO · c7dae4ba
      Jens Axboe 提交于
      This is the last class of requests that cannot utilize the req alloc
      cache. Add a per-ctx req cache that is protected by the completion_lock,
      and refill our submit side cache when it gets over our batch count.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c7dae4ba
    • H
      io_uring: fix possible deadlock in io_uring_poll · ed670c3f
      Hao Xu 提交于
      Abaci reported follow issue:
      
      [   30.615891] ======================================================
      [   30.616648] WARNING: possible circular locking dependency detected
      [   30.617423] 5.11.0-rc3-next-20210115 #1 Not tainted
      [   30.618035] ------------------------------------------------------
      [   30.618914] a.out/1128 is trying to acquire lock:
      [   30.619520] ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220
      [   30.620505]
      [   30.620505] but task is already holding lock:
      [   30.621218] ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
      [   30.622349]
      [   30.622349] which lock already depends on the new lock.
      [   30.622349]
      [   30.623289]
      [   30.623289] the existing dependency chain (in reverse order) is:
      [   30.624243]
      [   30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}:
      [   30.625263]        lock_acquire+0x2c7/0x390
      [   30.625868]        __mutex_lock+0xae/0x9f0
      [   30.626451]        io_cqring_overflow_flush.part.95+0x6d/0x70
      [   30.627278]        io_uring_poll+0xcb/0xd0
      [   30.627890]        ep_item_poll.isra.14+0x4e/0x90
      [   30.628531]        do_epoll_ctl+0xb7e/0x1120
      [   30.629122]        __x64_sys_epoll_ctl+0x70/0xb0
      [   30.629770]        do_syscall_64+0x2d/0x40
      [   30.630332]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   30.631187]
      [   30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}:
      [   30.631985]        check_prevs_add+0x226/0xb00
      [   30.632584]        __lock_acquire+0x1237/0x13a0
      [   30.633207]        lock_acquire+0x2c7/0x390
      [   30.633740]        __mutex_lock+0xae/0x9f0
      [   30.634258]        __ep_eventpoll_poll+0x9f/0x220
      [   30.634879]        __io_arm_poll_handler+0xbf/0x220
      [   30.635462]        io_issue_sqe+0xa6b/0x13e0
      [   30.635982]        __io_queue_sqe+0x10b/0x550
      [   30.636648]        io_queue_sqe+0x235/0x470
      [   30.637281]        io_submit_sqes+0xcce/0xf10
      [   30.637839]        __x64_sys_io_uring_enter+0x3fb/0x5b0
      [   30.638465]        do_syscall_64+0x2d/0x40
      [   30.638999]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   30.639643]
      [   30.639643] other info that might help us debug this:
      [   30.639643]
      [   30.640618]  Possible unsafe locking scenario:
      [   30.640618]
      [   30.641402]        CPU0                    CPU1
      [   30.641938]        ----                    ----
      [   30.642664]   lock(&ctx->uring_lock);
      [   30.643425]                                lock(&ep->mtx);
      [   30.644498]                                lock(&ctx->uring_lock);
      [   30.645668]   lock(&ep->mtx);
      [   30.646321]
      [   30.646321]  *** DEADLOCK ***
      [   30.646321]
      [   30.647642] 1 lock held by a.out/1128:
      [   30.648424]  #0: ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
      [   30.649954]
      [   30.649954] stack backtrace:
      [   30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-20210115 #1
      [   30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [   30.652290] Call Trace:
      [   30.652688]  dump_stack+0xac/0xe3
      [   30.653164]  check_noncircular+0x11e/0x130
      [   30.653747]  ? check_prevs_add+0x226/0xb00
      [   30.654303]  check_prevs_add+0x226/0xb00
      [   30.654845]  ? add_lock_to_list.constprop.49+0xac/0x1d0
      [   30.655564]  __lock_acquire+0x1237/0x13a0
      [   30.656262]  lock_acquire+0x2c7/0x390
      [   30.656788]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.657379]  ? __io_queue_proc.isra.88+0x180/0x180
      [   30.658014]  __mutex_lock+0xae/0x9f0
      [   30.658524]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.659112]  ? mark_held_locks+0x5a/0x80
      [   30.659648]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.660229]  ? _raw_spin_unlock_irqrestore+0x2d/0x40
      [   30.660885]  ? trace_hardirqs_on+0x46/0x110
      [   30.661471]  ? __io_queue_proc.isra.88+0x180/0x180
      [   30.662102]  ? __ep_eventpoll_poll+0x9f/0x220
      [   30.662696]  __ep_eventpoll_poll+0x9f/0x220
      [   30.663273]  ? __ep_eventpoll_poll+0x220/0x220
      [   30.663875]  __io_arm_poll_handler+0xbf/0x220
      [   30.664463]  io_issue_sqe+0xa6b/0x13e0
      [   30.664984]  ? __lock_acquire+0x782/0x13a0
      [   30.665544]  ? __io_queue_proc.isra.88+0x180/0x180
      [   30.666170]  ? __io_queue_sqe+0x10b/0x550
      [   30.666725]  __io_queue_sqe+0x10b/0x550
      [   30.667252]  ? __fget_files+0x131/0x260
      [   30.667791]  ? io_req_prep+0xd8/0x1090
      [   30.668316]  ? io_queue_sqe+0x235/0x470
      [   30.668868]  io_queue_sqe+0x235/0x470
      [   30.669398]  io_submit_sqes+0xcce/0xf10
      [   30.669931]  ? xa_load+0xe4/0x1c0
      [   30.670425]  __x64_sys_io_uring_enter+0x3fb/0x5b0
      [   30.671051]  ? lockdep_hardirqs_on_prepare+0xde/0x180
      [   30.671719]  ? syscall_enter_from_user_mode+0x2b/0x80
      [   30.672380]  do_syscall_64+0x2d/0x40
      [   30.672901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   30.673503] RIP: 0033:0x7fd89c813239
      [   30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05  3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
      [   30.675920] RSP: 002b:00007ffc65a7c628 EFLAGS: 00000217 ORIG_RAX: 00000000000001aa
      [   30.676791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd89c813239
      [   30.677594] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 0000000000000003
      [   30.678678] RBP: 00007ffc65a7c720 R08: 0000000000000000 R09: 0000000003000000
      [   30.679492] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000400ff0
      [   30.680282] R13: 00007ffc65a7c840 R14: 0000000000000000 R15: 0000000000000000
      
      This might happen if we do epoll_wait on a uring fd while reading/writing
      the former epoll fd in a sqe in the former uring instance.
      So let's don't flush cqring overflow list, just do a simple check.
      Reported-by: NAbaci <abaci@linux.alibaba.com>
      Fixes: 6c503150 ("io_uring: patch up IOPOLL overflow_flush sync")
      Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ed670c3f
    • P
      io_uring: defer flushing cached reqs · e5d1bc0a
      Pavel Begunkov 提交于
      Awhile there are requests in the allocation cache -- use them, only if
      those ended go for the stashed memory in comp.free_list. As list
      manipulation are generally heavy and are not good for caches, flush them
      all or as much as can in one go.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      [axboe: return success/failure from io_flush_cached_reqs()]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e5d1bc0a
    • P
      io_uring: take comp_state from ctx · c5eef2b9
      Pavel Begunkov 提交于
      __io_queue_sqe() is always called with a non-NULL comp_state, which is
      taken directly from context. Don't pass it around but infer from ctx.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c5eef2b9
    • J
      io_uring: enable req cache for task_work items · 65453d1e
      Jens Axboe 提交于
      task_work is run without utilizing the req alloc cache, so any deferred
      items don't get to take advantage of either the alloc or free side of it.
      With task_work now being wrapped by io_uring, we can use the ctx
      completion state to both use the req cache and the completion flush
      batching.
      
      With this, the only request type that cannot take advantage of the req
      cache is IRQ driven IO for regular files / block devices. Anything else,
      including IOPOLL polled IO to those same tyes, will take advantage of it.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      65453d1e
    • J
      io_uring: provide FIFO ordering for task_work · 7cbf1722
      Jens Axboe 提交于
      task_work is a LIFO list, due to how it's implemented as a lockless
      list. For long chains of task_work, this can be problematic as the
      first entry added is the last one processed. Similarly, we'd waste
      a lot of CPU cycles reversing this list.
      
      Wrap the task_work so we have a single task_work entry per task per
      ctx, and use that to run it in the right order.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7cbf1722
    • J
      io_uring: use persistent request cache · 1b4c351f
      Jens Axboe 提交于
      Now that we have the submit_state in the ring itself, we can have io_kiocb
      allocations that are persistent across invocations. This reduces the time
      spent doing slab allocations and frees.
      
      [sil: rebased]
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1b4c351f
    • P
      io_uring: feed reqs back into alloc cache · 6ff119a6
      Pavel Begunkov 提交于
      Make io_req_free_batch(), which is used for inline executed requests and
      IOPOLL, to return requests back into the allocation cache, so avoid
      most of kmalloc()/kfree() for those cases.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6ff119a6
    • P
      io_uring: persistent req cache · bf019da7
      Pavel Begunkov 提交于
      Don't free batch-allocated requests across syscalls.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bf019da7
    • P
      io_uring: count ctx refs separately from reqs · 9ae72463
      Pavel Begunkov 提交于
      Currently batch free handles request memory freeing and ctx ref putting
      together. Separate them and use different counters, that will be needed
      for reusing reqs memory.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9ae72463
    • P
      io_uring: remove fallback_req · 3893f39f
      Pavel Begunkov 提交于
      Remove fallback_req for now, it gets in the way of other changes.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3893f39f
    • P
      io_uring: submit-completion free batching · 905c172f
      Pavel Begunkov 提交于
      io_submit_flush_completions() does completion batching, but may also use
      free batching as iopoll does. The main beneficiaries should be buffered
      reads/writes and send/recv.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      905c172f
    • P
      io_uring: replace list with array for compl batch · 6dd0be1e
      Pavel Begunkov 提交于
      Reincarnation of an old patch that replaces a list in struct
      io_compl_batch with an array. It's needed to avoid hooking requests via
      their compl.list, because it won't be always available in the future.
      
      It's also nice to split io_submit_flush_completions() to avoid free
      under locks and remove unlock/lock with a long comment describing when
      it can be done.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6dd0be1e
    • P
      io_uring: don't reinit submit state every time · 5087275d
      Pavel Begunkov 提交于
      As now submit_state is retained across syscalls, we can save ourself
      from initialising it from ground up for each io_submit_sqes(). Set some
      fields during ctx allocation, and just keep them always consistent.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      [axboe: remove unnecessary zeroing of ctx members]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5087275d
    • P
      io_uring: remove ctx from comp_state · ba88ff11
      Pavel Begunkov 提交于
      completion state is closely bound to ctx, we don't need to store ctx
      inside as we always have it around to pass to flush.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ba88ff11
    • P
      io_uring: don't keep submit_state on stack · 258b29a9
      Pavel Begunkov 提交于
      struct io_submit_state is quite big (168 bytes) and going to grow. It's
      better to not keep it on stack as it is now. Move it to context, it's
      always protected by uring_lock, so it's fine to have only one instance
      of it.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      258b29a9
    • P
      io_uring: don't propagate io_comp_state · 889fca73
      Pavel Begunkov 提交于
      There is no reason to drag io_comp_state into opcode handlers, we just
      need a flag and the actual work will be done in __io_queue_sqe().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      889fca73
    • P
      io_uring: make op handlers always take issue flags · 61e98203
      Pavel Begunkov 提交于
      Make opcode handler interfaces a bit more consistent by always passing
      in issue flags. Bulky but pretty easy and mechanical change.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      61e98203
    • P
      io_uring: replace force_nonblock with flags · 45d189c6
      Pavel Begunkov 提交于
      Replace bool force_nonblock with flags. It has a long standing goal of
      differentiating context from which we execute. Currently we have some
      subtle places where some invariants, like holding of uring_lock, are
      subtly inferred.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      45d189c6
  2. 08 2月, 2021 1 次提交
  3. 05 2月, 2021 3 次提交
  4. 04 2月, 2021 13 次提交
  5. 02 2月, 2021 3 次提交