1. 28 6月, 2020 11 次提交
    • P
      io_uring: kill REQ_F_LINK_NEXT · 9b0d911a
      Pavel Begunkov 提交于
      After pulling nxt from a request, it's no more a links head, so clear
      REQ_F_LINK_HEAD. Absence of this flag also indicates that there are no
      linked requests, so replacing REQ_F_LINK_NEXT, which can be killed.
      
      Linked timeouts also behave leaving the flag intact when necessary.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9b0d911a
    • P
      io_uring: cosmetic changes for batch free · 2d6500d4
      Pavel Begunkov 提交于
      Move all batch free bits close to each other and rename in a consistent
      way.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d6500d4
    • P
      io_uring: batch-free linked requests as well · c3524383
      Pavel Begunkov 提交于
      There is no reason to not batch deallocation of linked requests. Take
      away its next req first and handle it as everything else in
      io_req_multi_free().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c3524383
    • P
      io_uring: dismantle req early and remove need_iter · 2757a23e
      Pavel Begunkov 提交于
      Every request in io_req_multi_free() is has ->file set. Instead of
      pointlessly defering and counting reqs with file, dismantle it on place
      and save for batch dealloc.
      
      It also saves us from potentially skipping io_cleanup_req(), put_task(),
      etc. Never happens though, becacuse ->file is always there.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2757a23e
    • P
      io_uring: remove inflight batching in free_many() · e6543a81
      Pavel Begunkov 提交于
      io_free_req_many() is used only for iopoll requests, i.e. reads/writes.
      Hence no need to batch inflight unhooking. For safety, it'll be done by
      io_dismantle_req(), which replaces __io_req_aux_free(), and looks more
      solid and cleaner.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e6543a81
    • P
      io_uring: fix refs underflow in io_iopoll_queue() · 8c9cb6cd
      Pavel Begunkov 提交于
      Now io_complete_rw_common() puts a ref, extra io_req_put() in
      io_iopoll_queue() causes undeflow. Remove it.
      
      [  455.998620] refcount_t: underflow; use-after-free.
      [  455.998743] WARNING: CPU: 6 PID: 285394 at lib/refcount.c:28
      	refcount_warn_saturate+0xae/0xf0
      [  455.998772] CPU: 6 PID: 285394 Comm: read-write2 Tainted: G
                I E     5.8.0-rc2-00048-g1b1aa738f167-dirty #509
      [  455.998772] RIP: 0010:refcount_warn_saturate+0xae/0xf0
      ...
      [  455.998778] Call Trace:
      [  455.998778]  io_put_req+0x44/0x50
      [  455.998778]  io_iopoll_complete+0x245/0x370
      [  455.998779]  io_iopoll_getevents+0x12f/0x1a0
      [  455.998779]  io_iopoll_reap_events.part.0+0x5e/0xa0
      [  455.998780]  io_ring_ctx_wait_and_kill+0x132/0x1c0
      [  455.998780]  io_uring_release+0x20/0x30
      [  455.998780]  __fput+0xcd/0x230
      [  455.998781]  ____fput+0xe/0x10
      [  455.998781]  task_work_run+0x67/0xa0
      [  455.998781]  do_exit+0x35d/0xb70
      [  455.998782]  do_group_exit+0x43/0xa0
      [  455.998783]  get_signal+0x140/0x900
      [  455.998783]  do_signal+0x37/0x780
      [  455.998784]  __prepare_exit_to_usermode+0x126/0x1c0
      [  455.998785]  __syscall_return_slowpath+0x3b/0x1c0
      [  455.998785]  do_syscall_64+0x5f/0xa0
      [  455.998785]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: a1d7c393 ("io_uring: enable READ/WRITE to use deferred completions")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8c9cb6cd
    • P
      io_uring: fix missing io_grab_files() · 710c2bfb
      Pavel Begunkov 提交于
      We won't have valid ring_fd, ring_file in task work. Grab files early.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      710c2bfb
    • P
      io_uring: don't mark link's head for_async · a6d45dd0
      Pavel Begunkov 提交于
      No reason to mark a head of a link as for-async in io_req_defer_prep().
      grab_env(), etc. That will be done further during submission if
      neccessary.
      
      Mark for_async=false saving extra grab_env() in many cases.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a6d45dd0
    • P
      io_uring: fix feeding io-wq with uninit reqs · 1bcb8c5d
      Pavel Begunkov 提交于
      io_steal_work() can't be sure that @nxt has req->work properly set, so we
      can't pass it to io-wq as is.
      
      A dirty quick fix -- drag it through io_req_task_queue(), and always
      return NULL from io_steal_work().
      
      e.g.
      
      [   50.770161] BUG: kernel NULL pointer dereference, address: 00000000
      [   50.770164] #PF: supervisor write access in kernel mode
      [   50.770164] #PF: error_code(0x0002) - not-present page
      [   50.770168] CPU: 1 PID: 1448 Comm: io_wqe_worker-0 Tainted: G
      	I       5.8.0-rc2-00035-g2237d765-dirty #494
      [   50.770172] RIP: 0010:override_creds+0x19/0x30
      ...
      [   50.770183]  io_worker_handle_work+0x25c/0x430
      [   50.770185]  io_wqe_worker+0x2a0/0x350
      [   50.770190]  kthread+0x136/0x180
      [   50.770194]  ret_from_fork+0x22/0x30
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1bcb8c5d
    • P
      io_uring: fix punting req w/o grabbed env · 906a8c3f
      Pavel Begunkov 提交于
      It's not enough to check for REQ_F_WORK_INITIALIZED and punt async
      assuming that io_req_work_grab_env() was called, it may not have been.
      E.g. io_close_prep() and personality path set the flag without further
      async init.
      
      As a quick fix, always pass next work through io_req_task_queue().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      906a8c3f
    • P
      io_uring: fix req->work corruption · 8ef77766
      Pavel Begunkov 提交于
      req->work and req->task_work are in a union, so io_req_task_queue() screws
      everything that was in work. De-union them for now.
      
      [  704.367253] BUG: unable to handle page fault for address:
      	ffffffffaf7330d0
      [  704.367256] #PF: supervisor write access in kernel mode
      [  704.367256] #PF: error_code(0x0003) - permissions violation
      [  704.367261] CPU: 6 PID: 1654 Comm: io_wqe_worker-0 Tainted: G
      I       5.8.0-rc2-00038-ge28d0bdc4863-dirty #498
      [  704.367265] RIP: 0010:_raw_spin_lock+0x1e/0x36
      ...
      [  704.367276]  __alloc_fd+0x35/0x150
      [  704.367279]  __get_unused_fd_flags+0x25/0x30
      [  704.367280]  io_openat2+0xcb/0x1b0
      [  704.367283]  io_issue_sqe+0x36a/0x1320
      [  704.367294]  io_wq_submit_work+0x58/0x160
      [  704.367295]  io_worker_handle_work+0x2a3/0x430
      [  704.367296]  io_wqe_worker+0x2a0/0x350
      [  704.367301]  kthread+0x136/0x180
      [  704.367304]  ret_from_fork+0x22/0x30
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8ef77766
  2. 27 6月, 2020 3 次提交
    • R
      io_uring: fix function args for !CONFIG_NET · 1e16c2f9
      Randy Dunlap 提交于
      Fix build errors when CONFIG_NET is not set/enabled:
      
      ../fs/io_uring.c:5472:10: error: too many arguments to function ‘io_sendmsg’
      ../fs/io_uring.c:5474:10: error: too many arguments to function ‘io_send’
      ../fs/io_uring.c:5484:10: error: too many arguments to function ‘io_recvmsg’
      ../fs/io_uring.c:5486:10: error: too many arguments to function ‘io_recv’
      ../fs/io_uring.c:5510:9: error: too many arguments to function ‘io_accept’
      ../fs/io_uring.c:5518:9: error: too many arguments to function ‘io_connect’
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: io-uring@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1e16c2f9
    • P
      io-wq: return next work from ->do_work() directly · f4db7182
      Pavel Begunkov 提交于
      It's easier to return next work from ->do_work() than
      having an in-out argument. Looks nicer and easier to compile.
      Also, merge io_wq_assign_next() into its only user.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f4db7182
    • J
      io_uring: use task_work for links if possible · c40f6379
      Jens Axboe 提交于
      Currently links are always done in an async fashion, unless we catch them
      inline after we successfully complete a request without having to resort
      to blocking. This isn't necessarily the most efficient approach, it'd be
      more ideal if we could just use the task_work handling for this.
      
      Outside of saving an async jump, we can also do less prep work for these
      kinds of requests.
      
      Running dependent links from the task_work handler yields some nice
      performance benefits. As an example, examples/link-cp from the liburing
      repository uses read+write links to implement a copy operation. Without
      this patch, the a cache fold 4G file read from a VM runs in about 3
      seconds:
      
      $ time examples/link-cp /data/file /dev/null
      
      real	0m2.986s
      user	0m0.051s
      sys	0m2.843s
      
      and a subsequent cache hot run looks like this:
      
      $ time examples/link-cp /data/file /dev/null
      
      real	0m0.898s
      user	0m0.069s
      sys	0m0.797s
      
      With this patch in place, the cold case takes about 2.4 seconds:
      
      $ time examples/link-cp /data/file /dev/null
      
      real	0m2.400s
      user	0m0.020s
      sys	0m2.366s
      
      and the cache hot case looks like this:
      
      $ time examples/link-cp /data/file /dev/null
      
      real	0m0.676s
      user	0m0.010s
      sys	0m0.665s
      
      As expected, the (mostly) cache hot case yields the biggest improvement,
      running about 25% faster with this change, while the cache cold case
      yields about a 20% increase in performance. Outside of the performance
      increase, we're using less CPU as well, as we're not using the async
      offload threads at all for this anymore.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c40f6379
  3. 25 6月, 2020 8 次提交
    • J
      io_uring: enable READ/WRITE to use deferred completions · a1d7c393
      Jens Axboe 提交于
      A bit more surgery required here, as completions are generally done
      through the kiocb->ki_complete() callback, even if they complete inline.
      This enables the regular read/write path to use the io_comp_state
      logic to batch inline completions.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a1d7c393
    • J
      io_uring: pass in completion state to appropriate issue side handlers · 229a7b63
      Jens Axboe 提交于
      Provide the completion state to the handlers that we know can complete
      inline, so they can utilize this for batching completions.
      
      Cap the max batch count at 32. This should be enough to provide a good
      amortization of the cost of the lock+commit dance for completions, while
      still being low enough not to cause any real latency issues for SQPOLL
      applications.
      
      Xuan Zhuo <xuanzhuo@linux.alibaba.com> reports that this changes his
      profile from:
      
      17.97% [kernel] [k] copy_user_generic_unrolled
      13.92% [kernel] [k] io_commit_cqring
      11.04% [kernel] [k] __io_cqring_fill_event
      10.33% [kernel] [k] udp_recvmsg
       5.94% [kernel] [k] skb_release_data
       4.31% [kernel] [k] udp_rmem_release
       2.68% [kernel] [k] __check_object_size
       2.24% [kernel] [k] __slab_free
       2.22% [kernel] [k] _raw_spin_lock_bh
       2.21% [kernel] [k] kmem_cache_free
       2.13% [kernel] [k] free_pcppages_bulk
       1.83% [kernel] [k] io_submit_sqes
       1.38% [kernel] [k] page_frag_free
       1.31% [kernel] [k] inet_recvmsg
      
      to
      
      19.99% [kernel] [k] copy_user_generic_unrolled
      11.63% [kernel] [k] skb_release_data
       9.36% [kernel] [k] udp_rmem_release
       8.64% [kernel] [k] udp_recvmsg
       6.21% [kernel] [k] __slab_free
       4.39% [kernel] [k] __check_object_size
       3.64% [kernel] [k] free_pcppages_bulk
       2.41% [kernel] [k] kmem_cache_free
       2.00% [kernel] [k] io_submit_sqes
       1.95% [kernel] [k] page_frag_free
       1.54% [kernel] [k] io_put_req
      [...]
       0.07% [kernel] [k] io_commit_cqring
       0.44% [kernel] [k] __io_cqring_fill_event
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      229a7b63
    • J
      io_uring: pass down completion state on the issue side · f13fad7b
      Jens Axboe 提交于
      No functional changes in this patch, just in preparation for having the
      completion state be available on the issue side. Later on, this will
      allow requests that complete inline to be completed in batches.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f13fad7b
    • J
      io_uring: add 'io_comp_state' to struct io_submit_state · 013538bd
      Jens Axboe 提交于
      No functional changes in this patch, just in preparation for passing back
      pending completions to the caller and completing them in a batched
      fashion.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      013538bd
    • J
      io_uring: provide generic io_req_complete() helper · e1e16097
      Jens Axboe 提交于
      We have lots of callers of:
      
      io_cqring_add_event(req, result);
      io_put_req(req);
      
      Provide a helper that does this for us. It helps clean up the code, and
      also provides a more convenient location for us to change the completion
      handling.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e1e16097
    • P
      io_uring: fix NULL-mm for linked reqs · d3cac64c
      Pavel Begunkov 提交于
      __io_queue_sqe() tries to handle all request of a link,
      so it's not enough to grab mm in io_sq_thread_acquire_mm()
      based just on the head.
      
      Don't check req->needs_mm and do it always.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      d3cac64c
    • P
      io_uring: fix current->mm NULL dereference on exit · d60b5fbc
      Pavel Begunkov 提交于
      Don't reissue requests from io_iopoll_reap_events(), the task may not
      have mm, which ends up with NULL. It's better to kill everything off on
      exit anyway.
      
      [  677.734670] RIP: 0010:io_iopoll_complete+0x27e/0x630
      ...
      [  677.734679] Call Trace:
      [  677.734695]  ? __send_signal+0x1f2/0x420
      [  677.734698]  ? _raw_spin_unlock_irqrestore+0x24/0x40
      [  677.734699]  ? send_signal+0xf5/0x140
      [  677.734700]  io_iopoll_getevents+0x12f/0x1a0
      [  677.734702]  io_iopoll_reap_events.part.0+0x5e/0xa0
      [  677.734703]  io_ring_ctx_wait_and_kill+0x132/0x1c0
      [  677.734704]  io_uring_release+0x20/0x30
      [  677.734706]  __fput+0xcd/0x230
      [  677.734707]  ____fput+0xe/0x10
      [  677.734709]  task_work_run+0x67/0xa0
      [  677.734710]  do_exit+0x35d/0xb70
      [  677.734712]  do_group_exit+0x43/0xa0
      [  677.734713]  get_signal+0x140/0x900
      [  677.734715]  do_signal+0x37/0x780
      [  677.734717]  ? enqueue_hrtimer+0x41/0xb0
      [  677.734718]  ? recalibrate_cpu_khz+0x10/0x10
      [  677.734720]  ? ktime_get+0x3e/0xa0
      [  677.734721]  ? lapic_next_deadline+0x26/0x30
      [  677.734723]  ? tick_program_event+0x4d/0x90
      [  677.734724]  ? __hrtimer_get_next_event+0x4d/0x80
      [  677.734726]  __prepare_exit_to_usermode+0x126/0x1c0
      [  677.734741]  prepare_exit_to_usermode+0x9/0x40
      [  677.734742]  idtentry_exit_cond_rcu+0x4c/0x60
      [  677.734743]  sysvec_reschedule_ipi+0x92/0x160
      [  677.734744]  ? asm_sysvec_reschedule_ipi+0xa/0x20
      [  677.734745]  asm_sysvec_reschedule_ipi+0x12/0x20
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d60b5fbc
    • P
      io_uring: fix hanging iopoll in case of -EAGAIN · cd664b0e
      Pavel Begunkov 提交于
      io_do_iopoll() won't do anything with a request unless
      req->iopoll_completed is set. So io_complete_rw_iopoll() has to set
      it, otherwise io_do_iopoll() will poll a file again and again even
      though the request of interest was completed long time ago.
      
      Also, remove -EAGAIN check from io_issue_sqe() as it races with
      the changed lines. The request will take the long way and be
      resubmitted from io_iopoll*().
      
      io_kiocb's result and iopoll_completed")
      
      Fixes: bbde017a ("io_uring: add memory barrier to synchronize
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cd664b0e
  4. 24 6月, 2020 1 次提交
    • X
      io_uring: fix io_sq_thread no schedule when busy · b772f07a
      Xuan Zhuo 提交于
      When the user consumes and generates sqe at a fast rate,
      io_sqring_entries can always get sqe, and ret will not be equal to -EBUSY,
      so that io_sq_thread will never call cond_resched or schedule, and then
      we will get the following system error prompt:
      
      rcu: INFO: rcu_sched self-detected stall on CPU
      or
      watchdog: BUG: soft lockup-CPU#23 stuck for 112s! [io_uring-sq:1863]
      
      This patch checks whether need to call cond_resched() by checking
      the need_resched() function every cycle.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b772f07a
  5. 22 6月, 2020 14 次提交
  6. 18 6月, 2020 3 次提交
    • X
      io_uring: fix possible race condition against REQ_F_NEED_CLEANUP · 6f2cc166
      Xiaoguang Wang 提交于
      In io_read() or io_write(), when io request is submitted successfully,
      it'll go through the below sequence:
      
          kfree(iovec);
          req->flags &= ~REQ_F_NEED_CLEANUP;
          return ret;
      
      But clearing REQ_F_NEED_CLEANUP might be unsafe. The io request may
      already have been completed, and then io_complete_rw_iopoll()
      and io_complete_rw() will be called, both of which will also modify
      req->flags if needed. This causes a race condition, with concurrent
      non-atomic modification of req->flags.
      
      To eliminate this race, in io_read() or io_write(), if io request is
      submitted successfully, we don't remove REQ_F_NEED_CLEANUP flag. If
      REQ_F_NEED_CLEANUP is set, we'll leave __io_req_aux_free() to the
      iovec cleanup work correspondingly.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6f2cc166
    • J
      io_uring: reap poll completions while waiting for refs to drop on exit · 56952e91
      Jens Axboe 提交于
      If we're doing polled IO and end up having requests being submitted
      async, then completions can come in while we're waiting for refs to
      drop. We need to reap these manually, as nobody else will be looking
      for them.
      
      Break the wait into 1/20th of a second time waits, and check for done
      poll completions if we time out. Otherwise we can have done poll
      completions sitting in ctx->poll_list, which needs us to reap them but
      we're just waiting for them.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      56952e91
    • J
      io_uring: acquire 'mm' for task_work for SQPOLL · 9d8426a0
      Jens Axboe 提交于
      If we're unlucky with timing, we could be running task_work after
      having dropped the memory context in the sq thread. Since dropping
      the context requires a runnable task state, we cannot reliably drop
      it as part of our check-for-work loop in io_sq_thread(). Instead,
      abstract out the mm acquire for the sq thread into a helper, and call
      it from the async task work handler.
      
      Cc: stable@vger.kernel.org # v5.7
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9d8426a0