1. 20 3月, 2020 2 次提交
  2. 15 3月, 2020 1 次提交
    • P
      io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN} · f1d96a8f
      Pavel Begunkov 提交于
      Processing links, io_submit_sqe() prepares requests, drops sqes, and
      passes them with sqe=NULL to io_queue_sqe(). There IOSQE_DRAIN and/or
      IOSQE_ASYNC requests will go through the same prep, which doesn't expect
      sqe=NULL and fail with NULL pointer deference.
      
      Always do full prepare including io_alloc_async_ctx() for linked
      requests, and then it can skip the second preparation.
      
      Cc: stable@vger.kernel.org # 5.5
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f1d96a8f
  3. 09 3月, 2020 1 次提交
    • J
      io_uring: ensure RCU callback ordering with rcu_barrier() · 805b13ad
      Jens Axboe 提交于
      After more careful studying, Paul informs me that we cannot rely on
      ordering of RCU callbacks in the way that the the tagged commit did.
      The current construct looks like this:
      
      	void C(struct rcu_head *rhp)
      	{
      		do_something(rhp);
      		call_rcu(&p->rh, B);
      	}
      
      	call_rcu(&p->rh, A);
      	call_rcu(&p->rh, C);
      
      and we're relying on ordering between A and B, which isn't guaranteed.
      Make this explicit instead, and have a work item issue the rcu_barrier()
      to ensure that A has run before we manually execute B.
      
      While thorough testing never showed this issue, it's dependent on the
      per-cpu load in terms of RCU callbacks. The updated method simplifies
      the code as well, and eliminates the need to maintain an rcu_head in
      the fileset data.
      
      Fixes: c1e2148f ("io_uring: free fixed_file_data after RCU grace period")
      Reported-by: NPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      805b13ad
  4. 07 3月, 2020 2 次提交
    • P
      io_uring: fix lockup with timeouts · f0e20b89
      Pavel Begunkov 提交于
      There is a recipe to deadlock the kernel: submit a timeout sqe with a
      linked_timeout (e.g.  test_single_link_timeout_ception() from liburing),
      and SIGKILL the process.
      
      Then, io_kill_timeouts() takes @ctx->completion_lock, but the timeout
      isn't flagged with REQ_F_COMP_LOCKED, and will try to double grab it
      during io_put_free() to cancel the linked timeout. Probably, the same
      can happen with another io_kill_timeout() call site, that is
      io_commit_cqring().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f0e20b89
    • J
      io_uring: free fixed_file_data after RCU grace period · c1e2148f
      Jens Axboe 提交于
      The percpu refcount protects this structure, and we can have an atomic
      switch in progress when exiting. This makes it unsafe to just free the
      struct normally, and can trigger the following KASAN warning:
      
      BUG: KASAN: use-after-free in percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
      Read of size 1 at addr ffff888181a19a30 by task swapper/0/0
      
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0-rc4+ #5747
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      Call Trace:
       <IRQ>
       dump_stack+0x76/0xa0
       print_address_description.constprop.0+0x3b/0x60
       ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
       ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
       __kasan_report.cold+0x1a/0x3d
       ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
       percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
       rcu_core+0x370/0x830
       ? percpu_ref_exit+0x50/0x50
       ? rcu_note_context_switch+0x7b0/0x7b0
       ? run_rebalance_domains+0x11d/0x140
       __do_softirq+0x10a/0x3e9
       irq_exit+0xd5/0xe0
       smp_apic_timer_interrupt+0x86/0x200
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      RIP: 0010:default_idle+0x26/0x1f0
      
      Fix this by punting the final exit and free of the struct to RCU, then
      we know that it's safe to do so. Jann suggested the approach of using a
      double rcu callback to achieve this. It's important that we do a nested
      call_rcu() callback, as otherwise the free could be ordered before the
      atomic switch, even if the latter was already queued.
      
      Reported-by: syzbot+e017e49c39ab484ac87a@syzkaller.appspotmail.com
      Suggested-by: NJann Horn <jannh@google.com>
      Reviewed-by: NPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c1e2148f
  5. 28 2月, 2020 1 次提交
  6. 27 2月, 2020 2 次提交
    • T
      io_uring: define and set show_fdinfo only if procfs is enabled · bebdb65e
      Tobias Klauser 提交于
      Follow the pattern used with other *_show_fdinfo functions and only
      define and use io_uring_show_fdinfo and its helper functions if
      CONFIG_PROC_FS is set.
      
      Fixes: 87ce955b ("io_uring: add ->show_fdinfo() for the io_uring file descriptor")
      Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bebdb65e
    • J
      io_uring: drop file set ref put/get on switch · dd3db2a3
      Jens Axboe 提交于
      Dan reports that he triggered a warning on ring exit doing some testing:
      
      percpu ref (io_file_data_ref_zero) <= 0 (0) after switching to atomic
      WARNING: CPU: 3 PID: 0 at lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
      Modules linked in:
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc3+ #5648
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      RIP: 0010:percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
      Code: e7 ff 55 e8 eb d2 80 3d bd 02 d2 00 00 75 8b 48 8b 55 d8 48 c7 c7 e8 70 e6 81 c6 05 a9 02 d2 00 01 48 8b 75 e8 e8 3a d0 c5 ff <0f> 0b e9 69 ff ff ff 90 55 48 89 fd 53 48 89 f3 48 83 ec 28 48 83
      RSP: 0018:ffffc90000110ef8 EFLAGS: 00010292
      RAX: 0000000000000045 RBX: 7fffffffffffffff RCX: 0000000000000000
      RDX: 0000000000000045 RSI: ffffffff825be7a5 RDI: ffffffff825bc32c
      RBP: ffff8881b75eac38 R08: 000000042364b941 R09: 0000000000000045
      R10: ffffffff825beb40 R11: ffffffff825be78a R12: 0000607e46005aa0
      R13: ffff888107dcdd00 R14: 0000000000000000 R15: 0000000000000009
      FS:  0000000000000000(0000) GS:ffff8881b9d80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f49e6a5ea20 CR3: 00000001b747c004 CR4: 00000000001606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       rcu_core+0x1e4/0x4d0
       __do_softirq+0xdb/0x2f1
       irq_exit+0xa0/0xb0
       smp_apic_timer_interrupt+0x60/0x140
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      RIP: 0010:default_idle+0x23/0x170
      Code: ff eb ab cc cc cc cc 0f 1f 44 00 00 41 54 55 53 65 8b 2d 10 96 92 7e 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 21 d0 51 00 fb f4 <65> 8b 2d f6 95 92 7e 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 e5 95
      
      Turns out that this is due to percpu_ref_switch_to_atomic() only
      grabbing a reference to the percpu refcount if it's not already in
      atomic mode. io_uring drops a ref and re-gets it when switching back to
      percpu mode. We attempt to protect against this with the FFD_F_ATOMIC
      bit, but that isn't reliable.
      
      We don't actually need to juggle these refcounts between atomic and
      percpu switch, we can just do them when we've switched to atomic mode.
      This removes the need for FFD_F_ATOMIC, which wasn't reliable.
      
      Fixes: 05f3fb3c ("io_uring: avoid ring quiesce for fixed file set unregister and update")
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dd3db2a3
  7. 26 2月, 2020 2 次提交
    • J
      io_uring: import_single_range() returns 0/-ERROR · 3a901598
      Jens Axboe 提交于
      Unlike the other core import helpers, import_single_range() returns 0 on
      success, not the length imported. This means that links that depend on
      the result of non-vec based IORING_OP_{READ,WRITE} that were added for
      5.5 get errored when they should not be.
      
      Fixes: 3a6820f2 ("io_uring: add non-vectored read/write commands")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a901598
    • J
      io_uring: pick up link work on submit reference drop · 2a44f467
      Jens Axboe 提交于
      If work completes inline, then we should pick up a dependent link item
      in __io_queue_sqe() as well. If we don't do so, we're forced to go async
      with that item, which is suboptimal.
      
      This also fixes an issue with io_put_req_find_next(), which always looks
      up the next work item. That should only be done if we're dropping the
      last reference to the request, to prevent multiple lookups of the same
      work item.
      
      Outside of being a fix, this also enables a good cleanup series for 5.7,
      where we never have to pass 'nxt' around or into the work handlers.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2a44f467
  8. 25 2月, 2020 1 次提交
    • X
      io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL · bdcd3eab
      Xiaoguang Wang 提交于
      After making ext4 support iopoll method:
        let ext4_file_operations's iopoll method be iomap_dio_iopoll(),
      we found fio can easily hang in fio_ioring_getevents() with below fio
      job:
          rm -f testfile; sync;
          sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread
      -rw=write -ioengine=io_uring  -hipri=1 -sqthread_poll=1 -direct=1
      -bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting
      with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.
      
      There are two issues that results in this hang, one reason is that
      when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio
      does not use io_uring_enter to get completed events, it relies on
      kernel io_sq_thread to poll for completed events.
      
      Another reason is that there is a race: when io_submit_sqes() in
      io_sq_thread() submits a batch of sqes, variable 'inflight' will
      record the number of submitted reqs, then io_sq_thread will poll for
      reqs which have been added to poll_list. But note, if some previous
      reqs have been punted to io worker, these reqs will won't be in
      poll_list timely. io_sq_thread() will only poll for a part of previous
      submitted reqs, and then find poll_list is empty, reset variable
      'inflight' to be zero. If app just waits these deferred reqs and does
      not wake up io_sq_thread again, then hang happens.
      
      For app that entirely relies on io_sq_thread to poll completed requests,
      let io_iopoll_req_issued() wake up io_sq_thread properly when adding new
      element to poll_list, and when io_sq_thread prepares to sleep, check
      whether poll_list is empty again, if not empty, continue to poll.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bdcd3eab
  9. 24 2月, 2020 2 次提交
  10. 22 2月, 2020 2 次提交
    • X
      io_uring: fix __io_iopoll_check deadlock in io_sq_thread · c7849be9
      Xiaoguang Wang 提交于
      Since commit a3a0e43f ("io_uring: don't enter poll loop if we have
      CQEs pending"), if we already events pending, we won't enter poll loop.
      In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has
      been terminated and don't reap pending events which are already in cq
      ring, and there are some reqs in poll_list, io_sq_thread will enter
      __io_iopoll_check(), and find pending events, then return, this loop
      will never have a chance to exit.
      
      I have seen this issue in fio stress tests, to fix this issue, let
      io_sq_thread call io_iopoll_getevents() with argument 'min' being zero,
      and remove __io_iopoll_check().
      
      Fixes: a3a0e43f ("io_uring: don't enter poll loop if we have CQEs pending")
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c7849be9
    • S
      io_uring: prevent sq_thread from spinning when it should stop · 7143b5ac
      Stefano Garzarella 提交于
      This patch drops 'cur_mm' before calling cond_resched(), to prevent
      the sq_thread from spinning even when the user process is finished.
      
      Before this patch, if the user process ended without closing the
      io_uring fd, the sq_thread continues to spin until the
      'sq_thread_idle' timeout ends.
      
      In the worst case where the 'sq_thread_idle' parameter is bigger than
      INT_MAX, the sq_thread will spin forever.
      
      Fixes: 6c271ce2 ("io_uring: add submission polling")
      Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7143b5ac
  11. 19 2月, 2020 2 次提交
  12. 17 2月, 2020 1 次提交
  13. 14 2月, 2020 1 次提交
    • J
      io_uring: prune request from overflow list on flush · 2ca10259
      Jens Axboe 提交于
      Carter reported an issue where he could produce a stall on ring exit,
      when we're cleaning up requests that match the given file table. For
      this particular test case, a combination of a few things caused the
      issue:
      
      - The cq ring was overflown
      - The request being canceled was in the overflow list
      
      The combination of the above means that the cq overflow list holds a
      reference to the request. The request is canceled correctly, but since
      the overflow list holds a reference to it, the final put won't happen.
      Since the final put doesn't happen, the request remains in the inflight.
      Hence we never finish the cancelation flush.
      
      Fix this by removing requests from the overflow list if we're canceling
      them.
      
      Cc: stable@vger.kernel.org # 5.5
      Reported-by: NCarter Li 李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2ca10259
  14. 10 2月, 2020 2 次提交
  15. 09 2月, 2020 11 次提交
  16. 07 2月, 2020 3 次提交
    • P
      io_uring: fix deferred req iovec leak · 1e95081c
      Pavel Begunkov 提交于
      After defer, a request will be prepared, that includes allocating iovec
      if needed, and then submitted through io_wq_submit_work() but not custom
      handler (e.g. io_rw_async()/io_sendrecv_async()). However, it'll leak
      iovec, as it's in io-wq and the code goes as follows:
      
      io_read() {
      	if (!io_wq_current_is_worker())
      		kfree(iovec);
      }
      
      Put all deallocation logic in io_{read,write,send,recv}(), which will
      leave the memory, if going async with -EAGAIN.
      
      It also fixes a leak after failed io_alloc_async_ctx() in
      io_{recv,send}_msg().
      
      Cc: stable@vger.kernel.org # 5.5
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1e95081c
    • R
      io_uring: fix 1-bit bitfields to be unsigned · e1d85334
      Randy Dunlap 提交于
      Make bitfields of size 1 bit be unsigned (since there is no room
      for the sign bit).
      This clears up the sparse warnings:
      
        CHECK   ../fs/io_uring.c
      ../fs/io_uring.c:207:50: error: dubious one-bit signed bitfield
      ../fs/io_uring.c:208:55: error: dubious one-bit signed bitfield
      ../fs/io_uring.c:209:63: error: dubious one-bit signed bitfield
      ../fs/io_uring.c:210:54: error: dubious one-bit signed bitfield
      ../fs/io_uring.c:211:57: error: dubious one-bit signed bitfield
      
      Found by sight and then verified with sparse.
      
      Fixes: 69b3e546 ("io_uring: change io_ring_ctx bool fields into bit fields")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: io-uring@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e1d85334
    • P
      io_uring: get rid of delayed mm check · 1cb1edb2
      Pavel Begunkov 提交于
      Fail fast if can't grab mm, so past that requests always have an mm
      when required. This allows us to remove req->user altogether.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1cb1edb2
  17. 05 2月, 2020 2 次提交
    • J
      io_uring: cleanup fixed file data table references · 2faf852d
      Jens Axboe 提交于
      syzbot reports a use-after-free in io_ring_file_ref_switch() when it
      tries to switch back to percpu mode. When we put the final reference to
      the table by calling percpu_ref_kill_and_confirm(), we don't want the
      zero reference to queue async work for flushing the potentially queued
      up items. We currently do a few flush_work(), but they merely paper
      around the issue, since the work item may not have been queued yet
      depending on the when the percpu-ref callback gets run.
      
      Coming into the file unregister, we know we have the ring quiesced.
      io_ring_file_ref_switch() can check for whether or not the ref is dying
      or not, and not queue anything async at that point. Once the ref has
      been confirmed killed, flush any potential items manually.
      
      Reported-by: syzbot+7caeaea49c2c8a591e3d@syzkaller.appspotmail.com
      Fixes: 05f3fb3c ("io_uring: avoid ring quiesce for fixed file set unregister and update")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2faf852d
    • J
      io_uring: spin for sq thread to idle on shutdown · df069d80
      Jens Axboe 提交于
      As part of io_uring shutdown, we cancel work that is pending and won't
      necessarily complete on its own. That includes requests like poll
      commands and timeouts.
      
      If we're using SQPOLL for kernel side submission and we shutdown the
      ring immediately after queueing such work, we can race with the sqthread
      doing the submission. This means we may miss cancelling some work, which
      results in the io_uring shutdown hanging forever.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      df069d80
  18. 04 2月, 2020 2 次提交