1. 04 6月, 2020 28 次提交
  2. 28 5月, 2020 12 次提交
    • J
      io_uring: make sure accept honor rlimit nofile · 4520967e
      Jens Axboe 提交于
      to #26323588
      
      commit 09952e3e7826119ddd4357c453d54bcc7ef25156 upstream.
      
      Just like commit 4022e7af86be, this fixes the fact that
      IORING_OP_ACCEPT ends up using get_unused_fd_flags(), which checks
      current->signal->rlim[] for limits.
      
      Add an extra argument to __sys_accept4_file() that allows us to pass
      in the proper nofile limit, and grab it at request prep time.
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      4520967e
    • J
      io_uring: make sure openat/openat2 honor rlimit nofile · 4d28e850
      Jens Axboe 提交于
      to #26323588
      
      commit 4022e7af86be2dd62975dedb6b7ea551d108695e upstream.
      
      Dmitry reports that a test case shows that io_uring isn't honoring a
      modified rlimit nofile setting. get_unused_fd_flags() checks the task
      signal->rlimi[] for the limits. As this isn't easily inheritable,
      provide a __get_unused_fd_flags() that takes the value instead. Then we
      can grab it when the request is prepared (from the original task), and
      pass that in when we do the async part part of the open.
      Reported-by: NDmitry Kadashev <dkadashev@gmail.com>
      Tested-by: NDmitry Kadashev <dkadashev@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      4d28e850
    • P
      io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN} · 86f0692a
      Pavel Begunkov 提交于
      to #26323588
      
      commit f1d96a8fcbbbb22d4fbc1d69eaaa678bbb0ff6e2 upstream.
      
      Processing links, io_submit_sqe() prepares requests, drops sqes, and
      passes them with sqe=NULL to io_queue_sqe(). There IOSQE_DRAIN and/or
      IOSQE_ASYNC requests will go through the same prep, which doesn't expect
      sqe=NULL and fail with NULL pointer deference.
      
      Always do full prepare including io_alloc_async_ctx() for linked
      requests, and then it can skip the second preparation.
      
      Cc: stable@vger.kernel.org # 5.5
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      86f0692a
    • J
      io_uring: ensure RCU callback ordering with rcu_barrier() · 7b8de11e
      Jens Axboe 提交于
      to #26323588
      
      commit 805b13adde3964c78cba125a15527e88c19f87b3 upstream.
      
      After more careful studying, Paul informs me that we cannot rely on
      ordering of RCU callbacks in the way that the the tagged commit did.
      The current construct looks like this:
      
      	void C(struct rcu_head *rhp)
      	{
      		do_something(rhp);
      		call_rcu(&p->rh, B);
      	}
      
      	call_rcu(&p->rh, A);
      	call_rcu(&p->rh, C);
      
      and we're relying on ordering between A and B, which isn't guaranteed.
      Make this explicit instead, and have a work item issue the rcu_barrier()
      to ensure that A has run before we manually execute B.
      
      While thorough testing never showed this issue, it's dependent on the
      per-cpu load in terms of RCU callbacks. The updated method simplifies
      the code as well, and eliminates the need to maintain an rcu_head in
      the fileset data.
      
      Fixes: c1e2148f8ecb ("io_uring: free fixed_file_data after RCU grace period")
      Reported-by: NPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      7b8de11e
    • P
      io_uring: fix lockup with timeouts · 12e8d647
      Pavel Begunkov 提交于
      to #26323588
      
      commit f0e20b8943509d81200cef5e30af2adfddba0f5c upstream.
      
      There is a recipe to deadlock the kernel: submit a timeout sqe with a
      linked_timeout (e.g.  test_single_link_timeout_ception() from liburing),
      and SIGKILL the process.
      
      Then, io_kill_timeouts() takes @ctx->completion_lock, but the timeout
      isn't flagged with REQ_F_COMP_LOCKED, and will try to double grab it
      during io_put_free() to cancel the linked timeout. Probably, the same
      can happen with another io_kill_timeout() call site, that is
      io_commit_cqring().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      12e8d647
    • J
      io_uring: free fixed_file_data after RCU grace period · 33d7aab5
      Jens Axboe 提交于
      to #26323588
      
      commit c1e2148f8ecb26863b899d402a823dab8e26efd1 upstream.
      
      The percpu refcount protects this structure, and we can have an atomic
      switch in progress when exiting. This makes it unsafe to just free the
      struct normally, and can trigger the following KASAN warning:
      
      BUG: KASAN: use-after-free in percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
      Read of size 1 at addr ffff888181a19a30 by task swapper/0/0
      
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0-rc4+ #5747
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      Call Trace:
       <IRQ>
       dump_stack+0x76/0xa0
       print_address_description.constprop.0+0x3b/0x60
       ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
       ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
       __kasan_report.cold+0x1a/0x3d
       ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
       percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
       rcu_core+0x370/0x830
       ? percpu_ref_exit+0x50/0x50
       ? rcu_note_context_switch+0x7b0/0x7b0
       ? run_rebalance_domains+0x11d/0x140
       __do_softirq+0x10a/0x3e9
       irq_exit+0xd5/0xe0
       smp_apic_timer_interrupt+0x86/0x200
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      RIP: 0010:default_idle+0x26/0x1f0
      
      Fix this by punting the final exit and free of the struct to RCU, then
      we know that it's safe to do so. Jann suggested the approach of using a
      double rcu callback to achieve this. It's important that we do a nested
      call_rcu() callback, as otherwise the free could be ordered before the
      atomic switch, even if the latter was already queued.
      
      Reported-by: syzbot+e017e49c39ab484ac87a@syzkaller.appspotmail.com
      Suggested-by: NJann Horn <jannh@google.com>
      Reviewed-by: NPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      33d7aab5
    • J
      io_uring: fix 32-bit compatability with sendmsg/recvmsg · 4d6d3604
      Jens Axboe 提交于
      to #26323588
      
      commit d876836204897b6d7d911f942084f69a1e9d5c4d upstream.
      
      We must set MSG_CMSG_COMPAT if we're in compatability mode, otherwise
      the iovec import for these commands will not do the right thing and fail
      the command with -EINVAL.
      
      Found by running the test suite compiled as 32-bit.
      
      Cc: stable@vger.kernel.org
      Fixes: aa1fa28fc73e ("io_uring: add support for recvmsg()")
      Fixes: 0fa03c624d8f ("io_uring: add support for sendmsg()")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      4d6d3604
    • T
      io_uring: define and set show_fdinfo only if procfs is enabled · 0cc82904
      Tobias Klauser 提交于
      to #26323588
      
      commit bebdb65e077267957f48e43d205d4a16cc7b8161 upstream.
      
      Follow the pattern used with other *_show_fdinfo functions and only
      define and use io_uring_show_fdinfo and its helper functions if
      CONFIG_PROC_FS is set.
      
      Fixes: 87ce955b24c9 ("io_uring: add ->show_fdinfo() for the io_uring file descriptor")
      Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      0cc82904
    • J
      io_uring: drop file set ref put/get on switch · 2d56c6fc
      Jens Axboe 提交于
      to #26323588
      
      commit dd3db2a34cff14e152da7c8e320297719a35abf9 upstream.
      
      Dan reports that he triggered a warning on ring exit doing some testing:
      
      percpu ref (io_file_data_ref_zero) <= 0 (0) after switching to atomic
      WARNING: CPU: 3 PID: 0 at lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
      Modules linked in:
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc3+ #5648
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      RIP: 0010:percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
      Code: e7 ff 55 e8 eb d2 80 3d bd 02 d2 00 00 75 8b 48 8b 55 d8 48 c7 c7 e8 70 e6 81 c6 05 a9 02 d2 00 01 48 8b 75 e8 e8 3a d0 c5 ff <0f> 0b e9 69 ff ff ff 90 55 48 89 fd 53 48 89 f3 48 83 ec 28 48 83
      RSP: 0018:ffffc90000110ef8 EFLAGS: 00010292
      RAX: 0000000000000045 RBX: 7fffffffffffffff RCX: 0000000000000000
      RDX: 0000000000000045 RSI: ffffffff825be7a5 RDI: ffffffff825bc32c
      RBP: ffff8881b75eac38 R08: 000000042364b941 R09: 0000000000000045
      R10: ffffffff825beb40 R11: ffffffff825be78a R12: 0000607e46005aa0
      R13: ffff888107dcdd00 R14: 0000000000000000 R15: 0000000000000009
      FS:  0000000000000000(0000) GS:ffff8881b9d80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f49e6a5ea20 CR3: 00000001b747c004 CR4: 00000000001606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       rcu_core+0x1e4/0x4d0
       __do_softirq+0xdb/0x2f1
       irq_exit+0xa0/0xb0
       smp_apic_timer_interrupt+0x60/0x140
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      RIP: 0010:default_idle+0x23/0x170
      Code: ff eb ab cc cc cc cc 0f 1f 44 00 00 41 54 55 53 65 8b 2d 10 96 92 7e 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 21 d0 51 00 fb f4 <65> 8b 2d f6 95 92 7e 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 e5 95
      
      Turns out that this is due to percpu_ref_switch_to_atomic() only
      grabbing a reference to the percpu refcount if it's not already in
      atomic mode. io_uring drops a ref and re-gets it when switching back to
      percpu mode. We attempt to protect against this with the FFD_F_ATOMIC
      bit, but that isn't reliable.
      
      We don't actually need to juggle these refcounts between atomic and
      percpu switch, we can just do them when we've switched to atomic mode.
      This removes the need for FFD_F_ATOMIC, which wasn't reliable.
      
      Fixes: 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      2d56c6fc
    • J
      io_uring: import_single_range() returns 0/-ERROR · f7c811d5
      Jens Axboe 提交于
      to #26323588
      
      commit 3a9015988b3d41027cda61f4fdeaaeee73be8b24 upstream.
      
      Unlike the other core import helpers, import_single_range() returns 0 on
      success, not the length imported. This means that links that depend on
      the result of non-vec based IORING_OP_{READ,WRITE} that were added for
      5.5 get errored when they should not be.
      
      Fixes: 3a6820f2bb8a ("io_uring: add non-vectored read/write commands")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      f7c811d5
    • J
      io_uring: pick up link work on submit reference drop · 14bacbdd
      Jens Axboe 提交于
      to #26323588
      
      commit 2a44f46781617c5040372b59da33553a02b1f46d upstream.
      
      If work completes inline, then we should pick up a dependent link item
      in __io_queue_sqe() as well. If we don't do so, we're forced to go async
      with that item, which is suboptimal.
      
      This also fixes an issue with io_put_req_find_next(), which always looks
      up the next work item. That should only be done if we're dropping the
      last reference to the request, to prevent multiple lookups of the same
      work item.
      
      Outside of being a fix, this also enables a good cleanup series for 5.7,
      where we never have to pass 'nxt' around or into the work handlers.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      14bacbdd
    • J
      io_uring: fix personality idr leak · 6767af13
      Jens Axboe 提交于
      to #26323588
      
      commit 41726c9a50e7464beca7112d0aebf3a0090c62d2 upstream.
      
      We somehow never free the idr, even though we init it for every ctx.
      Free it when the rest of the ring data is freed.
      
      Fixes: 071698e13ac6 ("io_uring: allow registering credentials")
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      6767af13