1. 01 10月, 2020 40 次提交
    • J
      io_uring: kill callback_head argument for io_req_task_work_add() · 87c4311f
      Jens Axboe 提交于
      We always use &req->task_work anyway, no point in passing it in.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      87c4311f
    • P
      io_uring: move req preps out of io_issue_sqe() · c1379e24
      Pavel Begunkov 提交于
      All request preparations are done only during submission, reflect it in
      the code by moving io_req_prep() much earlier into io_queue_sqe().
      
      That's much cleaner, because it doen't expose bits to async code which
      it won't ever use. Also it makes the interface harder to misuse, and
      there are potential places for bugs.
      
      For instance, __io_queue() doesn't clear @sqe before proceeding to a
      next linked request, that could have been disastrous, but hopefully
      there are linked requests IFF sqe==NULL, so not actually a bug.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c1379e24
    • P
      io_uring: decouple issuing and req preparation · bfe76559
      Pavel Begunkov 提交于
      io_issue_sqe() does two things at once, trying to prepare request and
      issuing them. Split it in two and deduplicate with io_defer_prep().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bfe76559
    • P
      io_uring: remove nonblock arg from io_{rw}_prep() · 73debe68
      Pavel Begunkov 提交于
      All io_*_prep() functions including io_{read,write}_prep() are called
      only during submission where @force_nonblock is always true. Don't keep
      propagating it and instead remove the @force_nonblock argument
      from prep() altogether.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      73debe68
    • P
      io_uring: set/clear IOCB_NOWAIT into io_read/write · a88fc400
      Pavel Begunkov 提交于
      Move setting IOCB_NOWAIT from io_prep_rw() into io_read()/io_write(), so
      it's set/cleared in a single place. Also remove @force_nonblock
      parameter from io_prep_rw().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a88fc400
    • P
      io_uring: remove F_NEED_CLEANUP check in *prep() · 2d199895
      Pavel Begunkov 提交于
      REQ_F_NEED_CLEANUP is set only by io_*_prep() and they're guaranteed to
      be called only once, so there is no one who may have set the flag
      before. Kill REQ_F_NEED_CLEANUP check in these *prep() handlers.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d199895
    • P
      io_uring: io_kiocb_ppos() style change · 5b09e37e
      Pavel Begunkov 提交于
      Put brackets around bitwise ops in a complex expression
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5b09e37e
    • P
      io_uring: simplify io_alloc_req() · 291b2821
      Pavel Begunkov 提交于
      Extract common code from if/else branches. That is cleaner and optimised
      even better.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      291b2821
    • J
      io-wq: kill unused IO_WORKER_F_EXITING · 145cc8c6
      Jens Axboe 提交于
      This flag is no longer used, remove it.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      145cc8c6
    • H
      io-wq: fix use-after-free in io_wq_worker_running · c4068bf8
      Hillf Danton 提交于
      The smart syzbot has found a reproducer for the following issue:
      
       ==================================================================
       BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
       BUG: KASAN: use-after-free in atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
       BUG: KASAN: use-after-free in io_wqe_inc_running fs/io-wq.c:301 [inline]
       BUG: KASAN: use-after-free in io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
       Write of size 4 at addr ffff8882183db08c by task io_wqe_worker-0/7771
      
       CPU: 0 PID: 7771 Comm: io_wqe_worker-0 Not tainted 5.9.0-rc4-syzkaller #0
       Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
       Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x198/0x1fd lib/dump_stack.c:118
        print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
        __kasan_report mm/kasan/report.c:513 [inline]
        kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
        check_memory_region_inline mm/kasan/generic.c:186 [inline]
        check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
        instrument_atomic_write include/linux/instrumented.h:71 [inline]
        atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
        io_wqe_inc_running fs/io-wq.c:301 [inline]
        io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
        schedule_timeout+0x148/0x250 kernel/time/timer.c:1879
        io_wqe_worker+0x517/0x10e0 fs/io-wq.c:580
        kthread+0x3b5/0x4a0 kernel/kthread.c:292
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
       Allocated by task 7768:
        kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
        kasan_set_track mm/kasan/common.c:56 [inline]
        __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
        kmem_cache_alloc_node_trace+0x17b/0x3f0 mm/slab.c:3594
        kmalloc_node include/linux/slab.h:572 [inline]
        kzalloc_node include/linux/slab.h:677 [inline]
        io_wq_create+0x57b/0xa10 fs/io-wq.c:1064
        io_init_wq_offload fs/io_uring.c:7432 [inline]
        io_sq_offload_start fs/io_uring.c:7504 [inline]
        io_uring_create fs/io_uring.c:8625 [inline]
        io_uring_setup+0x1836/0x28e0 fs/io_uring.c:8694
        do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       Freed by task 21:
        kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
        kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
        kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
        __kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422
        __cache_free mm/slab.c:3418 [inline]
        kfree+0x10e/0x2b0 mm/slab.c:3756
        __io_wq_destroy fs/io-wq.c:1138 [inline]
        io_wq_destroy+0x2af/0x460 fs/io-wq.c:1146
        io_finish_async fs/io_uring.c:6836 [inline]
        io_ring_ctx_free fs/io_uring.c:7870 [inline]
        io_ring_exit_work+0x1e4/0x6d0 fs/io_uring.c:7954
        process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
        worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
        kthread+0x3b5/0x4a0 kernel/kthread.c:292
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
       The buggy address belongs to the object at ffff8882183db000
        which belongs to the cache kmalloc-1k of size 1024
       The buggy address is located 140 bytes inside of
        1024-byte region [ffff8882183db000, ffff8882183db400)
       The buggy address belongs to the page:
       page:000000009bada22b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2183db
       flags: 0x57ffe0000000200(slab)
       raw: 057ffe0000000200 ffffea0008604c48 ffffea00086a8648 ffff8880aa040700
       raw: 0000000000000000 ffff8882183db000 0000000100000002 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff8882183daf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        ffff8882183db000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       >ffff8882183db080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                             ^
        ffff8882183db100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff8882183db180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ==================================================================
      
      which is down to the comment below,
      
      	/* all workers gone, wq exit can proceed */
      	if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
      		complete(&wqe->wq->done);
      
      because there might be multiple cases of wqe in a wq and we would wait
      for every worker in every wqe to go home before releasing wq's resources
      on destroying.
      
      To that end, rework wq's refcount by making it independent of the tracking
      of workers because after all they are two different things, and keeping
      it balanced when workers come and go. Note the manager kthread, like
      other workers, now holds a grab to wq during its lifetime.
      
      Finally to help destroy wq, check IO_WQ_BIT_EXIT upon creating worker
      and do nothing for exiting wq.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reported-by: syzbot+45fa0a195b941764e0f0@syzkaller.appspotmail.com
      Reported-by: syzbot+9af99580130003da82b1@syzkaller.appspotmail.com
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NHillf Danton <hdanton@sina.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c4068bf8
    • J
      io_uring: show sqthread pid and cpu in fdinfo · dbbe9c64
      Joseph Qi 提交于
      In most cases we'll specify IORING_SETUP_SQPOLL and run multiple
      io_uring instances in a host. Since all sqthreads are named
      "io_uring-sq", it's hard to distinguish the relations between
      application process and its io_uring sqthread.
      With this patch, application can get its corresponding sqthread pid
      and cpu through show_fdinfo.
      Steps:
      1. Get io_uring fd first.
      $ ls -l /proc/<pid>/fd | grep -w io_uring
      2. Then get io_uring instance related info, including corresponding
      sqthread pid and cpu.
      $ cat /proc/<pid>/fdinfo/<io_uring_fd>
      
      pos:	0
      flags:	02000002
      mnt_id:	13
      SqThread:	6929
      SqThreadCpu:	2
      UserFiles:	1
          0: testfile
      UserBufs:	0
      PollList:
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      [axboe: fixed for new shared SQPOLL]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dbbe9c64
    • J
      io_uring: process task work in io_uring_register() · af9c1a44
      Jens Axboe 提交于
      We do this for CQ ring wait, in case task_work completions come in. We
      should do the same in io_uring_register(), to avoid spurious -EINTR
      if the ring quiescing ends up having to process task_work to complete
      the operation
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      af9c1a44
    • D
      io_uring: add blkcg accounting to offloaded operations · 91d8f519
      Dennis Zhou 提交于
      There are a few operations that are offloaded to the worker threads. In
      this case, we lose process context and end up in kthread context. This
      results in ios to be not accounted to the issuing cgroup and
      consequently end up as issued by root. Just like others, adopt the
      personality of the blkcg too when issuing via the workqueues.
      
      For the SQPOLL thread, it will live and attach in the inited cgroup's
      context.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      91d8f519
    • J
      io_uring: improve registered buffer accounting for huge pages · de293938
      Jens Axboe 提交于
      io_uring does account any registered buffer as pinned/locked memory, and
      checks limit and fails if the given user doesn't have a big enough limit
      to register the ranges specified. However, if huge pages are used, we
      are potentially under-accounting the memory in terms of what gets pinned
      on the vm side.
      
      This patch rectifies that, by ensuring that we account the full size of
      a compound page, regardless of how much of it is being registered. Huge
      pages are not accounted mulitple times - if multiple sections of a huge
      page is registered, then the page is only accounted once.
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      de293938
    • Z
      io_uring: remove unneeded semicolon · 14db8411
      Zheng Bin 提交于
      Fixes coccicheck warning:
      
      fs/io_uring.c:4242:13-14: Unneeded semicolon
      Signed-off-by: NZheng Bin <zhengbin13@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      14db8411
    • J
      io_uring: cap SQ submit size for SQPOLL with multiple rings · e95eee2d
      Jens Axboe 提交于
      In the spirit of fairness, cap the max number of SQ entries we'll submit
      for SQPOLL if we have multiple rings. If we don't do that, we could be
      submitting tons of entries for one ring, while others are waiting to get
      service.
      
      The value of 8 is somewhat arbitrarily chosen as something that allows
      a fair bit of batching, without using an excessive time per ring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e95eee2d
    • J
      io_uring: get rid of req->io/io_async_ctx union · e8c2bc1f
      Jens Axboe 提交于
      There's really no point in having this union, it just means that we're
      always allocating enough room to cater to any command. But that's
      pointless, as the ->io field is request type private anyway.
      
      This gets rid of the io_async_ctx structure, and fills in the required
      size in the io_op_defs[] instead.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e8c2bc1f
    • P
      io_uring: kill extra user_bufs check · 4be1c615
      Pavel Begunkov 提交于
      Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary,
      because in that case ctx->nr_user_bufs would be zero, and the following
      check would fail.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4be1c615
    • P
      io_uring: fix overlapped memcpy in io_req_map_rw() · ab0b196c
      Pavel Begunkov 提交于
      When io_req_map_rw() is called from io_rw_prep_async(), it memcpy()
      iorw->iter into itself. Even though it doesn't lead to an error, such a
      memcpy()'s aliasing rules violation is considered to be a bad practise.
      
      Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any
      remapping there, so it's much simpler than the generic implementation.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ab0b196c
    • P
      io_uring: refactor io_req_map_rw() · afb87658
      Pavel Begunkov 提交于
      Set rw->free_iovec to @iovec, that gives an identical result and stresses
      that @iovec param rw->free_iovec play the same role.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      afb87658
    • P
      io_uring: simplify io_rw_prep_async() · f4bff104
      Pavel Begunkov 提交于
      Don't touch iter->iov and iov in between __io_import_iovec() and
      io_req_map_rw(), the former function aleady sets it correctly, because it
      creates one more case with NULL'ed iov to consider in io_req_map_rw().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f4bff104
    • J
      io_uring: provide IORING_ENTER_SQ_WAIT for SQPOLL SQ ring waits · 90554200
      Jens Axboe 提交于
      When using SQPOLL, applications can run into the issue of running out of
      SQ ring entries because the thread hasn't consumed them yet. The only
      option for dealing with that is checking later, or busy checking for the
      condition.
      
      Provide IORING_ENTER_SQ_WAIT if applications want to wait on this
      condition.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      90554200
    • J
      io_uring: mark io_uring_fops/io_op_defs as __read_mostly · 738277ad
      Jens Axboe 提交于
      These structures are never written, move them appropriately.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      738277ad
    • J
      io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too · aa06165d
      Jens Axboe 提交于
      We support using IORING_SETUP_ATTACH_WQ to share async backends between
      rings created by the same process, this now also allows the same to
      happen with SQPOLL. The setup procedure remains the same, the caller
      sets io_uring_params->wq_fd to the 'parent' context, and then the newly
      created ring will attach to that async backend.
      
      This means that multiple rings can share the same SQPOLL thread, saving
      resources.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      aa06165d
    • J
      io_uring: base SQPOLL handling off io_sq_data · 69fb2131
      Jens Axboe 提交于
      Remove the SQPOLL thread from the ctx, and use the io_sq_data as the
      data structure we pass in. io_sq_data has a list of ctx's that we can
      then iterate over and handle.
      
      As of now we're ready to handle multiple ctx's, though we're still just
      handling a single one after this patch.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      69fb2131
    • J
      io_uring: split SQPOLL data into separate structure · 534ca6d6
      Jens Axboe 提交于
      Move all the necessary state out of io_ring_ctx, and into a new
      structure, io_sq_data. The latter now deals with any state or
      variables associated with the SQPOLL thread itself.
      
      In preparation for supporting more than one io_ring_ctx per SQPOLL
      thread.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      534ca6d6
    • J
      io_uring: split work handling part of SQPOLL into helper · c8d1ba58
      Jens Axboe 提交于
      This is done in preparation for handling more than one ctx, but it also
      cleans up the code a bit since io_sq_thread() was a bit too unwieldy to
      get a get overview on.
      
      __io_sq_thread() is now the main handler, and it returns an enum sq_ret
      that tells io_sq_thread() what it ended up doing. The parent then makes
      a decision on idle, spinning, or work handling based on that.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c8d1ba58
    • J
      io_uring: move SQPOLL post-wakeup ring need wakeup flag into wake handler · 3f0e64d0
      Jens Axboe 提交于
      We need to decouple the clearing on wakeup from the the inline schedule,
      as that is going to be required for handling multiple rings in one
      thread.
      
      Wrap our wakeup handler so we can clear it when we get the wakeup, by
      definition that is when we no longer need the flag set.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3f0e64d0
    • J
      io_uring: use private ctx wait queue entries for SQPOLL · 6a779382
      Jens Axboe 提交于
      This is in preparation to sharing the poller thread between rings. For
      that we need per-ring wait_queue_entry storage, and we can't easily put
      that on the stack if one thread is managing multiple rings.
      
      We'll also be sharing the wait_queue_head across rings for the purposes
      of wakeups, provide the usual private ring wait_queue_head for now but
      make it a pointer so we can easily override it when sharing.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6a779382
    • J
      io_uring: io_sq_thread() doesn't need to flush signals · e35afcf9
      Jens Axboe 提交于
      We're not handling signals by default in kernel threads, and we never
      use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never
      have a signal pending, and we don't need to check for it (nor flush it).
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e35afcf9
    • S
      io_wq: Make io_wqe::lock a raw_spinlock_t · 95da8465
      Sebastian Andrzej Siewior 提交于
      During a context switch the scheduler invokes wq_worker_sleeping() with
      disabled preemption. Disabling preemption is needed because it protects
      access to `worker->sleeping'. As an optimisation it avoids invoking
      schedule() within the schedule path as part of possible wake up (thus
      preempt_enable_no_resched() afterwards).
      
      The io-wq has been added to the mix in the same section with disabled
      preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
      acquires a spinlock_t. Also within the schedule() the spinlock_t must be
      acquired after tsk_is_pi_blocked() otherwise it will block on the
      sleeping lock again while scheduling out.
      
      While playing with `io_uring-bench' I didn't notice a significant
      latency spike after converting io_wqe::lock to a raw_spinlock_t. The
      latency was more or less the same.
      
      In order to keep the spinlock_t it would have to be moved after the
      tsk_is_pi_blocked() check which would introduce a branch instruction
      into the hot path.
      
      The lock is used to maintain the `work_list' and wakes one task up at
      most.
      Should io_wqe_cancel_pending_work() cause latency spikes, while
      searching for a specific item, then it would need to drop the lock
      during iterations.
      revert_creds() is also invoked under the lock. According to debug
      cred::non_rcu is 0. Otherwise it should be moved outside of the locked
      section because put_cred_rcu()->free_uid() acquires a sleeping lock.
      
      Convert io_wqe::lock to a raw_spinlock_t.c
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      95da8465
    • S
      io_uring: allow disabling rings during the creation · 7e84e1c7
      Stefano Garzarella 提交于
      This patch adds a new IORING_SETUP_R_DISABLED flag to start the
      rings disabled, allowing the user to register restrictions,
      buffers, files, before to start processing SQEs.
      
      When IORING_SETUP_R_DISABLED is set, SQE are not processed and
      SQPOLL kthread is not started.
      
      The restrictions registration are allowed only when the rings
      are disable to prevent concurrency issue while processing SQEs.
      
      The rings can be enabled using IORING_REGISTER_ENABLE_RINGS
      opcode with io_uring_register(2).
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7e84e1c7
    • S
      io_uring: add IOURING_REGISTER_RESTRICTIONS opcode · 21b55dbc
      Stefano Garzarella 提交于
      The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode
      permanently installs a feature allowlist on an io_ring_ctx.
      The io_ring_ctx can then be passed to untrusted code with the
      knowledge that only operations present in the allowlist can be
      executed.
      
      The allowlist approach ensures that new features added to io_uring
      do not accidentally become available when an existing application
      is launched on a newer kernel version.
      
      Currently is it possible to restrict sqe opcodes, sqe flags, and
      register opcodes.
      
      IOURING_REGISTER_RESTRICTIONS can only be made once. Afterwards
      it is not possible to change restrictions anymore.
      This prevents untrusted code from removing restrictions.
      Suggested-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      21b55dbc
    • J
      io_uring: reference ->nsproxy for file table commands · 9b828492
      Jens Axboe 提交于
      If we don't get and assign the namespace for the async work, then certain
      paths just don't work properly (like /dev/stdin, /proc/mounts, etc).
      Anything that references the current namespace of the given task should
      be assigned for async work on behalf of that task.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9b828492
    • J
      io_uring: don't rely on weak ->files references · 0f212204
      Jens Axboe 提交于
      Grab actual references to the files_struct. To avoid circular references
      issues due to this, we add a per-task note that keeps track of what
      io_uring contexts a task has used. When the tasks execs or exits its
      assigned files, we cancel requests based on this tracking.
      
      With that, we can grab proper references to the files table, and no
      longer need to rely on stashing away ring_fd and ring_file to check
      if the ring_fd may have been closed.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f212204
    • J
      io_uring: enable task/files specific overflow flushing · e6c8aa9a
      Jens Axboe 提交于
      This allows us to selectively flush out pending overflows, depending on
      the task and/or files_struct being passed in.
      
      No intended functional changes in this patch.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e6c8aa9a
    • J
      io_uring: return cancelation status from poll/timeout/files handlers · 76e1b642
      Jens Axboe 提交于
      Return whether we found and canceled requests or not. This is in
      preparation for using this information, no functional changes in this
      patch.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      76e1b642
    • J
      io_uring: unconditionally grab req->task · e3bc8e9d
      Jens Axboe 提交于
      Sometimes we assign a weak reference to it, sometimes we grab a
      reference to it. Clean this up and make it unconditional, and drop the
      flag related to tracking this state.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e3bc8e9d
    • J
      io_uring: stash ctx task reference for SQPOLL · 2aede0e4
      Jens Axboe 提交于
      We can grab a reference to the task instead of stashing away the task
      files_struct. This is doable without creating a circular reference
      between the ring fd and the task itself.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2aede0e4
    • J
      io_uring: move dropping of files into separate helper · f573d384
      Jens Axboe 提交于
      No functional changes in this patch, prep patch for grabbing references
      to the files_struct.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f573d384