1. 01 10月, 2020 35 次提交
    • P
      io_uring: io_kiocb_ppos() style change · 5b09e37e
      Pavel Begunkov 提交于
      Put brackets around bitwise ops in a complex expression
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5b09e37e
    • P
      io_uring: simplify io_alloc_req() · 291b2821
      Pavel Begunkov 提交于
      Extract common code from if/else branches. That is cleaner and optimised
      even better.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      291b2821
    • J
      io-wq: kill unused IO_WORKER_F_EXITING · 145cc8c6
      Jens Axboe 提交于
      This flag is no longer used, remove it.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      145cc8c6
    • H
      io-wq: fix use-after-free in io_wq_worker_running · c4068bf8
      Hillf Danton 提交于
      The smart syzbot has found a reproducer for the following issue:
      
       ==================================================================
       BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
       BUG: KASAN: use-after-free in atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
       BUG: KASAN: use-after-free in io_wqe_inc_running fs/io-wq.c:301 [inline]
       BUG: KASAN: use-after-free in io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
       Write of size 4 at addr ffff8882183db08c by task io_wqe_worker-0/7771
      
       CPU: 0 PID: 7771 Comm: io_wqe_worker-0 Not tainted 5.9.0-rc4-syzkaller #0
       Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
       Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x198/0x1fd lib/dump_stack.c:118
        print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
        __kasan_report mm/kasan/report.c:513 [inline]
        kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
        check_memory_region_inline mm/kasan/generic.c:186 [inline]
        check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
        instrument_atomic_write include/linux/instrumented.h:71 [inline]
        atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
        io_wqe_inc_running fs/io-wq.c:301 [inline]
        io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
        schedule_timeout+0x148/0x250 kernel/time/timer.c:1879
        io_wqe_worker+0x517/0x10e0 fs/io-wq.c:580
        kthread+0x3b5/0x4a0 kernel/kthread.c:292
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
       Allocated by task 7768:
        kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
        kasan_set_track mm/kasan/common.c:56 [inline]
        __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
        kmem_cache_alloc_node_trace+0x17b/0x3f0 mm/slab.c:3594
        kmalloc_node include/linux/slab.h:572 [inline]
        kzalloc_node include/linux/slab.h:677 [inline]
        io_wq_create+0x57b/0xa10 fs/io-wq.c:1064
        io_init_wq_offload fs/io_uring.c:7432 [inline]
        io_sq_offload_start fs/io_uring.c:7504 [inline]
        io_uring_create fs/io_uring.c:8625 [inline]
        io_uring_setup+0x1836/0x28e0 fs/io_uring.c:8694
        do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       Freed by task 21:
        kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
        kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
        kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
        __kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422
        __cache_free mm/slab.c:3418 [inline]
        kfree+0x10e/0x2b0 mm/slab.c:3756
        __io_wq_destroy fs/io-wq.c:1138 [inline]
        io_wq_destroy+0x2af/0x460 fs/io-wq.c:1146
        io_finish_async fs/io_uring.c:6836 [inline]
        io_ring_ctx_free fs/io_uring.c:7870 [inline]
        io_ring_exit_work+0x1e4/0x6d0 fs/io_uring.c:7954
        process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
        worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
        kthread+0x3b5/0x4a0 kernel/kthread.c:292
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
       The buggy address belongs to the object at ffff8882183db000
        which belongs to the cache kmalloc-1k of size 1024
       The buggy address is located 140 bytes inside of
        1024-byte region [ffff8882183db000, ffff8882183db400)
       The buggy address belongs to the page:
       page:000000009bada22b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2183db
       flags: 0x57ffe0000000200(slab)
       raw: 057ffe0000000200 ffffea0008604c48 ffffea00086a8648 ffff8880aa040700
       raw: 0000000000000000 ffff8882183db000 0000000100000002 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff8882183daf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        ffff8882183db000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       >ffff8882183db080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                             ^
        ffff8882183db100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff8882183db180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ==================================================================
      
      which is down to the comment below,
      
      	/* all workers gone, wq exit can proceed */
      	if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
      		complete(&wqe->wq->done);
      
      because there might be multiple cases of wqe in a wq and we would wait
      for every worker in every wqe to go home before releasing wq's resources
      on destroying.
      
      To that end, rework wq's refcount by making it independent of the tracking
      of workers because after all they are two different things, and keeping
      it balanced when workers come and go. Note the manager kthread, like
      other workers, now holds a grab to wq during its lifetime.
      
      Finally to help destroy wq, check IO_WQ_BIT_EXIT upon creating worker
      and do nothing for exiting wq.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reported-by: syzbot+45fa0a195b941764e0f0@syzkaller.appspotmail.com
      Reported-by: syzbot+9af99580130003da82b1@syzkaller.appspotmail.com
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NHillf Danton <hdanton@sina.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c4068bf8
    • J
      io_uring: show sqthread pid and cpu in fdinfo · dbbe9c64
      Joseph Qi 提交于
      In most cases we'll specify IORING_SETUP_SQPOLL and run multiple
      io_uring instances in a host. Since all sqthreads are named
      "io_uring-sq", it's hard to distinguish the relations between
      application process and its io_uring sqthread.
      With this patch, application can get its corresponding sqthread pid
      and cpu through show_fdinfo.
      Steps:
      1. Get io_uring fd first.
      $ ls -l /proc/<pid>/fd | grep -w io_uring
      2. Then get io_uring instance related info, including corresponding
      sqthread pid and cpu.
      $ cat /proc/<pid>/fdinfo/<io_uring_fd>
      
      pos:	0
      flags:	02000002
      mnt_id:	13
      SqThread:	6929
      SqThreadCpu:	2
      UserFiles:	1
          0: testfile
      UserBufs:	0
      PollList:
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      [axboe: fixed for new shared SQPOLL]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dbbe9c64
    • J
      io_uring: process task work in io_uring_register() · af9c1a44
      Jens Axboe 提交于
      We do this for CQ ring wait, in case task_work completions come in. We
      should do the same in io_uring_register(), to avoid spurious -EINTR
      if the ring quiescing ends up having to process task_work to complete
      the operation
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      af9c1a44
    • D
      io_uring: add blkcg accounting to offloaded operations · 91d8f519
      Dennis Zhou 提交于
      There are a few operations that are offloaded to the worker threads. In
      this case, we lose process context and end up in kthread context. This
      results in ios to be not accounted to the issuing cgroup and
      consequently end up as issued by root. Just like others, adopt the
      personality of the blkcg too when issuing via the workqueues.
      
      For the SQPOLL thread, it will live and attach in the inited cgroup's
      context.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      91d8f519
    • J
      io_uring: improve registered buffer accounting for huge pages · de293938
      Jens Axboe 提交于
      io_uring does account any registered buffer as pinned/locked memory, and
      checks limit and fails if the given user doesn't have a big enough limit
      to register the ranges specified. However, if huge pages are used, we
      are potentially under-accounting the memory in terms of what gets pinned
      on the vm side.
      
      This patch rectifies that, by ensuring that we account the full size of
      a compound page, regardless of how much of it is being registered. Huge
      pages are not accounted mulitple times - if multiple sections of a huge
      page is registered, then the page is only accounted once.
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      de293938
    • Z
      io_uring: remove unneeded semicolon · 14db8411
      Zheng Bin 提交于
      Fixes coccicheck warning:
      
      fs/io_uring.c:4242:13-14: Unneeded semicolon
      Signed-off-by: NZheng Bin <zhengbin13@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      14db8411
    • J
      io_uring: cap SQ submit size for SQPOLL with multiple rings · e95eee2d
      Jens Axboe 提交于
      In the spirit of fairness, cap the max number of SQ entries we'll submit
      for SQPOLL if we have multiple rings. If we don't do that, we could be
      submitting tons of entries for one ring, while others are waiting to get
      service.
      
      The value of 8 is somewhat arbitrarily chosen as something that allows
      a fair bit of batching, without using an excessive time per ring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e95eee2d
    • J
      io_uring: get rid of req->io/io_async_ctx union · e8c2bc1f
      Jens Axboe 提交于
      There's really no point in having this union, it just means that we're
      always allocating enough room to cater to any command. But that's
      pointless, as the ->io field is request type private anyway.
      
      This gets rid of the io_async_ctx structure, and fills in the required
      size in the io_op_defs[] instead.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e8c2bc1f
    • P
      io_uring: kill extra user_bufs check · 4be1c615
      Pavel Begunkov 提交于
      Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary,
      because in that case ctx->nr_user_bufs would be zero, and the following
      check would fail.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4be1c615
    • P
      io_uring: fix overlapped memcpy in io_req_map_rw() · ab0b196c
      Pavel Begunkov 提交于
      When io_req_map_rw() is called from io_rw_prep_async(), it memcpy()
      iorw->iter into itself. Even though it doesn't lead to an error, such a
      memcpy()'s aliasing rules violation is considered to be a bad practise.
      
      Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any
      remapping there, so it's much simpler than the generic implementation.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ab0b196c
    • P
      io_uring: refactor io_req_map_rw() · afb87658
      Pavel Begunkov 提交于
      Set rw->free_iovec to @iovec, that gives an identical result and stresses
      that @iovec param rw->free_iovec play the same role.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      afb87658
    • P
      io_uring: simplify io_rw_prep_async() · f4bff104
      Pavel Begunkov 提交于
      Don't touch iter->iov and iov in between __io_import_iovec() and
      io_req_map_rw(), the former function aleady sets it correctly, because it
      creates one more case with NULL'ed iov to consider in io_req_map_rw().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f4bff104
    • J
      io_uring: provide IORING_ENTER_SQ_WAIT for SQPOLL SQ ring waits · 90554200
      Jens Axboe 提交于
      When using SQPOLL, applications can run into the issue of running out of
      SQ ring entries because the thread hasn't consumed them yet. The only
      option for dealing with that is checking later, or busy checking for the
      condition.
      
      Provide IORING_ENTER_SQ_WAIT if applications want to wait on this
      condition.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      90554200
    • J
      io_uring: mark io_uring_fops/io_op_defs as __read_mostly · 738277ad
      Jens Axboe 提交于
      These structures are never written, move them appropriately.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      738277ad
    • J
      io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too · aa06165d
      Jens Axboe 提交于
      We support using IORING_SETUP_ATTACH_WQ to share async backends between
      rings created by the same process, this now also allows the same to
      happen with SQPOLL. The setup procedure remains the same, the caller
      sets io_uring_params->wq_fd to the 'parent' context, and then the newly
      created ring will attach to that async backend.
      
      This means that multiple rings can share the same SQPOLL thread, saving
      resources.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      aa06165d
    • J
      io_uring: base SQPOLL handling off io_sq_data · 69fb2131
      Jens Axboe 提交于
      Remove the SQPOLL thread from the ctx, and use the io_sq_data as the
      data structure we pass in. io_sq_data has a list of ctx's that we can
      then iterate over and handle.
      
      As of now we're ready to handle multiple ctx's, though we're still just
      handling a single one after this patch.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      69fb2131
    • J
      io_uring: split SQPOLL data into separate structure · 534ca6d6
      Jens Axboe 提交于
      Move all the necessary state out of io_ring_ctx, and into a new
      structure, io_sq_data. The latter now deals with any state or
      variables associated with the SQPOLL thread itself.
      
      In preparation for supporting more than one io_ring_ctx per SQPOLL
      thread.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      534ca6d6
    • J
      io_uring: split work handling part of SQPOLL into helper · c8d1ba58
      Jens Axboe 提交于
      This is done in preparation for handling more than one ctx, but it also
      cleans up the code a bit since io_sq_thread() was a bit too unwieldy to
      get a get overview on.
      
      __io_sq_thread() is now the main handler, and it returns an enum sq_ret
      that tells io_sq_thread() what it ended up doing. The parent then makes
      a decision on idle, spinning, or work handling based on that.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c8d1ba58
    • J
      io_uring: move SQPOLL post-wakeup ring need wakeup flag into wake handler · 3f0e64d0
      Jens Axboe 提交于
      We need to decouple the clearing on wakeup from the the inline schedule,
      as that is going to be required for handling multiple rings in one
      thread.
      
      Wrap our wakeup handler so we can clear it when we get the wakeup, by
      definition that is when we no longer need the flag set.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3f0e64d0
    • J
      io_uring: use private ctx wait queue entries for SQPOLL · 6a779382
      Jens Axboe 提交于
      This is in preparation to sharing the poller thread between rings. For
      that we need per-ring wait_queue_entry storage, and we can't easily put
      that on the stack if one thread is managing multiple rings.
      
      We'll also be sharing the wait_queue_head across rings for the purposes
      of wakeups, provide the usual private ring wait_queue_head for now but
      make it a pointer so we can easily override it when sharing.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6a779382
    • J
      io_uring: io_sq_thread() doesn't need to flush signals · e35afcf9
      Jens Axboe 提交于
      We're not handling signals by default in kernel threads, and we never
      use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never
      have a signal pending, and we don't need to check for it (nor flush it).
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e35afcf9
    • S
      io_wq: Make io_wqe::lock a raw_spinlock_t · 95da8465
      Sebastian Andrzej Siewior 提交于
      During a context switch the scheduler invokes wq_worker_sleeping() with
      disabled preemption. Disabling preemption is needed because it protects
      access to `worker->sleeping'. As an optimisation it avoids invoking
      schedule() within the schedule path as part of possible wake up (thus
      preempt_enable_no_resched() afterwards).
      
      The io-wq has been added to the mix in the same section with disabled
      preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
      acquires a spinlock_t. Also within the schedule() the spinlock_t must be
      acquired after tsk_is_pi_blocked() otherwise it will block on the
      sleeping lock again while scheduling out.
      
      While playing with `io_uring-bench' I didn't notice a significant
      latency spike after converting io_wqe::lock to a raw_spinlock_t. The
      latency was more or less the same.
      
      In order to keep the spinlock_t it would have to be moved after the
      tsk_is_pi_blocked() check which would introduce a branch instruction
      into the hot path.
      
      The lock is used to maintain the `work_list' and wakes one task up at
      most.
      Should io_wqe_cancel_pending_work() cause latency spikes, while
      searching for a specific item, then it would need to drop the lock
      during iterations.
      revert_creds() is also invoked under the lock. According to debug
      cred::non_rcu is 0. Otherwise it should be moved outside of the locked
      section because put_cred_rcu()->free_uid() acquires a sleeping lock.
      
      Convert io_wqe::lock to a raw_spinlock_t.c
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      95da8465
    • S
      io_uring: allow disabling rings during the creation · 7e84e1c7
      Stefano Garzarella 提交于
      This patch adds a new IORING_SETUP_R_DISABLED flag to start the
      rings disabled, allowing the user to register restrictions,
      buffers, files, before to start processing SQEs.
      
      When IORING_SETUP_R_DISABLED is set, SQE are not processed and
      SQPOLL kthread is not started.
      
      The restrictions registration are allowed only when the rings
      are disable to prevent concurrency issue while processing SQEs.
      
      The rings can be enabled using IORING_REGISTER_ENABLE_RINGS
      opcode with io_uring_register(2).
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7e84e1c7
    • S
      io_uring: add IOURING_REGISTER_RESTRICTIONS opcode · 21b55dbc
      Stefano Garzarella 提交于
      The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode
      permanently installs a feature allowlist on an io_ring_ctx.
      The io_ring_ctx can then be passed to untrusted code with the
      knowledge that only operations present in the allowlist can be
      executed.
      
      The allowlist approach ensures that new features added to io_uring
      do not accidentally become available when an existing application
      is launched on a newer kernel version.
      
      Currently is it possible to restrict sqe opcodes, sqe flags, and
      register opcodes.
      
      IOURING_REGISTER_RESTRICTIONS can only be made once. Afterwards
      it is not possible to change restrictions anymore.
      This prevents untrusted code from removing restrictions.
      Suggested-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      21b55dbc
    • J
      io_uring: reference ->nsproxy for file table commands · 9b828492
      Jens Axboe 提交于
      If we don't get and assign the namespace for the async work, then certain
      paths just don't work properly (like /dev/stdin, /proc/mounts, etc).
      Anything that references the current namespace of the given task should
      be assigned for async work on behalf of that task.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9b828492
    • J
      io_uring: don't rely on weak ->files references · 0f212204
      Jens Axboe 提交于
      Grab actual references to the files_struct. To avoid circular references
      issues due to this, we add a per-task note that keeps track of what
      io_uring contexts a task has used. When the tasks execs or exits its
      assigned files, we cancel requests based on this tracking.
      
      With that, we can grab proper references to the files table, and no
      longer need to rely on stashing away ring_fd and ring_file to check
      if the ring_fd may have been closed.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f212204
    • J
      io_uring: enable task/files specific overflow flushing · e6c8aa9a
      Jens Axboe 提交于
      This allows us to selectively flush out pending overflows, depending on
      the task and/or files_struct being passed in.
      
      No intended functional changes in this patch.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e6c8aa9a
    • J
      io_uring: return cancelation status from poll/timeout/files handlers · 76e1b642
      Jens Axboe 提交于
      Return whether we found and canceled requests or not. This is in
      preparation for using this information, no functional changes in this
      patch.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      76e1b642
    • J
      io_uring: unconditionally grab req->task · e3bc8e9d
      Jens Axboe 提交于
      Sometimes we assign a weak reference to it, sometimes we grab a
      reference to it. Clean this up and make it unconditional, and drop the
      flag related to tracking this state.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e3bc8e9d
    • J
      io_uring: stash ctx task reference for SQPOLL · 2aede0e4
      Jens Axboe 提交于
      We can grab a reference to the task instead of stashing away the task
      files_struct. This is doable without creating a circular reference
      between the ring fd and the task itself.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2aede0e4
    • J
      io_uring: move dropping of files into separate helper · f573d384
      Jens Axboe 提交于
      No functional changes in this patch, prep patch for grabbing references
      to the files_struct.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f573d384
    • J
      io_uring: allow timeout/poll/files killing to take task into account · f3606e3a
      Jens Axboe 提交于
      We currently cancel these when the ring exits, and we cancel all of
      them. This is in preparation for killing only the ones associated
      with a given task.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f3606e3a
  2. 29 9月, 2020 1 次提交
    • H
      io_uring: fix async buffered reads when readahead is disabled · c8d317aa
      Hao Xu 提交于
      The async buffered reads feature is not working when readahead is
      turned off. There are two things to concern:
      
      - when doing retry in io_read, not only the IOCB_WAITQ flag but also
        the IOCB_NOWAIT flag is still set, which makes it goes to would_block
        phase in generic_file_buffered_read() and then return -EAGAIN. After
        that, the io-wq thread work is queued, and later doing the async
        reads in the old way.
      
      - even if we remove IOCB_NOWAIT when doing retry, the feature is still
        not running properly, since in generic_file_buffered_read() it goes to
        lock_page_killable() after calling mapping->a_ops->readpage() to do
        IO, and thus causing process to sleep.
      
      Fixes: 1a0a7853 ("mm: support async buffered reads in generic_file_buffered_read()")
      Fixes: 3b2a4439 ("io_uring: get rid of kiocb_wait_page_queue_init()")
      Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c8d317aa
  3. 28 9月, 2020 2 次提交
    • J
      io_uring: fix potential ABBA deadlock in ->show_fdinfo() · fad8e0de
      Jens Axboe 提交于
      syzbot reports a potential lock deadlock between the normal IO path and
      ->show_fdinfo():
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.9.0-rc6-syzkaller #0 Not tainted
      ------------------------------------------------------
      syz-executor.2/19710 is trying to acquire lock:
      ffff888098ddc450 (sb_writers#4){.+.+}-{0:0}, at: io_write+0x6b5/0xb30 fs/io_uring.c:3296
      
      but task is already holding lock:
      ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #2 (&ctx->uring_lock){+.+.}-{3:3}:
             __mutex_lock_common kernel/locking/mutex.c:956 [inline]
             __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
             __io_uring_show_fdinfo fs/io_uring.c:8417 [inline]
             io_uring_show_fdinfo+0x194/0xc70 fs/io_uring.c:8460
             seq_show+0x4a8/0x700 fs/proc/fd.c:65
             seq_read+0x432/0x1070 fs/seq_file.c:208
             do_loop_readv_writev fs/read_write.c:734 [inline]
             do_loop_readv_writev fs/read_write.c:721 [inline]
             do_iter_read+0x48e/0x6e0 fs/read_write.c:955
             vfs_readv+0xe5/0x150 fs/read_write.c:1073
             kernel_readv fs/splice.c:355 [inline]
             default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
             do_splice_to+0x137/0x170 fs/splice.c:871
             splice_direct_to_actor+0x307/0x980 fs/splice.c:950
             do_splice_direct+0x1b3/0x280 fs/splice.c:1059
             do_sendfile+0x55f/0xd40 fs/read_write.c:1540
             __do_sys_sendfile64 fs/read_write.c:1601 [inline]
             __se_sys_sendfile64 fs/read_write.c:1587 [inline]
             __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
             do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #1 (&p->lock){+.+.}-{3:3}:
             __mutex_lock_common kernel/locking/mutex.c:956 [inline]
             __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
             seq_read+0x61/0x1070 fs/seq_file.c:155
             pde_read fs/proc/inode.c:306 [inline]
             proc_reg_read+0x221/0x300 fs/proc/inode.c:318
             do_loop_readv_writev fs/read_write.c:734 [inline]
             do_loop_readv_writev fs/read_write.c:721 [inline]
             do_iter_read+0x48e/0x6e0 fs/read_write.c:955
             vfs_readv+0xe5/0x150 fs/read_write.c:1073
             kernel_readv fs/splice.c:355 [inline]
             default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
             do_splice_to+0x137/0x170 fs/splice.c:871
             splice_direct_to_actor+0x307/0x980 fs/splice.c:950
             do_splice_direct+0x1b3/0x280 fs/splice.c:1059
             do_sendfile+0x55f/0xd40 fs/read_write.c:1540
             __do_sys_sendfile64 fs/read_write.c:1601 [inline]
             __se_sys_sendfile64 fs/read_write.c:1587 [inline]
             __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
             do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #0 (sb_writers#4){.+.+}-{0:0}:
             check_prev_add kernel/locking/lockdep.c:2496 [inline]
             check_prevs_add kernel/locking/lockdep.c:2601 [inline]
             validate_chain kernel/locking/lockdep.c:3218 [inline]
             __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
             lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
             percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
             __sb_start_write+0x228/0x450 fs/super.c:1672
             io_write+0x6b5/0xb30 fs/io_uring.c:3296
             io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
             __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
             io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
             io_submit_sqe fs/io_uring.c:6324 [inline]
             io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
             __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
             do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      other info that might help us debug this:
      
      Chain exists of:
        sb_writers#4 --> &p->lock --> &ctx->uring_lock
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&ctx->uring_lock);
                                     lock(&p->lock);
                                     lock(&ctx->uring_lock);
        lock(sb_writers#4);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor.2/19710:
       #0: ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348
      
      stack backtrace:
      CPU: 0 PID: 19710 Comm: syz-executor.2 Not tainted 5.9.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x198/0x1fd lib/dump_stack.c:118
       check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
       check_prev_add kernel/locking/lockdep.c:2496 [inline]
       check_prevs_add kernel/locking/lockdep.c:2601 [inline]
       validate_chain kernel/locking/lockdep.c:3218 [inline]
       __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
       lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
       percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
       __sb_start_write+0x228/0x450 fs/super.c:1672
       io_write+0x6b5/0xb30 fs/io_uring.c:3296
       io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
       __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
       io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
       io_submit_sqe fs/io_uring.c:6324 [inline]
       io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
       __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x45e179
      Code: 3d b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 0b b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f1194e74c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
      RAX: ffffffffffffffda RBX: 00000000000082c0 RCX: 000000000045e179
      RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004
      RBP: 000000000118cf98 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118cf4c
      R13: 00007ffd1aa5756f R14: 00007f1194e759c0 R15: 000000000118cf4c
      
      Fix this by just not diving into details if we fail to trylock the
      io_uring mutex. We know the ctx isn't going away during this operation,
      but we cannot safely iterate buffers/files/personalities if we don't
      hold the io_uring mutex.
      
      Reported-by: syzbot+2f8fa4e860edc3066aba@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fad8e0de
    • J
      io_uring: always delete double poll wait entry on match · 8706e04e
      Jens Axboe 提交于
      syzbot reports a crash with tty polling, which is using the double poll
      handling:
      
      general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] PREEMPT SMP KASAN
      KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
      CPU: 0 PID: 6874 Comm: syz-executor749 Not tainted 5.9.0-rc6-next-20200924-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:io_poll_get_single fs/io_uring.c:4778 [inline]
      RIP: 0010:io_poll_double_wake+0x51/0x510 fs/io_uring.c:4845
      Code: fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 9e 03 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 5d 08 48 8d 7b 48 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 06 0f 8e 63 03 00 00 0f b6 6b 48 bf 06 00 00
      RSP: 0018:ffffc90001c1fb70 EFLAGS: 00010006
      RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000004
      RDX: 0000000000000009 RSI: ffffffff81d9b3ad RDI: 0000000000000048
      RBP: dffffc0000000000 R08: ffff8880a3cac798 R09: ffffc90001c1fc60
      R10: fffff52000383f73 R11: 0000000000000000 R12: 0000000000000004
      R13: ffff8880a3cac798 R14: ffff8880a3cac7a0 R15: 0000000000000004
      FS:  0000000001f98880(0000) GS:ffff8880ae400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f18886916c0 CR3: 0000000094c5a000 CR4: 00000000001506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       __wake_up_common+0x147/0x650 kernel/sched/wait.c:93
       __wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:123
       tty_ldisc_hangup+0x1cf/0x680 drivers/tty/tty_ldisc.c:735
       __tty_hangup.part.0+0x403/0x870 drivers/tty/tty_io.c:625
       __tty_hangup drivers/tty/tty_io.c:575 [inline]
       tty_vhangup+0x1d/0x30 drivers/tty/tty_io.c:698
       pty_close+0x3f5/0x550 drivers/tty/pty.c:79
       tty_release+0x455/0xf60 drivers/tty/tty_io.c:1679
       __fput+0x285/0x920 fs/file_table.c:281
       task_work_run+0xdd/0x190 kernel/task_work.c:141
       tracehook_notify_resume include/linux/tracehook.h:188 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:165 [inline]
       exit_to_user_mode_prepare+0x1e2/0x1f0 kernel/entry/common.c:192
       syscall_exit_to_user_mode+0x7a/0x2c0 kernel/entry/common.c:267
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x401210
      
      which is due to a failure in removing the double poll wait entry if we
      hit a wakeup match. This can cause multiple invocations of the wakeup,
      which isn't safe.
      
      Cc: stable@vger.kernel.org # v5.8
      Reported-by: syzbot+81b3883093f772addf6d@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8706e04e
  4. 26 9月, 2020 1 次提交
  5. 25 9月, 2020 1 次提交