1. 04 2月, 2020 5 次提交
    • Z
      io_uring: fix the sequence comparison in io_sequence_defer · 170a1188
      Zhengyuan Liu 提交于
      commit dbd0f6d6c2a11eb9c31ca9cd454f95bb5713e92e upstream.
      
      sq->cached_sq_head and cq->cached_cq_tail are both unsigned int. If
      cached_sq_head overflows before cached_cq_tail, then we may miss a
      barrier req. As cached_cq_tail always follows cached_sq_head, the NQ
      should be enough.
      
      Cc: stable@vger.kernel.org
      Fixes: de0617e46717 ("io_uring: add support for marking commands as draining")
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      170a1188
    • J
      io_uring: fix io_sq_thread_stop running in front of io_sq_thread · db4c235e
      Jackie Liu 提交于
      commit a4c0b3decb33fb4a2b5ecc6234a50680f0b21e7d upstream.
      
      INFO: task syz-executor.5:8634 blocked for more than 143 seconds.
             Not tainted 5.2.0-rc5+ #3
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      syz-executor.5  D25632  8634   8224 0x00004004
      Call Trace:
        context_switch kernel/sched/core.c:2818 [inline]
        __schedule+0x658/0x9e0 kernel/sched/core.c:3445
        schedule+0x131/0x1d0 kernel/sched/core.c:3509
        schedule_timeout+0x9a/0x2b0 kernel/time/timer.c:1783
        do_wait_for_common+0x35e/0x5a0 kernel/sched/completion.c:83
        __wait_for_common kernel/sched/completion.c:104 [inline]
        wait_for_common kernel/sched/completion.c:115 [inline]
        wait_for_completion+0x47/0x60 kernel/sched/completion.c:136
        kthread_stop+0xb4/0x150 kernel/kthread.c:559
        io_sq_thread_stop fs/io_uring.c:2252 [inline]
        io_finish_async fs/io_uring.c:2259 [inline]
        io_ring_ctx_free fs/io_uring.c:2770 [inline]
        io_ring_ctx_wait_and_kill+0x268/0x880 fs/io_uring.c:2834
        io_uring_release+0x5d/0x70 fs/io_uring.c:2842
        __fput+0x2e4/0x740 fs/file_table.c:280
        ____fput+0x15/0x20 fs/file_table.c:313
        task_work_run+0x17e/0x1b0 kernel/task_work.c:113
        tracehook_notify_resume include/linux/tracehook.h:185 [inline]
        exit_to_usermode_loop arch/x86/entry/common.c:168 [inline]
        prepare_exit_to_usermode+0x402/0x4f0 arch/x86/entry/common.c:199
        syscall_return_slowpath+0x110/0x440 arch/x86/entry/common.c:279
        do_syscall_64+0x126/0x140 arch/x86/entry/common.c:304
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x412fb1
      Code: 80 3b 7c 0f 84 c7 02 00 00 c7 85 d0 00 00 00 00 00 00 00 48 8b 05 cf
      a6 24 00 49 8b 14 24 41 b9 cb 2a 44 00 48 89 ee 48 89 df <48> 85 c0 4c 0f
      45 c8 45 31 c0 31 c9 e8 0e 5b 00 00 85 c0 41 89 c7
      RSP: 002b:00007ffe7ee6a180 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
      RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000412fb1
      RDX: 0000001b2d920000 RSI: 0000000000000000 RDI: 0000000000000003
      RBP: 0000000000000001 R08: 00000000f3a3e1f8 R09: 00000000f3a3e1fc
      R10: 00007ffe7ee6a260 R11: 0000000000000293 R12: 000000000075c9a0
      R13: 000000000075c9a0 R14: 0000000000024c00 R15: 000000000075bf2c
      
      =============================================
      
      There is an wrong logic, when kthread_park running
      in front of io_sq_thread.
      
      CPU#0					CPU#1
      
      io_sq_thread_stop:			int kthread(void *_create):
      
      kthread_park()
      					__kthread_parkme(self);	 <<< Wrong
      kthread_stop()
          << wait for self->exited
          << clear_bit KTHREAD_SHOULD_PARK
      
      					ret = threadfn(data);
      					   |
      					   |- io_sq_thread
      					       |- kthread_should_park()	<< false
      					       |- schedule() <<< nobody wake up
      
      stuck CPU#0				stuck CPU#1
      
      So, use a new variable sqo_thread_started to ensure that io_sq_thread
      run first, then io_sq_thread_stop.
      
      Reported-by: syzbot+94324416c485d422fe15@syzkaller.appspotmail.com
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      db4c235e
    • J
      io_uring: add support for recvmsg() · 7cabfcb1
      Jens Axboe 提交于
      commit aa1fa28fc73ea6b740ee7b62bf3b07141883dbb8 upstream.
      
      This is done through IORING_OP_RECVMSG. This opcode uses the same
      sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the
      msghdr struct in the sqe->addr field as well.
      
      We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't
      block, and punt to async execution if it would have.
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      7cabfcb1
    • J
      io_uring: add support for sendmsg() · ce630b99
      Jens Axboe 提交于
      commit 0fa03c624d8fc9932d0f27c39a9deca6a37e0e17 upstream.
      
      This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags
      for the flags argument, and the msghdr struct is passed in the
      sqe->addr field.
      
      We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't
      block, and punt to async execution if it would have.
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      ce630b99
    • C
      block: never take page references for ITER_BVEC · 3f9da4d9
      Christoph Hellwig 提交于
      Cherry-pick from commit b620743077e291ae7d0debd21f50413a8c266229 upstream.
      
      If we pass pages through an iov_iter we always already have a reference
      in the caller.  Thus remove the ITER_BVEC_FLAG_NO_REF and don't take
      reference to pages by default for bvec backed iov_iters.
      
      [Joseph] Resolve conflicts since we don't have:
      81ba6abd2bcd "block: loop: mark bvec as ITER_BVEC_FLAG_NO_REF"
      7321ecbfc7cf "block: change how we get page references in bio_iov_iter_get_pages"
      Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      3f9da4d9
  2. 03 2月, 2020 21 次提交
  3. 15 1月, 2020 2 次提交
    • X
      alinux: jbd2: track slow handle which is preventing transaction committing · 861575c9
      Xiaoguang Wang 提交于
      While transaction is going to commit, it first sets its state to be
      T_LOCKED and waits all outstanding handles to complete, and the
      committing transaction will always be in locked state so long as it
      has outstanding handles, also the whole fs will be locked and all later
      fs modification operations will be stucked in wait_transaction_locked().
      
      It's hard to tell why handles are that slow, so here we add a new staic
      tracepoint to track such slow handle, and show io wait time and sched
      wait time, output likes below:
        fsstress-20347 [024] ....  1570.305454: jbd2_slow_handle_stats: dev 254,17
      tid 15853 type 4 line_no 3101 interval 126 sync 0 requested_blocks 24
      dirtied_blocks 0 trans_wait 122 space_wait 0 sched_wait 0 io_wait 126
      
      "trans_wait 122" means that this current committing transaction has been
      locked for 122ms, due to this handle is not completed quickly.
      
      From "io_wait 126", we can see that io is the major reason.
      
      In this patch, we also add a per fs control file used to determine
      whether a handle can be considered to be slow.
          /proc/fs/jbd2/vdb1-8/stall_thresh
      default value is 100ms, users can set new threshold by echoing new value
      to this file.
      
      Later I also plan to add a proc file fs per fs to record these info.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      861575c9
    • X
      alinux: fs: record page or bio info while process is waitting on it · 79209707
      Xiaoguang Wang 提交于
      If one process context is stucked in wait_on_buffer(), lock_buffer(),
      lock_page() and wait_on_page_writeback() and wait_on_bit_io(), it's
      hard to tell ture reason, for example, whether this page is under io,
      or this page is just locked too long by other process context.
      
      Normally io request has multiple bios, and every bio contains multiple
      pages which will hold data to be read from or written to device, so here
      we record page info or bio info in task_struct while process calls
      lock_page(), lock_buffer(), wait_on_page_writeback(), wait_on_buffer()
      and wait_on_bit_io(), we add a new proce interface:
      [lege@localhost linux]$ cat /proc/4516/wait_res
      1 ffffd0969f95d3c0 4295369599 4295381596
      
      Above info means that thread 4516 is waitting on a page, address is
      ffffd0969f95d3c0, and has waited for 11997ms.
      
      First field denotes the page address process is waitting on.
      Second field denotes the wait moment and the third denotes current moment.
      
      In practice, if we found a process waitting on one page for too long time,
      we can get page's address by reading /proc/$pid/wait_page, and search this
      page address in all block devices' /sys/kernel/debug/block/${devname}/rq_hang,
      if search operation hits one, we can get the request and know why this io
      request hangs that long.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      79209707
  4. 12 1月, 2020 12 次提交
    • M
      io_uring: avoid page allocation warnings · 8f74a3b0
      Mark Rutland 提交于
      commit d4ef647510b1200fe1c996ff1cbf5ac47eb930cc upstream.
      
      In io_sqe_buffer_register() we allocate a number of arrays based on the
      iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
      we can still attempt to allocate arrays exceeding MAX_ORDER.
      
      On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
      iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
      allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
      bytes, which is greater than 4MiB. This results in SLUB warning that
      we're trying to allocate greater than MAX_ORDER, and failing the
      allocation.
      
      Avoid this by using kvmalloc() for allocations dependent on the
      user-provided iov_len. At the same time, fix a leak of imu->bvec when
      registration fails.
      
      Full splat from before this patch:
      
      WARNING: CPU: 1 PID: 2314 at mm/page_alloc.c:4595 __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 1 PID: 2314 Comm: syz-executor326 Not tainted 5.1.0-rc7-dirty #4
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193
       show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x110/0x190 lib/dump_stack.c:113
       panic+0x384/0x68c kernel/panic.c:214
       __warn+0x2bc/0x2c0 kernel/panic.c:571
       report_bug+0x228/0x2d8 lib/bug.c:186
       bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956
       call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline]
       brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316
       do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831
       el1_dbg+0x18/0x8c
       __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595
       alloc_pages_current+0x164/0x278 mm/mempolicy.c:2132
       alloc_pages include/linux/gfp.h:509 [inline]
       kmalloc_order+0x20/0x50 mm/slab_common.c:1231
       kmalloc_order_trace+0x30/0x2b0 mm/slab_common.c:1243
       kmalloc_large include/linux/slab.h:480 [inline]
       __kmalloc+0x3dc/0x4f0 mm/slub.c:3791
       kmalloc_array include/linux/slab.h:670 [inline]
       io_sqe_buffer_register fs/io_uring.c:2472 [inline]
       __io_uring_register fs/io_uring.c:2962 [inline]
       __do_sys_io_uring_register fs/io_uring.c:3008 [inline]
       __se_sys_io_uring_register fs/io_uring.c:2990 [inline]
       __arm64_sys_io_uring_register+0x9e0/0x1bc8 fs/io_uring.c:2990
       __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
       invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
       el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83
       el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129
       el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948
      SMP: stopping secondary CPUs
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Kernel Offset: disabled
      CPU features: 0x002,23000438
      Memory Limit: none
      Rebooting in 1 seconds..
      
      Fixes: edafccee56ff3167 ("io_uring: add support for pre-mapped user IO buffers")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-block@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      8f74a3b0
    • J
      io_uring: drop req submit reference always in async punt · ff7a6c23
      Jens Axboe 提交于
      commit 817869d2519f0cb7be5b3482129dadc806dfb747 upstream.
      
      If we don't end up actually calling submit in io_sq_wq_submit_work(),
      we still need to drop the submit reference to the request. If we
      don't, then we can leak the request. This can happen if we race
      with ring shutdown while flushing the workqueue for requests that
      require use of the mm_struct.
      
      Fixes: e65ef56db494 ("io_uring: use regular request ref counts")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ff7a6c23
    • M
      io_uring: free allocated io_memory once · 956fe246
      Mark Rutland 提交于
      commit 52e04ef4c9d459cba3afd86ec335a411b40b7fd2 upstream.
      
      If io_allocate_scq_urings() fails to allocate an sq_* region, it will
      call io_mem_free() for any previously allocated regions, but leave
      dangling pointers to these regions in the ctx. Any regions which have
      not yet been allocated are left NULL. Note that when returning
      -EOVERFLOW, the previously allocated sq_ring is not freed, which appears
      to be an unintentional leak.
      
      When io_allocate_scq_urings() fails, io_uring_create() will call
      io_ring_ctx_wait_and_kill(), which calls io_mem_free() on all the sq_*
      regions, assuming the pointers are valid and not NULL.
      
      This can result in pages being freed multiple times, which has been
      observed to corrupt the page state, leading to subsequent fun. This can
      also result in virt_to_page() on NULL, resulting in the use of bogus
      page addresses, and yet more subsequent fun. The latter can be detected
      with CONFIG_DEBUG_VIRTUAL on arm64.
      
      Adding a cleanup path to io_allocate_scq_urings() complicates the logic,
      so let's leave it to io_ring_ctx_free() to consistently free these
      pointers, and simplify the io_allocate_scq_urings() error paths.
      
      Full splats from before this patch below. Note that the pointer logged
      by the DEBUG_VIRTUAL "non-linear address" warning has been hashed, and
      is actually NULL.
      
      [   26.098129] page:ffff80000e949a00 count:0 mapcount:-128 mapping:0000000000000000 index:0x0
      [   26.102976] flags: 0x63fffc000000()
      [   26.104373] raw: 000063fffc000000 ffff80000e86c188 ffff80000ea3df08 0000000000000000
      [   26.108917] raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000
      [   26.137235] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
      [   26.143960] ------------[ cut here ]------------
      [   26.146020] kernel BUG at include/linux/mm.h:547!
      [   26.147586] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
      [   26.149163] Modules linked in:
      [   26.150287] Process syz-executor.21 (pid: 20204, stack limit = 0x000000000e9cefeb)
      [   26.153307] CPU: 2 PID: 20204 Comm: syz-executor.21 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #18
      [   26.156566] Hardware name: linux,dummy-virt (DT)
      [   26.158089] pstate: 40400005 (nZcv daif +PAN -UAO)
      [   26.159869] pc : io_mem_free+0x9c/0xa8
      [   26.161436] lr : io_mem_free+0x9c/0xa8
      [   26.162720] sp : ffff000013003d60
      [   26.164048] x29: ffff000013003d60 x28: ffff800025048040
      [   26.165804] x27: 0000000000000000 x26: ffff800025048040
      [   26.167352] x25: 00000000000000c0 x24: ffff0000112c2820
      [   26.169682] x23: 0000000000000000 x22: 0000000020000080
      [   26.171899] x21: ffff80002143b418 x20: ffff80002143b400
      [   26.174236] x19: ffff80002143b280 x18: 0000000000000000
      [   26.176607] x17: 0000000000000000 x16: 0000000000000000
      [   26.178997] x15: 0000000000000000 x14: 0000000000000000
      [   26.181508] x13: 00009178a5e077b2 x12: 0000000000000001
      [   26.183863] x11: 0000000000000000 x10: 0000000000000980
      [   26.186437] x9 : ffff000013003a80 x8 : ffff800025048a20
      [   26.189006] x7 : ffff8000250481c0 x6 : ffff80002ffe9118
      [   26.191359] x5 : ffff80002ffe9118 x4 : 0000000000000000
      [   26.193863] x3 : ffff80002ffefe98 x2 : 44c06ddd107d1f00
      [   26.196642] x1 : 0000000000000000 x0 : 000000000000003e
      [   26.198892] Call trace:
      [   26.199893]  io_mem_free+0x9c/0xa8
      [   26.201155]  io_ring_ctx_wait_and_kill+0xec/0x180
      [   26.202688]  io_uring_setup+0x6c4/0x6f0
      [   26.204091]  __arm64_sys_io_uring_setup+0x18/0x20
      [   26.205576]  el0_svc_common.constprop.0+0x7c/0xe8
      [   26.207186]  el0_svc_handler+0x28/0x78
      [   26.208389]  el0_svc+0x8/0xc
      [   26.209408] Code: aa0203e0 d0006861 9133a021 97fcdc3c (d4210000)
      [   26.211995] ---[ end trace bdb81cd43a21e50d ]---
      
      [   81.770626] ------------[ cut here ]------------
      [   81.825015] virt_to_phys used for non-linear address: 000000000d42f2c7 (          (null))
      [   81.827860] WARNING: CPU: 1 PID: 30171 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x48/0x68
      [   81.831202] Modules linked in:
      [   81.832212] CPU: 1 PID: 30171 Comm: syz-executor.20 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #19
      [   81.835616] Hardware name: linux,dummy-virt (DT)
      [   81.836863] pstate: 60400005 (nZCv daif +PAN -UAO)
      [   81.838727] pc : __virt_to_phys+0x48/0x68
      [   81.840572] lr : __virt_to_phys+0x48/0x68
      [   81.842264] sp : ffff80002cf67c70
      [   81.843858] x29: ffff80002cf67c70 x28: ffff800014358e18
      [   81.846463] x27: 0000000000000000 x26: 0000000020000080
      [   81.849148] x25: 0000000000000000 x24: ffff80001bb01f40
      [   81.851986] x23: ffff200011db06c8 x22: ffff2000127e3c60
      [   81.854351] x21: ffff800014358cc0 x20: ffff800014358d98
      [   81.856711] x19: 0000000000000000 x18: 0000000000000000
      [   81.859132] x17: 0000000000000000 x16: 0000000000000000
      [   81.861586] x15: 0000000000000000 x14: 0000000000000000
      [   81.863905] x13: 0000000000000000 x12: ffff1000037603e9
      [   81.866226] x11: 1ffff000037603e8 x10: 0000000000000980
      [   81.868776] x9 : ffff80002cf67840 x8 : ffff80001bb02920
      [   81.873272] x7 : ffff1000037603e9 x6 : ffff80001bb01f47
      [   81.875266] x5 : ffff1000037603e9 x4 : dfff200000000000
      [   81.876875] x3 : ffff200010087528 x2 : ffff1000059ecf58
      [   81.878751] x1 : 44c06ddd107d1f00 x0 : 0000000000000000
      [   81.880453] Call trace:
      [   81.881164]  __virt_to_phys+0x48/0x68
      [   81.882919]  io_mem_free+0x18/0x110
      [   81.886585]  io_ring_ctx_wait_and_kill+0x13c/0x1f0
      [   81.891212]  io_uring_setup+0xa60/0xad0
      [   81.892881]  __arm64_sys_io_uring_setup+0x2c/0x38
      [   81.894398]  el0_svc_common.constprop.0+0xac/0x150
      [   81.896306]  el0_svc_handler+0x34/0x88
      [   81.897744]  el0_svc+0x8/0xc
      [   81.898715] ---[ end trace b4a703802243cbba ]---
      
      Fixes: 2b188cc1bb857a9d ("Add io_uring IO interface")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      956fe246
    • M
      io_uring: fix SQPOLL cpu validation · de675a8f
      Mark Rutland 提交于
      commit 975554b03eddc1df73bda3a764a09e18cadd5f1c upstream.
      
      In io_sq_offload_start(), we call cpu_possible() on an unbounded cpu
      value from userspace. On v5.1-rc7 on arm64 with
      CONFIG_DEBUG_PER_CPU_MAPS, this results in a splat:
      
        WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline]
      
      There was an attempt to fix this in commit:
      
        917257daa0fea7a0 ("io_uring: only test SQPOLL cpu after we've verified it")
      
      ... by adding a check after the cpu value had been limited to NR_CPU_IDS
      using array_index_nospec(). However, this left an unbound check at the
      start of the function, for which the warning still fires.
      
      Let's fix this correctly by checking that the cpu value is bound by
      nr_cpu_ids before passing it to cpu_possible(). Note that only
      nr_cpu_ids of a cpumask are guaranteed to exist at runtime, and
      nr_cpu_ids can be significantly smaller than NR_CPUs. For example, an
      arm64 defconfig has NR_CPUS=256, while my test VM has 4 vCPUs.
      
      Following the intent from the commit message for 917257daa0fea7a0, the
      check is moved under the SQ_AFF branch, which is the only branch where
      the cpu values is consumed. The check is performed before bounding the
      value with array_index_nospec() so that we don't silently accept bogus
      cpu values from userspace, where array_index_nospec() would force these
      values to 0.
      
      I suspect we can remove the array_index_nospec() call entirely, but I've
      conservatively left that in place, updated to use nr_cpu_ids to match
      the prior check.
      
      Tested on arm64 with the Syzkaller reproducer:
      
        https://syzkaller.appspot.com/bug?extid=cd714a07c6de2bc34293
        https://syzkaller.appspot.com/x/repro.syz?x=15d8b397200000
      
      Full splat from before this patch:
      
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_check include/linux/cpumask.h:128 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_test_cpu include/linux/cpumask.h:344 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_sq_offload_start fs/io_uring.c:2244 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_create fs/io_uring.c:2864 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 1 PID: 27601 Comm: syz-executor.0 Not tainted 5.1.0-rc7 #3
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193
       show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x110/0x190 lib/dump_stack.c:113
       panic+0x384/0x68c kernel/panic.c:214
       __warn+0x2bc/0x2c0 kernel/panic.c:571
       report_bug+0x228/0x2d8 lib/bug.c:186
       bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956
       call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline]
       brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316
       do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831
       el1_dbg+0x18/0x8c
       cpu_max_bits_warn include/linux/cpumask.h:121 [inline]
       cpumask_check include/linux/cpumask.h:128 [inline]
       cpumask_test_cpu include/linux/cpumask.h:344 [inline]
       io_sq_offload_start fs/io_uring.c:2244 [inline]
       io_uring_create fs/io_uring.c:2864 [inline]
       io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916
       __do_sys_io_uring_setup fs/io_uring.c:2929 [inline]
       __se_sys_io_uring_setup fs/io_uring.c:2926 [inline]
       __arm64_sys_io_uring_setup+0x50/0x70 fs/io_uring.c:2926
       __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
       invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
       el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83
       el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129
       el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948
      SMP: stopping secondary CPUs
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Kernel Offset: disabled
      CPU features: 0x002,23000438
      Memory Limit: none
      Rebooting in 1 seconds..
      
      Fixes: 917257daa0fea7a0 ("io_uring: only test SQPOLL cpu after we've verified it")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      
      Simplied the logic
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      de675a8f
    • J
      io_uring: have submission side sqe errors post a cqe · 39c1adf7
      Jens Axboe 提交于
      commit 5c8b0b54db22c54f2aec991b388f550d3a927f26 upstream.
      
      Currently we only post a cqe if we get an error OUTSIDE of submission.
      For submission, we return the error directly through io_uring_enter().
      This is a bit awkward for applications, and it makes more sense to
      always post a cqe with an error, if the error happens on behalf of an
      sqe.
      
      This changes submission behavior a bit. io_uring_enter() returns -ERROR
      for an error, and > 0 for number of sqes submitted. Before this change,
      if you wanted to submit 8 entries and had an error on the 5th entry,
      io_uring_enter() would return 4 (for number of entries successfully
      submitted) and rewind the sqring. The application would then have to
      peek at the sqring and figure out what was wrong with the head sqe, and
      then skip it itself. With this change, we'll return 5 since we did
      consume 5 sqes, and the last sqe (with the error) will result in a cqe
      being posted with the error.
      
      This makes the logic easier to handle in the application, and it cleans
      up the submission part.
      Suggested-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      39c1adf7
    • S
      io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP · 31f0e246
      Stefan Bühler 提交于
      commit 62977281a6384d3904c02272a638cc3ac3bac54d upstream.
      
      There is no operation to order with afterwards, and removing the flag is
      not critical in any way.
      
      There will always be a "race condition" where the application will
      trigger IORING_ENTER_SQ_WAKEUP when it isn't actually needed.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      31f0e246
    • S
      io_uring: remove unnecessary barrier after incrementing dropped counter · b3764e21
      Stefan Bühler 提交于
      commit b841f19524a16cd93a39f9306191f85c549a2bc2 upstream.
      
      smp_store_release in io_commit_sqring already orders the store to
      dropped before the update to SQ head.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b3764e21
    • S
      io_uring: remove unnecessary barrier before reading SQ tail · 9b88a603
      Stefan Bühler 提交于
      commit 82ab082c0e2f8592c2ff6b2ab99a92d8406c8c2c upstream.
      
      There is no operation before to order with.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      9b88a603
    • S
      io_uring: remove unnecessary barrier after updating SQ head · c8034b7a
      Stefan Bühler 提交于
      commit 9e4c15a3939448d2ea9b9bf59561183bbe3fdc49 upstream.
      
      There is no operation afterwards to order with.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      c8034b7a
    • S
      io_uring: remove unnecessary barrier before reading cq head · 9c231919
      Stefan Bühler 提交于
      commit 115e12e58dbc055e98c965e3255aed7b20214f95 upstream.
      
      The memory operations before reading cq head are unrelated and we
      don't care about their order.
      
      Document that the control dependency in combination with READ_ONCE and
      WRITE_ONCE forms a barrier we need.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      9c231919
    • S
      io_uring: remove unnecessary barrier before wq_has_sleeper · fc6b8e86
      Stefan Bühler 提交于
      commit 4f7067c3fb7f2974363a28c597a41949d971af02 upstream.
      
      wq_has_sleeper has a full barrier internally. The smp_rmb barrier in
      io_uring_poll synchronizes with it.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      fc6b8e86
    • S
      io_uring: fix notes on barriers · a79e96b2
      Stefan Bühler 提交于
      commit 1e84b97b7377bd0198f87b49ad3e396e84bf0458 upstream.
      
      The application reading the CQ ring needs a barrier to pair with the
      smp_store_release in io_commit_cqring, not the barrier after it.
      
      Also a write barrier *after* writing something (but not *before*
      writing anything interesting) doesn't order anything, so an smp_wmb()
      after writing SQ tail is not needed.
      
      Additionally consider reading SQ head and writing CQ tail in the notes.
      
      Also add some clarifications how the various other fields in the ring
      buffers are used.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      a79e96b2