1. 17 1月, 2020 40 次提交
    • M
      io_uring: avoid page allocation warnings · 71f6bbff
      Mark Rutland 提交于
      commit d4ef647510b1200fe1c996ff1cbf5ac47eb930cc upstream.
      
      In io_sqe_buffer_register() we allocate a number of arrays based on the
      iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
      we can still attempt to allocate arrays exceeding MAX_ORDER.
      
      On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
      iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
      allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
      bytes, which is greater than 4MiB. This results in SLUB warning that
      we're trying to allocate greater than MAX_ORDER, and failing the
      allocation.
      
      Avoid this by using kvmalloc() for allocations dependent on the
      user-provided iov_len. At the same time, fix a leak of imu->bvec when
      registration fails.
      
      Full splat from before this patch:
      
      WARNING: CPU: 1 PID: 2314 at mm/page_alloc.c:4595 __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 1 PID: 2314 Comm: syz-executor326 Not tainted 5.1.0-rc7-dirty #4
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193
       show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x110/0x190 lib/dump_stack.c:113
       panic+0x384/0x68c kernel/panic.c:214
       __warn+0x2bc/0x2c0 kernel/panic.c:571
       report_bug+0x228/0x2d8 lib/bug.c:186
       bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956
       call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline]
       brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316
       do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831
       el1_dbg+0x18/0x8c
       __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595
       alloc_pages_current+0x164/0x278 mm/mempolicy.c:2132
       alloc_pages include/linux/gfp.h:509 [inline]
       kmalloc_order+0x20/0x50 mm/slab_common.c:1231
       kmalloc_order_trace+0x30/0x2b0 mm/slab_common.c:1243
       kmalloc_large include/linux/slab.h:480 [inline]
       __kmalloc+0x3dc/0x4f0 mm/slub.c:3791
       kmalloc_array include/linux/slab.h:670 [inline]
       io_sqe_buffer_register fs/io_uring.c:2472 [inline]
       __io_uring_register fs/io_uring.c:2962 [inline]
       __do_sys_io_uring_register fs/io_uring.c:3008 [inline]
       __se_sys_io_uring_register fs/io_uring.c:2990 [inline]
       __arm64_sys_io_uring_register+0x9e0/0x1bc8 fs/io_uring.c:2990
       __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
       invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
       el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83
       el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129
       el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948
      SMP: stopping secondary CPUs
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Kernel Offset: disabled
      CPU features: 0x002,23000438
      Memory Limit: none
      Rebooting in 1 seconds..
      
      Fixes: edafccee56ff3167 ("io_uring: add support for pre-mapped user IO buffers")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-block@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      71f6bbff
    • J
      io_uring: drop req submit reference always in async punt · 3f1e3e7d
      Jens Axboe 提交于
      commit 817869d2519f0cb7be5b3482129dadc806dfb747 upstream.
      
      If we don't end up actually calling submit in io_sq_wq_submit_work(),
      we still need to drop the submit reference to the request. If we
      don't, then we can leak the request. This can happen if we race
      with ring shutdown while flushing the workqueue for requests that
      require use of the mm_struct.
      
      Fixes: e65ef56db494 ("io_uring: use regular request ref counts")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3f1e3e7d
    • M
      io_uring: free allocated io_memory once · a064fda3
      Mark Rutland 提交于
      commit 52e04ef4c9d459cba3afd86ec335a411b40b7fd2 upstream.
      
      If io_allocate_scq_urings() fails to allocate an sq_* region, it will
      call io_mem_free() for any previously allocated regions, but leave
      dangling pointers to these regions in the ctx. Any regions which have
      not yet been allocated are left NULL. Note that when returning
      -EOVERFLOW, the previously allocated sq_ring is not freed, which appears
      to be an unintentional leak.
      
      When io_allocate_scq_urings() fails, io_uring_create() will call
      io_ring_ctx_wait_and_kill(), which calls io_mem_free() on all the sq_*
      regions, assuming the pointers are valid and not NULL.
      
      This can result in pages being freed multiple times, which has been
      observed to corrupt the page state, leading to subsequent fun. This can
      also result in virt_to_page() on NULL, resulting in the use of bogus
      page addresses, and yet more subsequent fun. The latter can be detected
      with CONFIG_DEBUG_VIRTUAL on arm64.
      
      Adding a cleanup path to io_allocate_scq_urings() complicates the logic,
      so let's leave it to io_ring_ctx_free() to consistently free these
      pointers, and simplify the io_allocate_scq_urings() error paths.
      
      Full splats from before this patch below. Note that the pointer logged
      by the DEBUG_VIRTUAL "non-linear address" warning has been hashed, and
      is actually NULL.
      
      [   26.098129] page:ffff80000e949a00 count:0 mapcount:-128 mapping:0000000000000000 index:0x0
      [   26.102976] flags: 0x63fffc000000()
      [   26.104373] raw: 000063fffc000000 ffff80000e86c188 ffff80000ea3df08 0000000000000000
      [   26.108917] raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000
      [   26.137235] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
      [   26.143960] ------------[ cut here ]------------
      [   26.146020] kernel BUG at include/linux/mm.h:547!
      [   26.147586] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
      [   26.149163] Modules linked in:
      [   26.150287] Process syz-executor.21 (pid: 20204, stack limit = 0x000000000e9cefeb)
      [   26.153307] CPU: 2 PID: 20204 Comm: syz-executor.21 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #18
      [   26.156566] Hardware name: linux,dummy-virt (DT)
      [   26.158089] pstate: 40400005 (nZcv daif +PAN -UAO)
      [   26.159869] pc : io_mem_free+0x9c/0xa8
      [   26.161436] lr : io_mem_free+0x9c/0xa8
      [   26.162720] sp : ffff000013003d60
      [   26.164048] x29: ffff000013003d60 x28: ffff800025048040
      [   26.165804] x27: 0000000000000000 x26: ffff800025048040
      [   26.167352] x25: 00000000000000c0 x24: ffff0000112c2820
      [   26.169682] x23: 0000000000000000 x22: 0000000020000080
      [   26.171899] x21: ffff80002143b418 x20: ffff80002143b400
      [   26.174236] x19: ffff80002143b280 x18: 0000000000000000
      [   26.176607] x17: 0000000000000000 x16: 0000000000000000
      [   26.178997] x15: 0000000000000000 x14: 0000000000000000
      [   26.181508] x13: 00009178a5e077b2 x12: 0000000000000001
      [   26.183863] x11: 0000000000000000 x10: 0000000000000980
      [   26.186437] x9 : ffff000013003a80 x8 : ffff800025048a20
      [   26.189006] x7 : ffff8000250481c0 x6 : ffff80002ffe9118
      [   26.191359] x5 : ffff80002ffe9118 x4 : 0000000000000000
      [   26.193863] x3 : ffff80002ffefe98 x2 : 44c06ddd107d1f00
      [   26.196642] x1 : 0000000000000000 x0 : 000000000000003e
      [   26.198892] Call trace:
      [   26.199893]  io_mem_free+0x9c/0xa8
      [   26.201155]  io_ring_ctx_wait_and_kill+0xec/0x180
      [   26.202688]  io_uring_setup+0x6c4/0x6f0
      [   26.204091]  __arm64_sys_io_uring_setup+0x18/0x20
      [   26.205576]  el0_svc_common.constprop.0+0x7c/0xe8
      [   26.207186]  el0_svc_handler+0x28/0x78
      [   26.208389]  el0_svc+0x8/0xc
      [   26.209408] Code: aa0203e0 d0006861 9133a021 97fcdc3c (d4210000)
      [   26.211995] ---[ end trace bdb81cd43a21e50d ]---
      
      [   81.770626] ------------[ cut here ]------------
      [   81.825015] virt_to_phys used for non-linear address: 000000000d42f2c7 (          (null))
      [   81.827860] WARNING: CPU: 1 PID: 30171 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x48/0x68
      [   81.831202] Modules linked in:
      [   81.832212] CPU: 1 PID: 30171 Comm: syz-executor.20 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #19
      [   81.835616] Hardware name: linux,dummy-virt (DT)
      [   81.836863] pstate: 60400005 (nZCv daif +PAN -UAO)
      [   81.838727] pc : __virt_to_phys+0x48/0x68
      [   81.840572] lr : __virt_to_phys+0x48/0x68
      [   81.842264] sp : ffff80002cf67c70
      [   81.843858] x29: ffff80002cf67c70 x28: ffff800014358e18
      [   81.846463] x27: 0000000000000000 x26: 0000000020000080
      [   81.849148] x25: 0000000000000000 x24: ffff80001bb01f40
      [   81.851986] x23: ffff200011db06c8 x22: ffff2000127e3c60
      [   81.854351] x21: ffff800014358cc0 x20: ffff800014358d98
      [   81.856711] x19: 0000000000000000 x18: 0000000000000000
      [   81.859132] x17: 0000000000000000 x16: 0000000000000000
      [   81.861586] x15: 0000000000000000 x14: 0000000000000000
      [   81.863905] x13: 0000000000000000 x12: ffff1000037603e9
      [   81.866226] x11: 1ffff000037603e8 x10: 0000000000000980
      [   81.868776] x9 : ffff80002cf67840 x8 : ffff80001bb02920
      [   81.873272] x7 : ffff1000037603e9 x6 : ffff80001bb01f47
      [   81.875266] x5 : ffff1000037603e9 x4 : dfff200000000000
      [   81.876875] x3 : ffff200010087528 x2 : ffff1000059ecf58
      [   81.878751] x1 : 44c06ddd107d1f00 x0 : 0000000000000000
      [   81.880453] Call trace:
      [   81.881164]  __virt_to_phys+0x48/0x68
      [   81.882919]  io_mem_free+0x18/0x110
      [   81.886585]  io_ring_ctx_wait_and_kill+0x13c/0x1f0
      [   81.891212]  io_uring_setup+0xa60/0xad0
      [   81.892881]  __arm64_sys_io_uring_setup+0x2c/0x38
      [   81.894398]  el0_svc_common.constprop.0+0xac/0x150
      [   81.896306]  el0_svc_handler+0x34/0x88
      [   81.897744]  el0_svc+0x8/0xc
      [   81.898715] ---[ end trace b4a703802243cbba ]---
      
      Fixes: 2b188cc1bb857a9d ("Add io_uring IO interface")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      a064fda3
    • M
      io_uring: fix SQPOLL cpu validation · 5425cccc
      Mark Rutland 提交于
      commit 975554b03eddc1df73bda3a764a09e18cadd5f1c upstream.
      
      In io_sq_offload_start(), we call cpu_possible() on an unbounded cpu
      value from userspace. On v5.1-rc7 on arm64 with
      CONFIG_DEBUG_PER_CPU_MAPS, this results in a splat:
      
        WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline]
      
      There was an attempt to fix this in commit:
      
        917257daa0fea7a0 ("io_uring: only test SQPOLL cpu after we've verified it")
      
      ... by adding a check after the cpu value had been limited to NR_CPU_IDS
      using array_index_nospec(). However, this left an unbound check at the
      start of the function, for which the warning still fires.
      
      Let's fix this correctly by checking that the cpu value is bound by
      nr_cpu_ids before passing it to cpu_possible(). Note that only
      nr_cpu_ids of a cpumask are guaranteed to exist at runtime, and
      nr_cpu_ids can be significantly smaller than NR_CPUs. For example, an
      arm64 defconfig has NR_CPUS=256, while my test VM has 4 vCPUs.
      
      Following the intent from the commit message for 917257daa0fea7a0, the
      check is moved under the SQ_AFF branch, which is the only branch where
      the cpu values is consumed. The check is performed before bounding the
      value with array_index_nospec() so that we don't silently accept bogus
      cpu values from userspace, where array_index_nospec() would force these
      values to 0.
      
      I suspect we can remove the array_index_nospec() call entirely, but I've
      conservatively left that in place, updated to use nr_cpu_ids to match
      the prior check.
      
      Tested on arm64 with the Syzkaller reproducer:
      
        https://syzkaller.appspot.com/bug?extid=cd714a07c6de2bc34293
        https://syzkaller.appspot.com/x/repro.syz?x=15d8b397200000
      
      Full splat from before this patch:
      
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_check include/linux/cpumask.h:128 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_test_cpu include/linux/cpumask.h:344 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_sq_offload_start fs/io_uring.c:2244 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_create fs/io_uring.c:2864 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 1 PID: 27601 Comm: syz-executor.0 Not tainted 5.1.0-rc7 #3
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193
       show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x110/0x190 lib/dump_stack.c:113
       panic+0x384/0x68c kernel/panic.c:214
       __warn+0x2bc/0x2c0 kernel/panic.c:571
       report_bug+0x228/0x2d8 lib/bug.c:186
       bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956
       call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline]
       brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316
       do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831
       el1_dbg+0x18/0x8c
       cpu_max_bits_warn include/linux/cpumask.h:121 [inline]
       cpumask_check include/linux/cpumask.h:128 [inline]
       cpumask_test_cpu include/linux/cpumask.h:344 [inline]
       io_sq_offload_start fs/io_uring.c:2244 [inline]
       io_uring_create fs/io_uring.c:2864 [inline]
       io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916
       __do_sys_io_uring_setup fs/io_uring.c:2929 [inline]
       __se_sys_io_uring_setup fs/io_uring.c:2926 [inline]
       __arm64_sys_io_uring_setup+0x50/0x70 fs/io_uring.c:2926
       __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
       invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
       el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83
       el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129
       el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948
      SMP: stopping secondary CPUs
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Kernel Offset: disabled
      CPU features: 0x002,23000438
      Memory Limit: none
      Rebooting in 1 seconds..
      
      Fixes: 917257daa0fea7a0 ("io_uring: only test SQPOLL cpu after we've verified it")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      
      Simplied the logic
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      5425cccc
    • J
      io_uring: have submission side sqe errors post a cqe · b0195834
      Jens Axboe 提交于
      commit 5c8b0b54db22c54f2aec991b388f550d3a927f26 upstream.
      
      Currently we only post a cqe if we get an error OUTSIDE of submission.
      For submission, we return the error directly through io_uring_enter().
      This is a bit awkward for applications, and it makes more sense to
      always post a cqe with an error, if the error happens on behalf of an
      sqe.
      
      This changes submission behavior a bit. io_uring_enter() returns -ERROR
      for an error, and > 0 for number of sqes submitted. Before this change,
      if you wanted to submit 8 entries and had an error on the 5th entry,
      io_uring_enter() would return 4 (for number of entries successfully
      submitted) and rewind the sqring. The application would then have to
      peek at the sqring and figure out what was wrong with the head sqe, and
      then skip it itself. With this change, we'll return 5 since we did
      consume 5 sqes, and the last sqe (with the error) will result in a cqe
      being posted with the error.
      
      This makes the logic easier to handle in the application, and it cleans
      up the submission part.
      Suggested-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b0195834
    • S
      io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP · 51097e80
      Stefan Bühler 提交于
      commit 62977281a6384d3904c02272a638cc3ac3bac54d upstream.
      
      There is no operation to order with afterwards, and removing the flag is
      not critical in any way.
      
      There will always be a "race condition" where the application will
      trigger IORING_ENTER_SQ_WAKEUP when it isn't actually needed.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      51097e80
    • S
      io_uring: remove unnecessary barrier after incrementing dropped counter · 2af9e86a
      Stefan Bühler 提交于
      commit b841f19524a16cd93a39f9306191f85c549a2bc2 upstream.
      
      smp_store_release in io_commit_sqring already orders the store to
      dropped before the update to SQ head.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      2af9e86a
    • S
      io_uring: remove unnecessary barrier before reading SQ tail · 27288aff
      Stefan Bühler 提交于
      commit 82ab082c0e2f8592c2ff6b2ab99a92d8406c8c2c upstream.
      
      There is no operation before to order with.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      27288aff
    • S
      io_uring: remove unnecessary barrier after updating SQ head · ddc03f6d
      Stefan Bühler 提交于
      commit 9e4c15a3939448d2ea9b9bf59561183bbe3fdc49 upstream.
      
      There is no operation afterwards to order with.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ddc03f6d
    • S
      io_uring: remove unnecessary barrier before reading cq head · 45f8cbda
      Stefan Bühler 提交于
      commit 115e12e58dbc055e98c965e3255aed7b20214f95 upstream.
      
      The memory operations before reading cq head are unrelated and we
      don't care about their order.
      
      Document that the control dependency in combination with READ_ONCE and
      WRITE_ONCE forms a barrier we need.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      45f8cbda
    • S
      io_uring: remove unnecessary barrier before wq_has_sleeper · 64956e14
      Stefan Bühler 提交于
      commit 4f7067c3fb7f2974363a28c597a41949d971af02 upstream.
      
      wq_has_sleeper has a full barrier internally. The smp_rmb barrier in
      io_uring_poll synchronizes with it.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      64956e14
    • S
      io_uring: fix notes on barriers · 2286b79e
      Stefan Bühler 提交于
      commit 1e84b97b7377bd0198f87b49ad3e396e84bf0458 upstream.
      
      The application reading the CQ ring needs a barrier to pair with the
      smp_store_release in io_commit_cqring, not the barrier after it.
      
      Also a write barrier *after* writing something (but not *before*
      writing anything interesting) doesn't order anything, so an smp_wmb()
      after writing SQ tail is not needed.
      
      Additionally consider reading SQ head and writing CQ tail in the notes.
      
      Also add some clarifications how the various other fields in the ring
      buffers are used.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      2286b79e
    • S
      io_uring: fix handling SQEs requesting NOWAIT · fb367b56
      Stefan Bühler 提交于
      commit 8449eedaa1da6a51d67190c905b1b54243e095f6 upstream.
      
      Not all request types set REQ_F_FORCE_NONBLOCK when they needed async
      punting; reverse logic instead and set REQ_F_NOWAIT if request mustn't
      be punted.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      
      Merged with my previous patch for this.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      fb367b56
    • J
      io_uring: remove 'state' argument from io_{read,write} path · 54578c52
      Jens Axboe 提交于
      commit 8358e3a8264a228cf2dfb6f3a05c0328f4118f12 upstream.
      
      Since commit 09bb839434b we don't use the state argument for any sort
      of on-stack caching in the io read and write path. Remove the stale
      and unused argument from them, and bubble it up to __io_submit_sqe()
      and down to io_prep_rw().
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      54578c52
    • S
      io_uring: fix poll full SQ detection · af1b6490
      Stefan Bühler 提交于
      commit 0d7bae69c574c5f25802f8a71252e7d66933a3ab upstream.
      
      io_uring_poll shouldn't signal EPOLLOUT | EPOLLWRNORM if the queue is
      full; the old check would always signal EPOLLOUT | EPOLLWRNORM (unless
      there were U32_MAX - 1 entries in the SQ queue).
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      af1b6490
    • S
      io_uring: fix race condition when sq threads goes sleeping · 03bf0f6f
      Stefan Bühler 提交于
      commit 0d7bae69c574c5f25802f8a71252e7d66933a3ab upstream.
      
      Reading the SQ tail needs to come after setting IORING_SQ_NEED_WAKEUP in
      flags; there is no cheap barrier for ordering a store before a load, a
      full memory barrier is required.
      
      Userspace needs a full memory barrier between updating SQ tail and
      checking for the IORING_SQ_NEED_WAKEUP too.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      03bf0f6f
    • S
      io_uring: fix race condition reading SQ entries · 95d7a8c9
      Stefan Bühler 提交于
      commit e523a29c4f2703bdb98f68ce1bb256e259fd8d5f upstream.
      
      A read memory barrier is required between reading SQ tail and reading
      the actual data belonging to the SQ entry.
      
      Userspace needs a matching write barrier between writing SQ entries and
      updating SQ tail (using smp_store_release to update tail will do).
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      95d7a8c9
    • J
      io_uring: fail io_uring_register(2) on a dying io_uring instance · 581b8394
      Jens Axboe 提交于
      commit 35fa71a030caa50458a043560d4814ea9bcd639f upstream.
      
      If we have multiple threads doing io_uring_register(2) on an io_uring
      fd, then we can potentially try and kill the percpu reference while
      someone else has already killed it.
      
      Prevent this race by failing io_uring_register(2) if the ref is marked
      dying. This is safe since we're inside the io_uring mutex.
      
      Fixes: b19062a56726 ("io_uring: fix possible deadlock between io_uring_{enter,register}")
      Reported-by: Nsyzbot <syzbot+10d25e23199614b7721f@syzkaller.appspotmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      581b8394
    • J
      io_uring: fix CQ overflow condition · eddff5bd
      Jens Axboe 提交于
      commit 74f464e97044da33b25aaed00213914b0edf1f2e upstream.
      
      This is a leftover from when the rings initially were not free flowing,
      and hence a test for tail + 1 == head would indicate full. Since we now
      let them wrap instead of mask them with the size, we need to check if
      they drift more than the ring size from each other.
      
      This fixes a case where we'd overwrite CQ ring entries, if the user
      failed to reap completions. Both cases would ultimately result in lost
      completions as the application violated the depth it asked for. The only
      difference is that before this fix we'd return invalid entries for the
      overflowed completions, instead of properly flagging it in the
      cq_ring->overflow variable.
      Reported-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      eddff5bd
    • J
      io_uring: fix possible deadlock between io_uring_{enter,register} · ad0e85b1
      Jens Axboe 提交于
      commit b19062a567266ee1f10f6709325f766bbcc07d1c upstream.
      
      If we have multiple threads, one doing io_uring_enter() while the other
      is doing io_uring_register(), we can run into a deadlock between the
      two. io_uring_register() must wait for existing users of the io_uring
      instance to exit. But it does so while holding the io_uring mutex.
      Callers of io_uring_enter() may need this mutex to make progress (and
      eventually exit). If we wait for users to exit in io_uring_register(),
      we can't do so with the io_uring mutex held without potentially risking
      a deadlock.
      
      Drop the io_uring mutex while waiting for existing callers to exit. This
      is safe and guaranteed to make forward progress, since we already killed
      the percpu ref before doing so. Hence later callers of io_uring_enter()
      will be rejected.
      
      Reported-by: syzbot+16dc03452dee970a0c3e@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ad0e85b1
    • A
      arch: add io_uring syscalls everywhere · cb6fa366
      Arnd Bergmann 提交于
      Cherry-pick from commit 39036cd2727395c3369b1051005da74059a85317
      upstream.
      
      Add the io_uring system calls to all architectures.
      
      These system calls are designed to handle both native and compat tasks,
      so all entries are the same across architectures, only arm-compat and
      the generic tale still use an old format.
      
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> (s390)
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      cb6fa366
    • J
      io_uring: drop io_file_put() 'file' argument · 52460705
      Jens Axboe 提交于
      commit 3d6770fbd9353988839611bab107e4e891506aad upstream.
      
      Since the fget/fput handling was reworked in commit 09bb839434bd, we
      never call io_file_put() with state == NULL (and hence file != NULL)
      anymore. Remove that case.
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      52460705
    • J
      io_uring: only test SQPOLL cpu after we've verified it · e738eb4e
      Jens Axboe 提交于
      commit 917257daa0fea7a007102691c0e27d9216a96768 upstream.
      
      We currently call cpu_possible() even if we don't use the CPU. Move the
      test under the SQ_AFF branch, which is the only place where we'll use
      the value. Do the cpu_possible() test AFTER we've limited it to a max
      of NR_CPUS. This avoids triggering the following warning:
      
      WARNING: CPU: 1 PID: 7600 at include/linux/cpumask.h:121 cpu_max_bits_warn
      
      if CONFIG_DEBUG_PER_CPU_MAPS is enabled.
      
      While in there, also move the SQ thread idle period assignment inside
      SETUP_SQPOLL, as we don't use it otherwise either.
      
      Reported-by: syzbot+cd714a07c6de2bc34293@syzkaller.appspotmail.com
      Fixes: 6c271ce2f1d5 ("io_uring: add submission polling")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      e738eb4e
    • J
      io_uring: park SQPOLL thread if it's percpu · 35fea020
      Jens Axboe 提交于
      commit 06058632464845abb1af91521122fd04dd3daaec upstream.
      
      kthread expects this, or we can throw a warning on exit:
      
      WARNING: CPU: 0 PID: 7822 at kernel/kthread.c:399
      __kthread_bind_mask+0x3b/0xc0 kernel/kthread.c:399
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 7822 Comm: syz-executor030 Not tainted 5.1.0-rc4-next-20190412
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x172/0x1f0 lib/dump_stack.c:113
        panic+0x2cb/0x72b kernel/panic.c:214
        __warn.cold+0x20/0x46 kernel/panic.c:576
        report_bug+0x263/0x2b0 lib/bug.c:186
        fixup_bug arch/x86/kernel/traps.c:179 [inline]
        fixup_bug arch/x86/kernel/traps.c:174 [inline]
        do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
        do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
        invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
      RIP: 0010:__kthread_bind_mask+0x3b/0xc0 kernel/kthread.c:399
      Code: 48 89 fb e8 f7 ab 24 00 4c 89 e6 48 89 df e8 ac e1 02 00 31 ff 49 89
      c4 48 89 c6 e8 7f ad 24 00 4d 85 e4 75 15 e8 d5 ab 24 00 <0f> 0b e8 ce ab
      24 00 5b 41 5c 41 5d 41 5e 5d c3 e8 c0 ab 24 00 4c
      RSP: 0018:ffff8880a89bfbb8 EFLAGS: 00010293
      RAX: ffff88808ca7a280 RBX: ffff8880a98e4380 RCX: ffffffff814bdd11
      RDX: 0000000000000000 RSI: ffffffff814bdd1b RDI: 0000000000000007
      RBP: ffff8880a89bfbd8 R08: ffff88808ca7a280 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: ffffffff87691148 R14: ffff8880a98e43a0 R15: ffffffff81c91e10
        __kthread_bind kernel/kthread.c:412 [inline]
        kthread_unpark+0x123/0x160 kernel/kthread.c:480
        kthread_stop+0xfa/0x6c0 kernel/kthread.c:556
        io_sq_thread_stop fs/io_uring.c:2057 [inline]
        io_sq_thread_stop fs/io_uring.c:2052 [inline]
        io_finish_async+0xab/0x180 fs/io_uring.c:2064
        io_ring_ctx_free fs/io_uring.c:2534 [inline]
        io_ring_ctx_wait_and_kill+0x133/0x510 fs/io_uring.c:2591
        io_uring_release+0x42/0x50 fs/io_uring.c:2599
        __fput+0x2e5/0x8d0 fs/file_table.c:278
        ____fput+0x16/0x20 fs/file_table.c:309
        task_work_run+0x14a/0x1c0 kernel/task_work.c:113
        exit_task_work include/linux/task_work.h:22 [inline]
        do_exit+0x90a/0x2fa0 kernel/exit.c:876
        do_group_exit+0x135/0x370 kernel/exit.c:980
        __do_sys_exit_group kernel/exit.c:991 [inline]
        __se_sys_exit_group kernel/exit.c:989 [inline]
        __x64_sys_exit_group+0x44/0x50 kernel/exit.c:989
        do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Reported-by: syzbot+6d4a92619eb0ad08602b@syzkaller.appspotmail.com
      Fixes: 6c271ce2f1d5 ("io_uring: add submission polling")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      35fea020
    • J
      io_uring: restrict IORING_SETUP_SQPOLL to root · 29745a0d
      Jens Axboe 提交于
      commit 3ec482d15cb986bf08b923f9193eeddb3b9ca69f upstream.
      
      This options spawns a kernel side thread that will poll for submissions
      (and completions, if IORING_SETUP_IOPOLL is set). As this allows a user
      to potentially use more cycles outside of the normal hierarchy,
      restrict the use of this feature to root.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      29745a0d
    • J
      tools/io_uring: remove IOCQE_FLAG_CACHEHIT · 204fcca3
      Jens Axboe 提交于
      commit 704236672edacf353c362bab70c3d3eda7bb4a51 upstream.
      
      This ended up not being included in the mainline version of io_uring,
      so drop it from the test app as well.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      204fcca3
    • J
      io_uring: fix double free in case of fileset regitration failure · b7e56f26
      Jens Axboe 提交于
      commit 25adf50fe25d506d3fc12070a5ff4be858a1ac1b upstream.
      
      Will Deacon reported the following KASAN complaint:
      
      [  149.890370] ==================================================================
      [  149.891266] BUG: KASAN: double-free or invalid-free in io_sqe_files_unregister+0xa8/0x140
      [  149.892218]
      [  149.892411] CPU: 113 PID: 3974 Comm: io_uring_regist Tainted: G    B             5.1.0-rc3-00012-g40b114779944 #3
      [  149.893623] Hardware name: linux,dummy-virt (DT)
      [  149.894169] Call trace:
      [  149.894539]  dump_backtrace+0x0/0x228
      [  149.895172]  show_stack+0x14/0x20
      [  149.895747]  dump_stack+0xe8/0x124
      [  149.896335]  print_address_description+0x60/0x258
      [  149.897148]  kasan_report_invalid_free+0x78/0xb8
      [  149.897936]  __kasan_slab_free+0x1fc/0x228
      [  149.898641]  kasan_slab_free+0x10/0x18
      [  149.899283]  kfree+0x70/0x1f8
      [  149.899798]  io_sqe_files_unregister+0xa8/0x140
      [  149.900574]  io_ring_ctx_wait_and_kill+0x190/0x3c0
      [  149.901402]  io_uring_release+0x2c/0x48
      [  149.902068]  __fput+0x18c/0x510
      [  149.902612]  ____fput+0xc/0x18
      [  149.903146]  task_work_run+0xf0/0x148
      [  149.903778]  do_notify_resume+0x554/0x748
      [  149.904467]  work_pending+0x8/0x10
      [  149.905060]
      [  149.905331] Allocated by task 3974:
      [  149.905934]  __kasan_kmalloc.isra.0.part.1+0x48/0xf8
      [  149.906786]  __kasan_kmalloc.isra.0+0xb8/0xd8
      [  149.907531]  kasan_kmalloc+0xc/0x18
      [  149.908134]  __kmalloc+0x168/0x248
      [  149.908724]  __arm64_sys_io_uring_register+0x2b8/0x15a8
      [  149.909622]  el0_svc_common+0x100/0x258
      [  149.910281]  el0_svc_handler+0x48/0xc0
      [  149.910928]  el0_svc+0x8/0xc
      [  149.911425]
      [  149.911696] Freed by task 3974:
      [  149.912242]  __kasan_slab_free+0x114/0x228
      [  149.912955]  kasan_slab_free+0x10/0x18
      [  149.913602]  kfree+0x70/0x1f8
      [  149.914118]  __arm64_sys_io_uring_register+0xc2c/0x15a8
      [  149.915009]  el0_svc_common+0x100/0x258
      [  149.915670]  el0_svc_handler+0x48/0xc0
      [  149.916317]  el0_svc+0x8/0xc
      [  149.916817]
      [  149.917101] The buggy address belongs to the object at ffff8004ce07ed00
      [  149.917101]  which belongs to the cache kmalloc-128 of size 128
      [  149.919197] The buggy address is located 0 bytes inside of
      [  149.919197]  128-byte region [ffff8004ce07ed00, ffff8004ce07ed80)
      [  149.921142] The buggy address belongs to the page:
      [  149.921953] page:ffff7e0013381f00 count:1 mapcount:0 mapping:ffff800503417c00 index:0x0 compound_mapcount: 0
      [  149.923595] flags: 0x1ffff00000010200(slab|head)
      [  149.924388] raw: 1ffff00000010200 dead000000000100 dead000000000200 ffff800503417c00
      [  149.925706] raw: 0000000000000000 0000000080400040 00000001ffffffff 0000000000000000
      [  149.927011] page dumped because: kasan: bad access detected
      [  149.927956]
      [  149.928224] Memory state around the buggy address:
      [  149.929054]  ffff8004ce07ec00: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
      [  149.930274]  ffff8004ce07ec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  149.931494] >ffff8004ce07ed00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  149.932712]                    ^
      [  149.933281]  ffff8004ce07ed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  149.934508]  ffff8004ce07ee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  149.935725] ==================================================================
      
      which is due to a failure in registrering a fileset. This frees the
      ctx->user_files pointer, but doesn't clear it. When the io_uring
      instance is later freed through the normal channels, we free this
      pointer again. At this point it's invalid.
      
      Ensure we clear the pointer when we free it for the error case.
      Reported-by: NWill Deacon <will.deacon@arm.com>
      Tested-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b7e56f26
    • A
      tools headers: Update x86's syscall_64.tbl and uapi/asm-generic/unistd · 00400e96
      Arnaldo Carvalho de Melo 提交于
      commit 8142bd82a59e452fefea7b21113101d6a87d9fa8 upstream.
      
      To pick up the changes introduced in the following csets:
      
        2b188cc1bb85 ("Add io_uring IO interface")
        edafccee56ff ("io_uring: add support for pre-mapped user IO buffers")
        3eb39f47934f ("signal: add pidfd_send_signal() syscall")
      
      This makes 'perf trace' to become aware of these new syscalls, so that
      one can use them like 'perf trace -e ui_uring*,*signal' to do a system
      wide strace-like session looking at those syscalls, for instance.
      
      For example:
      
        # perf trace -s io_uring-cp ~acme/isos/RHEL-x86_64-dvd1.iso ~/bla
      
         Summary of events:
      
         io_uring-cp (383), 1208866 events, 100.0%
      
           syscall         calls   total    min     avg     max   stddev
                                   (msec) (msec)  (msec)  (msec)     (%)
           -------------- ------ -------- ------ ------- -------  ------
           io_uring_enter 605780 2955.615  0.000   0.005  33.804   1.94%
           openat              4  459.446  0.004 114.861 459.435 100.00%
           munmap              4    0.073  0.009   0.018   0.042  44.03%
           mmap               10    0.054  0.002   0.005   0.026  43.24%
           brk                28    0.038  0.001   0.001   0.003   7.51%
           io_uring_setup      1    0.030  0.030   0.030   0.030   0.00%
           mprotect            4    0.014  0.002   0.004   0.005  14.32%
           close               5    0.012  0.001   0.002   0.004  28.87%
           fstat               3    0.006  0.001   0.002   0.003  35.83%
           read                4    0.004  0.001   0.001   0.002  13.58%
           access              1    0.003  0.003   0.003   0.003   0.00%
           lseek               3    0.002  0.001   0.001   0.001   9.00%
           arch_prctl          2    0.002  0.001   0.001   0.001   0.69%
           execve              1    0.000  0.000   0.000   0.000   0.00%
        #
        # perf trace -e io_uring* -s io_uring-cp ~acme/isos/RHEL-x86_64-dvd1.iso ~/bla
      
         Summary of events:
      
         io_uring-cp (390), 1191250 events, 100.0%
      
           syscall         calls   total    min    avg    max  stddev
                                   (msec) (msec) (msec) (msec)    (%)
           -------------- ------ -------- ------ ------ ------ ------
           io_uring_enter 597093 2706.060  0.001  0.005 14.761  1.10%
           io_uring_setup      1    0.038  0.038  0.038  0.038  0.00%
        #
      
      More work needed to make the tools/perf/examples/bpf/augmented_raw_syscalls.c
      BPF program to copy the 'struct io_uring_params' arguments to perf's ring
      buffer so that 'perf trace' can use the BTF info put in place by pahole's
      conversion of the kernel DWARF and then auto-beautify those arguments.
      
      This patch produces the expected change in the generated syscalls table
      for x86_64:
      
        --- /tmp/build/perf/arch/x86/include/generated/asm/syscalls_64.c.before	2019-03-26 13:37:46.679057774 -0300
        +++ /tmp/build/perf/arch/x86/include/generated/asm/syscalls_64.c	2019-03-26 13:38:12.755990383 -0300
        @@ -334,5 +334,9 @@ static const char *syscalltbl_x86_64[] =
         	[332] = "statx",
         	[333] = "io_pgetevents",
         	[334] = "rseq",
        +	[424] = "pidfd_send_signal",
        +	[425] = "io_uring_setup",
        +	[426] = "io_uring_enter",
        +	[427] = "io_uring_register",
         };
        -#define SYSCALLTBL_x86_64_MAX_ID 334
        +#define SYSCALLTBL_x86_64_MAX_ID 427
      
      This silences these perf build warnings:
      
        Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/unistd.h' differs from latest version at 'include/uapi/asm-generic/unistd.h'
        diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
        Warning: Kernel ABI header at 'tools/perf/arch/x86/entry/syscalls/syscall_64.tbl' differs from latest version at 'arch/x86/entry/syscalls/syscall_64.tbl'
        diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Link: https://lkml.kernel.org/n/tip-p0ars3otuc52x5iznf21shhw@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      00400e96
    • R
      io_uring: offload write to async worker in case of -EAGAIN · 691ad7b2
      Roman Penyaev 提交于
      commit 9bf7933fc3f306bc4ce74ad734f690a71670178a upstream.
      
      In case of direct write -EAGAIN will be returned if page cache was
      previously populated.  To avoid immediate completion of a request
      with -EAGAIN error write has to be offloaded to the async worker,
      like io_read() does.
      Signed-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      691ad7b2
    • A
      io_uring: fix big-endian compat signal mask handling · d895ee01
      Arnd Bergmann 提交于
      commit 9e75ad5d8f399a21c86271571aa630dd080223e2 upstream.
      
      On big-endian architectures, the signal masks are differnet
      between 32-bit and 64-bit tasks, so we have to use a different
      function for reading them from user space.
      
      io_cqring_wait() initially got this wrong, and always interprets
      this as a native structure. This is ok on x86 and most arm64,
      but not on s390, ppc64be, mips64be, sparc64 and parisc.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d895ee01
    • J
      block: add BIO_NO_PAGE_REF flag · c0d2a0b9
      Jens Axboe 提交于
      commit 399254aaf4892113c806816f7e64cf40c804d46d upstream.
      
      If bio_iov_iter_get_pages() is called on an iov_iter that is flagged
      with NO_REF, then we don't need to add a page reference for the pages
      that we add.
      
      Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows
      not to drop a reference to these pages.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      c0d2a0b9
    • J
      iov_iter: add ITER_BVEC_FLAG_NO_REF flag · 209cc5b5
      Jens Axboe 提交于
      commit 875f1d0769cdcfe1596ff0ca609b453359e42ec9 upstream.
      
      For ITER_BVEC, if we're holding on to kernel pages, the caller
      doesn't need to grab a reference to the bvec pages, and drop that
      same reference on IO completion. This is essentially safe for any
      ITER_BVEC, but some use cases end up reusing pages and uncondtionally
      dropping a page reference on completion. And example of that is
      sendfile(2), that ends up being a splice_in + splice_out on the
      pipe pages.
      
      Add a flag that tells us it's fine to not grab a page reference
      to the bvec pages, since that caller knows not to drop a reference
      when it's done with the pages.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      209cc5b5
    • J
      io_uring: mark me as the maintainer · b8b92e8f
      Jens Axboe 提交于
      commit bf33a7699e992b12d4c7d39dc3f0b61f6b26c5c2 upstream.
      
      And io_uring as maintained in general.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b8b92e8f
    • J
      io_uring: retry bulk slab allocs as single allocs · ef1ee13b
      Jens Axboe 提交于
      commit fd6fab2cb78d3b6023c26ec53e0aa6f0b477d2f7 upstream.
      
      I've seen cases where bulk alloc fails, since the bulk alloc API
      is all-or-nothing - either we get the number we ask for, or it
      returns 0 as number of entries.
      
      If we fail a batch bulk alloc, retry a "normal" kmem_cache_alloc()
      and just use that instead of failing with -EAGAIN.
      
      While in there, ensure we use GFP_KERNEL. That was an oversight in
      the original code, when we switched away from GFP_ATOMIC.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ef1ee13b
    • J
      io_uring: fix poll races · 845a60d5
      Jens Axboe 提交于
      commit 8c838788775a593527803786d376393b7c28f589 upstream.
      
      This is a straight port of Al's fix for the aio poll implementation,
      since the io_uring version is heavily based on that. The below
      description is almost straight from that patch, just modified to
      fit the io_uring situation.
      
      io_poll() has to cope with several unpleasant problems:
      	* requests that might stay around indefinitely need to
      be made visible for io_cancel(2); that must not be done to
      a request already completed, though.
      	* in cases when ->poll() has placed us on a waitqueue,
      wakeup might have happened (and request completed) before ->poll()
      returns.
      	* worse, in some early wakeup cases request might end
      up re-added into the queue later - we can't treat "woken up and
      currently not in the queue" as "it's not going to stick around
      indefinitely"
      	* ... moreover, ->poll() might have decided not to
      put it on any queues to start with, and that needs to be distinguished
      from the previous case
      	* ->poll() might have tried to put us on more than one queue.
      Only the first will succeed for io poll, so we might end up missing
      wakeups.  OTOH, we might very well notice that only after the
      wakeup hits and request gets completed (all before ->poll() gets
      around to the second poll_wait()).  In that case it's too late to
      decide that we have an error.
      
      req->woken was an attempt to deal with that.  Unfortunately, it was
      broken.  What we need to keep track of is not that wakeup has happened -
      the thing might come back after that.  It's that async reference is
      already gone and won't come back, so we can't (and needn't) put the
      request on the list of cancellables.
      
      The easiest case is "request hadn't been put on any waitqueues"; we
      can tell by seeing NULL apt.head, and in that case there won't be
      anything async.  We should either complete the request ourselves
      (if vfs_poll() reports anything of interest) or return an error.
      
      In all other cases we get exclusion with wakeups by grabbing the
      queue lock.
      
      If request is currently on queue and we have something interesting
      from vfs_poll(), we can steal it and complete the request ourselves.
      
      If it's on queue and vfs_poll() has not reported anything interesting,
      we either put it on the cancellable list, or, if we know that it
      hadn't been put on all queues ->poll() wanted it on, we steal it and
      return an error.
      
      If it's _not_ on queue, it's either been already dealt with (in which
      case we do nothing), or there's io_poll_complete_work() about to be
      executed.  In that case we either put it on the cancellable list,
      or, if we know it hadn't been put on all queues ->poll() wanted it on,
      simulate what cancel would've done.
      
      Fixes: 221c5eb23382 ("io_uring: add support for IORING_OP_POLL")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      845a60d5
    • J
      io_uring: fix fget/fput handling · 1a18e019
      Jens Axboe 提交于
      commit 09bb839434bd845c01da3d159b0c126fe7fa90da upstream.
      
      This isn't a straight port of commit 84c4e1f89fef for aio.c, since
      io_uring doesn't use files in exactly the same way. But it's pretty
      close. See the commit message for that commit.
      
      This essentially fixes a use-after-free with the poll command
      handling, but it takes cue from Linus's approach to just simplifying
      the file handling. We move the setup of the file into a higher level
      location, so the individual commands don't have to deal with it. And
      then we release the reference when we free the associated io_kiocb.
      
      Fixes: 221c5eb23382 ("io_uring: add support for IORING_OP_POLL")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      1a18e019
    • J
      io_uring: add prepped flag · 2904810b
      Jens Axboe 提交于
      commit d530a402a114efcf6d2b88d7f628856dade5b90b upstream.
      
      We currently use the fact that if ->ki_filp is already set, then we've
      done the prep. In preparation for moving the file assignment earlier,
      use a separate flag to tell whether the request has been prepped for
      IO or not.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      2904810b
    • J
      io_uring: make io_read/write return an integer · b586a15b
      Jens Axboe 提交于
      commit e0c5c576d5074b5bb7b1b4b59848c25ceb521331 upstream.
      
      The callers all convert to an integer, and we only return 0/-ERROR
      anyway.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b586a15b
    • J
      io_uring: use regular request ref counts · 3a3cfffc
      Jens Axboe 提交于
      commit e65ef56db4945fb18a0d522e056c02ddf939e644 upstream.
      
      Get rid of the special casing of "normal" requests not having
      any references to the io_kiocb. We initialize the ref count to 2,
      one for the submission side, and one or the completion side.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3a3cfffc
    • J
      io_uring: add a few test tools · b0034373
      Jens Axboe 提交于
      commit 21b4aa5d20fd07207e73270cadffed5c63fb4343 upstream.
      
      This adds two test programs in tools/io_uring/ that demonstrate both
      the raw io_uring API (and all features) through a small benchmark
      app, io_uring-bench, and the liburing exposed API in a simplified
      cp(1) implementation through io_uring-cp.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b0034373