1. 17 1月, 2020 40 次提交
    • J
      io_uring: drop req submit reference always in async punt · 3f1e3e7d
      Jens Axboe 提交于
      commit 817869d2519f0cb7be5b3482129dadc806dfb747 upstream.
      
      If we don't end up actually calling submit in io_sq_wq_submit_work(),
      we still need to drop the submit reference to the request. If we
      don't, then we can leak the request. This can happen if we race
      with ring shutdown while flushing the workqueue for requests that
      require use of the mm_struct.
      
      Fixes: e65ef56db494 ("io_uring: use regular request ref counts")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3f1e3e7d
    • M
      io_uring: free allocated io_memory once · a064fda3
      Mark Rutland 提交于
      commit 52e04ef4c9d459cba3afd86ec335a411b40b7fd2 upstream.
      
      If io_allocate_scq_urings() fails to allocate an sq_* region, it will
      call io_mem_free() for any previously allocated regions, but leave
      dangling pointers to these regions in the ctx. Any regions which have
      not yet been allocated are left NULL. Note that when returning
      -EOVERFLOW, the previously allocated sq_ring is not freed, which appears
      to be an unintentional leak.
      
      When io_allocate_scq_urings() fails, io_uring_create() will call
      io_ring_ctx_wait_and_kill(), which calls io_mem_free() on all the sq_*
      regions, assuming the pointers are valid and not NULL.
      
      This can result in pages being freed multiple times, which has been
      observed to corrupt the page state, leading to subsequent fun. This can
      also result in virt_to_page() on NULL, resulting in the use of bogus
      page addresses, and yet more subsequent fun. The latter can be detected
      with CONFIG_DEBUG_VIRTUAL on arm64.
      
      Adding a cleanup path to io_allocate_scq_urings() complicates the logic,
      so let's leave it to io_ring_ctx_free() to consistently free these
      pointers, and simplify the io_allocate_scq_urings() error paths.
      
      Full splats from before this patch below. Note that the pointer logged
      by the DEBUG_VIRTUAL "non-linear address" warning has been hashed, and
      is actually NULL.
      
      [   26.098129] page:ffff80000e949a00 count:0 mapcount:-128 mapping:0000000000000000 index:0x0
      [   26.102976] flags: 0x63fffc000000()
      [   26.104373] raw: 000063fffc000000 ffff80000e86c188 ffff80000ea3df08 0000000000000000
      [   26.108917] raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000
      [   26.137235] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
      [   26.143960] ------------[ cut here ]------------
      [   26.146020] kernel BUG at include/linux/mm.h:547!
      [   26.147586] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
      [   26.149163] Modules linked in:
      [   26.150287] Process syz-executor.21 (pid: 20204, stack limit = 0x000000000e9cefeb)
      [   26.153307] CPU: 2 PID: 20204 Comm: syz-executor.21 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #18
      [   26.156566] Hardware name: linux,dummy-virt (DT)
      [   26.158089] pstate: 40400005 (nZcv daif +PAN -UAO)
      [   26.159869] pc : io_mem_free+0x9c/0xa8
      [   26.161436] lr : io_mem_free+0x9c/0xa8
      [   26.162720] sp : ffff000013003d60
      [   26.164048] x29: ffff000013003d60 x28: ffff800025048040
      [   26.165804] x27: 0000000000000000 x26: ffff800025048040
      [   26.167352] x25: 00000000000000c0 x24: ffff0000112c2820
      [   26.169682] x23: 0000000000000000 x22: 0000000020000080
      [   26.171899] x21: ffff80002143b418 x20: ffff80002143b400
      [   26.174236] x19: ffff80002143b280 x18: 0000000000000000
      [   26.176607] x17: 0000000000000000 x16: 0000000000000000
      [   26.178997] x15: 0000000000000000 x14: 0000000000000000
      [   26.181508] x13: 00009178a5e077b2 x12: 0000000000000001
      [   26.183863] x11: 0000000000000000 x10: 0000000000000980
      [   26.186437] x9 : ffff000013003a80 x8 : ffff800025048a20
      [   26.189006] x7 : ffff8000250481c0 x6 : ffff80002ffe9118
      [   26.191359] x5 : ffff80002ffe9118 x4 : 0000000000000000
      [   26.193863] x3 : ffff80002ffefe98 x2 : 44c06ddd107d1f00
      [   26.196642] x1 : 0000000000000000 x0 : 000000000000003e
      [   26.198892] Call trace:
      [   26.199893]  io_mem_free+0x9c/0xa8
      [   26.201155]  io_ring_ctx_wait_and_kill+0xec/0x180
      [   26.202688]  io_uring_setup+0x6c4/0x6f0
      [   26.204091]  __arm64_sys_io_uring_setup+0x18/0x20
      [   26.205576]  el0_svc_common.constprop.0+0x7c/0xe8
      [   26.207186]  el0_svc_handler+0x28/0x78
      [   26.208389]  el0_svc+0x8/0xc
      [   26.209408] Code: aa0203e0 d0006861 9133a021 97fcdc3c (d4210000)
      [   26.211995] ---[ end trace bdb81cd43a21e50d ]---
      
      [   81.770626] ------------[ cut here ]------------
      [   81.825015] virt_to_phys used for non-linear address: 000000000d42f2c7 (          (null))
      [   81.827860] WARNING: CPU: 1 PID: 30171 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x48/0x68
      [   81.831202] Modules linked in:
      [   81.832212] CPU: 1 PID: 30171 Comm: syz-executor.20 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #19
      [   81.835616] Hardware name: linux,dummy-virt (DT)
      [   81.836863] pstate: 60400005 (nZCv daif +PAN -UAO)
      [   81.838727] pc : __virt_to_phys+0x48/0x68
      [   81.840572] lr : __virt_to_phys+0x48/0x68
      [   81.842264] sp : ffff80002cf67c70
      [   81.843858] x29: ffff80002cf67c70 x28: ffff800014358e18
      [   81.846463] x27: 0000000000000000 x26: 0000000020000080
      [   81.849148] x25: 0000000000000000 x24: ffff80001bb01f40
      [   81.851986] x23: ffff200011db06c8 x22: ffff2000127e3c60
      [   81.854351] x21: ffff800014358cc0 x20: ffff800014358d98
      [   81.856711] x19: 0000000000000000 x18: 0000000000000000
      [   81.859132] x17: 0000000000000000 x16: 0000000000000000
      [   81.861586] x15: 0000000000000000 x14: 0000000000000000
      [   81.863905] x13: 0000000000000000 x12: ffff1000037603e9
      [   81.866226] x11: 1ffff000037603e8 x10: 0000000000000980
      [   81.868776] x9 : ffff80002cf67840 x8 : ffff80001bb02920
      [   81.873272] x7 : ffff1000037603e9 x6 : ffff80001bb01f47
      [   81.875266] x5 : ffff1000037603e9 x4 : dfff200000000000
      [   81.876875] x3 : ffff200010087528 x2 : ffff1000059ecf58
      [   81.878751] x1 : 44c06ddd107d1f00 x0 : 0000000000000000
      [   81.880453] Call trace:
      [   81.881164]  __virt_to_phys+0x48/0x68
      [   81.882919]  io_mem_free+0x18/0x110
      [   81.886585]  io_ring_ctx_wait_and_kill+0x13c/0x1f0
      [   81.891212]  io_uring_setup+0xa60/0xad0
      [   81.892881]  __arm64_sys_io_uring_setup+0x2c/0x38
      [   81.894398]  el0_svc_common.constprop.0+0xac/0x150
      [   81.896306]  el0_svc_handler+0x34/0x88
      [   81.897744]  el0_svc+0x8/0xc
      [   81.898715] ---[ end trace b4a703802243cbba ]---
      
      Fixes: 2b188cc1bb857a9d ("Add io_uring IO interface")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      a064fda3
    • M
      io_uring: fix SQPOLL cpu validation · 5425cccc
      Mark Rutland 提交于
      commit 975554b03eddc1df73bda3a764a09e18cadd5f1c upstream.
      
      In io_sq_offload_start(), we call cpu_possible() on an unbounded cpu
      value from userspace. On v5.1-rc7 on arm64 with
      CONFIG_DEBUG_PER_CPU_MAPS, this results in a splat:
      
        WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline]
      
      There was an attempt to fix this in commit:
      
        917257daa0fea7a0 ("io_uring: only test SQPOLL cpu after we've verified it")
      
      ... by adding a check after the cpu value had been limited to NR_CPU_IDS
      using array_index_nospec(). However, this left an unbound check at the
      start of the function, for which the warning still fires.
      
      Let's fix this correctly by checking that the cpu value is bound by
      nr_cpu_ids before passing it to cpu_possible(). Note that only
      nr_cpu_ids of a cpumask are guaranteed to exist at runtime, and
      nr_cpu_ids can be significantly smaller than NR_CPUs. For example, an
      arm64 defconfig has NR_CPUS=256, while my test VM has 4 vCPUs.
      
      Following the intent from the commit message for 917257daa0fea7a0, the
      check is moved under the SQ_AFF branch, which is the only branch where
      the cpu values is consumed. The check is performed before bounding the
      value with array_index_nospec() so that we don't silently accept bogus
      cpu values from userspace, where array_index_nospec() would force these
      values to 0.
      
      I suspect we can remove the array_index_nospec() call entirely, but I've
      conservatively left that in place, updated to use nr_cpu_ids to match
      the prior check.
      
      Tested on arm64 with the Syzkaller reproducer:
      
        https://syzkaller.appspot.com/bug?extid=cd714a07c6de2bc34293
        https://syzkaller.appspot.com/x/repro.syz?x=15d8b397200000
      
      Full splat from before this patch:
      
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_check include/linux/cpumask.h:128 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_test_cpu include/linux/cpumask.h:344 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_sq_offload_start fs/io_uring.c:2244 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_create fs/io_uring.c:2864 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 1 PID: 27601 Comm: syz-executor.0 Not tainted 5.1.0-rc7 #3
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193
       show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x110/0x190 lib/dump_stack.c:113
       panic+0x384/0x68c kernel/panic.c:214
       __warn+0x2bc/0x2c0 kernel/panic.c:571
       report_bug+0x228/0x2d8 lib/bug.c:186
       bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956
       call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline]
       brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316
       do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831
       el1_dbg+0x18/0x8c
       cpu_max_bits_warn include/linux/cpumask.h:121 [inline]
       cpumask_check include/linux/cpumask.h:128 [inline]
       cpumask_test_cpu include/linux/cpumask.h:344 [inline]
       io_sq_offload_start fs/io_uring.c:2244 [inline]
       io_uring_create fs/io_uring.c:2864 [inline]
       io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916
       __do_sys_io_uring_setup fs/io_uring.c:2929 [inline]
       __se_sys_io_uring_setup fs/io_uring.c:2926 [inline]
       __arm64_sys_io_uring_setup+0x50/0x70 fs/io_uring.c:2926
       __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
       invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
       el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83
       el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129
       el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948
      SMP: stopping secondary CPUs
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Kernel Offset: disabled
      CPU features: 0x002,23000438
      Memory Limit: none
      Rebooting in 1 seconds..
      
      Fixes: 917257daa0fea7a0 ("io_uring: only test SQPOLL cpu after we've verified it")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      
      Simplied the logic
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      5425cccc
    • J
      io_uring: have submission side sqe errors post a cqe · b0195834
      Jens Axboe 提交于
      commit 5c8b0b54db22c54f2aec991b388f550d3a927f26 upstream.
      
      Currently we only post a cqe if we get an error OUTSIDE of submission.
      For submission, we return the error directly through io_uring_enter().
      This is a bit awkward for applications, and it makes more sense to
      always post a cqe with an error, if the error happens on behalf of an
      sqe.
      
      This changes submission behavior a bit. io_uring_enter() returns -ERROR
      for an error, and > 0 for number of sqes submitted. Before this change,
      if you wanted to submit 8 entries and had an error on the 5th entry,
      io_uring_enter() would return 4 (for number of entries successfully
      submitted) and rewind the sqring. The application would then have to
      peek at the sqring and figure out what was wrong with the head sqe, and
      then skip it itself. With this change, we'll return 5 since we did
      consume 5 sqes, and the last sqe (with the error) will result in a cqe
      being posted with the error.
      
      This makes the logic easier to handle in the application, and it cleans
      up the submission part.
      Suggested-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b0195834
    • S
      io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP · 51097e80
      Stefan Bühler 提交于
      commit 62977281a6384d3904c02272a638cc3ac3bac54d upstream.
      
      There is no operation to order with afterwards, and removing the flag is
      not critical in any way.
      
      There will always be a "race condition" where the application will
      trigger IORING_ENTER_SQ_WAKEUP when it isn't actually needed.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      51097e80
    • S
      io_uring: remove unnecessary barrier after incrementing dropped counter · 2af9e86a
      Stefan Bühler 提交于
      commit b841f19524a16cd93a39f9306191f85c549a2bc2 upstream.
      
      smp_store_release in io_commit_sqring already orders the store to
      dropped before the update to SQ head.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      2af9e86a
    • S
      io_uring: remove unnecessary barrier before reading SQ tail · 27288aff
      Stefan Bühler 提交于
      commit 82ab082c0e2f8592c2ff6b2ab99a92d8406c8c2c upstream.
      
      There is no operation before to order with.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      27288aff
    • S
      io_uring: remove unnecessary barrier after updating SQ head · ddc03f6d
      Stefan Bühler 提交于
      commit 9e4c15a3939448d2ea9b9bf59561183bbe3fdc49 upstream.
      
      There is no operation afterwards to order with.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ddc03f6d
    • S
      io_uring: remove unnecessary barrier before reading cq head · 45f8cbda
      Stefan Bühler 提交于
      commit 115e12e58dbc055e98c965e3255aed7b20214f95 upstream.
      
      The memory operations before reading cq head are unrelated and we
      don't care about their order.
      
      Document that the control dependency in combination with READ_ONCE and
      WRITE_ONCE forms a barrier we need.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      45f8cbda
    • S
      io_uring: remove unnecessary barrier before wq_has_sleeper · 64956e14
      Stefan Bühler 提交于
      commit 4f7067c3fb7f2974363a28c597a41949d971af02 upstream.
      
      wq_has_sleeper has a full barrier internally. The smp_rmb barrier in
      io_uring_poll synchronizes with it.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      64956e14
    • S
      io_uring: fix notes on barriers · 2286b79e
      Stefan Bühler 提交于
      commit 1e84b97b7377bd0198f87b49ad3e396e84bf0458 upstream.
      
      The application reading the CQ ring needs a barrier to pair with the
      smp_store_release in io_commit_cqring, not the barrier after it.
      
      Also a write barrier *after* writing something (but not *before*
      writing anything interesting) doesn't order anything, so an smp_wmb()
      after writing SQ tail is not needed.
      
      Additionally consider reading SQ head and writing CQ tail in the notes.
      
      Also add some clarifications how the various other fields in the ring
      buffers are used.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      2286b79e
    • S
      io_uring: fix handling SQEs requesting NOWAIT · fb367b56
      Stefan Bühler 提交于
      commit 8449eedaa1da6a51d67190c905b1b54243e095f6 upstream.
      
      Not all request types set REQ_F_FORCE_NONBLOCK when they needed async
      punting; reverse logic instead and set REQ_F_NOWAIT if request mustn't
      be punted.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      
      Merged with my previous patch for this.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      fb367b56
    • J
      io_uring: remove 'state' argument from io_{read,write} path · 54578c52
      Jens Axboe 提交于
      commit 8358e3a8264a228cf2dfb6f3a05c0328f4118f12 upstream.
      
      Since commit 09bb839434b we don't use the state argument for any sort
      of on-stack caching in the io read and write path. Remove the stale
      and unused argument from them, and bubble it up to __io_submit_sqe()
      and down to io_prep_rw().
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      54578c52
    • S
      io_uring: fix poll full SQ detection · af1b6490
      Stefan Bühler 提交于
      commit 0d7bae69c574c5f25802f8a71252e7d66933a3ab upstream.
      
      io_uring_poll shouldn't signal EPOLLOUT | EPOLLWRNORM if the queue is
      full; the old check would always signal EPOLLOUT | EPOLLWRNORM (unless
      there were U32_MAX - 1 entries in the SQ queue).
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      af1b6490
    • S
      io_uring: fix race condition when sq threads goes sleeping · 03bf0f6f
      Stefan Bühler 提交于
      commit 0d7bae69c574c5f25802f8a71252e7d66933a3ab upstream.
      
      Reading the SQ tail needs to come after setting IORING_SQ_NEED_WAKEUP in
      flags; there is no cheap barrier for ordering a store before a load, a
      full memory barrier is required.
      
      Userspace needs a full memory barrier between updating SQ tail and
      checking for the IORING_SQ_NEED_WAKEUP too.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      03bf0f6f
    • S
      io_uring: fix race condition reading SQ entries · 95d7a8c9
      Stefan Bühler 提交于
      commit e523a29c4f2703bdb98f68ce1bb256e259fd8d5f upstream.
      
      A read memory barrier is required between reading SQ tail and reading
      the actual data belonging to the SQ entry.
      
      Userspace needs a matching write barrier between writing SQ entries and
      updating SQ tail (using smp_store_release to update tail will do).
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      95d7a8c9
    • J
      io_uring: fail io_uring_register(2) on a dying io_uring instance · 581b8394
      Jens Axboe 提交于
      commit 35fa71a030caa50458a043560d4814ea9bcd639f upstream.
      
      If we have multiple threads doing io_uring_register(2) on an io_uring
      fd, then we can potentially try and kill the percpu reference while
      someone else has already killed it.
      
      Prevent this race by failing io_uring_register(2) if the ref is marked
      dying. This is safe since we're inside the io_uring mutex.
      
      Fixes: b19062a56726 ("io_uring: fix possible deadlock between io_uring_{enter,register}")
      Reported-by: Nsyzbot <syzbot+10d25e23199614b7721f@syzkaller.appspotmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      581b8394
    • J
      io_uring: fix CQ overflow condition · eddff5bd
      Jens Axboe 提交于
      commit 74f464e97044da33b25aaed00213914b0edf1f2e upstream.
      
      This is a leftover from when the rings initially were not free flowing,
      and hence a test for tail + 1 == head would indicate full. Since we now
      let them wrap instead of mask them with the size, we need to check if
      they drift more than the ring size from each other.
      
      This fixes a case where we'd overwrite CQ ring entries, if the user
      failed to reap completions. Both cases would ultimately result in lost
      completions as the application violated the depth it asked for. The only
      difference is that before this fix we'd return invalid entries for the
      overflowed completions, instead of properly flagging it in the
      cq_ring->overflow variable.
      Reported-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      eddff5bd
    • J
      io_uring: fix possible deadlock between io_uring_{enter,register} · ad0e85b1
      Jens Axboe 提交于
      commit b19062a567266ee1f10f6709325f766bbcc07d1c upstream.
      
      If we have multiple threads, one doing io_uring_enter() while the other
      is doing io_uring_register(), we can run into a deadlock between the
      two. io_uring_register() must wait for existing users of the io_uring
      instance to exit. But it does so while holding the io_uring mutex.
      Callers of io_uring_enter() may need this mutex to make progress (and
      eventually exit). If we wait for users to exit in io_uring_register(),
      we can't do so with the io_uring mutex held without potentially risking
      a deadlock.
      
      Drop the io_uring mutex while waiting for existing callers to exit. This
      is safe and guaranteed to make forward progress, since we already killed
      the percpu ref before doing so. Hence later callers of io_uring_enter()
      will be rejected.
      
      Reported-by: syzbot+16dc03452dee970a0c3e@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ad0e85b1
    • J
      io_uring: drop io_file_put() 'file' argument · 52460705
      Jens Axboe 提交于
      commit 3d6770fbd9353988839611bab107e4e891506aad upstream.
      
      Since the fget/fput handling was reworked in commit 09bb839434bd, we
      never call io_file_put() with state == NULL (and hence file != NULL)
      anymore. Remove that case.
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      52460705
    • J
      io_uring: only test SQPOLL cpu after we've verified it · e738eb4e
      Jens Axboe 提交于
      commit 917257daa0fea7a007102691c0e27d9216a96768 upstream.
      
      We currently call cpu_possible() even if we don't use the CPU. Move the
      test under the SQ_AFF branch, which is the only place where we'll use
      the value. Do the cpu_possible() test AFTER we've limited it to a max
      of NR_CPUS. This avoids triggering the following warning:
      
      WARNING: CPU: 1 PID: 7600 at include/linux/cpumask.h:121 cpu_max_bits_warn
      
      if CONFIG_DEBUG_PER_CPU_MAPS is enabled.
      
      While in there, also move the SQ thread idle period assignment inside
      SETUP_SQPOLL, as we don't use it otherwise either.
      
      Reported-by: syzbot+cd714a07c6de2bc34293@syzkaller.appspotmail.com
      Fixes: 6c271ce2f1d5 ("io_uring: add submission polling")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      e738eb4e
    • J
      io_uring: park SQPOLL thread if it's percpu · 35fea020
      Jens Axboe 提交于
      commit 06058632464845abb1af91521122fd04dd3daaec upstream.
      
      kthread expects this, or we can throw a warning on exit:
      
      WARNING: CPU: 0 PID: 7822 at kernel/kthread.c:399
      __kthread_bind_mask+0x3b/0xc0 kernel/kthread.c:399
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 7822 Comm: syz-executor030 Not tainted 5.1.0-rc4-next-20190412
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x172/0x1f0 lib/dump_stack.c:113
        panic+0x2cb/0x72b kernel/panic.c:214
        __warn.cold+0x20/0x46 kernel/panic.c:576
        report_bug+0x263/0x2b0 lib/bug.c:186
        fixup_bug arch/x86/kernel/traps.c:179 [inline]
        fixup_bug arch/x86/kernel/traps.c:174 [inline]
        do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
        do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
        invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
      RIP: 0010:__kthread_bind_mask+0x3b/0xc0 kernel/kthread.c:399
      Code: 48 89 fb e8 f7 ab 24 00 4c 89 e6 48 89 df e8 ac e1 02 00 31 ff 49 89
      c4 48 89 c6 e8 7f ad 24 00 4d 85 e4 75 15 e8 d5 ab 24 00 <0f> 0b e8 ce ab
      24 00 5b 41 5c 41 5d 41 5e 5d c3 e8 c0 ab 24 00 4c
      RSP: 0018:ffff8880a89bfbb8 EFLAGS: 00010293
      RAX: ffff88808ca7a280 RBX: ffff8880a98e4380 RCX: ffffffff814bdd11
      RDX: 0000000000000000 RSI: ffffffff814bdd1b RDI: 0000000000000007
      RBP: ffff8880a89bfbd8 R08: ffff88808ca7a280 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: ffffffff87691148 R14: ffff8880a98e43a0 R15: ffffffff81c91e10
        __kthread_bind kernel/kthread.c:412 [inline]
        kthread_unpark+0x123/0x160 kernel/kthread.c:480
        kthread_stop+0xfa/0x6c0 kernel/kthread.c:556
        io_sq_thread_stop fs/io_uring.c:2057 [inline]
        io_sq_thread_stop fs/io_uring.c:2052 [inline]
        io_finish_async+0xab/0x180 fs/io_uring.c:2064
        io_ring_ctx_free fs/io_uring.c:2534 [inline]
        io_ring_ctx_wait_and_kill+0x133/0x510 fs/io_uring.c:2591
        io_uring_release+0x42/0x50 fs/io_uring.c:2599
        __fput+0x2e5/0x8d0 fs/file_table.c:278
        ____fput+0x16/0x20 fs/file_table.c:309
        task_work_run+0x14a/0x1c0 kernel/task_work.c:113
        exit_task_work include/linux/task_work.h:22 [inline]
        do_exit+0x90a/0x2fa0 kernel/exit.c:876
        do_group_exit+0x135/0x370 kernel/exit.c:980
        __do_sys_exit_group kernel/exit.c:991 [inline]
        __se_sys_exit_group kernel/exit.c:989 [inline]
        __x64_sys_exit_group+0x44/0x50 kernel/exit.c:989
        do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Reported-by: syzbot+6d4a92619eb0ad08602b@syzkaller.appspotmail.com
      Fixes: 6c271ce2f1d5 ("io_uring: add submission polling")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      35fea020
    • J
      io_uring: restrict IORING_SETUP_SQPOLL to root · 29745a0d
      Jens Axboe 提交于
      commit 3ec482d15cb986bf08b923f9193eeddb3b9ca69f upstream.
      
      This options spawns a kernel side thread that will poll for submissions
      (and completions, if IORING_SETUP_IOPOLL is set). As this allows a user
      to potentially use more cycles outside of the normal hierarchy,
      restrict the use of this feature to root.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      29745a0d
    • J
      io_uring: fix double free in case of fileset regitration failure · b7e56f26
      Jens Axboe 提交于
      commit 25adf50fe25d506d3fc12070a5ff4be858a1ac1b upstream.
      
      Will Deacon reported the following KASAN complaint:
      
      [  149.890370] ==================================================================
      [  149.891266] BUG: KASAN: double-free or invalid-free in io_sqe_files_unregister+0xa8/0x140
      [  149.892218]
      [  149.892411] CPU: 113 PID: 3974 Comm: io_uring_regist Tainted: G    B             5.1.0-rc3-00012-g40b114779944 #3
      [  149.893623] Hardware name: linux,dummy-virt (DT)
      [  149.894169] Call trace:
      [  149.894539]  dump_backtrace+0x0/0x228
      [  149.895172]  show_stack+0x14/0x20
      [  149.895747]  dump_stack+0xe8/0x124
      [  149.896335]  print_address_description+0x60/0x258
      [  149.897148]  kasan_report_invalid_free+0x78/0xb8
      [  149.897936]  __kasan_slab_free+0x1fc/0x228
      [  149.898641]  kasan_slab_free+0x10/0x18
      [  149.899283]  kfree+0x70/0x1f8
      [  149.899798]  io_sqe_files_unregister+0xa8/0x140
      [  149.900574]  io_ring_ctx_wait_and_kill+0x190/0x3c0
      [  149.901402]  io_uring_release+0x2c/0x48
      [  149.902068]  __fput+0x18c/0x510
      [  149.902612]  ____fput+0xc/0x18
      [  149.903146]  task_work_run+0xf0/0x148
      [  149.903778]  do_notify_resume+0x554/0x748
      [  149.904467]  work_pending+0x8/0x10
      [  149.905060]
      [  149.905331] Allocated by task 3974:
      [  149.905934]  __kasan_kmalloc.isra.0.part.1+0x48/0xf8
      [  149.906786]  __kasan_kmalloc.isra.0+0xb8/0xd8
      [  149.907531]  kasan_kmalloc+0xc/0x18
      [  149.908134]  __kmalloc+0x168/0x248
      [  149.908724]  __arm64_sys_io_uring_register+0x2b8/0x15a8
      [  149.909622]  el0_svc_common+0x100/0x258
      [  149.910281]  el0_svc_handler+0x48/0xc0
      [  149.910928]  el0_svc+0x8/0xc
      [  149.911425]
      [  149.911696] Freed by task 3974:
      [  149.912242]  __kasan_slab_free+0x114/0x228
      [  149.912955]  kasan_slab_free+0x10/0x18
      [  149.913602]  kfree+0x70/0x1f8
      [  149.914118]  __arm64_sys_io_uring_register+0xc2c/0x15a8
      [  149.915009]  el0_svc_common+0x100/0x258
      [  149.915670]  el0_svc_handler+0x48/0xc0
      [  149.916317]  el0_svc+0x8/0xc
      [  149.916817]
      [  149.917101] The buggy address belongs to the object at ffff8004ce07ed00
      [  149.917101]  which belongs to the cache kmalloc-128 of size 128
      [  149.919197] The buggy address is located 0 bytes inside of
      [  149.919197]  128-byte region [ffff8004ce07ed00, ffff8004ce07ed80)
      [  149.921142] The buggy address belongs to the page:
      [  149.921953] page:ffff7e0013381f00 count:1 mapcount:0 mapping:ffff800503417c00 index:0x0 compound_mapcount: 0
      [  149.923595] flags: 0x1ffff00000010200(slab|head)
      [  149.924388] raw: 1ffff00000010200 dead000000000100 dead000000000200 ffff800503417c00
      [  149.925706] raw: 0000000000000000 0000000080400040 00000001ffffffff 0000000000000000
      [  149.927011] page dumped because: kasan: bad access detected
      [  149.927956]
      [  149.928224] Memory state around the buggy address:
      [  149.929054]  ffff8004ce07ec00: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
      [  149.930274]  ffff8004ce07ec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  149.931494] >ffff8004ce07ed00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  149.932712]                    ^
      [  149.933281]  ffff8004ce07ed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  149.934508]  ffff8004ce07ee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  149.935725] ==================================================================
      
      which is due to a failure in registrering a fileset. This frees the
      ctx->user_files pointer, but doesn't clear it. When the io_uring
      instance is later freed through the normal channels, we free this
      pointer again. At this point it's invalid.
      
      Ensure we clear the pointer when we free it for the error case.
      Reported-by: NWill Deacon <will.deacon@arm.com>
      Tested-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b7e56f26
    • R
      io_uring: offload write to async worker in case of -EAGAIN · 691ad7b2
      Roman Penyaev 提交于
      commit 9bf7933fc3f306bc4ce74ad734f690a71670178a upstream.
      
      In case of direct write -EAGAIN will be returned if page cache was
      previously populated.  To avoid immediate completion of a request
      with -EAGAIN error write has to be offloaded to the async worker,
      like io_read() does.
      Signed-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      691ad7b2
    • A
      io_uring: fix big-endian compat signal mask handling · d895ee01
      Arnd Bergmann 提交于
      commit 9e75ad5d8f399a21c86271571aa630dd080223e2 upstream.
      
      On big-endian architectures, the signal masks are differnet
      between 32-bit and 64-bit tasks, so we have to use a different
      function for reading them from user space.
      
      io_cqring_wait() initially got this wrong, and always interprets
      this as a native structure. This is ok on x86 and most arm64,
      but not on s390, ppc64be, mips64be, sparc64 and parisc.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d895ee01
    • J
      iov_iter: add ITER_BVEC_FLAG_NO_REF flag · 209cc5b5
      Jens Axboe 提交于
      commit 875f1d0769cdcfe1596ff0ca609b453359e42ec9 upstream.
      
      For ITER_BVEC, if we're holding on to kernel pages, the caller
      doesn't need to grab a reference to the bvec pages, and drop that
      same reference on IO completion. This is essentially safe for any
      ITER_BVEC, but some use cases end up reusing pages and uncondtionally
      dropping a page reference on completion. And example of that is
      sendfile(2), that ends up being a splice_in + splice_out on the
      pipe pages.
      
      Add a flag that tells us it's fine to not grab a page reference
      to the bvec pages, since that caller knows not to drop a reference
      when it's done with the pages.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      209cc5b5
    • J
      io_uring: retry bulk slab allocs as single allocs · ef1ee13b
      Jens Axboe 提交于
      commit fd6fab2cb78d3b6023c26ec53e0aa6f0b477d2f7 upstream.
      
      I've seen cases where bulk alloc fails, since the bulk alloc API
      is all-or-nothing - either we get the number we ask for, or it
      returns 0 as number of entries.
      
      If we fail a batch bulk alloc, retry a "normal" kmem_cache_alloc()
      and just use that instead of failing with -EAGAIN.
      
      While in there, ensure we use GFP_KERNEL. That was an oversight in
      the original code, when we switched away from GFP_ATOMIC.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ef1ee13b
    • J
      io_uring: fix poll races · 845a60d5
      Jens Axboe 提交于
      commit 8c838788775a593527803786d376393b7c28f589 upstream.
      
      This is a straight port of Al's fix for the aio poll implementation,
      since the io_uring version is heavily based on that. The below
      description is almost straight from that patch, just modified to
      fit the io_uring situation.
      
      io_poll() has to cope with several unpleasant problems:
      	* requests that might stay around indefinitely need to
      be made visible for io_cancel(2); that must not be done to
      a request already completed, though.
      	* in cases when ->poll() has placed us on a waitqueue,
      wakeup might have happened (and request completed) before ->poll()
      returns.
      	* worse, in some early wakeup cases request might end
      up re-added into the queue later - we can't treat "woken up and
      currently not in the queue" as "it's not going to stick around
      indefinitely"
      	* ... moreover, ->poll() might have decided not to
      put it on any queues to start with, and that needs to be distinguished
      from the previous case
      	* ->poll() might have tried to put us on more than one queue.
      Only the first will succeed for io poll, so we might end up missing
      wakeups.  OTOH, we might very well notice that only after the
      wakeup hits and request gets completed (all before ->poll() gets
      around to the second poll_wait()).  In that case it's too late to
      decide that we have an error.
      
      req->woken was an attempt to deal with that.  Unfortunately, it was
      broken.  What we need to keep track of is not that wakeup has happened -
      the thing might come back after that.  It's that async reference is
      already gone and won't come back, so we can't (and needn't) put the
      request on the list of cancellables.
      
      The easiest case is "request hadn't been put on any waitqueues"; we
      can tell by seeing NULL apt.head, and in that case there won't be
      anything async.  We should either complete the request ourselves
      (if vfs_poll() reports anything of interest) or return an error.
      
      In all other cases we get exclusion with wakeups by grabbing the
      queue lock.
      
      If request is currently on queue and we have something interesting
      from vfs_poll(), we can steal it and complete the request ourselves.
      
      If it's on queue and vfs_poll() has not reported anything interesting,
      we either put it on the cancellable list, or, if we know that it
      hadn't been put on all queues ->poll() wanted it on, we steal it and
      return an error.
      
      If it's _not_ on queue, it's either been already dealt with (in which
      case we do nothing), or there's io_poll_complete_work() about to be
      executed.  In that case we either put it on the cancellable list,
      or, if we know it hadn't been put on all queues ->poll() wanted it on,
      simulate what cancel would've done.
      
      Fixes: 221c5eb23382 ("io_uring: add support for IORING_OP_POLL")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      845a60d5
    • J
      io_uring: fix fget/fput handling · 1a18e019
      Jens Axboe 提交于
      commit 09bb839434bd845c01da3d159b0c126fe7fa90da upstream.
      
      This isn't a straight port of commit 84c4e1f89fef for aio.c, since
      io_uring doesn't use files in exactly the same way. But it's pretty
      close. See the commit message for that commit.
      
      This essentially fixes a use-after-free with the poll command
      handling, but it takes cue from Linus's approach to just simplifying
      the file handling. We move the setup of the file into a higher level
      location, so the individual commands don't have to deal with it. And
      then we release the reference when we free the associated io_kiocb.
      
      Fixes: 221c5eb23382 ("io_uring: add support for IORING_OP_POLL")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      1a18e019
    • J
      io_uring: add prepped flag · 2904810b
      Jens Axboe 提交于
      commit d530a402a114efcf6d2b88d7f628856dade5b90b upstream.
      
      We currently use the fact that if ->ki_filp is already set, then we've
      done the prep. In preparation for moving the file assignment earlier,
      use a separate flag to tell whether the request has been prepped for
      IO or not.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      2904810b
    • J
      io_uring: make io_read/write return an integer · b586a15b
      Jens Axboe 提交于
      commit e0c5c576d5074b5bb7b1b4b59848c25ceb521331 upstream.
      
      The callers all convert to an integer, and we only return 0/-ERROR
      anyway.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b586a15b
    • J
      io_uring: use regular request ref counts · 3a3cfffc
      Jens Axboe 提交于
      commit e65ef56db4945fb18a0d522e056c02ddf939e644 upstream.
      
      Get rid of the special casing of "normal" requests not having
      any references to the io_kiocb. We initialize the ref count to 2,
      one for the submission side, and one or the completion side.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3a3cfffc
    • J
      io_uring: allow workqueue item to handle multiple buffered requests · af01af1b
      Jens Axboe 提交于
      commit 31b515106428b9717d2b6475b6f6182cf231b1e6 upstream.
      
      Right now we punt any buffered request that ends up triggering an
      -EAGAIN to an async workqueue. This works fine in terms of providing
      async execution of them, but it also can create quite a lot of work
      queue items. For sequentially buffered IO, it's advantageous to
      serialize the issue of them. For reads, the first one will trigger a
      read-ahead, and subsequent request merely end up waiting on later pages
      to complete. For writes, devices usually respond better to streamed
      sequential writes.
      
      Add state to track the last buffered request we punted to a work queue,
      and if the next one is sequential to the previous, attempt to get the
      previous work item to handle it. We limit the number of sequential
      add-ons to the a multiple (8) of the max read-ahead size of the file.
      This should be a good number for both reads and wries, as it defines the
      max IO size the device can do directly.
      
      This drastically cuts down on the number of context switches we need to
      handle buffered sequential IO, and a basic test case of copying a big
      file with io_uring sees a 5x speedup.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      af01af1b
    • J
      io_uring: add support for IORING_OP_POLL · 51de0e8f
      Jens Axboe 提交于
      commit 221c5eb2338232f7340386de1c43decc32682e58 upstream.
      
      This is basically a direct port of bfe4037e, which implements a
      one-shot poll command through aio. Description below is based on that
      commit as well. However, instead of adding a POLL command and relying
      on io_cancel(2) to remove it, we mimic the epoll(2) interface of
      having a command to add a poll notification, IORING_OP_POLL_ADD,
      and one to remove it again, IORING_OP_POLL_REMOVE.
      
      To poll for a file descriptor the application should submit an sqe of
      type IORING_OP_POLL. It will poll the fd for the events specified in the
      poll_events field.
      
      Unlike poll or epoll without EPOLLONESHOT this interface always works in
      one shot mode, that is once the sqe is completed, it will have to be
      resubmitted.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Based-on-code-from: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      51de0e8f
    • J
      io_uring: add io_kiocb ref count · 739ec170
      Jens Axboe 提交于
      commit c16361c1d805b6ea50c3c1fc5c314e944c71a984 upstream.
      
      We'll use this for the POLL implementation. Regular requests will
      NOT be using references, so initialize it to 0. Any real use of
      the io_kiocb ref will initialize it to at least 2.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      739ec170
    • J
      io_uring: add submission polling · aa124ba8
      Jens Axboe 提交于
      commit 6c271ce2f1d572f7fa225700a13cfe7ced492434 upstream.
      
      This enables an application to do IO, without ever entering the kernel.
      By using the SQ ring to fill in new sqes and watching for completions
      on the CQ ring, we can submit and reap IOs without doing a single system
      call. The kernel side thread will poll for new submissions, and in case
      of HIPRI/polled IO, it'll also poll for completions.
      
      By default, we allow 1 second of active spinning. This can by changed
      by passing in a different grace period at io_uring_register(2) time.
      If the thread exceeds this idle time without having any work to do, it
      will set:
      
      sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
      
      The application will have to call io_uring_enter() to start things back
      up again. If IO is kept busy, that will never be needed. Basically an
      application that has this feature enabled will guard it's
      io_uring_enter(2) call with:
      
      read_barrier();
      if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
      	io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
      
      instead of calling it unconditionally.
      
      It's mandatory to use fixed files with this feature. Failure to do so
      will result in the application getting an -EBADF CQ entry when
      submitting IO.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      aa124ba8
    • J
      io_uring: add file set registration · 7bfbdad6
      Jens Axboe 提交于
      commit 6b06314c47e141031be043539900d80d2c7ba10f upstream.
      
      We normally have to fget/fput for each IO we do on a file. Even with
      the batching we do, the cost of the atomic inc/dec of the file usage
      count adds up.
      
      This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
      for the io_uring_register(2) system call. The arguments passed in must
      be an array of __s32 holding file descriptors, and nr_args should hold
      the number of file descriptors the application wishes to pin for the
      duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
      called).
      
      When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
      member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
      to the index in the array passed in to IORING_REGISTER_FILES.
      
      Files are automatically unregistered when the io_uring instance is torn
      down. An application need only unregister if it wishes to register a new
      set of fds.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      7bfbdad6
    • J
      io_uring: add support for pre-mapped user IO buffers · a078ed69
      Jens Axboe 提交于
      commit edafccee56ff31678a091ddb7219aba9b28bc3cb upstream.
      
      If we have fixed user buffers, we can map them into the kernel when we
      setup the io_uring. That avoids the need to do get_user_pages() for
      each and every IO.
      
      To utilize this feature, the application must call io_uring_register()
      after having setup an io_uring instance, passing in
      IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
      an iovec array, and the nr_args should contain how many iovecs the
      application wishes to map.
      
      If successful, these buffers are now mapped into the kernel, eligible
      for IO. To use these fixed buffers, the application must use the
      IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
      set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
      must point to somewhere inside the indexed buffer.
      
      The application may register buffers throughout the lifetime of the
      io_uring instance. It can call io_uring_register() with
      IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
      buffers, and then register a new set. The application need not
      unregister buffers explicitly before shutting down the io_uring
      instance.
      
      It's perfectly valid to setup a larger buffer, and then sometimes only
      use parts of it for an IO. As long as the range is within the originally
      mapped region, it will work just fine.
      
      For now, buffers must not be file backed. If file backed buffers are
      passed in, the registration will fail with -1/EOPNOTSUPP. This
      restriction may be relaxed in the future.
      
      RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
      arbitrary 1G per buffer size is also imposed.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      a078ed69
    • J
      io_uring: batch io_kiocb allocation · f25b8cbf
      Jens Axboe 提交于
      commit 2579f913d41a086563bb81762c519f3d62ddee37 upstream.
      
      Similarly to how we use the state->ios_left to know how many references
      to get to a file, we can use it to allocate the io_kiocb's we need in
      bulk.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f25b8cbf