1. 16 5月, 2019 4 次提交
    • J
      io_uring: use wait_event_interruptible for cq_wait conditional wait · fdb288a6
      Jackie Liu 提交于
      The previous patch has ensured that io_cqring_events contain
      smp_rmb memory barriers, Now we can use wait_event_interruptible
      to keep the code simple.
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fdb288a6
    • J
      io_uring: adjust smp_rmb inside io_cqring_events · dc6ce4bc
      Jackie Liu 提交于
      Whenever smp_rmb is required to use io_cqring_events,
      keep smp_rmb inside the function io_cqring_events.
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dc6ce4bc
    • R
      io_uring: fix infinite wait in khread_park() on io_finish_async() · 2bbcd6d3
      Roman Penyaev 提交于
      This fixes couple of races which lead to infinite wait of park completion
      with the following backtraces:
      
        [20801.303319] Call Trace:
        [20801.303321]  ? __schedule+0x284/0x650
        [20801.303323]  schedule+0x33/0xc0
        [20801.303324]  schedule_timeout+0x1bc/0x210
        [20801.303326]  ? schedule+0x3d/0xc0
        [20801.303327]  ? schedule_timeout+0x1bc/0x210
        [20801.303329]  ? preempt_count_add+0x79/0xb0
        [20801.303330]  wait_for_completion+0xa5/0x120
        [20801.303331]  ? wake_up_q+0x70/0x70
        [20801.303333]  kthread_park+0x48/0x80
        [20801.303335]  io_finish_async+0x2c/0x70
        [20801.303336]  io_ring_ctx_wait_and_kill+0x95/0x180
        [20801.303338]  io_uring_release+0x1c/0x20
        [20801.303339]  __fput+0xad/0x210
        [20801.303341]  task_work_run+0x8f/0xb0
        [20801.303342]  exit_to_usermode_loop+0xa0/0xb0
        [20801.303343]  do_syscall_64+0xe0/0x100
        [20801.303349]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        [20801.303380] Call Trace:
        [20801.303383]  ? __schedule+0x284/0x650
        [20801.303384]  schedule+0x33/0xc0
        [20801.303386]  io_sq_thread+0x38a/0x410
        [20801.303388]  ? __switch_to_asm+0x40/0x70
        [20801.303390]  ? wait_woken+0x80/0x80
        [20801.303392]  ? _raw_spin_lock_irqsave+0x17/0x40
        [20801.303394]  ? io_submit_sqes+0x120/0x120
        [20801.303395]  kthread+0x112/0x130
        [20801.303396]  ? kthread_create_on_node+0x60/0x60
        [20801.303398]  ret_from_fork+0x35/0x40
      
       o kthread_park() waits for park completion, so io_sq_thread() loop
         should check kthread_should_park() along with khread_should_stop(),
         otherwise if kthread_park() is called before prepare_to_wait()
         the following schedule() never returns:
      
         CPU#0                    CPU#1
      
         io_sq_thread_stop():     io_sq_thread():
      
                                     while(!kthread_should_stop() && !ctx->sqo_stop) {
      
            ctx->sqo_stop = 1;
            kthread_park()
      
      	                            prepare_to_wait();
                                          if (kthread_should_stop() {
      				    }
                                          schedule();   <<< nobody checks park flag,
      				                  <<< so schedule and never return
      
       o if the flag ctx->sqo_stop is observed by the io_sq_thread() loop
         it is quite possible, that kthread_should_park() check and the
         following kthread_parkme() is never called, because kthread_park()
         has not been yet called, but few moments later is is called and
         waits there for park completion, which never happens, because
         kthread has already exited:
      
         CPU#0                    CPU#1
      
         io_sq_thread_stop():     io_sq_thread():
      
            ctx->sqo_stop = 1;
                                     while(!kthread_should_stop() && !ctx->sqo_stop) {
                                         <<< observe sqo_stop and exit the loop
      			       }
      
      			       if (kthread_should_park())
      			           kthread_parkme();  <<< never called, since was
      					              <<< never parked
      
            kthread_park()           <<< waits forever for park completion
      
      In the current patch we quit the loop by only kthread_should_park()
      check (kthread_park() is synchronous, so kthread_should_stop() is
      never observed), and we abandon ->sqo_stop flag, since it is racy.
      At the end of the io_sq_thread() we unconditionally call parmke(),
      since we've exited the loop by the park flag.
      Signed-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2bbcd6d3
    • J
      io_uring: remove 'ev_flags' argument · c71ffb67
      Jens Axboe 提交于
      We always pass in 0 for the cqe flags argument, since the support for
      "this read hit page cache" hint was dropped.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c71ffb67
  2. 15 5月, 2019 2 次提交
    • J
      io_uring: fix failure to verify SQ_AFF cpu · 44a9bd18
      Jens Axboe 提交于
      The test case we have is rightfully failing with the current kernel:
      
      io_uring_setup(1, 0x7ffe2cafebe0), flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, sq_thread_cpu: 4
      expected -1, got 3
      
      This is in a vm, and CPU3 is the last valid one, hence asking for 4
      should fail the setup with -EINVAL, not succeed. The problem is that
      we're using array_index_nospec() with nr_cpu_ids as the index, hence we
      wrap and end up using CPU0 instead of CPU4. This makes the setup
      succeed where it should be failing.
      
      We don't need to use array_index_nospec() as we're not indexing any
      array with this. Instead just compare with nr_cpu_ids directly. This
      is fine as we're checking with cpu_online() afterwards.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      44a9bd18
    • I
      mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM · 932f4a63
      Ira Weiny 提交于
      Pach series "Add FOLL_LONGTERM to GUP fast and use it".
      
      HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
      advantages.  These pages can be held for a significant time.  But
      get_user_pages_fast() does not protect against mapping FS DAX pages.
      
      Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
      retains the performance while also adding the FS DAX checks.  XDP has also
      shown interest in using this functionality.[1]
      
      In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
      and remove the specialized get_user_pages_longterm call.
      
      [1] https://lkml.org/lkml/2019/3/19/939
      
      "longterm" is a relative thing and at this point is probably a misnomer.
      This is really flagging a pin which is going to be given to hardware and
      can't move.  I've thought of a couple of alternative names but I think we
      have to settle on if we are going to use FL_LAYOUT or something else to
      solve the "longterm" problem.  Then I think we can change the flag to a
      better name.
      
      Secondly, it depends on how often you are registering memory.  I have
      spoken with some RDMA users who consider MR in the performance path...
      For the overall application performance.  I don't have the numbers as the
      tests for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an aside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      This patch (of 7):
      
      This patch starts a series which aims to support FOLL_LONGTERM in
      get_user_pages_fast().  Some callers who would like to do a longterm (user
      controlled pin) of pages with the fast variant of GUP for performance
      purposes.
      
      Rather than have a separate get_user_pages_longterm() call, introduce
      FOLL_LONGTERM and change the longterm callers to use it.
      
      This patch does not change any functionality.  In the short term
      "longterm" or user controlled pins are unsafe for Filesystems and FS DAX
      in particular has been blocked.  However, callers of get_user_pages_fast()
      were not "protected".
      
      FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
      requires vmas to determine if DAX is in use.
      
      NOTE: In merging with the CMA changes we opt to change the
      get_user_pages() call in check_and_migrate_cma_pages() to a call of
      __get_user_pages_locked() on the newly migrated pages.  This makes the
      code read better in that we are calling __get_user_pages_locked() on the
      pages before and after a potential migration.
      
      As a side affect some of the interfaces are cleaned up but this is not the
      primary purpose of the series.
      
      In review[1] it was asked:
      
      <quote>
      > This I don't get - if you do lock down long term mappings performance
      > of the actual get_user_pages call shouldn't matter to start with.
      >
      > What do I miss?
      
      A couple of points.
      
      First "longterm" is a relative thing and at this point is probably a
      misnomer.  This is really flagging a pin which is going to be given to
      hardware and can't move.  I've thought of a couple of alternative names
      but I think we have to settle on if we are going to use FL_LAYOUT or
      something else to solve the "longterm" problem.  Then I think we can
      change the flag to a better name.
      
      Second, It depends on how often you are registering memory.  I have spoken
      with some RDMA users who consider MR in the performance path...  For the
      overall application performance.  I don't have the numbers as the tests
      for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an asside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      </quote>
      
      [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
      
      [ira.weiny@intel.com: v3]
        Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      932f4a63
  3. 13 5月, 2019 1 次提交
    • S
      io_uring: fix race condition reading SQE data · e2033e33
      Stefan Bühler 提交于
      When punting to workers the SQE gets copied after the initial try.
      There is a race condition between reading SQE data for the initial try
      and copying it for punting it to the workers.
      
      For example io_rw_done calls kiocb->ki_complete even if it was prepared
      for IORING_OP_FSYNC (and would be NULL).
      
      The easiest solution for now is to alway prepare again in the worker.
      
      req->file is safe to prepare though as long as it is checked before use.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e2033e33
  4. 07 5月, 2019 2 次提交
    • S
      io_uring: use cpu_online() to check p->sq_thread_cpu instead of cpu_possible() · 7889f44d
      Shenghui Wang 提交于
      This issue is found by running liburing/test/io_uring_setup test.
      
      When test run, the testcase "attempt to bind to invalid cpu" would not
      pass with messages like:
         io_uring_setup(1, 0xbfc2f7c8), \
      flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, \
      resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, \
      sq_thread_cpu: 2
         expected -1, got 3
         FAIL
      
      On my system, there is:
         CPU(s) possible : 0-3
         CPU(s) online   : 0-1
         CPU(s) offline  : 2-3
         CPU(s) present  : 0-1
      
      The sq_thread_cpu 2 is offline on my system, so the bind should fail.
      But cpu_possible() will pass the check. We shouldn't be able to bind
      to an offline cpu. Use cpu_online() to do the check.
      
      After the change, the testcase run as expected: EINVAL will be returned
      for cpu offlined.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7889f44d
    • C
      io_uring: fix shadowed variable ret return code being not checked · efeb862b
      Colin Ian King 提交于
      Currently variable ret is declared in a while-loop code block that
      shadows another variable ret. When an error occurs in the while-loop
      the error return in ret is not being set in the outer code block and
      so the error check on ret is always going to be checking on the wrong
      ret variable resulting in check that is always going to be true and
      a premature return occurs.
      
      Fix this by removing the declaration of the inner while-loop variable
      ret so that shadowing does not occur.
      
      Addresses-Coverity: ("'Constant' variable guards dead code")
      Fixes: 6b06314c ("io_uring: add file set registration")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      efeb862b
  5. 03 5月, 2019 4 次提交
  6. 02 5月, 2019 1 次提交
    • M
      io_uring: avoid page allocation warnings · d4ef6475
      Mark Rutland 提交于
      In io_sqe_buffer_register() we allocate a number of arrays based on the
      iov_len from the user-provided iov. While we limit iov_len to SZ_1G,
      we can still attempt to allocate arrays exceeding MAX_ORDER.
      
      On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and
      iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to
      allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320
      bytes, which is greater than 4MiB. This results in SLUB warning that
      we're trying to allocate greater than MAX_ORDER, and failing the
      allocation.
      
      Avoid this by using kvmalloc() for allocations dependent on the
      user-provided iov_len. At the same time, fix a leak of imu->bvec when
      registration fails.
      
      Full splat from before this patch:
      
      WARNING: CPU: 1 PID: 2314 at mm/page_alloc.c:4595 __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 1 PID: 2314 Comm: syz-executor326 Not tainted 5.1.0-rc7-dirty #4
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193
       show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x110/0x190 lib/dump_stack.c:113
       panic+0x384/0x68c kernel/panic.c:214
       __warn+0x2bc/0x2c0 kernel/panic.c:571
       report_bug+0x228/0x2d8 lib/bug.c:186
       bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956
       call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline]
       brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316
       do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831
       el1_dbg+0x18/0x8c
       __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595
       alloc_pages_current+0x164/0x278 mm/mempolicy.c:2132
       alloc_pages include/linux/gfp.h:509 [inline]
       kmalloc_order+0x20/0x50 mm/slab_common.c:1231
       kmalloc_order_trace+0x30/0x2b0 mm/slab_common.c:1243
       kmalloc_large include/linux/slab.h:480 [inline]
       __kmalloc+0x3dc/0x4f0 mm/slub.c:3791
       kmalloc_array include/linux/slab.h:670 [inline]
       io_sqe_buffer_register fs/io_uring.c:2472 [inline]
       __io_uring_register fs/io_uring.c:2962 [inline]
       __do_sys_io_uring_register fs/io_uring.c:3008 [inline]
       __se_sys_io_uring_register fs/io_uring.c:2990 [inline]
       __arm64_sys_io_uring_register+0x9e0/0x1bc8 fs/io_uring.c:2990
       __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
       invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
       el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83
       el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129
       el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948
      SMP: stopping secondary CPUs
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Kernel Offset: disabled
      CPU features: 0x002,23000438
      Memory Limit: none
      Rebooting in 1 seconds..
      
      Fixes: edafccee ("io_uring: add support for pre-mapped user IO buffers")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-block@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d4ef6475
  7. 01 5月, 2019 4 次提交
    • J
      io_uring: drop req submit reference always in async punt · 817869d2
      Jens Axboe 提交于
      If we don't end up actually calling submit in io_sq_wq_submit_work(),
      we still need to drop the submit reference to the request. If we
      don't, then we can leak the request. This can happen if we race
      with ring shutdown while flushing the workqueue for requests that
      require use of the mm_struct.
      
      Fixes: e65ef56d ("io_uring: use regular request ref counts")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      817869d2
    • M
      io_uring: free allocated io_memory once · 52e04ef4
      Mark Rutland 提交于
      If io_allocate_scq_urings() fails to allocate an sq_* region, it will
      call io_mem_free() for any previously allocated regions, but leave
      dangling pointers to these regions in the ctx. Any regions which have
      not yet been allocated are left NULL. Note that when returning
      -EOVERFLOW, the previously allocated sq_ring is not freed, which appears
      to be an unintentional leak.
      
      When io_allocate_scq_urings() fails, io_uring_create() will call
      io_ring_ctx_wait_and_kill(), which calls io_mem_free() on all the sq_*
      regions, assuming the pointers are valid and not NULL.
      
      This can result in pages being freed multiple times, which has been
      observed to corrupt the page state, leading to subsequent fun. This can
      also result in virt_to_page() on NULL, resulting in the use of bogus
      page addresses, and yet more subsequent fun. The latter can be detected
      with CONFIG_DEBUG_VIRTUAL on arm64.
      
      Adding a cleanup path to io_allocate_scq_urings() complicates the logic,
      so let's leave it to io_ring_ctx_free() to consistently free these
      pointers, and simplify the io_allocate_scq_urings() error paths.
      
      Full splats from before this patch below. Note that the pointer logged
      by the DEBUG_VIRTUAL "non-linear address" warning has been hashed, and
      is actually NULL.
      
      [   26.098129] page:ffff80000e949a00 count:0 mapcount:-128 mapping:0000000000000000 index:0x0
      [   26.102976] flags: 0x63fffc000000()
      [   26.104373] raw: 000063fffc000000 ffff80000e86c188 ffff80000ea3df08 0000000000000000
      [   26.108917] raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000
      [   26.137235] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
      [   26.143960] ------------[ cut here ]------------
      [   26.146020] kernel BUG at include/linux/mm.h:547!
      [   26.147586] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
      [   26.149163] Modules linked in:
      [   26.150287] Process syz-executor.21 (pid: 20204, stack limit = 0x000000000e9cefeb)
      [   26.153307] CPU: 2 PID: 20204 Comm: syz-executor.21 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #18
      [   26.156566] Hardware name: linux,dummy-virt (DT)
      [   26.158089] pstate: 40400005 (nZcv daif +PAN -UAO)
      [   26.159869] pc : io_mem_free+0x9c/0xa8
      [   26.161436] lr : io_mem_free+0x9c/0xa8
      [   26.162720] sp : ffff000013003d60
      [   26.164048] x29: ffff000013003d60 x28: ffff800025048040
      [   26.165804] x27: 0000000000000000 x26: ffff800025048040
      [   26.167352] x25: 00000000000000c0 x24: ffff0000112c2820
      [   26.169682] x23: 0000000000000000 x22: 0000000020000080
      [   26.171899] x21: ffff80002143b418 x20: ffff80002143b400
      [   26.174236] x19: ffff80002143b280 x18: 0000000000000000
      [   26.176607] x17: 0000000000000000 x16: 0000000000000000
      [   26.178997] x15: 0000000000000000 x14: 0000000000000000
      [   26.181508] x13: 00009178a5e077b2 x12: 0000000000000001
      [   26.183863] x11: 0000000000000000 x10: 0000000000000980
      [   26.186437] x9 : ffff000013003a80 x8 : ffff800025048a20
      [   26.189006] x7 : ffff8000250481c0 x6 : ffff80002ffe9118
      [   26.191359] x5 : ffff80002ffe9118 x4 : 0000000000000000
      [   26.193863] x3 : ffff80002ffefe98 x2 : 44c06ddd107d1f00
      [   26.196642] x1 : 0000000000000000 x0 : 000000000000003e
      [   26.198892] Call trace:
      [   26.199893]  io_mem_free+0x9c/0xa8
      [   26.201155]  io_ring_ctx_wait_and_kill+0xec/0x180
      [   26.202688]  io_uring_setup+0x6c4/0x6f0
      [   26.204091]  __arm64_sys_io_uring_setup+0x18/0x20
      [   26.205576]  el0_svc_common.constprop.0+0x7c/0xe8
      [   26.207186]  el0_svc_handler+0x28/0x78
      [   26.208389]  el0_svc+0x8/0xc
      [   26.209408] Code: aa0203e0 d0006861 9133a021 97fcdc3c (d4210000)
      [   26.211995] ---[ end trace bdb81cd43a21e50d ]---
      
      [   81.770626] ------------[ cut here ]------------
      [   81.825015] virt_to_phys used for non-linear address: 000000000d42f2c7 (          (null))
      [   81.827860] WARNING: CPU: 1 PID: 30171 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x48/0x68
      [   81.831202] Modules linked in:
      [   81.832212] CPU: 1 PID: 30171 Comm: syz-executor.20 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #19
      [   81.835616] Hardware name: linux,dummy-virt (DT)
      [   81.836863] pstate: 60400005 (nZCv daif +PAN -UAO)
      [   81.838727] pc : __virt_to_phys+0x48/0x68
      [   81.840572] lr : __virt_to_phys+0x48/0x68
      [   81.842264] sp : ffff80002cf67c70
      [   81.843858] x29: ffff80002cf67c70 x28: ffff800014358e18
      [   81.846463] x27: 0000000000000000 x26: 0000000020000080
      [   81.849148] x25: 0000000000000000 x24: ffff80001bb01f40
      [   81.851986] x23: ffff200011db06c8 x22: ffff2000127e3c60
      [   81.854351] x21: ffff800014358cc0 x20: ffff800014358d98
      [   81.856711] x19: 0000000000000000 x18: 0000000000000000
      [   81.859132] x17: 0000000000000000 x16: 0000000000000000
      [   81.861586] x15: 0000000000000000 x14: 0000000000000000
      [   81.863905] x13: 0000000000000000 x12: ffff1000037603e9
      [   81.866226] x11: 1ffff000037603e8 x10: 0000000000000980
      [   81.868776] x9 : ffff80002cf67840 x8 : ffff80001bb02920
      [   81.873272] x7 : ffff1000037603e9 x6 : ffff80001bb01f47
      [   81.875266] x5 : ffff1000037603e9 x4 : dfff200000000000
      [   81.876875] x3 : ffff200010087528 x2 : ffff1000059ecf58
      [   81.878751] x1 : 44c06ddd107d1f00 x0 : 0000000000000000
      [   81.880453] Call trace:
      [   81.881164]  __virt_to_phys+0x48/0x68
      [   81.882919]  io_mem_free+0x18/0x110
      [   81.886585]  io_ring_ctx_wait_and_kill+0x13c/0x1f0
      [   81.891212]  io_uring_setup+0xa60/0xad0
      [   81.892881]  __arm64_sys_io_uring_setup+0x2c/0x38
      [   81.894398]  el0_svc_common.constprop.0+0xac/0x150
      [   81.896306]  el0_svc_handler+0x34/0x88
      [   81.897744]  el0_svc+0x8/0xc
      [   81.898715] ---[ end trace b4a703802243cbba ]---
      
      Fixes: 2b188cc1 ("Add io_uring IO interface")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      52e04ef4
    • M
      io_uring: fix SQPOLL cpu validation · 975554b0
      Mark Rutland 提交于
      In io_sq_offload_start(), we call cpu_possible() on an unbounded cpu
      value from userspace. On v5.1-rc7 on arm64 with
      CONFIG_DEBUG_PER_CPU_MAPS, this results in a splat:
      
        WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline]
      
      There was an attempt to fix this in commit:
      
        917257da ("io_uring: only test SQPOLL cpu after we've verified it")
      
      ... by adding a check after the cpu value had been limited to NR_CPU_IDS
      using array_index_nospec(). However, this left an unbound check at the
      start of the function, for which the warning still fires.
      
      Let's fix this correctly by checking that the cpu value is bound by
      nr_cpu_ids before passing it to cpu_possible(). Note that only
      nr_cpu_ids of a cpumask are guaranteed to exist at runtime, and
      nr_cpu_ids can be significantly smaller than NR_CPUs. For example, an
      arm64 defconfig has NR_CPUS=256, while my test VM has 4 vCPUs.
      
      Following the intent from the commit message for 917257da, the
      check is moved under the SQ_AFF branch, which is the only branch where
      the cpu values is consumed. The check is performed before bounding the
      value with array_index_nospec() so that we don't silently accept bogus
      cpu values from userspace, where array_index_nospec() would force these
      values to 0.
      
      I suspect we can remove the array_index_nospec() call entirely, but I've
      conservatively left that in place, updated to use nr_cpu_ids to match
      the prior check.
      
      Tested on arm64 with the Syzkaller reproducer:
      
        https://syzkaller.appspot.com/bug?extid=cd714a07c6de2bc34293
        https://syzkaller.appspot.com/x/repro.syz?x=15d8b397200000
      
      Full splat from before this patch:
      
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_check include/linux/cpumask.h:128 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_test_cpu include/linux/cpumask.h:344 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_sq_offload_start fs/io_uring.c:2244 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_create fs/io_uring.c:2864 [inline]
      WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 1 PID: 27601 Comm: syz-executor.0 Not tainted 5.1.0-rc7 #3
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193
       show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x110/0x190 lib/dump_stack.c:113
       panic+0x384/0x68c kernel/panic.c:214
       __warn+0x2bc/0x2c0 kernel/panic.c:571
       report_bug+0x228/0x2d8 lib/bug.c:186
       bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956
       call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline]
       brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316
       do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831
       el1_dbg+0x18/0x8c
       cpu_max_bits_warn include/linux/cpumask.h:121 [inline]
       cpumask_check include/linux/cpumask.h:128 [inline]
       cpumask_test_cpu include/linux/cpumask.h:344 [inline]
       io_sq_offload_start fs/io_uring.c:2244 [inline]
       io_uring_create fs/io_uring.c:2864 [inline]
       io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916
       __do_sys_io_uring_setup fs/io_uring.c:2929 [inline]
       __se_sys_io_uring_setup fs/io_uring.c:2926 [inline]
       __arm64_sys_io_uring_setup+0x50/0x70 fs/io_uring.c:2926
       __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
       invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
       el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83
       el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129
       el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948
      SMP: stopping secondary CPUs
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Kernel Offset: disabled
      CPU features: 0x002,23000438
      Memory Limit: none
      Rebooting in 1 seconds..
      
      Fixes: 917257da ("io_uring: only test SQPOLL cpu after we've verified it")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      
      Simplied the logic
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      975554b0
    • J
      io_uring: have submission side sqe errors post a cqe · 5c8b0b54
      Jens Axboe 提交于
      Currently we only post a cqe if we get an error OUTSIDE of submission.
      For submission, we return the error directly through io_uring_enter().
      This is a bit awkward for applications, and it makes more sense to
      always post a cqe with an error, if the error happens on behalf of an
      sqe.
      
      This changes submission behavior a bit. io_uring_enter() returns -ERROR
      for an error, and > 0 for number of sqes submitted. Before this change,
      if you wanted to submit 8 entries and had an error on the 5th entry,
      io_uring_enter() would return 4 (for number of entries successfully
      submitted) and rewind the sqring. The application would then have to
      peek at the sqring and figure out what was wrong with the head sqe, and
      then skip it itself. With this change, we'll return 5 since we did
      consume 5 sqes, and the last sqe (with the error) will result in a cqe
      being posted with the error.
      
      This makes the logic easier to handle in the application, and it cleans
      up the submission part.
      Suggested-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5c8b0b54
  8. 30 4月, 2019 8 次提交
  9. 23 4月, 2019 5 次提交
  10. 18 4月, 2019 1 次提交
    • J
      io_uring: fix CQ overflow condition · 74f464e9
      Jens Axboe 提交于
      This is a leftover from when the rings initially were not free flowing,
      and hence a test for tail + 1 == head would indicate full. Since we now
      let them wrap instead of mask them with the size, we need to check if
      they drift more than the ring size from each other.
      
      This fixes a case where we'd overwrite CQ ring entries, if the user
      failed to reap completions. Both cases would ultimately result in lost
      completions as the application violated the depth it asked for. The only
      difference is that before this fix we'd return invalid entries for the
      overflowed completions, instead of properly flagging it in the
      cq_ring->overflow variable.
      Reported-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74f464e9
  11. 16 4月, 2019 1 次提交
    • J
      io_uring: fix possible deadlock between io_uring_{enter,register} · b19062a5
      Jens Axboe 提交于
      If we have multiple threads, one doing io_uring_enter() while the other
      is doing io_uring_register(), we can run into a deadlock between the
      two. io_uring_register() must wait for existing users of the io_uring
      instance to exit. But it does so while holding the io_uring mutex.
      Callers of io_uring_enter() may need this mutex to make progress (and
      eventually exit). If we wait for users to exit in io_uring_register(),
      we can't do so with the io_uring mutex held without potentially risking
      a deadlock.
      
      Drop the io_uring mutex while waiting for existing callers to exit. This
      is safe and guaranteed to make forward progress, since we already killed
      the percpu ref before doing so. Hence later callers of io_uring_enter()
      will be rejected.
      
      Reported-by: syzbot+16dc03452dee970a0c3e@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b19062a5
  12. 14 4月, 2019 3 次提交
    • J
      io_uring: drop io_file_put() 'file' argument · 3d6770fb
      Jens Axboe 提交于
      Since the fget/fput handling was reworked in commit 09bb8394, we
      never call io_file_put() with state == NULL (and hence file != NULL)
      anymore. Remove that case.
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d6770fb
    • J
      io_uring: only test SQPOLL cpu after we've verified it · 917257da
      Jens Axboe 提交于
      We currently call cpu_possible() even if we don't use the CPU. Move the
      test under the SQ_AFF branch, which is the only place where we'll use
      the value. Do the cpu_possible() test AFTER we've limited it to a max
      of NR_CPUS. This avoids triggering the following warning:
      
      WARNING: CPU: 1 PID: 7600 at include/linux/cpumask.h:121 cpu_max_bits_warn
      
      if CONFIG_DEBUG_PER_CPU_MAPS is enabled.
      
      While in there, also move the SQ thread idle period assignment inside
      SETUP_SQPOLL, as we don't use it otherwise either.
      
      Reported-by: syzbot+cd714a07c6de2bc34293@syzkaller.appspotmail.com
      Fixes: 6c271ce2 ("io_uring: add submission polling")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      917257da
    • J
      io_uring: park SQPOLL thread if it's percpu · 06058632
      Jens Axboe 提交于
      kthread expects this, or we can throw a warning on exit:
      
      WARNING: CPU: 0 PID: 7822 at kernel/kthread.c:399
      __kthread_bind_mask+0x3b/0xc0 kernel/kthread.c:399
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 7822 Comm: syz-executor030 Not tainted 5.1.0-rc4-next-20190412
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x172/0x1f0 lib/dump_stack.c:113
        panic+0x2cb/0x72b kernel/panic.c:214
        __warn.cold+0x20/0x46 kernel/panic.c:576
        report_bug+0x263/0x2b0 lib/bug.c:186
        fixup_bug arch/x86/kernel/traps.c:179 [inline]
        fixup_bug arch/x86/kernel/traps.c:174 [inline]
        do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
        do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
        invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
      RIP: 0010:__kthread_bind_mask+0x3b/0xc0 kernel/kthread.c:399
      Code: 48 89 fb e8 f7 ab 24 00 4c 89 e6 48 89 df e8 ac e1 02 00 31 ff 49 89
      c4 48 89 c6 e8 7f ad 24 00 4d 85 e4 75 15 e8 d5 ab 24 00 <0f> 0b e8 ce ab
      24 00 5b 41 5c 41 5d 41 5e 5d c3 e8 c0 ab 24 00 4c
      RSP: 0018:ffff8880a89bfbb8 EFLAGS: 00010293
      RAX: ffff88808ca7a280 RBX: ffff8880a98e4380 RCX: ffffffff814bdd11
      RDX: 0000000000000000 RSI: ffffffff814bdd1b RDI: 0000000000000007
      RBP: ffff8880a89bfbd8 R08: ffff88808ca7a280 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: ffffffff87691148 R14: ffff8880a98e43a0 R15: ffffffff81c91e10
        __kthread_bind kernel/kthread.c:412 [inline]
        kthread_unpark+0x123/0x160 kernel/kthread.c:480
        kthread_stop+0xfa/0x6c0 kernel/kthread.c:556
        io_sq_thread_stop fs/io_uring.c:2057 [inline]
        io_sq_thread_stop fs/io_uring.c:2052 [inline]
        io_finish_async+0xab/0x180 fs/io_uring.c:2064
        io_ring_ctx_free fs/io_uring.c:2534 [inline]
        io_ring_ctx_wait_and_kill+0x133/0x510 fs/io_uring.c:2591
        io_uring_release+0x42/0x50 fs/io_uring.c:2599
        __fput+0x2e5/0x8d0 fs/file_table.c:278
        ____fput+0x16/0x20 fs/file_table.c:309
        task_work_run+0x14a/0x1c0 kernel/task_work.c:113
        exit_task_work include/linux/task_work.h:22 [inline]
        do_exit+0x90a/0x2fa0 kernel/exit.c:876
        do_group_exit+0x135/0x370 kernel/exit.c:980
        __do_sys_exit_group kernel/exit.c:991 [inline]
        __se_sys_exit_group kernel/exit.c:989 [inline]
        __x64_sys_exit_group+0x44/0x50 kernel/exit.c:989
        do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Reported-by: syzbot+6d4a92619eb0ad08602b@syzkaller.appspotmail.com
      Fixes: 6c271ce2 ("io_uring: add submission polling")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      06058632
  13. 09 4月, 2019 1 次提交
  14. 03 4月, 2019 1 次提交
    • J
      io_uring: fix double free in case of fileset regitration failure · 25adf50f
      Jens Axboe 提交于
      Will Deacon reported the following KASAN complaint:
      
      [  149.890370] ==================================================================
      [  149.891266] BUG: KASAN: double-free or invalid-free in io_sqe_files_unregister+0xa8/0x140
      [  149.892218]
      [  149.892411] CPU: 113 PID: 3974 Comm: io_uring_regist Tainted: G    B             5.1.0-rc3-00012-g40b114779944 #3
      [  149.893623] Hardware name: linux,dummy-virt (DT)
      [  149.894169] Call trace:
      [  149.894539]  dump_backtrace+0x0/0x228
      [  149.895172]  show_stack+0x14/0x20
      [  149.895747]  dump_stack+0xe8/0x124
      [  149.896335]  print_address_description+0x60/0x258
      [  149.897148]  kasan_report_invalid_free+0x78/0xb8
      [  149.897936]  __kasan_slab_free+0x1fc/0x228
      [  149.898641]  kasan_slab_free+0x10/0x18
      [  149.899283]  kfree+0x70/0x1f8
      [  149.899798]  io_sqe_files_unregister+0xa8/0x140
      [  149.900574]  io_ring_ctx_wait_and_kill+0x190/0x3c0
      [  149.901402]  io_uring_release+0x2c/0x48
      [  149.902068]  __fput+0x18c/0x510
      [  149.902612]  ____fput+0xc/0x18
      [  149.903146]  task_work_run+0xf0/0x148
      [  149.903778]  do_notify_resume+0x554/0x748
      [  149.904467]  work_pending+0x8/0x10
      [  149.905060]
      [  149.905331] Allocated by task 3974:
      [  149.905934]  __kasan_kmalloc.isra.0.part.1+0x48/0xf8
      [  149.906786]  __kasan_kmalloc.isra.0+0xb8/0xd8
      [  149.907531]  kasan_kmalloc+0xc/0x18
      [  149.908134]  __kmalloc+0x168/0x248
      [  149.908724]  __arm64_sys_io_uring_register+0x2b8/0x15a8
      [  149.909622]  el0_svc_common+0x100/0x258
      [  149.910281]  el0_svc_handler+0x48/0xc0
      [  149.910928]  el0_svc+0x8/0xc
      [  149.911425]
      [  149.911696] Freed by task 3974:
      [  149.912242]  __kasan_slab_free+0x114/0x228
      [  149.912955]  kasan_slab_free+0x10/0x18
      [  149.913602]  kfree+0x70/0x1f8
      [  149.914118]  __arm64_sys_io_uring_register+0xc2c/0x15a8
      [  149.915009]  el0_svc_common+0x100/0x258
      [  149.915670]  el0_svc_handler+0x48/0xc0
      [  149.916317]  el0_svc+0x8/0xc
      [  149.916817]
      [  149.917101] The buggy address belongs to the object at ffff8004ce07ed00
      [  149.917101]  which belongs to the cache kmalloc-128 of size 128
      [  149.919197] The buggy address is located 0 bytes inside of
      [  149.919197]  128-byte region [ffff8004ce07ed00, ffff8004ce07ed80)
      [  149.921142] The buggy address belongs to the page:
      [  149.921953] page:ffff7e0013381f00 count:1 mapcount:0 mapping:ffff800503417c00 index:0x0 compound_mapcount: 0
      [  149.923595] flags: 0x1ffff00000010200(slab|head)
      [  149.924388] raw: 1ffff00000010200 dead000000000100 dead000000000200 ffff800503417c00
      [  149.925706] raw: 0000000000000000 0000000080400040 00000001ffffffff 0000000000000000
      [  149.927011] page dumped because: kasan: bad access detected
      [  149.927956]
      [  149.928224] Memory state around the buggy address:
      [  149.929054]  ffff8004ce07ec00: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
      [  149.930274]  ffff8004ce07ec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  149.931494] >ffff8004ce07ed00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  149.932712]                    ^
      [  149.933281]  ffff8004ce07ed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  149.934508]  ffff8004ce07ee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  149.935725] ==================================================================
      
      which is due to a failure in registrering a fileset. This frees the
      ctx->user_files pointer, but doesn't clear it. When the io_uring
      instance is later freed through the normal channels, we free this
      pointer again. At this point it's invalid.
      
      Ensure we clear the pointer when we free it for the error case.
      Reported-by: NWill Deacon <will.deacon@arm.com>
      Tested-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      25adf50f
  15. 26 3月, 2019 2 次提交