1. 24 1月, 2021 5 次提交
  2. 17 1月, 2021 2 次提交
  3. 16 1月, 2021 6 次提交
  4. 14 1月, 2021 5 次提交
  5. 13 1月, 2021 2 次提交
    • P
      io_uring: do sqo disable on install_fd error · 06585c49
      Pavel Begunkov 提交于
      WARNING: CPU: 0 PID: 8494 at fs/io_uring.c:8717
      	io_ring_ctx_wait_and_kill+0x4f2/0x600 fs/io_uring.c:8717
      Call Trace:
       io_uring_release+0x3e/0x50 fs/io_uring.c:8759
       __fput+0x283/0x920 fs/file_table.c:280
       task_work_run+0xdd/0x190 kernel/task_work.c:140
       tracehook_notify_resume include/linux/tracehook.h:189 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:174 [inline]
       exit_to_user_mode_prepare+0x249/0x250 kernel/entry/common.c:201
       __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
       syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      failed io_uring_install_fd() is a special case, we don't do
      io_ring_ctx_wait_and_kill() directly but defer it to fput, though still
      need to io_disable_sqo_submit() before.
      
      note: it doesn't fix any real problem, just a warning. That's because
      sqring won't be available to the userspace in this case and so SQPOLL
      won't submit anything.
      
      Reported-by: syzbot+9c9c35374c0ecac06516@syzkaller.appspotmail.com
      Fixes: d9d05217 ("io_uring: stop SQPOLL submit on creator's death")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      06585c49
    • P
      io_uring: fix null-deref in io_disable_sqo_submit · b4411616
      Pavel Begunkov 提交于
      general protection fault, probably for non-canonical address
      	0xdffffc0000000022: 0000 [#1] KASAN: null-ptr-deref
      	in range [0x0000000000000110-0x0000000000000117]
      RIP: 0010:io_ring_set_wakeup_flag fs/io_uring.c:6929 [inline]
      RIP: 0010:io_disable_sqo_submit+0xdb/0x130 fs/io_uring.c:8891
      Call Trace:
       io_uring_create fs/io_uring.c:9711 [inline]
       io_uring_setup+0x12b1/0x38e0 fs/io_uring.c:9739
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      io_disable_sqo_submit() might be called before user rings were
      allocated, don't do io_ring_set_wakeup_flag() in those cases.
      
      Reported-by: syzbot+ab412638aeb652ded540@syzkaller.appspotmail.com
      Fixes: d9d05217 ("io_uring: stop SQPOLL submit on creator's death")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b4411616
  6. 11 1月, 2021 12 次提交
  7. 10 1月, 2021 4 次提交
    • P
      io_uring: stop SQPOLL submit on creator's death · d9d05217
      Pavel Begunkov 提交于
      When the creator of SQPOLL io_uring dies (i.e. sqo_task), we don't want
      its internals like ->files and ->mm to be poked by the SQPOLL task, it
      have never been nice and recently got racy. That can happen when the
      owner undergoes destruction and SQPOLL tasks tries to submit new
      requests in parallel, and so calls io_sq_thread_acquire*().
      
      That patch halts SQPOLL submissions when sqo_task dies by introducing
      sqo_dead flag. Once set, the SQPOLL task must not do any submission,
      which is synchronised by uring_lock as well as the new flag.
      
      The tricky part is to make sure that disabling always happens, that
      means either the ring is discovered by creator's do_exit() -> cancel,
      or if the final close() happens before it's done by the creator. The
      last is guaranteed by the fact that for SQPOLL the creator task and only
      it holds exactly one file note, so either it pins up to do_exit() or
      removed by the creator on the final put in flush. (see comments in
      uring_flush() around file->f_count == 2).
      
      One more place that can trigger io_sq_thread_acquire_*() is
      __io_req_task_submit(). Shoot off requests on sqo_dead there, even
      though actually we don't need to. That's because cancellation of
      sqo_task should wait for the request before going any further.
      
      note 1: io_disable_sqo_submit() does io_ring_set_wakeup_flag() so the
      caller would enter the ring to get an error, but it still doesn't
      guarantee that the flag won't be cleared.
      
      note 2: if final __userspace__ close happens not from the creator
      task, the file note will pin the ring until the task dies.
      
      Fixed: b1b6b5a3 ("kernel/io_uring: cancel io_uring before task works")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d9d05217
    • P
      io_uring: add warn_once for io_uring_flush() · 6b5733eb
      Pavel Begunkov 提交于
      files_cancel() should cancel all relevant requests and drop file notes,
      so we should never have file notes after that, including on-exit fput
      and flush. Add a WARN_ONCE to be sure.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6b5733eb
    • P
      io_uring: inline io_uring_attempt_task_drop() · 4f793dc4
      Pavel Begunkov 提交于
      A simple preparation change inlining io_uring_attempt_task_drop() into
      io_uring_flush().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4f793dc4
    • P
      io_uring: io_rw_reissue lockdep annotations · 55e6ac1e
      Pavel Begunkov 提交于
      We expect io_rw_reissue() to take place only during submission with
      uring_lock held. Add a lockdep annotation to check that invariant.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      55e6ac1e
  8. 09 1月, 2021 1 次提交
    • L
      poll: fix performance regression due to out-of-line __put_user() · ef0ba055
      Linus Torvalds 提交于
      The kernel test robot reported a -5.8% performance regression on the
      "poll2" test of will-it-scale, and bisected it to commit d55564cf
      ("x86: Make __put_user() generate an out-of-line call").
      
      I didn't expect an out-of-line __put_user() to matter, because no normal
      core code should use that non-checking legacy version of user access any
      more.  But I had overlooked the very odd poll() usage, which does a
      __put_user() to update the 'revents' values of the poll array.
      
      Now, Al Viro correctly points out that instead of updating just the
      'revents' field, it would be much simpler to just copy the _whole_
      pollfd entry, and then we could just use "copy_to_user()" on the whole
      array of entries, the same way we use "copy_from_user()" a few lines
      earlier to get the original values.
      
      But that is not what we've traditionally done, and I worry that threaded
      applications might be concurrently modifying the other fields of the
      pollfd array.  So while Al's suggestion is simpler - and perhaps worth
      trying in the future - this instead keeps the "just update revents"
      model.
      
      To fix the performance regression, use the modern "unsafe_put_user()"
      instead of __put_user(), with the proper "user_write_access_begin()"
      guarding in place. This improves code generation enormously.
      
      Link: https://lore.kernel.org/lkml/20210107134723.GA28532@xsang-OptiPlex-9020/Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Tested-by: NOliver Sang <oliver.sang@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef0ba055
  9. 08 1月, 2021 3 次提交
    • J
      btrfs: shrink delalloc pages instead of full inodes · e076ab2a
      Josef Bacik 提交于
      Commit 38d715f4 ("btrfs: use btrfs_start_delalloc_roots in
      shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
      some infrastructure we have in place to flush inodes that we use for
      device replace and snapshot.  However this introduced a pretty serious
      performance regression.  To reproduce the user untarred the source
      tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
      see it take anywhere from 5 to 20 times as long to untar in 5.10
      compared to 5.9. This was observed on fast devices (SSD and better) and
      not on HDD.
      
      The root cause is because before we would generally use the normal
      writeback path to reclaim delalloc space, and for this we would provide
      it with the number of pages we wanted to flush.  The referenced commit
      changed this to flush that many inodes, which drastically increased the
      amount of space we were flushing in certain cases, which severely
      affected performance.
      
      We cannot revert this patch unfortunately because of 3d45f221
      ("btrfs: fix deadlock when cloning inline extent and low on free
      metadata space") which requires the ability to skip flushing inodes that
      are being cloned in certain scenarios, which means we need to keep using
      our flushing infrastructure or risk re-introducing the deadlock.
      
      Instead to fix this problem we can go back to providing
      btrfs_start_delalloc_roots with a number of pages to flush, and then set
      up a writeback_control and utilize sync_inode() to handle the flushing
      for us.  This gives us the same behavior we had prior to the fix, while
      still allowing us to avoid the deadlock that was fixed by Filipe.  I
      redid the users original test and got the following results on one of
      our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)
      
        5.9		0m54.258s
        5.10		1m26.212s
        5.10+patch	0m38.800s
      
      5.10+patch is significantly faster than plain 5.9 because of my patch
      series "Change data reservations to use the ticketing infra" which
      contained the patch that introduced the regression, but generally
      improved the overall ENOSPC flushing mechanisms.
      
      Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
      the results:
      
        5.10.5            4m00s
        5.10.5+patch      1m08s
        5.11-rc2	    5m14s
        5.11-rc2+patch    1m30s
      Reported-by: NRené Rebe <rene@exactcode.de>
      Fixes: 38d715f4 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
      CC: stable@vger.kernel.org # 5.10
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Tested-by: NDavid Sterba <dsterba@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add my test results ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e076ab2a
    • C
      block: pre-initialize struct block_device in bdev_alloc_inode · 2d2f6f1b
      Christoph Hellwig 提交于
      bdev_evict_inode and bdev_free_inode are also called for the root inode
      of bdevfs, for which bdev_alloc is never called.  Move the zeroing o
      f struct block_device and the initialization of the bd_bdi field into
      bdev_alloc_inode to make sure they are initialized for the root inode
      as well.
      
      Fixes: e6cb5382 ("block: initialize struct block_device in bdev_alloc")
      Reported-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d2f6f1b
    • S
      fs: Fix freeze_bdev()/thaw_bdev() accounting of bd_fsfreeze_sb · 04a6a536
      Satya Tangirala 提交于
      freeze/thaw_bdev() currently use bdev->bd_fsfreeze_count to infer
      whether or not bdev->bd_fsfreeze_sb is valid (it's valid iff
      bd_fsfreeze_count is non-zero). thaw_bdev() doesn't nullify
      bd_fsfreeze_sb.
      
      But this means a freeze_bdev() call followed by a thaw_bdev() call can
      leave bd_fsfreeze_sb with a non-null value, while bd_fsfreeze_count is
      zero. If freeze_bdev() is called again, and this time
      get_active_super() returns NULL (e.g. because the FS is unmounted),
      we'll end up with bd_fsfreeze_count > 0, but bd_fsfreeze_sb is
      *untouched* - it stays the same (now garbage) value. A subsequent
      thaw_bdev() will decide that the bd_fsfreeze_sb value is legitimate
      (since bd_fsfreeze_count > 0), and attempt to use it.
      
      Fix this by always setting bd_fsfreeze_sb to NULL when
      bd_fsfreeze_count is successfully decremented to 0 in thaw_sb().
      Alternatively, we could set bd_fsfreeze_sb to whatever
      get_active_super() returns in freeze_bdev() whenever bd_fsfreeze_count
      is successfully incremented to 1 from 0 (which can be achieved cleanly
      by moving the line currently setting bd_fsfreeze_sb to immediately
      after the "sync:" label, but it might be a little too subtle/easily
      overlooked in future).
      
      This fixes the currently panicking xfstests generic/085.
      
      Fixes: 040f04bd ("fs: simplify freeze_bdev/thaw_bdev")
      Signed-off-by: NSatya Tangirala <satyat@google.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      04a6a536