1. 09 7月, 2020 1 次提交
  2. 01 7月, 2020 4 次提交
  3. 24 6月, 2020 1 次提交
  4. 18 6月, 2020 1 次提交
  5. 17 6月, 2020 1 次提交
    • J
      block: Fix use-after-free in blkdev_get() · 2d3a8e2d
      Jason Yan 提交于
      In blkdev_get() we call __blkdev_get() to do some internal jobs and if
      there is some errors in __blkdev_get(), the bdput() is called which
      means we have released the refcount of the bdev (actually the refcount of
      the bdev inode). This means we cannot access bdev after that point. But
      acctually bdev is still accessed in blkdev_get() after calling
      __blkdev_get(). This results in use-after-free if the refcount is the
      last one we released in __blkdev_get(). Let's take a look at the
      following scenerio:
      
        CPU0            CPU1                    CPU2
      blkdev_open     blkdev_open           Remove disk
                        bd_acquire
      		  blkdev_get
      		    __blkdev_get      del_gendisk
      					bdev_unhash_inode
        bd_acquire          bdev_get_gendisk
          bd_forget           failed because of unhashed
      	  bdput
      	              bdput (the last one)
      		        bdev_evict_inode
      
      	  	    access bdev => use after free
      
      [  459.350216] BUG: KASAN: use-after-free in __lock_acquire+0x24c1/0x31b0
      [  459.351190] Read of size 8 at addr ffff88806c815a80 by task syz-executor.0/20132
      [  459.352347]
      [  459.352594] CPU: 0 PID: 20132 Comm: syz-executor.0 Not tainted 4.19.90 #2
      [  459.353628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [  459.354947] Call Trace:
      [  459.355337]  dump_stack+0x111/0x19e
      [  459.355879]  ? __lock_acquire+0x24c1/0x31b0
      [  459.356523]  print_address_description+0x60/0x223
      [  459.357248]  ? __lock_acquire+0x24c1/0x31b0
      [  459.357887]  kasan_report.cold+0xae/0x2d8
      [  459.358503]  __lock_acquire+0x24c1/0x31b0
      [  459.359120]  ? _raw_spin_unlock_irq+0x24/0x40
      [  459.359784]  ? lockdep_hardirqs_on+0x37b/0x580
      [  459.360465]  ? _raw_spin_unlock_irq+0x24/0x40
      [  459.361123]  ? finish_task_switch+0x125/0x600
      [  459.361812]  ? finish_task_switch+0xee/0x600
      [  459.362471]  ? mark_held_locks+0xf0/0xf0
      [  459.363108]  ? __schedule+0x96f/0x21d0
      [  459.363716]  lock_acquire+0x111/0x320
      [  459.364285]  ? blkdev_get+0xce/0xbe0
      [  459.364846]  ? blkdev_get+0xce/0xbe0
      [  459.365390]  __mutex_lock+0xf9/0x12a0
      [  459.365948]  ? blkdev_get+0xce/0xbe0
      [  459.366493]  ? bdev_evict_inode+0x1f0/0x1f0
      [  459.367130]  ? blkdev_get+0xce/0xbe0
      [  459.367678]  ? destroy_inode+0xbc/0x110
      [  459.368261]  ? mutex_trylock+0x1a0/0x1a0
      [  459.368867]  ? __blkdev_get+0x3e6/0x1280
      [  459.369463]  ? bdev_disk_changed+0x1d0/0x1d0
      [  459.370114]  ? blkdev_get+0xce/0xbe0
      [  459.370656]  blkdev_get+0xce/0xbe0
      [  459.371178]  ? find_held_lock+0x2c/0x110
      [  459.371774]  ? __blkdev_get+0x1280/0x1280
      [  459.372383]  ? lock_downgrade+0x680/0x680
      [  459.373002]  ? lock_acquire+0x111/0x320
      [  459.373587]  ? bd_acquire+0x21/0x2c0
      [  459.374134]  ? do_raw_spin_unlock+0x4f/0x250
      [  459.374780]  blkdev_open+0x202/0x290
      [  459.375325]  do_dentry_open+0x49e/0x1050
      [  459.375924]  ? blkdev_get_by_dev+0x70/0x70
      [  459.376543]  ? __x64_sys_fchdir+0x1f0/0x1f0
      [  459.377192]  ? inode_permission+0xbe/0x3a0
      [  459.377818]  path_openat+0x148c/0x3f50
      [  459.378392]  ? kmem_cache_alloc+0xd5/0x280
      [  459.379016]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  459.379802]  ? path_lookupat.isra.0+0x900/0x900
      [  459.380489]  ? __lock_is_held+0xad/0x140
      [  459.381093]  do_filp_open+0x1a1/0x280
      [  459.381654]  ? may_open_dev+0xf0/0xf0
      [  459.382214]  ? find_held_lock+0x2c/0x110
      [  459.382816]  ? lock_downgrade+0x680/0x680
      [  459.383425]  ? __lock_is_held+0xad/0x140
      [  459.384024]  ? do_raw_spin_unlock+0x4f/0x250
      [  459.384668]  ? _raw_spin_unlock+0x1f/0x30
      [  459.385280]  ? __alloc_fd+0x448/0x560
      [  459.385841]  do_sys_open+0x3c3/0x500
      [  459.386386]  ? filp_open+0x70/0x70
      [  459.386911]  ? trace_hardirqs_on_thunk+0x1a/0x1c
      [  459.387610]  ? trace_hardirqs_off_caller+0x55/0x1c0
      [  459.388342]  ? do_syscall_64+0x1a/0x520
      [  459.388930]  do_syscall_64+0xc3/0x520
      [  459.389490]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  459.390248] RIP: 0033:0x416211
      [  459.390720] Code: 75 14 b8 02 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83
      04 19 00 00 c3 48 83 ec 08 e8 0a fa ff ff 48 89 04 24 b8 02 00 00 00 0f
         05 <48> 8b 3c 24 48 89 c2 e8 53 fa ff ff 48 89 d0 48 83 c4 08 48 3d
            01
      [  459.393483] RSP: 002b:00007fe45dfe9a60 EFLAGS: 00000293 ORIG_RAX: 0000000000000002
      [  459.394610] RAX: ffffffffffffffda RBX: 00007fe45dfea6d4 RCX: 0000000000416211
      [  459.395678] RDX: 00007fe45dfe9b0a RSI: 0000000000000002 RDI: 00007fe45dfe9b00
      [  459.396758] RBP: 000000000076bf20 R08: 0000000000000000 R09: 000000000000000a
      [  459.397930] R10: 0000000000000075 R11: 0000000000000293 R12: 00000000ffffffff
      [  459.399022] R13: 0000000000000bd9 R14: 00000000004cdb80 R15: 000000000076bf2c
      [  459.400168]
      [  459.400430] Allocated by task 20132:
      [  459.401038]  kasan_kmalloc+0xbf/0xe0
      [  459.401652]  kmem_cache_alloc+0xd5/0x280
      [  459.402330]  bdev_alloc_inode+0x18/0x40
      [  459.402970]  alloc_inode+0x5f/0x180
      [  459.403510]  iget5_locked+0x57/0xd0
      [  459.404095]  bdget+0x94/0x4e0
      [  459.404607]  bd_acquire+0xfa/0x2c0
      [  459.405113]  blkdev_open+0x110/0x290
      [  459.405702]  do_dentry_open+0x49e/0x1050
      [  459.406340]  path_openat+0x148c/0x3f50
      [  459.406926]  do_filp_open+0x1a1/0x280
      [  459.407471]  do_sys_open+0x3c3/0x500
      [  459.408010]  do_syscall_64+0xc3/0x520
      [  459.408572]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  459.409415]
      [  459.409679] Freed by task 1262:
      [  459.410212]  __kasan_slab_free+0x129/0x170
      [  459.410919]  kmem_cache_free+0xb2/0x2a0
      [  459.411564]  rcu_process_callbacks+0xbb2/0x2320
      [  459.412318]  __do_softirq+0x225/0x8ac
      
      Fix this by delaying bdput() to the end of blkdev_get() which means we
      have finished accessing bdev.
      
      Fixes: 77ea887e ("implement in-kernel gendisk events handling")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d3a8e2d
  6. 03 6月, 2020 1 次提交
  7. 27 5月, 2020 1 次提交
    • D
      PM: hibernate: Restrict writes to the resume device · ad1e4f74
      Domenico Andreoli 提交于
      Hibernation via snapshot device requires write permission to the swap
      block device, the one that more often (but not necessarily) is used to
      store the hibernation image.
      
      With this patch, such permissions are granted iff:
      
       1) snapshot device config option is enabled
       2) swap partition is used as resume device
      
      In other circumstances the swap device is not writable from userspace.
      
      In order to achieve this, every write attempt to a swap device is
      checked against the device configured as part of the uswsusp API [0]
      using a pointer to the inode struct in memory. If the swap device being
      written was not configured for resuming, the write request is denied.
      
      NOTE: this implementation works only for swap block devices, where the
      inode configured by swapon (which sets S_SWAPFILE) is the same used
      by SNAPSHOT_SET_SWAP_AREA.
      
      In case of swap file, SNAPSHOT_SET_SWAP_AREA indeed receives the inode
      of the block device containing the filesystem where the swap file is
      located (+ offset in it) which is never passed to swapon and then has
      not set S_SWAPFILE.
      
      As result, the swap file itself (as a file) has never an option to be
      written from userspace. Instead it remains writable if accessed directly
      from the containing block device, which is always writeable from root.
      
      [0] Documentation/power/userland-swsusp.rst
      
      v2:
       - rename is_hibernate_snapshot_dev() to is_hibernate_resume_dev()
       - fix description so to correctly refer to the resume device
      Signed-off-by: NDomenico Andreoli <domenico.andreoli@linux.com>
      Acked-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      ad1e4f74
  8. 22 5月, 2020 1 次提交
  9. 21 5月, 2020 1 次提交
  10. 13 5月, 2020 1 次提交
  11. 24 4月, 2020 1 次提交
  12. 21 4月, 2020 3 次提交
  13. 20 4月, 2020 1 次提交
    • D
      bdev: Reduce time holding bd_mutex in sync in blkdev_close() · b849dd84
      Douglas Anderson 提交于
      While trying to "dd" to the block device for a USB stick, I
      encountered a hung task warning (blocked for > 120 seconds).  I
      managed to come up with an easy way to reproduce this on my system
      (where /dev/sdb is the block device for my USB stick) with:
      
        while true; do dd if=/dev/zero of=/dev/sdb bs=4M; done
      
      With my reproduction here are the relevant bits from the hung task
      detector:
      
       INFO: task udevd:294 blocked for more than 122 seconds.
       ...
       udevd           D    0   294      1 0x00400008
       Call trace:
        ...
        mutex_lock_nested+0x40/0x50
        __blkdev_get+0x7c/0x3d4
        blkdev_get+0x118/0x138
        blkdev_open+0x94/0xa8
        do_dentry_open+0x268/0x3a0
        vfs_open+0x34/0x40
        path_openat+0x39c/0xdf4
        do_filp_open+0x90/0x10c
        do_sys_open+0x150/0x3c8
        ...
      
       ...
       Showing all locks held in the system:
       ...
       1 lock held by dd/2798:
        #0: ffffff814ac1a3b8 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x50/0x204
       ...
       dd              D    0  2798   2764 0x00400208
       Call trace:
        ...
        schedule+0x8c/0xbc
        io_schedule+0x1c/0x40
        wait_on_page_bit_common+0x238/0x338
        __lock_page+0x5c/0x68
        write_cache_pages+0x194/0x500
        generic_writepages+0x64/0xa4
        blkdev_writepages+0x24/0x30
        do_writepages+0x48/0xa8
        __filemap_fdatawrite_range+0xac/0xd8
        filemap_write_and_wait+0x30/0x84
        __blkdev_put+0x88/0x204
        blkdev_put+0xc4/0xe4
        blkdev_close+0x28/0x38
        __fput+0xe0/0x238
        ____fput+0x1c/0x28
        task_work_run+0xb0/0xe4
        do_notify_resume+0xfc0/0x14bc
        work_pending+0x8/0x14
      
      The problem appears related to the fact that my USB disk is terribly
      slow and that I have a lot of RAM in my system to cache things.
      Specifically my writes seem to be happening at ~15 MB/s and I've got
      ~4 GB of RAM in my system that can be used for buffering.  To write 4
      GB of buffer to disk thus takes ~4000 MB / ~15 MB/s = ~267 seconds.
      
      The 267 second number is a problem because in __blkdev_put() we call
      sync_blockdev() while holding the bd_mutex.  Any other callers who
      want the bd_mutex will be blocked for the whole time.
      
      The problem is made worse because I believe blkdev_put() specifically
      tells other tasks (namely udev) to go try to access the device at right
      around the same time we're going to hold the mutex for a long time.
      
      Putting some traces around this (after disabling the hung task detector),
      I could confirm:
       dd:    437.608600: __blkdev_put() right before sync_blockdev() for sdb
       udevd: 437.623901: blkdev_open() right before blkdev_get() for sdb
       dd:    661.468451: __blkdev_put() right after sync_blockdev() for sdb
       udevd: 663.820426: blkdev_open() right after blkdev_get() for sdb
      
      A simple fix for this is to realize that sync_blockdev() works fine if
      you're not holding the mutex.  Also, it's not the end of the world if
      you sync a little early (though it can have performance impacts).
      Thus we can make a guess that we're going to need to do the sync and
      then do it without holding the mutex.  We still do one last sync with
      the mutex but it should be much, much faster.
      
      With this, my hung task warnings for my test case are gone.
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Reviewed-by: NGuenter Roeck <groeck@chromium.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b849dd84
  14. 23 3月, 2020 1 次提交
  15. 18 3月, 2020 1 次提交
  16. 03 12月, 2019 1 次提交
  17. 14 11月, 2019 5 次提交
  18. 03 11月, 2019 2 次提交
  19. 20 8月, 2019 1 次提交
  20. 16 8月, 2019 1 次提交
    • J
      block: remove REQ_NOWAIT_INLINE · 7b6620d7
      Jens Axboe 提交于
      We had a few issues with this code, and there's still a problem around
      how we deal with error handling for chained/split bios. For now, just
      revert the code and we'll try again with a thoroug solution. This
      reverts commits:
      
      e15c2ffa ("block: fix O_DIRECT error handling for bio fragments")
      0eb6ddfb ("block: Fix __blkdev_direct_IO() for bio fragments")
      6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      893a1c97 ("blk-mq: allow REQ_NOWAIT to return an error inline")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7b6620d7
  21. 08 8月, 2019 2 次提交
    • J
      bdev: Fixup error handling in blkdev_get() · e91455ba
      Jan Kara 提交于
      Commit 89e524c0 ("loop: Fix mount(2) failure due to race with
      LOOP_SET_FD") converted blkdev_get() to use the new helpers for
      finishing claiming of a block device. However the conversion botched the
      error handling in blkdev_get() and thus the bdev has been marked as held
      even in case __blkdev_get() returned error. This led to occasional
      warnings with block/001 test from blktests like:
      
      kernel: WARNING: CPU: 5 PID: 907 at fs/block_dev.c:1899 __blkdev_put+0x396/0x3a0
      
      Correct the error handling.
      
      CC: stable@vger.kernel.org
      Fixes: 89e524c0 ("loop: Fix mount(2) failure due to race with LOOP_SET_FD")
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e91455ba
    • J
      block: fix O_DIRECT error handling for bio fragments · e15c2ffa
      Jens Axboe 提交于
      0eb6ddfb tried to fix this up, but introduced a use-after-free
      of dio. Additionally, we still had an issue with error handling,
      as reported by Darrick:
      
      "I noticed a regression in xfs/747 (an unreleased xfstest for the
      xfs_scrub media scanning feature) on 5.3-rc3.  I'll condense that down
      to a simpler reproducer:
      
      error-test: 0 209 linear 8:48 0
      error-test: 209 1 error
      error-test: 210 6446894 linear 8:48 210
      
      Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
      for sector 209 and to pass the io to the scsi device everywhere else.
      
      On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
      (in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
      EIO like you'd expect:
      
      pread64(3, 0x7f880e1c7000, 1048576, 0)  = -1 EIO (Input/output error)
      pread: Input/output error
      +++ exited with 0 +++
      
      But doing it with a larger buffer succeeds(!):
      
      pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
      read 1146880/1146880 bytes at offset 0
      1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
      +++ exited with 0 +++
      
      (Note that the part of the buffer corresponding to the dm-error area is
      uninitialized)
      
      On 5.3-rc2, both commands would fail with EIO like you'd expect.  The
      only change between rc2 and rc3 is commit 0eb6ddfb ("block: Fix
      __blkdev_direct_IO() for bio fragments").
      
      AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
      split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
      and a second one for the 96k after that."
      
      Fix this by noting that it's always safe to dereference dio if we get
      BLK_QC_T_EAGAIN returned, as end_io hasn't been run for that case. So
      we can safely increment the dio size before calling submit_bio(), and
      then decrement it on failure (not that it really matters, as the bio
      and dio are going away).
      
      For error handling, return to the original method of just using 'ret'
      for tracking the error, and the size tracking in dio->size.
      
      Fixes: 0eb6ddfb ("block: Fix __blkdev_direct_IO() for bio fragments")
      Fixes: 6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      Reported-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e15c2ffa
  22. 02 8月, 2019 1 次提交
    • D
      block: Fix __blkdev_direct_IO() for bio fragments · 0eb6ddfb
      Damien Le Moal 提交于
      The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
      (patch 6a43074e) introduced two problems with BIO fragment handling
      for direct IOs:
      1) The dio size processed is calculated by incrementing the ret variable
      by the size of the bio fragment issued for the dio. However, this size
      is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
      which may result in referencing the bi_size value after the bio
      completed, resulting in an incorrect value use.
      2) The ret variable is not incremented by the size of the last bio
      fragment issued for the bio, leading to an invalid IO size being
      returned to the user.
      
      Fix both problem by using dio->size (which is incremented before the bio
      submission) to update the value of ret after bio submissions, including
      for the last bio fragment issued.
      
      Fixes: 6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      Reported-by: NMasato Suzuki <masato.suzuki@wdc.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0eb6ddfb
  23. 31 7月, 2019 1 次提交
    • J
      loop: Fix mount(2) failure due to race with LOOP_SET_FD · 89e524c0
      Jan Kara 提交于
      Commit 33ec3e53 ("loop: Don't change loop device under exclusive
      opener") made LOOP_SET_FD ioctl acquire exclusive block device reference
      while it updates loop device binding. However this can make perfectly
      valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding
      temporarily the exclusive bdev reference in cases like this:
      
      for i in {a..z}{a..z}; do
              dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024
              mkfs.ext2 $i.image
              mkdir mnt$i
      done
      
      echo "Run"
      for i in {a..z}{a..z}; do
              mount -o loop -t ext2 $i.image mnt$i &
      done
      
      Fix the problem by not getting full exclusive bdev reference in
      LOOP_SET_FD but instead just mark the bdev as being claimed while we
      update the binding information. This just blocks new exclusive openers
      instead of failing them with EBUSY thus fixing the problem.
      
      Fixes: 33ec3e53 ("loop: Don't change loop device under exclusive opener")
      Cc: stable@vger.kernel.org
      Tested-by: NKai-Heng Feng <kai.heng.feng@canonical.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      89e524c0
  24. 22 7月, 2019 1 次提交
    • J
      block: properly handle IOCB_NOWAIT for async O_DIRECT IO · 6a43074e
      Jens Axboe 提交于
      A caller is supposed to pass in REQ_NOWAIT if we can't block for any
      given operation, but O_DIRECT for block devices just ignore this. Hence
      we'll block for various resource shortages on the block layer side,
      like having to wait for requests.
      
      Use the new REQ_NOWAIT_INLINE to ask for this error to be returned
      inline, so we can handle it appropriately and return -EAGAIN to the
      caller.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6a43074e
  25. 29 6月, 2019 2 次提交
  26. 27 5月, 2019 1 次提交
  27. 26 5月, 2019 2 次提交
    • D
      vfs: Convert bdev to use the new mount API · 9030d16e
      David Howells 提交于
      Convert the bdev filesystem to the new internal mount API as the old
      one will be obsoleted and removed.  This allows greater flexibility in
      communication of mount parameters between userspace, the VFS and the
      filesystem.
      
      See Documentation/filesystems/mount_api.txt for more information.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Jens Axboe <axboe@kernel.dk>
      cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9030d16e
    • A
      mount_pseudo(): drop 'name' argument, switch to d_make_root() · 1f58bb18
      Al Viro 提交于
      Once upon a time we used to set ->d_name of e.g. pipefs root
      so that d_path() on pipes would work.  These days it's
      completely pointless - dentries of pipes are not even connected
      to pipefs root.  However, mount_pseudo() had set the root
      dentry name (passed as the second argument) and callers
      kept inventing names to pass to it.  Including those that
      didn't *have* any non-root dentries to start with...
      
      All of that had been pointless for about 8 years now; it's
      time to get rid of that cargo-culting...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1f58bb18