1. 27 5月, 2020 1 次提交
    • D
      PM: hibernate: Restrict writes to the resume device · ad1e4f74
      Domenico Andreoli 提交于
      Hibernation via snapshot device requires write permission to the swap
      block device, the one that more often (but not necessarily) is used to
      store the hibernation image.
      
      With this patch, such permissions are granted iff:
      
       1) snapshot device config option is enabled
       2) swap partition is used as resume device
      
      In other circumstances the swap device is not writable from userspace.
      
      In order to achieve this, every write attempt to a swap device is
      checked against the device configured as part of the uswsusp API [0]
      using a pointer to the inode struct in memory. If the swap device being
      written was not configured for resuming, the write request is denied.
      
      NOTE: this implementation works only for swap block devices, where the
      inode configured by swapon (which sets S_SWAPFILE) is the same used
      by SNAPSHOT_SET_SWAP_AREA.
      
      In case of swap file, SNAPSHOT_SET_SWAP_AREA indeed receives the inode
      of the block device containing the filesystem where the swap file is
      located (+ offset in it) which is never passed to swapon and then has
      not set S_SWAPFILE.
      
      As result, the swap file itself (as a file) has never an option to be
      written from userspace. Instead it remains writable if accessed directly
      from the containing block device, which is always writeable from root.
      
      [0] Documentation/power/userland-swsusp.rst
      
      v2:
       - rename is_hibernate_snapshot_dev() to is_hibernate_resume_dev()
       - fix description so to correctly refer to the resume device
      Signed-off-by: NDomenico Andreoli <domenico.andreoli@linux.com>
      Acked-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      ad1e4f74
  2. 21 4月, 2020 1 次提交
  3. 20 4月, 2020 1 次提交
    • D
      bdev: Reduce time holding bd_mutex in sync in blkdev_close() · b849dd84
      Douglas Anderson 提交于
      While trying to "dd" to the block device for a USB stick, I
      encountered a hung task warning (blocked for > 120 seconds).  I
      managed to come up with an easy way to reproduce this on my system
      (where /dev/sdb is the block device for my USB stick) with:
      
        while true; do dd if=/dev/zero of=/dev/sdb bs=4M; done
      
      With my reproduction here are the relevant bits from the hung task
      detector:
      
       INFO: task udevd:294 blocked for more than 122 seconds.
       ...
       udevd           D    0   294      1 0x00400008
       Call trace:
        ...
        mutex_lock_nested+0x40/0x50
        __blkdev_get+0x7c/0x3d4
        blkdev_get+0x118/0x138
        blkdev_open+0x94/0xa8
        do_dentry_open+0x268/0x3a0
        vfs_open+0x34/0x40
        path_openat+0x39c/0xdf4
        do_filp_open+0x90/0x10c
        do_sys_open+0x150/0x3c8
        ...
      
       ...
       Showing all locks held in the system:
       ...
       1 lock held by dd/2798:
        #0: ffffff814ac1a3b8 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x50/0x204
       ...
       dd              D    0  2798   2764 0x00400208
       Call trace:
        ...
        schedule+0x8c/0xbc
        io_schedule+0x1c/0x40
        wait_on_page_bit_common+0x238/0x338
        __lock_page+0x5c/0x68
        write_cache_pages+0x194/0x500
        generic_writepages+0x64/0xa4
        blkdev_writepages+0x24/0x30
        do_writepages+0x48/0xa8
        __filemap_fdatawrite_range+0xac/0xd8
        filemap_write_and_wait+0x30/0x84
        __blkdev_put+0x88/0x204
        blkdev_put+0xc4/0xe4
        blkdev_close+0x28/0x38
        __fput+0xe0/0x238
        ____fput+0x1c/0x28
        task_work_run+0xb0/0xe4
        do_notify_resume+0xfc0/0x14bc
        work_pending+0x8/0x14
      
      The problem appears related to the fact that my USB disk is terribly
      slow and that I have a lot of RAM in my system to cache things.
      Specifically my writes seem to be happening at ~15 MB/s and I've got
      ~4 GB of RAM in my system that can be used for buffering.  To write 4
      GB of buffer to disk thus takes ~4000 MB / ~15 MB/s = ~267 seconds.
      
      The 267 second number is a problem because in __blkdev_put() we call
      sync_blockdev() while holding the bd_mutex.  Any other callers who
      want the bd_mutex will be blocked for the whole time.
      
      The problem is made worse because I believe blkdev_put() specifically
      tells other tasks (namely udev) to go try to access the device at right
      around the same time we're going to hold the mutex for a long time.
      
      Putting some traces around this (after disabling the hung task detector),
      I could confirm:
       dd:    437.608600: __blkdev_put() right before sync_blockdev() for sdb
       udevd: 437.623901: blkdev_open() right before blkdev_get() for sdb
       dd:    661.468451: __blkdev_put() right after sync_blockdev() for sdb
       udevd: 663.820426: blkdev_open() right after blkdev_get() for sdb
      
      A simple fix for this is to realize that sync_blockdev() works fine if
      you're not holding the mutex.  Also, it's not the end of the world if
      you sync a little early (though it can have performance impacts).
      Thus we can make a guess that we're going to need to do the sync and
      then do it without holding the mutex.  We still do one last sync with
      the mutex but it should be much, much faster.
      
      With this, my hung task warnings for my test case are gone.
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Reviewed-by: NGuenter Roeck <groeck@chromium.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b849dd84
  4. 23 3月, 2020 1 次提交
  5. 18 3月, 2020 1 次提交
  6. 03 12月, 2019 1 次提交
  7. 14 11月, 2019 5 次提交
  8. 03 11月, 2019 2 次提交
  9. 20 8月, 2019 1 次提交
  10. 16 8月, 2019 1 次提交
    • J
      block: remove REQ_NOWAIT_INLINE · 7b6620d7
      Jens Axboe 提交于
      We had a few issues with this code, and there's still a problem around
      how we deal with error handling for chained/split bios. For now, just
      revert the code and we'll try again with a thoroug solution. This
      reverts commits:
      
      e15c2ffa ("block: fix O_DIRECT error handling for bio fragments")
      0eb6ddfb ("block: Fix __blkdev_direct_IO() for bio fragments")
      6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      893a1c97 ("blk-mq: allow REQ_NOWAIT to return an error inline")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7b6620d7
  11. 08 8月, 2019 2 次提交
    • J
      bdev: Fixup error handling in blkdev_get() · e91455ba
      Jan Kara 提交于
      Commit 89e524c0 ("loop: Fix mount(2) failure due to race with
      LOOP_SET_FD") converted blkdev_get() to use the new helpers for
      finishing claiming of a block device. However the conversion botched the
      error handling in blkdev_get() and thus the bdev has been marked as held
      even in case __blkdev_get() returned error. This led to occasional
      warnings with block/001 test from blktests like:
      
      kernel: WARNING: CPU: 5 PID: 907 at fs/block_dev.c:1899 __blkdev_put+0x396/0x3a0
      
      Correct the error handling.
      
      CC: stable@vger.kernel.org
      Fixes: 89e524c0 ("loop: Fix mount(2) failure due to race with LOOP_SET_FD")
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e91455ba
    • J
      block: fix O_DIRECT error handling for bio fragments · e15c2ffa
      Jens Axboe 提交于
      0eb6ddfb tried to fix this up, but introduced a use-after-free
      of dio. Additionally, we still had an issue with error handling,
      as reported by Darrick:
      
      "I noticed a regression in xfs/747 (an unreleased xfstest for the
      xfs_scrub media scanning feature) on 5.3-rc3.  I'll condense that down
      to a simpler reproducer:
      
      error-test: 0 209 linear 8:48 0
      error-test: 209 1 error
      error-test: 210 6446894 linear 8:48 210
      
      Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
      for sector 209 and to pass the io to the scsi device everywhere else.
      
      On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
      (in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
      EIO like you'd expect:
      
      pread64(3, 0x7f880e1c7000, 1048576, 0)  = -1 EIO (Input/output error)
      pread: Input/output error
      +++ exited with 0 +++
      
      But doing it with a larger buffer succeeds(!):
      
      pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
      read 1146880/1146880 bytes at offset 0
      1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
      +++ exited with 0 +++
      
      (Note that the part of the buffer corresponding to the dm-error area is
      uninitialized)
      
      On 5.3-rc2, both commands would fail with EIO like you'd expect.  The
      only change between rc2 and rc3 is commit 0eb6ddfb ("block: Fix
      __blkdev_direct_IO() for bio fragments").
      
      AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
      split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
      and a second one for the 96k after that."
      
      Fix this by noting that it's always safe to dereference dio if we get
      BLK_QC_T_EAGAIN returned, as end_io hasn't been run for that case. So
      we can safely increment the dio size before calling submit_bio(), and
      then decrement it on failure (not that it really matters, as the bio
      and dio are going away).
      
      For error handling, return to the original method of just using 'ret'
      for tracking the error, and the size tracking in dio->size.
      
      Fixes: 0eb6ddfb ("block: Fix __blkdev_direct_IO() for bio fragments")
      Fixes: 6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      Reported-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e15c2ffa
  12. 02 8月, 2019 1 次提交
    • D
      block: Fix __blkdev_direct_IO() for bio fragments · 0eb6ddfb
      Damien Le Moal 提交于
      The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
      (patch 6a43074e) introduced two problems with BIO fragment handling
      for direct IOs:
      1) The dio size processed is calculated by incrementing the ret variable
      by the size of the bio fragment issued for the dio. However, this size
      is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
      which may result in referencing the bi_size value after the bio
      completed, resulting in an incorrect value use.
      2) The ret variable is not incremented by the size of the last bio
      fragment issued for the bio, leading to an invalid IO size being
      returned to the user.
      
      Fix both problem by using dio->size (which is incremented before the bio
      submission) to update the value of ret after bio submissions, including
      for the last bio fragment issued.
      
      Fixes: 6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      Reported-by: NMasato Suzuki <masato.suzuki@wdc.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0eb6ddfb
  13. 31 7月, 2019 1 次提交
    • J
      loop: Fix mount(2) failure due to race with LOOP_SET_FD · 89e524c0
      Jan Kara 提交于
      Commit 33ec3e53 ("loop: Don't change loop device under exclusive
      opener") made LOOP_SET_FD ioctl acquire exclusive block device reference
      while it updates loop device binding. However this can make perfectly
      valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding
      temporarily the exclusive bdev reference in cases like this:
      
      for i in {a..z}{a..z}; do
              dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024
              mkfs.ext2 $i.image
              mkdir mnt$i
      done
      
      echo "Run"
      for i in {a..z}{a..z}; do
              mount -o loop -t ext2 $i.image mnt$i &
      done
      
      Fix the problem by not getting full exclusive bdev reference in
      LOOP_SET_FD but instead just mark the bdev as being claimed while we
      update the binding information. This just blocks new exclusive openers
      instead of failing them with EBUSY thus fixing the problem.
      
      Fixes: 33ec3e53 ("loop: Don't change loop device under exclusive opener")
      Cc: stable@vger.kernel.org
      Tested-by: NKai-Heng Feng <kai.heng.feng@canonical.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      89e524c0
  14. 22 7月, 2019 1 次提交
    • J
      block: properly handle IOCB_NOWAIT for async O_DIRECT IO · 6a43074e
      Jens Axboe 提交于
      A caller is supposed to pass in REQ_NOWAIT if we can't block for any
      given operation, but O_DIRECT for block devices just ignore this. Hence
      we'll block for various resource shortages on the block layer side,
      like having to wait for requests.
      
      Use the new REQ_NOWAIT_INLINE to ask for this error to be returned
      inline, so we can handle it appropriately and return -EAGAIN to the
      caller.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6a43074e
  15. 29 6月, 2019 2 次提交
  16. 27 5月, 2019 1 次提交
  17. 26 5月, 2019 2 次提交
    • D
      vfs: Convert bdev to use the new mount API · 9030d16e
      David Howells 提交于
      Convert the bdev filesystem to the new internal mount API as the old
      one will be obsoleted and removed.  This allows greater flexibility in
      communication of mount parameters between userspace, the VFS and the
      filesystem.
      
      See Documentation/filesystems/mount_api.txt for more information.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Jens Axboe <axboe@kernel.dk>
      cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9030d16e
    • A
      mount_pseudo(): drop 'name' argument, switch to d_make_root() · 1f58bb18
      Al Viro 提交于
      Once upon a time we used to set ->d_name of e.g. pipefs root
      so that d_path() on pipes would work.  These days it's
      completely pointless - dentries of pipes are not even connected
      to pipefs root.  However, mount_pseudo() had set the root
      dentry name (passed as the second argument) and callers
      kept inventing names to pass to it.  Including those that
      didn't *have* any non-root dentries to start with...
      
      All of that had been pointless for about 8 years now; it's
      time to get rid of that cargo-culting...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1f58bb18
  18. 21 5月, 2019 1 次提交
  19. 15 5月, 2019 1 次提交
  20. 02 5月, 2019 1 次提交
  21. 01 5月, 2019 1 次提交
  22. 30 4月, 2019 1 次提交
  23. 12 4月, 2019 1 次提交
    • J
      block: fix the return errno for direct IO · a89afe58
      Jason Yan 提交于
      If the last bio returned is not dio->bio, the status of the bio will
      not assigned to dio->bio if it is error. This will cause the whole IO
      status wrong.
      
          ksoftirqd/21-117   [021] ..s.  4017.966090:   8,0    C   N 4883648 [0]
                <idle>-0     [018] ..s.  4017.970888:   8,0    C  WS 4924800 + 1024 [0]
                <idle>-0     [018] ..s.  4017.970909:   8,0    D  WS 4935424 + 1024 [<idle>]
                <idle>-0     [018] ..s.  4017.970924:   8,0    D  WS 4936448 + 321 [<idle>]
          ksoftirqd/21-117   [021] ..s.  4017.995033:   8,0    C   R 4883648 + 336 [65475]
          ksoftirqd/21-117   [021] d.s.  4018.001988: myprobe1: (blkdev_bio_end_io+0x0/0x168) bi_status=7
          ksoftirqd/21-117   [021] d.s.  4018.001992: myprobe: (aio_complete_rw+0x0/0x148) x0=0xffff802f2595ad80 res=0x12a000 res2=0x0
      
      We always have to assign bio->bi_status to dio->bio.bi_status because we
      will only check dio->bio.bi_status when we return the whole IO to
      the upper layer.
      
      Fixes: 542ff7bf ("block: new direct I/O implementation")
      Cc: stable@vger.kernel.org
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a89afe58
  24. 10 4月, 2019 1 次提交
  25. 19 3月, 2019 1 次提交
    • J
      block: add BIO_NO_PAGE_REF flag · 399254aa
      Jens Axboe 提交于
      If bio_iov_iter_get_pages() is called on an iov_iter that is flagged
      with NO_REF, then we don't need to add a page reference for the pages
      that we add.
      
      Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows
      not to drop a reference to these pages.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      399254aa
  26. 24 2月, 2019 2 次提交
  27. 15 2月, 2019 1 次提交
  28. 15 1月, 2019 1 次提交
    • J
      blockdev: Fix livelocks on loop device · 04906b2f
      Jan Kara 提交于
      bd_set_size() updates also block device's block size. This is somewhat
      unexpected from its name and at this point, only blkdev_open() uses this
      functionality. Furthermore, this can result in changing block size under
      a filesystem mounted on a loop device which leads to livelocks inside
      __getblk_gfp() like:
      
      Sending NMI from CPU 0 to CPUs 1:
      NMI backtrace for cpu 1
      CPU: 1 PID: 10863 Comm: syz-executor0 Not tainted 4.18.0-rc5+ #151
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
      01/01/2011
      RIP: 0010:__sanitizer_cov_trace_pc+0x3f/0x50 kernel/kcov.c:106
      ...
      Call Trace:
       init_page_buffers+0x3e2/0x530 fs/buffer.c:904
       grow_dev_page fs/buffer.c:947 [inline]
       grow_buffers fs/buffer.c:1009 [inline]
       __getblk_slow fs/buffer.c:1036 [inline]
       __getblk_gfp+0x906/0xb10 fs/buffer.c:1313
       __bread_gfp+0x2d/0x310 fs/buffer.c:1347
       sb_bread include/linux/buffer_head.h:307 [inline]
       fat12_ent_bread+0x14e/0x3d0 fs/fat/fatent.c:75
       fat_ent_read_block fs/fat/fatent.c:441 [inline]
       fat_alloc_clusters+0x8ce/0x16e0 fs/fat/fatent.c:489
       fat_add_cluster+0x7a/0x150 fs/fat/inode.c:101
       __fat_get_block fs/fat/inode.c:148 [inline]
      ...
      
      Trivial reproducer for the problem looks like:
      
      truncate -s 1G /tmp/image
      losetup /dev/loop0 /tmp/image
      mkfs.ext4 -b 1024 /dev/loop0
      mount -t ext4 /dev/loop0 /mnt
      losetup -c /dev/loop0
      l /mnt
      
      Fix the problem by moving initialization of a block device block size
      into a separate function and call it when needed.
      
      Thanks to Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> for help with
      debugging the problem.
      
      Reported-by: syzbot+9933e4476f365f5d5a1b@syzkaller.appspotmail.com
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      04906b2f
  29. 03 1月, 2019 1 次提交
    • L
      block: don't use un-ordered __set_current_state(TASK_UNINTERRUPTIBLE) · 1ac5cd49
      Linus Torvalds 提交于
      This mostly reverts commit 849a3700 ("block: avoid ordered task
      state change for polled IO").  It was wrongly claiming that the ordering
      wasn't necessary.  The memory barrier _is_ necessary.
      
      If something is truly polling and not going to sleep, it's the whole
      state setting that is unnecessary, not the memory barrier.  Whenever you
      set your state to a sleeping state, you absolutely need the memory
      barrier.
      
      Note that sometimes the memory barrier can be elsewhere.  For example,
      the ordering might be provided by an external lock, or by setting the
      process state to sleeping before adding yourself to the wait queue list
      that is used for waking up (where the wait queue lock itself will
      guarantee that any wakeup will correctly see the sleeping state).
      
      But none of those cases were true here.
      
      NOTE! Some of the polling paths may indeed be able to drop the state
      setting entirely, at which point the memory barrier also goes away.
      
      (Also note that this doesn't revert the TASK_RUNNING cases: there is no
      race between a wakeup and setting the process state to TASK_RUNNING,
      since the end result doesn't depend on ordering).
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ac5cd49
  30. 29 12月, 2018 1 次提交
  31. 30 11月, 2018 1 次提交
    • C
      block: avoid extra bio reference for async O_DIRECT · 531724ab
      Christoph Hellwig 提交于
      The bio referencing has a trick that doesn't do any actual atomic
      inc/dec on the reference count until we have to elevator to > 1. For the
      async IO O_DIRECT case, we can't use the simple DIO variants, so we use
      __blkdev_direct_IO(). It always grabs an extra reference to the bio
      after allocation, which means we then enter the slower path of actually
      having to do atomic_inc/dec on the count.
      
      We don't need to do that for the async case, unless we end up going
      multi-bio, in which case we're already doing huge amounts of IO. For the
      smaller IO case (< BIO_MAX_PAGES), we can do without the extra ref.
      
      Based on an earlier patch (and commit log) from Jens Axboe.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      531724ab