提交 · c5638ab417660e2c5d681aed5ce76df992982dbb · openeuler / Kernel

16 7月, 2020 1 次提交

block: simplify the restart case in __blkdev_get · c5638ab4

由 Christoph Hellwig 提交于 7月 16, 2020

Insted of duplicating all the cleanup logic jump to the code that cleans
up anyway, and restart after that.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c5638ab4

09 7月, 2020 1 次提交

block: remove flush_disk · 9a3ffbbc

由 Christoph Hellwig 提交于 7月 08, 2020

flush_disk has only two callers, so open code it there.  That also helps
clarifying the error message for the particular case, and allows to remove
setting bd_invalidated in check_disk_size_change, which will be cleared
again instantly.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9a3ffbbc

01 7月, 2020 4 次提交

block: remove the all_bdevs list · 1008fe6d

由 Christoph Hellwig 提交于 6月 26, 2020

Instead just iterate over the inodes for the block device superblock.
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1008fe6d

block: remove the bd_queue field from struct block_device · e556f6ba

由 Christoph Hellwig 提交于 6月 26, 2020

Just use bd_disk->queue instead.
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e556f6ba

block: remove the bd_block_size field from struct block_device · 6b7b181b

由 Christoph Hellwig 提交于 6月 26, 2020

We can trivially calculate the block size from the inodes i_blkbits
variable.  Use that instead of keeping two redundant copies of the
information in slightly different formats.
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6b7b181b

block: simplify set_init_blocksize · 5ff9f192

由 Christoph Hellwig 提交于 6月 26, 2020

The loop to increase the initial block size doesn't really make any
sense, as the AND operation won't match for powers of two if it didn't
for the initial block size.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5ff9f192

24 6月, 2020 1 次提交

block: mark bd_finish_claiming static · 764b23bd

由 Christoph Hellwig 提交于 6月 20, 2020

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

764b23bd

18 6月, 2020 1 次提交

block: make function 'kill_bdev' static · 3373a346

由 Zheng Bin 提交于 6月 18, 2020

kill_bdev does not have any external user, so make it static.
Signed-off-by: NZheng Bin <zhengbin13@huawei.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3373a346

17 6月, 2020 1 次提交

block: Fix use-after-free in blkdev_get() · 2d3a8e2d

由 Jason Yan 提交于 6月 16, 2020

In blkdev_get() we call __blkdev_get() to do some internal jobs and if
there is some errors in __blkdev_get(), the bdput() is called which
means we have released the refcount of the bdev (actually the refcount of
the bdev inode). This means we cannot access bdev after that point. But
acctually bdev is still accessed in blkdev_get() after calling
__blkdev_get(). This results in use-after-free if the refcount is the
last one we released in __blkdev_get(). Let's take a look at the
following scenerio:

  CPU0            CPU1                    CPU2
blkdev_open     blkdev_open           Remove disk
                  bd_acquire
		  blkdev_get
		    __blkdev_get      del_gendisk
					bdev_unhash_inode
  bd_acquire          bdev_get_gendisk
    bd_forget           failed because of unhashed
	  bdput
	              bdput (the last one)
		        bdev_evict_inode

	  	    access bdev => use after free

[  459.350216] BUG: KASAN: use-after-free in __lock_acquire+0x24c1/0x31b0
[  459.351190] Read of size 8 at addr ffff88806c815a80 by task syz-executor.0/20132
[  459.352347]
[  459.352594] CPU: 0 PID: 20132 Comm: syz-executor.0 Not tainted 4.19.90 #2
[  459.353628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  459.354947] Call Trace:
[  459.355337]  dump_stack+0x111/0x19e
[  459.355879]  ? __lock_acquire+0x24c1/0x31b0
[  459.356523]  print_address_description+0x60/0x223
[  459.357248]  ? __lock_acquire+0x24c1/0x31b0
[  459.357887]  kasan_report.cold+0xae/0x2d8
[  459.358503]  __lock_acquire+0x24c1/0x31b0
[  459.359120]  ? _raw_spin_unlock_irq+0x24/0x40
[  459.359784]  ? lockdep_hardirqs_on+0x37b/0x580
[  459.360465]  ? _raw_spin_unlock_irq+0x24/0x40
[  459.361123]  ? finish_task_switch+0x125/0x600
[  459.361812]  ? finish_task_switch+0xee/0x600
[  459.362471]  ? mark_held_locks+0xf0/0xf0
[  459.363108]  ? __schedule+0x96f/0x21d0
[  459.363716]  lock_acquire+0x111/0x320
[  459.364285]  ? blkdev_get+0xce/0xbe0
[  459.364846]  ? blkdev_get+0xce/0xbe0
[  459.365390]  __mutex_lock+0xf9/0x12a0
[  459.365948]  ? blkdev_get+0xce/0xbe0
[  459.366493]  ? bdev_evict_inode+0x1f0/0x1f0
[  459.367130]  ? blkdev_get+0xce/0xbe0
[  459.367678]  ? destroy_inode+0xbc/0x110
[  459.368261]  ? mutex_trylock+0x1a0/0x1a0
[  459.368867]  ? __blkdev_get+0x3e6/0x1280
[  459.369463]  ? bdev_disk_changed+0x1d0/0x1d0
[  459.370114]  ? blkdev_get+0xce/0xbe0
[  459.370656]  blkdev_get+0xce/0xbe0
[  459.371178]  ? find_held_lock+0x2c/0x110
[  459.371774]  ? __blkdev_get+0x1280/0x1280
[  459.372383]  ? lock_downgrade+0x680/0x680
[  459.373002]  ? lock_acquire+0x111/0x320
[  459.373587]  ? bd_acquire+0x21/0x2c0
[  459.374134]  ? do_raw_spin_unlock+0x4f/0x250
[  459.374780]  blkdev_open+0x202/0x290
[  459.375325]  do_dentry_open+0x49e/0x1050
[  459.375924]  ? blkdev_get_by_dev+0x70/0x70
[  459.376543]  ? __x64_sys_fchdir+0x1f0/0x1f0
[  459.377192]  ? inode_permission+0xbe/0x3a0
[  459.377818]  path_openat+0x148c/0x3f50
[  459.378392]  ? kmem_cache_alloc+0xd5/0x280
[  459.379016]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  459.379802]  ? path_lookupat.isra.0+0x900/0x900
[  459.380489]  ? __lock_is_held+0xad/0x140
[  459.381093]  do_filp_open+0x1a1/0x280
[  459.381654]  ? may_open_dev+0xf0/0xf0
[  459.382214]  ? find_held_lock+0x2c/0x110
[  459.382816]  ? lock_downgrade+0x680/0x680
[  459.383425]  ? __lock_is_held+0xad/0x140
[  459.384024]  ? do_raw_spin_unlock+0x4f/0x250
[  459.384668]  ? _raw_spin_unlock+0x1f/0x30
[  459.385280]  ? __alloc_fd+0x448/0x560
[  459.385841]  do_sys_open+0x3c3/0x500
[  459.386386]  ? filp_open+0x70/0x70
[  459.386911]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[  459.387610]  ? trace_hardirqs_off_caller+0x55/0x1c0
[  459.388342]  ? do_syscall_64+0x1a/0x520
[  459.388930]  do_syscall_64+0xc3/0x520
[  459.389490]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  459.390248] RIP: 0033:0x416211
[  459.390720] Code: 75 14 b8 02 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83
04 19 00 00 c3 48 83 ec 08 e8 0a fa ff ff 48 89 04 24 b8 02 00 00 00 0f
   05 <48> 8b 3c 24 48 89 c2 e8 53 fa ff ff 48 89 d0 48 83 c4 08 48 3d
      01
[  459.393483] RSP: 002b:00007fe45dfe9a60 EFLAGS: 00000293 ORIG_RAX: 0000000000000002
[  459.394610] RAX: ffffffffffffffda RBX: 00007fe45dfea6d4 RCX: 0000000000416211
[  459.395678] RDX: 00007fe45dfe9b0a RSI: 0000000000000002 RDI: 00007fe45dfe9b00
[  459.396758] RBP: 000000000076bf20 R08: 0000000000000000 R09: 000000000000000a
[  459.397930] R10: 0000000000000075 R11: 0000000000000293 R12: 00000000ffffffff
[  459.399022] R13: 0000000000000bd9 R14: 00000000004cdb80 R15: 000000000076bf2c
[  459.400168]
[  459.400430] Allocated by task 20132:
[  459.401038]  kasan_kmalloc+0xbf/0xe0
[  459.401652]  kmem_cache_alloc+0xd5/0x280
[  459.402330]  bdev_alloc_inode+0x18/0x40
[  459.402970]  alloc_inode+0x5f/0x180
[  459.403510]  iget5_locked+0x57/0xd0
[  459.404095]  bdget+0x94/0x4e0
[  459.404607]  bd_acquire+0xfa/0x2c0
[  459.405113]  blkdev_open+0x110/0x290
[  459.405702]  do_dentry_open+0x49e/0x1050
[  459.406340]  path_openat+0x148c/0x3f50
[  459.406926]  do_filp_open+0x1a1/0x280
[  459.407471]  do_sys_open+0x3c3/0x500
[  459.408010]  do_syscall_64+0xc3/0x520
[  459.408572]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  459.409415]
[  459.409679] Freed by task 1262:
[  459.410212]  __kasan_slab_free+0x129/0x170
[  459.410919]  kmem_cache_free+0xb2/0x2a0
[  459.411564]  rcu_process_callbacks+0xbb2/0x2320
[  459.412318]  __do_softirq+0x225/0x8ac

Fix this by delaying bdput() to the end of blkdev_get() which means we
have finished accessing bdev.

Fixes: 77ea887e ("implement in-kernel gendisk events handling")
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NJason Yan <yanaijie@huawei.com>
Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDan Carpenter <dan.carpenter@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2d3a8e2d

03 6月, 2020 1 次提交

fs: convert mpage_readpages to mpage_readahead · d4388340

由 Matthew Wilcox (Oracle) 提交于 6月 01, 2020

Implement the new readahead aop and convert all callers (block_dev,
exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
reiserfs & udf).

The callers are all trivial except for GFS2 & OCFS2.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> # ocfs2
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> # ocfs2
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-17-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d4388340

27 5月, 2020 1 次提交

PM: hibernate: Restrict writes to the resume device · ad1e4f74

由 Domenico Andreoli 提交于 5月 19, 2020

Hibernation via snapshot device requires write permission to the swap
block device, the one that more often (but not necessarily) is used to
store the hibernation image.

With this patch, such permissions are granted iff:

 1) snapshot device config option is enabled
 2) swap partition is used as resume device

In other circumstances the swap device is not writable from userspace.

In order to achieve this, every write attempt to a swap device is
checked against the device configured as part of the uswsusp API [0]
using a pointer to the inode struct in memory. If the swap device being
written was not configured for resuming, the write request is denied.

NOTE: this implementation works only for swap block devices, where the
inode configured by swapon (which sets S_SWAPFILE) is the same used
by SNAPSHOT_SET_SWAP_AREA.

In case of swap file, SNAPSHOT_SET_SWAP_AREA indeed receives the inode
of the block device containing the filesystem where the swap file is
located (+ offset in it) which is never passed to swapon and then has
not set S_SWAPFILE.

As result, the swap file itself (as a file) has never an option to be
written from userspace. Instead it remains writable if accessed directly
from the containing block device, which is always writeable from root.

[0] Documentation/power/userland-swsusp.rst

v2:
 - rename is_hibernate_snapshot_dev() to is_hibernate_resume_dev()
 - fix description so to correctly refer to the resume device
Signed-off-by: NDomenico Andreoli <domenico.andreoli@linux.com>
Acked-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

ad1e4f74

22 5月, 2020 1 次提交

block: remove the error_sector argument to blkdev_issue_flush · 9398554f

由 Christoph Hellwig 提交于 5月 13, 2020

The argument isn't used by any caller, and drivers don't fill out
bi_sector for flush requests either.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9398554f

21 5月, 2020 1 次提交

block: remove ioctl_by_bdev · 3783daeb

由 Christoph Hellwig 提交于 5月 19, 2020

No callers left.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3783daeb

13 5月, 2020 1 次提交

block: add blk_io_schedule() for avoiding task hung in sync dio · e6249cdd

由 Ming Lei 提交于 5月 03, 2020

Sync dio could be big, or may take long time in discard or in case of
IO failure.

We have prevented task hung in submit_bio_wait() and blk_execute_rq(),
so apply the same trick for prevent task hung from happening in sync dio.

Add helper of blk_io_schedule() and use io_schedule_timeout() to prevent
task hung warning.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Cc: Salman Qazi <sqazi@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e6249cdd

24 4月, 2020 1 次提交

block: unexport bdev_read_page and bdev_write_page · 895d4775

由 Christoph Hellwig 提交于 4月 24, 2020

Each one just has two callers, both in always built-in code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

895d4775

21 4月, 2020 3 次提交

block: remove unused header · d56deb1e

由 Ma, Jianpeng 提交于 4月 21, 2020

Dax related code already removed from this file.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJianpeng Ma <jianpeng.ma@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d56deb1e

block: fold bdev_unhash_inode into invalidate_partition · 9bc5c397

由 Christoph Hellwig 提交于 4月 14, 2020

invalidate_partition and bdev_unhash_inode are always paired, and
invalidate_partition already does an icache lookup for the block device
inode.  Piggy back on that to remove the inode from the hash.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9bc5c397

block: remove the disk argument from blk_drop_partitions · d46430bf

由 Christoph Hellwig 提交于 4月 14, 2020

The gendisk can be trivially deducted from the block_device.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d46430bf

20 4月, 2020 1 次提交

bdev: Reduce time holding bd_mutex in sync in blkdev_close() · b849dd84

由 Douglas Anderson 提交于 3月 24, 2020

While trying to "dd" to the block device for a USB stick, I
encountered a hung task warning (blocked for > 120 seconds).  I
managed to come up with an easy way to reproduce this on my system
(where /dev/sdb is the block device for my USB stick) with:

  while true; do dd if=/dev/zero of=/dev/sdb bs=4M; done

With my reproduction here are the relevant bits from the hung task
detector:

 INFO: task udevd:294 blocked for more than 122 seconds.
 ...
 udevd           D    0   294      1 0x00400008
 Call trace:
  ...
  mutex_lock_nested+0x40/0x50
  __blkdev_get+0x7c/0x3d4
  blkdev_get+0x118/0x138
  blkdev_open+0x94/0xa8
  do_dentry_open+0x268/0x3a0
  vfs_open+0x34/0x40
  path_openat+0x39c/0xdf4
  do_filp_open+0x90/0x10c
  do_sys_open+0x150/0x3c8
  ...

 ...
 Showing all locks held in the system:
 ...
 1 lock held by dd/2798:
  #0: ffffff814ac1a3b8 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x50/0x204
 ...
 dd              D    0  2798   2764 0x00400208
 Call trace:
  ...
  schedule+0x8c/0xbc
  io_schedule+0x1c/0x40
  wait_on_page_bit_common+0x238/0x338
  __lock_page+0x5c/0x68
  write_cache_pages+0x194/0x500
  generic_writepages+0x64/0xa4
  blkdev_writepages+0x24/0x30
  do_writepages+0x48/0xa8
  __filemap_fdatawrite_range+0xac/0xd8
  filemap_write_and_wait+0x30/0x84
  __blkdev_put+0x88/0x204
  blkdev_put+0xc4/0xe4
  blkdev_close+0x28/0x38
  __fput+0xe0/0x238
  ____fput+0x1c/0x28
  task_work_run+0xb0/0xe4
  do_notify_resume+0xfc0/0x14bc
  work_pending+0x8/0x14

The problem appears related to the fact that my USB disk is terribly
slow and that I have a lot of RAM in my system to cache things.
Specifically my writes seem to be happening at ~15 MB/s and I've got
~4 GB of RAM in my system that can be used for buffering.  To write 4
GB of buffer to disk thus takes ~4000 MB / ~15 MB/s = ~267 seconds.

The 267 second number is a problem because in __blkdev_put() we call
sync_blockdev() while holding the bd_mutex.  Any other callers who
want the bd_mutex will be blocked for the whole time.

The problem is made worse because I believe blkdev_put() specifically
tells other tasks (namely udev) to go try to access the device at right
around the same time we're going to hold the mutex for a long time.

Putting some traces around this (after disabling the hung task detector),
I could confirm:
 dd:    437.608600: __blkdev_put() right before sync_blockdev() for sdb
 udevd: 437.623901: blkdev_open() right before blkdev_get() for sdb
 dd:    661.468451: __blkdev_put() right after sync_blockdev() for sdb
 udevd: 663.820426: blkdev_open() right after blkdev_get() for sdb

A simple fix for this is to realize that sync_blockdev() works fine if
you're not holding the mutex.  Also, it's not the end of the world if
you sync a little early (though it can have performance impacts).
Thus we can make a guess that we're going to need to do the sync and
then do it without holding the mutex.  We still do one last sync with
the mutex but it should be much, much faster.

With this, my hung task warnings for my test case are gone.
Signed-off-by: NDouglas Anderson <dianders@chromium.org>
Reviewed-by: NGuenter Roeck <groeck@chromium.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b849dd84

23 3月, 2020 1 次提交

hibernate: Allow uswsusp to write to swap · 56939e01

由 Domenico Andreoli 提交于 3月 23, 2020

It turns out that there is one use case for programs being able to
write to swap devices, and that is the userspace hibernation code.

Quick fix: disable the S_SWAPFILE check if hibernation is configured.

Fixes: dc617f29 ("vfs: don't allow writes to swap files")
Reported-by: NDomenico Andreoli <domenico.andreoli@linux.com>
Reported-by: NMarian Klein <mkleinsoft@gmail.com>
Signed-off-by: NDomenico Andreoli <domenico.andreoli@linux.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

56939e01

18 3月, 2020 1 次提交

block: fix a device invalidation regression · d981cb5b

由 Christoph Hellwig 提交于 3月 18, 2020

Historically we only set the capacity to zero for devices that support
partitions (independ of actually having partitions created).  Doing that
is rather inconsistent, but changing it broke legacy udisks polling for
legacy ide-cdrom devices.  Use the crude a crude check for devices that
either are non-removable or partitionable to get the sane behavior for
most device while not breaking userspace for this particular setup.

Fixes: a1548b67 ("block: move rescan_partitions to fs/block_dev.c")
Reported-by: NHe Zhe <zhe.he@windriver.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NHe Zhe <zhe.he@windriver.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d981cb5b

03 12月, 2019 1 次提交

block: don't send uevent for empty disk when not invalidating · 490547ca

由 Eric Biggers 提交于 12月 02, 2019

Commit 6917d068 ("block: merge invalidate_partitions into
rescan_partitions") caused a regression where systemd-udevd spins
forever using max CPU starting at boot time.

It's caused by a behavior change where a KOBJ_CHANGE uevent is now sent
in a case where previously it wasn't.

Restore the old behavior.

Fixes: 6917d068 ("block: merge invalidate_partitions into rescan_partitions")
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

490547ca

14 11月, 2019 5 次提交

block: move clearing bd_invalidated into check_disk_size_change · 979c690d

由 Christoph Hellwig 提交于 11月 14, 2019

Both callers of check_disk_size_change clear bd_invalidate directly
after the call, so move the clearing into check_disk_size_change
itself.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

979c690d

block: remove (__)blkdev_reread_part as an exported API · f0b870df

由 Christoph Hellwig 提交于 11月 14, 2019

In general drivers should never mess with partition tables directly.
Unfortunately s390 and loop do for somewhat historic reasons, but they
can use bdev_disk_changed directly instead when we export it as they
satisfy the sanity checks we have in __blkdev_reread_part.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>	[dasd]
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f0b870df

block: fix bdev_disk_changed for non-partitioned devices · 142fe8f4

由 Christoph Hellwig 提交于 11月 14, 2019

We still have to set the capacity to 0 if invalidating or call
revalidate_disk if not even if the disk has no partitions.  Fix
that by merging rescan_partitions into bdev_disk_changed and just
stubbing out blk_add_partitions and blk_drop_partitions for
non-partitioned devices.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

142fe8f4

block: move rescan_partitions to fs/block_dev.c · a1548b67

由 Christoph Hellwig 提交于 11月 14, 2019

Large parts of rescan_partitions aren't about partitions, and
moving it to block_dev.c will allow for some further cleanups by
merging it into its only caller.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a1548b67

block: merge invalidate_partitions into rescan_partitions · 6917d068

由 Christoph Hellwig 提交于 11月 14, 2019

A lot of the logic in invalidate_partitions and rescan_partitions is
shared.  Merge the two functions to simplify things.  There is a small
behavior change in that we now send the kevent change notice also if we
were not invalidating but no partitions were found, which seems like
the right thing to do.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6917d068

03 11月, 2019 2 次提交

bdev: Refresh bdev size for disks without partitioning · cba22d86

由 Jan Kara 提交于 10月 21, 2019

Currently, block device size in not updated on second and further open
for block devices where partition scan is disabled. This is particularly
annoying for example for DVD drives as that means block device size does
not get updated once the media is inserted into a drive if the device is
already open when inserting the media. This is actually always the case
for example when pktcdvd is in use.

Fix the problem by revalidating block device size on every open even for
devices with partition scan disabled.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cba22d86

bdev: Factor out bdev revalidation into a common helper · 731dc486

由 Jan Kara 提交于 10月 21, 2019

Factor out code handling revalidation of bdev on disk change into a
common helper.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

731dc486

20 8月, 2019 1 次提交

vfs: don't allow writes to swap files · dc617f29

由 Darrick J. Wong 提交于 8月 20, 2019

Don't let userspace write to an active swap file because the kernel
effectively has a long term lease on the storage and things could get
seriously corrupted if we let this happen.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

dc617f29

16 8月, 2019 1 次提交

block: remove REQ_NOWAIT_INLINE · 7b6620d7

由 Jens Axboe 提交于 8月 15, 2019

We had a few issues with this code, and there's still a problem around
how we deal with error handling for chained/split bios. For now, just
revert the code and we'll try again with a thoroug solution. This
reverts commits:

e15c2ffa ("block: fix O_DIRECT error handling for bio fragments")
0eb6ddfb ("block: Fix __blkdev_direct_IO() for bio fragments")
6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
893a1c97 ("blk-mq: allow REQ_NOWAIT to return an error inline")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7b6620d7

08 8月, 2019 2 次提交

bdev: Fixup error handling in blkdev_get() · e91455ba

由 Jan Kara 提交于 8月 07, 2019

Commit 89e524c0 ("loop: Fix mount(2) failure due to race with
LOOP_SET_FD") converted blkdev_get() to use the new helpers for
finishing claiming of a block device. However the conversion botched the
error handling in blkdev_get() and thus the bdev has been marked as held
even in case __blkdev_get() returned error. This led to occasional
warnings with block/001 test from blktests like:

kernel: WARNING: CPU: 5 PID: 907 at fs/block_dev.c:1899 __blkdev_put+0x396/0x3a0

Correct the error handling.

CC: stable@vger.kernel.org
Fixes: 89e524c0 ("loop: Fix mount(2) failure due to race with LOOP_SET_FD")
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e91455ba

block: fix O_DIRECT error handling for bio fragments · e15c2ffa

由 Jens Axboe 提交于 8月 06, 2019

0eb6ddfb tried to fix this up, but introduced a use-after-free
of dio. Additionally, we still had an issue with error handling,
as reported by Darrick:

"I noticed a regression in xfs/747 (an unreleased xfstest for the
xfs_scrub media scanning feature) on 5.3-rc3.  I'll condense that down
to a simpler reproducer:

error-test: 0 209 linear 8:48 0
error-test: 209 1 error
error-test: 210 6446894 linear 8:48 210

Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
for sector 209 and to pass the io to the scsi device everywhere else.

On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
(in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
EIO like you'd expect:

pread64(3, 0x7f880e1c7000, 1048576, 0)  = -1 EIO (Input/output error)
pread: Input/output error
+++ exited with 0 +++

But doing it with a larger buffer succeeds(!):

pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
read 1146880/1146880 bytes at offset 0
1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
+++ exited with 0 +++

(Note that the part of the buffer corresponding to the dm-error area is
uninitialized)

On 5.3-rc2, both commands would fail with EIO like you'd expect.  The
only change between rc2 and rc3 is commit 0eb6ddfb ("block: Fix
__blkdev_direct_IO() for bio fragments").

AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
and a second one for the 96k after that."

Fix this by noting that it's always safe to dereference dio if we get
BLK_QC_T_EAGAIN returned, as end_io hasn't been run for that case. So
we can safely increment the dio size before calling submit_bio(), and
then decrement it on failure (not that it really matters, as the bio
and dio are going away).

For error handling, return to the original method of just using 'ret'
for tracking the error, and the size tracking in dio->size.

Fixes: 0eb6ddfb ("block: Fix __blkdev_direct_IO() for bio fragments")
Fixes: 6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
Reported-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e15c2ffa

02 8月, 2019 1 次提交

block: Fix __blkdev_direct_IO() for bio fragments · 0eb6ddfb

由 Damien Le Moal 提交于 8月 01, 2019

The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
(patch 6a43074e) introduced two problems with BIO fragment handling
for direct IOs:
1) The dio size processed is calculated by incrementing the ret variable
by the size of the bio fragment issued for the dio. However, this size
is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
which may result in referencing the bi_size value after the bio
completed, resulting in an incorrect value use.
2) The ret variable is not incremented by the size of the last bio
fragment issued for the bio, leading to an invalid IO size being
returned to the user.

Fix both problem by using dio->size (which is incremented before the bio
submission) to update the value of ret after bio submissions, including
for the last bio fragment issued.

Fixes: 6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
Reported-by: NMasato Suzuki <masato.suzuki@wdc.com>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0eb6ddfb

31 7月, 2019 1 次提交

loop: Fix mount(2) failure due to race with LOOP_SET_FD · 89e524c0

由 Jan Kara 提交于 7月 30, 2019

Commit 33ec3e53 ("loop: Don't change loop device under exclusive
opener") made LOOP_SET_FD ioctl acquire exclusive block device reference
while it updates loop device binding. However this can make perfectly
valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding
temporarily the exclusive bdev reference in cases like this:

for i in {a..z}{a..z}; do
        dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024
        mkfs.ext2 $i.image
        mkdir mnt$i
done

echo "Run"
for i in {a..z}{a..z}; do
        mount -o loop -t ext2 $i.image mnt$i &
done

Fix the problem by not getting full exclusive bdev reference in
LOOP_SET_FD but instead just mark the bdev as being claimed while we
update the binding information. This just blocks new exclusive openers
instead of failing them with EBUSY thus fixing the problem.

Fixes: 33ec3e53 ("loop: Don't change loop device under exclusive opener")
Cc: stable@vger.kernel.org
Tested-by: NKai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

89e524c0

22 7月, 2019 1 次提交

block: properly handle IOCB_NOWAIT for async O_DIRECT IO · 6a43074e

由 Jens Axboe 提交于 7月 16, 2019

A caller is supposed to pass in REQ_NOWAIT if we can't block for any
given operation, but O_DIRECT for block devices just ignore this. Hence
we'll block for various resource shortages on the block layer side,
like having to wait for requests.

Use the new REQ_NOWAIT_INLINE to ask for this error to be returned
inline, so we can handle it appropriately and return -EAGAIN to the
caller.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6a43074e

29 6月, 2019 2 次提交

block_dev: use bio_release_pages in bio_unmap_user · 9fec4a21

由 Christoph Hellwig 提交于 6月 26, 2019

Use bio_release_pages instead of duplicating it.
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9fec4a21

block_dev: use bio_release_pages in blkdev_bio_end_io · 57dfe3ce

由 Christoph Hellwig 提交于 6月 26, 2019

Use bio_release_pages instead of duplicating it.
Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

57dfe3ce

27 5月, 2019 1 次提交

block: Don't revalidate bdev of hidden gendisk · 31cb1d64

由 Jan Kara 提交于 5月 15, 2019

When hidden gendisk is revalidated, there's no point in revalidating
associated block device as there's none. We would thus just create new
bdev inode, report "detected capacity change from 0 to XXX" message and
evict the bdev inode again. Avoid this pointless dance and confusing
message in the kernel log.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

31cb1d64

26 5月, 2019 1 次提交

vfs: Convert bdev to use the new mount API · 9030d16e

由 David Howells 提交于 3月 25, 2019

Convert the bdev filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9030d16e

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功