提交 · 0d20dcc277cfb50ec1de4db0d758f30e8f597d30 · openeuler / Kernel

01 8月, 2020 1 次提交

block: genhd: delete duplicated words · 0d20dcc2

由 Randy Dunlap 提交于 7月 30, 2020

Drop the repeated word "to" in multiple places.
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0d20dcc2

18 7月, 2020 1 次提交

blk-cgroup: show global disk stats in root cgroup io.stat · ef45fe47

由 Boris Burkov 提交于 6月 01, 2020

In order to improve consistency and usability in cgroup stat accounting,
we would like to support the root cgroup's io.stat.

Since the root cgroup has processes doing io even if the system has no
explicitly created cgroups, we need to be careful to avoid overhead in
that case.  For that reason, the rstat algorithms don't handle the root
cgroup, so just turning the file on wouldn't give correct statistics.

To get around this, we simulate flushing the iostat struct by filling it
out directly from global disk stats. The result is a root cgroup io.stat
file consistent with both /proc/diskstats and io.stat.

Note that in order to collect the disk stats, we needed to iterate over
devices. To facilitate that, we had to change the linkage of a disk_type
to external so that it can be used from blk-cgroup.c to iterate over
disks.
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NBoris Burkov <boris@bur.io>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ef45fe47

09 7月, 2020 1 次提交

md: switch to ->check_events for media change notifications · a564e23f

由 Christoph Hellwig 提交于 7月 08, 2020

md is the last driver using the legacy media_changed method.  Switch
it over to (not so) new ->clear_events approach, which also removes the
need for the ->revalidate_disk method.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
[axboe: remove unused 'bdops' variable in disk_clear_events()]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a564e23f

24 6月, 2020 3 次提交

block: revert back to synchronous request_queue removal · e8c7d14a

由 Luis Chamberlain 提交于 6月 19, 2020

Commit dc9edc44 ("block: Fix a blk_exit_rl() regression") merged on
v4.12 moved the work behind blk_release_queue() into a workqueue after a
splat floated around which indicated some work on blk_release_queue()
could sleep in blk_exit_rl(). This splat would be possible when a driver
called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
as its final call) from an atomic context.

blk_put_queue() decrements the refcount for the request_queue kobject, and
upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is
now removed through commit db6d9952 ("block: remove request_list code")
on v5.0, we reserve the right to be able to sleep within
blk_release_queue() context.

The last reference for the request_queue must not be called from atomic
context. *When* the last reference to the request_queue reaches 0 varies,
and so let's take the opportunity to document when that is expected to
happen and also document the context of the related calls as best as
possible so we can avoid future issues, and with the hopes that the
synchronous request_queue removal sticks.

We revert back to synchronous request_queue removal because asynchronous
removal creates a regression with expected userspace interaction with
several drivers. An example is when removing the loopback driver, one
uses ioctls from userspace to do so, but upon return and if successful,
one expects the device to be removed. Likewise if one races to add another
device the new one may not be added as it is still being removed. This was
expected behavior before and it now fails as the device is still present
and busy still. Moving to asynchronous request_queue removal could have
broken many scripts which relied on the removal to have been completed if
there was no error. Document this expectation as well so that this
doesn't regress userspace again.

Using asynchronous request_queue removal however has helped us find
other bugs. In the future we can test what could break with this
arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.

While at it, update the docs with the context expectations for the
request_queue / gendisk refcount decrement, and make these
expectations explicit by using might_sleep().

Fixes: dc9edc44 ("block: Fix a blk_exit_rl() regression")
Suggested-by: NNicolai Stange <nstange@suse.de>
Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e8c7d14a

block: clarify context for refcount increment helpers · 763b5892

由 Luis Chamberlain 提交于 6月 19, 2020

Let us clarify the context under which the helpers to increment the
refcount for the gendisk and request_queue can be called under. We
make this explicit on the places where we may sleep with might_sleep().

We don't address the decrement context yet, as that needs some extra
work and fixes, but will be addressed in the next patch.
Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

763b5892

block: add docs for gendisk / request_queue refcount helpers · b5bd357c

由 Luis Chamberlain 提交于 6月 19, 2020

This adds documentation for the gendisk / request_queue refcount
helpers.
Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b5bd357c

27 5月, 2020 2 次提交

block: remove rcu_read_lock() from part_stat_lock() · 8ab1d40a

由 Konstantin Khlebnikov 提交于 5月 27, 2020

The RCU lock is required only in disk_map_sector_rcu() to lookup the
partition.  After that request holds reference to related hd_struct.

Replace get_cpu() with preempt_disable() - returned cpu index is unused.

[hch: rebased]
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8ab1d40a

block: always use a percpu variable for disk stats · 58d4f14f

由 Christoph Hellwig 提交于 5月 27, 2020

percpu variables have a perfectly fine working stub implementation
for UP kernels, so use that.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

58d4f14f

19 5月, 2020 2 次提交

block: merge part_{inc,dev}_in_flight into their only callers · 10ec5e86

由 Christoph Hellwig 提交于 5月 13, 2020

part_inc_in_flight and part_dec_in_flight only have one caller each, and
those callers are purely for bio based drivers.  Merge each function into
the only caller, and remove the superflous blk-mq checks.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

10ec5e86

block: move the blk-mq calls out of part_in_flight{,_rw} · b2f609e1

由 Christoph Hellwig 提交于 5月 13, 2020

Don't bother to call part_in_flight / part_in_flight_rw on blk-mq
devices, just call the blk-mq versions directly.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b2f609e1

13 5月, 2020 3 次提交

block: don't hold part0's refcount in IO path · 27eb3af9

由 Ming Lei 提交于 5月 08, 2020

gendisk can't be gone when there is IO activity, so not hold
part0's refcount in IO path.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@infradead.org>
Cc: Yufen Yu <yuyufen@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hou Tao <houtao1@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

27eb3af9

block: only define 'nr_sects_seq' in hd_part for 32bit SMP · 07c4e1e8

由 Ming Lei 提交于 5月 08, 2020

The seqcount of 'nr_sects_seq' is only needed in case of 32bit SMP,
so define it just for 32bit SMP.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@infradead.org>
Cc: Yufen Yu <yuyufen@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hou Tao <houtao1@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

07c4e1e8

block: fix use-after-free on cached last_lookup partition · b7d6c303

由 Ming Lei 提交于 5月 08, 2020

delete_partition() clears the cached last_lookup partition. However the
.last_lookup cache may be overwritten by one IO path after it is cleared
from delete_partition(). Then another IO path may use the cached deleting
partition after hd_struct_free() is called, then use-after-free is triggered
on the cached partition.

Fixes the issue by the following approach:

1) always get the partition's refcount via hd_struct_try_get() before
setting .last_lookup

2) move clearing .last_lookup from delete_partition() to hd_struct_free()
which is the release handle of the partition's percpu-refcount, so that no
IO path can cache deleteing partition via .last_lookup.

It is one candidate approach of Yufen's patch[1] which adds overhead
in fast path by indirect lookup which may introduce one extra cacheline
in IO path. Also this patch relies on percpu-refcount's protection, and
it is easier to understand and verify.

[1] https://lore.kernel.org/linux-block/20200109013551.GB9655@ming.t460p/T/#tReported-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hou Tao <houtao1@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b7d6c303

10 5月, 2020 1 次提交

bdi: remove bdi_register_owner · 3c5d202b

由 Christoph Hellwig 提交于 5月 04, 2020

Split out a new bdi_set_owner helper to set the owner, and move the policy
for creating the bdi name back into genhd.c, where it belongs.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3c5d202b

21 4月, 2020 3 次提交

block: fold bdev_unhash_inode into invalidate_partition · 9bc5c397

由 Christoph Hellwig 提交于 4月 14, 2020

invalidate_partition and bdev_unhash_inode are always paired, and
invalidate_partition already does an icache lookup for the block device
inode.  Piggy back on that to remove the inode from the hash.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9bc5c397

block: mark invalidate_partition static · 02d33b67

由 Christoph Hellwig 提交于 4月 14, 2020

invalidate_partition is only used in genhd.c, so mark it static.  Also
drop the return value given that is is always ignored.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

02d33b67

block: pass a hd_struct to delete_partition · cddae808

由 Christoph Hellwig 提交于 4月 14, 2020

All callers have the hd_struct at hand, so pass it instead of performing
another lookup.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cddae808

27 3月, 2020 1 次提交

block: move the ->devnode callback to struct block_device_operations · 348e114b

由 Christoph Hellwig 提交于 3月 27, 2020

There really isn't any good reason to stash a method directly into
struct gendisk.  Move it together with the other block device
operations.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

348e114b

25 3月, 2020 7 次提交

block: unexport get_gendisk · 1b4d4dbd

由 Christoph Hellwig 提交于 3月 25, 2020

get_gendisk is not used by any modular code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1b4d4dbd

block: unexport disk_map_sector_rcu · a7818aed

由 Christoph Hellwig 提交于 3月 25, 2020

disk_map_sector_rcu is not used by any modular code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a7818aed

block: unexport disk_get_part · 572e7fc8

由 Christoph Hellwig 提交于 3月 25, 2020

disk_get_part is not used by any modular code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

572e7fc8

C
block: mark part_in_flight and part_in_flight_rw static · 6005771c
由 Christoph Hellwig 提交于 3月 25, 2020
```
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
```
6005771c

block: mark block_depr static · 31eb6186

由 Christoph Hellwig 提交于 3月 25, 2020

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

31eb6186

block/diskstats: replace time_in_queue with sum of request times · 8cd5b8fc

由 Konstantin Khlebnikov 提交于 3月 25, 2020

Column "time_in_queue" in diskstats is supposed to show total waiting time
of all requests. I.e. value should be equal to the sum of times from other
columns. But this is not true, because column "time_in_queue" is counted
separately in jiffies rather than in nanoseconds as other times.

This patch removes redundant counter for "time_in_queue" and shows total
time of read, write, discard and flush requests.
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8cd5b8fc

block/diskstats: accumulate all per-cpu counters in one pass · ea18e0f0

由 Konstantin Khlebnikov 提交于 3月 25, 2020

Reading /proc/diskstats iterates over all cpus for summing each field.
It's faster to sum all fields in one pass.

Hammering /proc/diskstats with fio shows 2x performance improvement:

fio --name=test --numjobs=$JOBS --filename=/proc/diskstats \
    --size=1k --bs=1k --fallocate=none --create_on_open=1 \
    --time_based=1 --runtime=10 --invalidate=0 --group_report

	  JOBS=1	JOBS=10
Before:	  7k iops	64k iops
After:	 18k iops      120k iops

Also this way code is more compact:

add/remove: 1/0 grow/shrink: 0/2 up/down: 194/-1540 (-1346)
Function                                     old     new   delta
part_stat_read_all                             -     194    +194
diskstats_show                              1344     631    -713
part_stat_show                              1219     392    -827
Total: Before=14966947, After=14965601, chg -0.01%
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ea18e0f0

24 3月, 2020 3 次提交

block: move sysfs methods shared by disks and partitions to genhd.c · 3ad5cee5

由 Christoph Hellwig 提交于 3月 24, 2020

Move the sysfs _show methods that are used both on the full disk and
partition nodes to genhd.c instead of hiding them in the partitioning
code.  Also move the declaration for these methods to block/blk.h so
that we don't expose them to drivers.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3ad5cee5

block: move disk_name and related helpers out of partition-generic.c · 5cbd28e3

由 Christoph Hellwig 提交于 3月 24, 2020

Thes functions aren't really related to partition support, so move them
to a more suitable place.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5cbd28e3

block: remove the blk_lookup_devt export · d2332c5c

由 Christoph Hellwig 提交于 3月 24, 2020

This function is only used by init/do_mounts.c, which can't be modular.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d2332c5c

19 3月, 2020 1 次提交

block/genhd: Notify udev about capacity change · e598a72f

由 Balbir Singh 提交于 3月 13, 2020

Allow block/genhd to notify user space (via udev) about disk size changes
using a new helper set_capacity_revalidate_and_notify(), which is a wrapper
on top of set_capacity(). set_capacity_revalidate_and_notify() will only
notify via udev if the current capacity or the target capacity is not zero
and iff the capacity changes.
Suggested-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSomeswarudu Sangaraju <ssomesh@amazon.com>
Signed-off-by: NBalbir Singh <sblbir@amazon.com>
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e598a72f

12 3月, 2020 1 次提交

block: Fix partition support for host aware zoned block devices · b53df2e7

由 Shin'ichiro Kawasaki 提交于 2月 21, 2020

Commit b7205307 ("block: allow partitions on host aware zone
devices") introduced the helper function disk_has_partitions() to check
if a given disk has valid partitions. However, since this function result
directly depends on the disk partition table length rather than the
actual existence of valid partitions in the table, it returns true even
after all partitions are removed from the disk. For host aware zoned
block devices, this results in zone management support to be kept
disabled even after removing all partitions.

Fix this by changing disk_has_partitions() to walk through the partition
table entries and return true if and only if a valid non-zero size
partition is found.

Fixes: b7205307 ("block: allow partitions on host aware zone devices")
Cc: stable@vger.kernel.org # 5.5
Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b53df2e7

22 11月, 2019 1 次提交

block: add iostat counters for flush requests · b6866318

由 Konstantin Khlebnikov 提交于 11月 21, 2019

Requests that triggers flushing volatile writeback cache to disk (barriers)
have significant effect to overall performance.

Block layer has sophisticated engine for combining several flush requests
into one. But there is no statistics for actual flushes executed by disk.
Requests which trigger flushes usually are barriers - zero-size writes.

This patch adds two iostat counters into /sys/class/block/$dev/stat and
/proc/diskstats - count of completed flush requests and their total time.
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b6866318

06 9月, 2019 1 次提交

block: Delay default elevator initialization · 737eb78e

由 Damien Le Moal 提交于 9月 05, 2019

When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
the only information known about the device is the number of hardware
queues as the block device scan by the device driver is not completed
yet for most drivers. The device type and elevator required features
are not set yet, preventing to correctly select the default elevator
most suitable for the device.

This currently affects all multi-queue zoned block devices which default
to the "none" elevator instead of the required "mq-deadline" elevator.
These drives currently include host-managed SMR disks connected to a
smartpqi HBA and null_blk block devices with zoned mode enabled.
Upcoming NVMe Zoned Namespace devices will also be affected.

Fix this by adding the boolean elevator_init argument to
blk_mq_init_allocated_queue() to control the execution of
elevator_init_mq(). Two cases exist:
1) elevator_init = false is used for calls to
   blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
   case, a call to elevator_init_mq() is added to __device_add_disk(),
   resulting in the delayed initialization of the queue elevator
   after the device driver finished probing the device information. This
   effectively allows elevator_init_mq() access to more information
   about the device.
2) elevator_init = true preserves the current behavior of initializing
   the elevator directly from blk_mq_init_allocated_queue(). This case
   is used for the special request based DM devices where the device
   gendisk is created before the queue initialization and device
   information (e.g. queue limits) is already known when the queue
   initialization is executed.

Additionally, to make sure that the elevator initialization is never
done while requests are in-flight (there should be none when the device
driver calls device_add_disk()), freeze and quiesce the device request
queue before calling blk_mq_init_sched() in elevator_init_mq().
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

737eb78e

17 7月, 2019 1 次提交

block: fix sysfs module parameters directory path in comment · 1624b0b2

由 Akinobu Mita 提交于 7月 16, 2019

The runtime configurable module parameter files are located under
/sys/module/MODULENAME/parameters, not /sys/module/MODULENAME.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1624b0b2

15 6月, 2019 1 次提交

block: genhd: Use struct_size() helper · 78b90a2c

由 Gustavo A. R. Silva 提交于 5月 31, 2019

Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes, in particular in the
context in which this code is being used.

So, replace the following form:

sizeof(*new_ptbl) + target * sizeof(new_ptbl->part[0])

with:

struct_size(new_ptbl, part, target)

Also, notice that variable size is unnecessary, hence it is removed.

This code was detected with the help of Coccinelle.
Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

78b90a2c

01 6月, 2019 1 次提交

block: Convert blk_invalidate_devt() header into a non-kernel-doc header · 33c826ef

由 Bart Van Assche 提交于 5月 30, 2019

This patch avoids that the kernel-doc tool warns about this function
header when building with W=1.
Reviewed-by: NChaitanya Kulkarni <chiatanya.kulkarni@wdc.com>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

33c826ef

01 5月, 2019 1 次提交

block: add SPDX tags to block layer files missing licensing information · 3dcf60bc

由 Christoph Hellwig 提交于 4月 30, 2019

Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3dcf60bc

22 4月, 2019 1 次提交

block: fix use-after-free on gendisk · 6fcc44d1

由 Yufen Yu 提交于 4月 02, 2019

commit 2da78092 "block: Fix dev_t minor allocation lifetime"
specifically moved blk_free_devt(dev->devt) call to part_release()
to avoid reallocating device number before the device is fully
shutdown.

However, it can cause use-after-free on gendisk in get_gendisk().
We use md device as example to show the race scenes:

Process1		Worker			Process2
md_free
						blkdev_open
del_gendisk
  add delete_partition_work_fn() to wq
  						__blkdev_get
						get_gendisk
put_disk
  disk_release
    kfree(disk)
    						find part from ext_devt_idr
						get_disk_and_module(disk)
    					  	cause use after free

    			delete_partition_work_fn
			put_device(part)
    		  	part_release
		    	remove part from ext_devt_idr

Before <devt, hd_struct pointer> is removed from ext_devt_idr by
delete_partition_work_fn(), we can find the devt and then access
gendisk by hd_struct pointer. But, if we access the gendisk after
it have been freed, it can cause in use-after-freeon gendisk in
get_gendisk().

We fix this by adding a new helper blk_invalidate_devt() in
delete_partition() and del_gendisk(). It replaces hd_struct
pointer in idr with value 'NULL', and deletes the entry from
idr in part_release() as we do now.

Thanks to Jan Kara for providing the solution and more clear comments
for the code.

Fixes: 2da78092 ("block: Fix dev_t minor allocation lifetime")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Suggested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6fcc44d1

16 4月, 2019 1 次提交

block: fix use-after-free on gendisk · 2c88e3c7

由 Yufen Yu 提交于 4月 02, 2019

commit 2da78092 "block: Fix dev_t minor allocation lifetime"
specifically moved blk_free_devt(dev->devt) call to part_release()
to avoid reallocating device number before the device is fully
shutdown.

However, it can cause use-after-free on gendisk in get_gendisk().
We use md device as example to show the race scenes:

Process1		Worker			Process2
md_free
						blkdev_open
del_gendisk
  add delete_partition_work_fn() to wq
  						__blkdev_get
						get_gendisk
put_disk
  disk_release
    kfree(disk)
    						find part from ext_devt_idr
						get_disk_and_module(disk)
    					  	cause use after free

    			delete_partition_work_fn
			put_device(part)
    		  	part_release
		    	remove part from ext_devt_idr

Before <devt, hd_struct pointer> is removed from ext_devt_idr by
delete_partition_work_fn(), we can find the devt and then access
gendisk by hd_struct pointer. But, if we access the gendisk after
it have been freed, it can cause in use-after-freeon gendisk in
get_gendisk().

We fix this by adding a new helper blk_invalidate_devt() in
delete_partition() and del_gendisk(). It replaces hd_struct
pointer in idr with value 'NULL', and deletes the entry from
idr in part_release() as we do now.

Thanks to Jan Kara for providing the solution and more clear comments
for the code.

Fixes: 2da78092 ("block: Fix dev_t minor allocation lifetime")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Suggested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2c88e3c7

13 4月, 2019 2 次提交

block: check_events: don't bother with events if unsupported · cdf3e3de

由 Martin Wilck 提交于 3月 27, 2019

Drivers now report to the block layer if they support media change
events. If this is not the case, there's no need to allocate the event
structure, and all event handling code can effectively be skipped. This
simplifies code flow in particular for non-removable sd devices.

This effectively reverts commit 75e3f3ee ("block: always allocate
genhd->ev if check_events is implemented").

The sysfs files for the events are kept in place even if no events are
supported, as user space may rely on them being present. The only
difference is that an error code is now returned if the user tries to
set poll_msecs.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMartin Wilck <mwilck@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cdf3e3de

block: disk_events: introduce event flags · c92e2f04

由 Martin Wilck 提交于 3月 27, 2019

Currently, an empty disk->events field tells the block layer not to
forward media change events to user space. This was done in commit
7c88a168 ("block: don't propagate unlisted DISK_EVENTs to userland")
in order to avoid events from "fringe" drivers to be forwarded to user
space. By doing so, the block layer lost the information which events
were supported by a particular block device, and most importantly,
whether or not a given device supports media change events at all.

Prepare for not interpreting the "events" field this way in the future
any more. This is done by adding an additional field "event_flags" to
struct gendisk, and two flag bits that can be set to have the device
treated like one that had the "events" field set to a non-zero value
before. This applies only to the sd and sr drivers, which are changed to
set the new flags.

The new flags are DISK_EVENT_FLAG_POLL to enforce polling of the device
for synchronous events, and DISK_EVENT_FLAG_UEVENT to tell the
blocklayer to generate udev events from kernel events.

In order to add the event_flags field to struct gendisk, the events
field is converted to an "unsigned short"; it doesn't need to hold
values bigger than 2 anyway.

This patch doesn't change behavior.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMartin Wilck <mwilck@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c92e2f04

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功