提交 · d97e594c51660bea510a387731637b894651e4b5 · openeuler / Kernel

24 5月, 2021 9 次提交

blk-mq: Use request queue-wide tags for tagset-wide sbitmap · d97e594c

由 John Garry 提交于 5月 13, 2021

The tags used for an IO scheduler are currently per hctx.

As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.

This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.

Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b

In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.

Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).

For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

d97e594c

blk-mq: Some tag allocation code refactoring · 56b68085

由 John Garry 提交于 5月 13, 2021

The tag allocation code to alloc the sbitmap pairs is common for regular
bitmaps tags and shared sbitmap, so refactor into a common function.

Also remove superfluous "flags" argument from blk_mq_init_shared_sbitmap().
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-2-git-send-email-john.garry@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

56b68085

blk-mq: clearing flush request reference in tags->rqs[] · 364b6181

由 Ming Lei 提交于 5月 11, 2021

Before we free request queue, clearing flush request reference in
tags->rqs[], so that potential UAF can be avoided.

Based on one patch written by David Jeffery.
Tested-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NDavid Jeffery <djeffery@redhat.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-5-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

364b6181

blk-mq: clear stale request in tags->rq[] before freeing one request pool · bd63141d

由 Ming Lei 提交于 5月 11, 2021

refcount_inc_not_zero() in bt_tags_iter() still may read one freed
request.

Fix the issue by the following approach:

1) hold a per-tags spinlock when reading ->rqs[tag] and calling
refcount_inc_not_zero in bt_tags_iter()

2) clearing stale request referred via ->rqs[tag] before freeing
request pool, the per-tags spinlock is held for clearing stale
->rq[tag]

So after we cleared stale requests, bt_tags_iter() won't observe
freed request any more, also the clearing will wait for pending
request reference.

The idea of clearing ->rqs[] is borrowed from John Garry's previous
patch and one recent David's patch.
Tested-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NDavid Jeffery <djeffery@redhat.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-4-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

bd63141d

blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter · 2e315dc0

由 Ming Lei 提交于 5月 11, 2021

Grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter(), and
this way will prevent the request from being re-used when ->fn is
running. The approach is same as what we do during handling timeout.

Fix request use-after-free(UAF) related with completion race or queue
releasing:

- If one rq is referred before rq->q is frozen, then queue won't be
frozen before the request is released during iteration.

- If one rq is referred after rq->q is frozen, refcount_inc_not_zero()
will return false, and we won't iterate over this request.

However, still one request UAF not covered: refcount_inc_not_zero() may
read one freed request, and it will be handled in next patch.
Tested-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-3-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

2e315dc0

block: avoid double io accounting for flush request · 84da7acc

由 Ming Lei 提交于 5月 11, 2021

For flush request, rq->end_io() may be called two times, one is from
timeout handling(blk_mq_check_expired()), another is from normal
completion(__blk_mq_end_request()).

Move blk_account_io_flush() after flush_rq->ref drops to zero, so
io accounting can be done just once for flush request.

Fixes: b6866318 ("block: add iostat counters for flush requests")
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NJohn Garry <john.garry@huawei.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-2-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

84da7acc

block: remove unneeded parenthesis from blk-sysfs · 8c390ff9

由 Max Gurtovoy 提交于 5月 11, 2021

Align to common code conventions.
Signed-off-by: NMax Gurtovoy <mgurtovoy@nvidia.com>
Link: https://lore.kernel.org/r/20210511155319.1885277-1-mgurtovoy@nvidia.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

8c390ff9

blkcg: drop CLONE_IO check in blkcg_can_attach() · b5f3352e

由 Tejun Heo 提交于 5月 11, 2021

blkcg has always rejected to attach if any of the member tasks has shared
io_context. The rationale was that io_contexts can be shared across
different cgroups making it impossible to define what the appropriate
control behavior should be. However, this check causes more problems than it
solves:

* The check prevents controller enable and migrations but not CLONE_IO
  itself, which can lead to surprises as the outcome changes depending on
  the order of operations.

* Sharing within a cgroup is fine but the check can't distinguish that. This
  leads to unnecessary conflicts with the recent CLONE_IO usage in io_uring.

io_context sharing doesn't make any difference for rq_qos based controllers
and the way it's used is safe as long as tasks aren't migrated dynamically
which is the vast majority of use cases. While we can try to make the check
more precise to avoid false positives, the added complexity doesn't seem
worthwhile. Let's just drop blkcg_can_attach().
Signed-off-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/YJrTvHbrRDbJjw+S@slm.duckdns.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

b5f3352e

block_dump: remove block_dump feature · 3af3d772

由 zhangyi (F) 提交于 3月 13, 2021

We have already delete block_dump feature in mark_inode_dirty() because
it can be replaced by tracepoints, now we also remove the part in
submit_bio() for the same reason. The part of block dump feature in
submit_bio() dump the write process, write region and sectors on the
target disk into kernel message. it can be replaced by
block_bio_queue tracepoint in submit_bio_checks(), so we do not need
block_dump anymore, remove the whole block_dump feature.
Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210313030146.2882027-3-yi.zhang@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

3af3d772

20 5月, 2021 1 次提交

block: prevent block device lookups at the beginning of del_gendisk · 6c60ff04

由 Christoph Hellwig 提交于 5月 14, 2021

As an artifact of how gendisk lookup used to work in earlier kernels,
GENHD_FL_UP is only cleared very late in del_gendisk, and a global lock
is used to prevent opens from succeeding while del_gendisk is tearing
down the gendisk. Switch to clearing the flag early and under bd_mutex
so that callers can use bd_mutex to stabilize the flag, which removes
the need for the global mutex.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514131842.1600568-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

6c60ff04

14 5月, 2021 3 次提交

block/partitions/efi.c: Fix the efi_partition() kernel-doc header · 4bc20823

由 Bart Van Assche 提交于 5月 13, 2021

Fix the following kernel-doc warning:

block/partitions/efi.c:685: warning: wrong kernel-doc identifier on line:
 * efi_partition(struct parsed_partitions *state)

Cc: Alexander Viro <viro@math.psu.edu>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210513171708.8391-1-bvanassche@acm.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

4bc20823

blk-mq: Swap two calls in blk_mq_exit_queue() · 630ef623

由 Bart Van Assche 提交于 5月 13, 2021

If a tag set is shared across request queues (e.g. SCSI LUNs) then the
block layer core keeps track of the number of active request queues in
tags->active_queues. blk_mq_tag_busy() and blk_mq_tag_idle() update that
atomic counter if the hctx flag BLK_MQ_F_TAG_QUEUE_SHARED is set. Make
sure that blk_mq_exit_queue() calls blk_mq_tag_idle() before that flag is
cleared by blk_mq_del_queue_tag_set().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Fixes: 0d2602ca ("blk-mq: improve support for shared tags maps")
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210513171529.7977-1-bvanassche@acm.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

630ef623

blk-mq: plug request for shared sbitmap · 03f26d8f

由 Ming Lei 提交于 5月 14, 2021

In case of shared sbitmap, request won't be held in plug list any more
sine commit 32bc15af ("blk-mq: Facilitate a shared sbitmap per
tagset"), this way makes request merge from flush plug list & batching
submission not possible, so cause performance regression.

Yanhui reports performance regression when running sequential IO
test(libaio, 16 jobs, 8 depth for each job) in VM, and the VM disk
is emulated with image stored on xfs/megaraid_sas.

Fix the issue by recovering original behavior to allow to hold request
in plug list.

Cc: Yanhui Ma <yama@redhat.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: kashyap.desai@broadcom.com
Fixes: 32bc15af ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514022052.1047665-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

03f26d8f

12 5月, 2021 2 次提交

block, bfq: avoid circular stable merges · 7ea96eef

由 Paolo Valente 提交于 5月 12, 2021

BFQ may merge a new bfq_queue, stably, with the last bfq_queue
created. In particular, BFQ first waits a little bit for some I/O to
flow inside the new queue, say Q2, if this is needed to understand
whether it is better or worse to merge Q2 with the last queue created,
say Q1. This delayed stable merge is performed by assigning
bic->stable_merge_bfqq = Q1, for the bic associated with Q1.

Yet, while waiting for some I/O to flow in Q2, a non-stable queue
merge of Q2 with Q1 may happen, causing the bic previously associated
with Q2 to be associated with exactly Q1 (bic->bfqq = Q1). After that,
Q2 and Q1 may happen to be split, and, in the split, Q1 may happen to
be recycled as a non-shared bfq_queue. In that case, Q1 may then
happen to undergo a stable merge with the bfq_queue pointed by
bic->stable_merge_bfqq. Yet bic->stable_merge_bfqq still points to
Q1. So Q1 would be merged with itself.

This commit fixes this error by intercepting this situation, and
canceling the schedule of the stable merge.

Fixes: 430a67f9 ("block, bfq: merge bursts of newly-created queues")
Signed-off-by: NPietro Pedroni <pedroni.pietro.96@gmail.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210512094352.85545-2-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

7ea96eef

blk-iocost: fix weight updates of inner active iocgs · e9f4eee9

由 Tejun Heo 提交于 5月 11, 2021

When the weight of an active iocg is updated, weight_updated() is called
which in turn calls __propagate_weights() to update the active and inuse
weights so that the effective hierarchical weights are update accordingly.

The current implementation is incorrect for inner active nodes. For an
active leaf iocg, inuse can be any value between 1 and active and the
difference represents how much the iocg is donating. When weight is updated,
as long as inuse is clamped between 1 and the new weight, we're alright and
this is what __propagate_weights() currently implements.

However, that's not how an active inner node's inuse is set. An inner node's
inuse is solely determined by the ratio between the sums of inuse's and
active's of its children - ie. they're results of propagating the leaves'
active and inuse weights upwards. __propagate_weights() incorrectly applies
the same clamping as for a leaf when an active inner node's weight is
updated. Consider a hierarchy which looks like the following with saturating
workloads in AA and BB.

     R
   /   \
  A     B
  |     |
 AA     BB

1. For both A and B, active=100, inuse=100, hwa=0.5, hwi=0.5.

2. echo 200 > A/io.weight

3. __propagate_weights() update A's active to 200 and leave inuse at 100 as
   it's already between 1 and the new active, making A:active=200,
   A:inuse=100. As R's active_sum is updated along with A's active,
   A:hwa=2/3, B:hwa=1/3. However, because the inuses didn't change, the
   hwi's remain unchanged at 0.5.

4. The weight of A is now twice that of B but AA and BB still have the same
   hwi of 0.5 and thus are doing the same amount of IOs.

Fix it by making __propgate_weights() always calculate the inuse of an
active inner iocg based on the ratio of child_inuse_sum to child_active_sum.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NDan Schatzberg <dschatzberg@fb.com>
Fixes: 7caa4715 ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Link: https://lore.kernel.org/r/YJsxnLZV1MnBcqjj@slm.duckdns.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

e9f4eee9

11 5月, 2021 1 次提交

kyber: fix out of bounds access when preempted · efed9a33

由 Omar Sandoval 提交于 5月 10, 2021

__blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
for the current CPU again and uses that to get the corresponding Kyber
context in the passed hctx. However, the thread may be preempted between
the two calls to blk_mq_get_ctx(), and the ctx returned the second time
may no longer correspond to the passed hctx. This "works" accidentally
most of the time, but it can cause us to read garbage if the second ctx
came from an hctx with more ctx's than the first one (i.e., if
ctx->index_hw[hctx->type] > hctx->nr_ctx).

This manifested as this UBSAN array index out of bounds error reported
by Jakub:

UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
index 13106 is out of range for type 'long unsigned int [128]'
Call Trace:
 dump_stack+0xa4/0xe5
 ubsan_epilogue+0x5/0x40
 __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
 queued_spin_lock_slowpath+0x476/0x480
 do_raw_spin_lock+0x1c2/0x1d0
 kyber_bio_merge+0x112/0x180
 blk_mq_submit_bio+0x1f5/0x1100
 submit_bio_noacct+0x7b0/0x870
 submit_bio+0xc2/0x3a0
 btrfs_map_bio+0x4f0/0x9d0
 btrfs_submit_data_bio+0x24e/0x310
 submit_one_bio+0x7f/0xb0
 submit_extent_page+0xc4/0x440
 __extent_writepage_io+0x2b8/0x5e0
 __extent_writepage+0x28d/0x6e0
 extent_write_cache_pages+0x4d7/0x7a0
 extent_writepages+0xa2/0x110
 do_writepages+0x8f/0x180
 __writeback_single_inode+0x99/0x7f0
 writeback_sb_inodes+0x34e/0x790
 __writeback_inodes_wb+0x9e/0x120
 wb_writeback+0x4d2/0x660
 wb_workfn+0x64d/0xa10
 process_one_work+0x53a/0xa80
 worker_thread+0x69/0x5b0
 kthread+0x20b/0x240
 ret_from_fork+0x1f/0x30

Only Kyber uses the hctx, so fix it by passing the request_queue to
->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
map the queues itself to avoid the mismatch.

Fixes: a6088845 ("block: kyber: make kyber more friendly with merging")
Reported-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

efed9a33

09 5月, 2021 1 次提交

Revert "bio: limit bio max size" · 35c820e7

由 Jens Axboe 提交于 5月 08, 2021

This reverts commit cd2c7545.

Alex reports that the commit causes corruption with LUKS on ext4. Revert
it for now so that this can be investigated properly.

Link: https://lore.kernel.org/linux-block/1620493841.bxdq8r5haw.none@localhost/Reported-by: NAlex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

35c820e7

07 5月, 2021 1 次提交

include: remove pagemap.h from blkdev.h · 4ee60ec1

由 Matthew Wilcox (Oracle) 提交于 5月 06, 2021

My UEK-derived config has 1030 files depending on pagemap.h before this
change. Afterwards, just 326 files need to be rebuilt when I touch
pagemap.h. I think blkdev.h is probably included too widely, but
untangling that dependency is harder and this solves my problem. x86
allmodconfig builds, but there may be implicit include problems on other
architectures.

Link: https://lkml.kernel.org/r/20210309195747.283796-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Dan Williams <dan.j.williams@intel.com> [nvdimm]
Acked-by: Jens Axboe <axboe@kernel.dk> [block]
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: Martin K. Petersen <martin.petersen@oracle.com> [scsi]
Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4ee60ec1

04 5月, 2021 1 次提交

bio: limit bio max size · cd2c7545

由 Changheun Lee 提交于 5月 03, 2021

bio size can grow up to 4GB when muli-page bvec is enabled.
but sometimes it would lead to inefficient behaviors.
in case of large chunk direct I/O, - 32MB chunk read in user space -
all pages for 32MB would be merged to a bio structure if the pages
physical addresses are contiguous. it makes some delay to submit
until merge complete. bio max size should be limited to a proper size.

When 32MB chunk read with direct I/O option is coming from userspace,
kernel behavior is below now in do_direct_IO() loop. it's timeline.

 | bio merge for 32MB. total 8,192 pages are merged.
 | total elapsed time is over 2ms.
 |------------------ ... ----------------------->|
                                                 | 8,192 pages merged a bio.
                                                 | at this time, first bio submit is done.
                                                 | 1 bio is split to 32 read request and issue.
                                                 |--------------->
                                                  |--------------->
                                                   |--------------->
                                                              ......
                                                                   |--------------->
                                                                    |--------------->|
                          total 19ms elapsed to complete 32MB read done from device. |

If bio max size is limited with 1MB, behavior is changed below.

 | bio merge for 1MB. 256 pages are merged for each bio.
 | total 32 bio will be made.
 | total elapsed time is over 2ms. it's same.
 | but, first bio submit timing is fast. about 100us.
 |--->|--->|--->|---> ... -->|--->|--->|--->|--->|
      | 256 pages merged a bio.
      | at this time, first bio submit is done.
      | and 1 read request is issued for 1 bio.
      |--------------->
           |--------------->
                |--------------->
                                      ......
                                                 |--------------->
                                                  |--------------->|
        total 17ms elapsed to complete 32MB read done from device. |

As a result, read request issue timing is faster if bio max size is limited.
Current kernel behavior with multipage bvec, super large bio can be created.
And it lead to delay first I/O request issue.
Signed-off-by: NChangheun Lee <nanich.lee@samsung.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210503095203.29076-1-nanich.lee@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

cd2c7545

01 5月, 2021 1 次提交

cgroup: rstat: punt root-level optimization to individual controllers · dc26532a

由 Johannes Weiner 提交于 4月 29, 2021

Current users of the rstat code can source root-level statistics from
the native counters of their respective subsystem, allowing them to
forego aggregation at the root level. This optimization is currently
implemented inside the generic rstat code, which doesn't track the root
cgroup and doesn't invoke the subsystem flush callbacks on it.

However, the memory controller cannot do this optimization, because
cgroup1 breaks out memory specifically for the local level, including at
the root level. In preparation for the memory controller switching to
rstat, move the optimization from rstat core to the controllers.

Afterwards, rstat will always track the root cgroup for changes and
invoke the subsystem callbacks on it; and it's up to the subsystem to
special-case and skip aggregation of the root cgroup if it can source
this information through other, cheaper means.

This is the case for the io controller and the cgroup base stats. In
their respective flush callbacks, check whether the parent is the root
cgroup, and if so, skip the unnecessary upward propagation.

The extra cost of tracking the root cgroup is negligible: on stat
changes, we actually remove a branch that checks for the root. The
queueing for a flush touches only per-cpu data, and only the first stat
change since a flush requires a (per-cpu) lock.

Link: https://lkml.kernel.org/r/20210209163304.77088-6-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NTejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dc26532a

26 4月, 2021 1 次提交

blk-iocost: don't ignore vrate_min on QD contention · f46ec84b

由 Tejun Heo 提交于 4月 22, 2021

ioc_adjust_base_vrate() ignored vrate_min when rq_wait_pct indicates that
there is QD contention. The reasoning was that QD depletion always reliably
indicates device saturation and thus it's safe to override user specified
vrate_min. However, this sometimes leads to unnecessary throttling,
especially on really fast devices, because vrate adjustments have delays and
inertia. It also confuses users because the behavior violates the explicitly
specified configuration.

This patch drops the special case handling so that vrate_min is always
applied.
Signed-off-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/YIIo1HuyNmhDeiNx@slm.duckdns.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

f46ec84b

22 4月, 2021 1 次提交

block: return -EBUSY when there are open partitions in blkdev_reread_part · 68e6582e

由 Christoph Hellwig 提交于 4月 21, 2021

The switch to go through blkdev_get_by_dev means we now ignore the
return value from bdev_disk_changed in __blkdev_get. Add a manual
check to restore the old semantics.

Fixes: 4601b4b1 ("block: reopen the device in blkdev_reread_part")
Reported-by: NKarel Zak <kzak@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210421160502.447418-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

68e6582e

17 4月, 2021 1 次提交

blk-mq: Fix spurious debugfs directory creation during initialization · 1e91e28e

由 Saravanan D 提交于 4月 07, 2021

blk_mq_debugfs_register_sched_hctx() called from
device_add_disk()->elevator_init_mq()->blk_mq_init_sched()
initialization sequence does not have relevant parent directory
setup and thus spuriously attempts "sched" directory creation
from root mount of debugfs for every hw queue detected on the
block device

dmesg
...
debugfs: Directory 'sched' with parent '/' already present!
debugfs: Directory 'sched' with parent '/' already present!
.
.
debugfs: Directory 'sched' with parent '/' already present!
...

The parent debugfs directory for hw queues get properly setup
device_add_disk()->blk_register_queue()->blk_mq_debugfs_register()
->blk_mq_debugfs_register_hctx() later in the block device
initialization sequence.

A simple check for debugfs_dir has been added to thwart premature
debugfs directory/file creation attempts.
Signed-off-by: NSaravanan D <saravanand@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1e91e28e

16 4月, 2021 2 次提交

bfq/mq-deadline: remove redundant check for passthrough request · 7687b38a

由 Lin Feng 提交于 4月 15, 2021

Since commit 01e99aec 'blk-mq: insert passthrough request into
hctx->dispatch directly', passthrough request should not appear in
IO-scheduler any more, so blk_rq_is_passthrough checking in addon IO
schedulers is redundant.

(Notes: this patch passes generic IO load test with hdds under SAS
controller and hdds under AHCI controller but obviously not covers all.
Not sure if passthrough request can still escape into IO scheduler from
blk_mq_sched_insert_requests, which is used by blk_mq_flush_plug_list and
has lots of indirect callers.)
Signed-off-by: NLin Feng <linf@wangsu.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7687b38a

blk-mq: bypass IO scheduler's limit_depth for passthrough request · 8d663f34

由 Lin Feng 提交于 4月 15, 2021

Commit 01e99aec ("blk-mq: insert passthrough request into
hctx->dispatch directly") gives high priority to passthrough requests and
bypass underlying IO scheduler. But as we allocate tag for such request it
still runs io-scheduler's callback limit_depth, while we really want is to
give full sbitmap-depth capabity to such request for acquiring available
tag.
blktrace shows PC requests(dmraid -s -c -i) hit bfq's limit_depth:
8,0 2 0 0.000000000 39952 1,0 m N bfq [bfq_limit_depth] wr_busy 0 sync 0 depth 8
8,0 2 1 0.000008134 39952 D R 4 [dmraid]
8,0 2 2 0.000021538 24 C R [0]
8,0 2 0 0.000035442 39952 1,0 m N bfq [bfq_limit_depth] wr_busy 0 sync 0 depth 8
8,0 2 3 0.000038813 39952 D R 24 [dmraid]
8,0 2 4 0.000044356 24 C R [0]

This patch introduce a new wrapper to make code not that ugly.
Signed-off-by: NLin Feng <linf@wangsu.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210415033920.213963-1-linf@wangsu.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

8d663f34

14 4月, 2021 1 次提交

block: Remove an obsolete comment from sg_io() · 347b546d

由 Bart Van Assche 提交于 4月 12, 2021

Commit b7819b92 ("block: remove the blk_execute_rq return value")
changed the return type of blk_execute_rq() from int into void. That
change made a comment in sg_io() obsolete. Hence remove that comment.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20210413034142.23460-1-bvanassche@acm.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

347b546d

12 4月, 2021 3 次提交

block: move bio_list_copy_data to pktcdvd · 5f03414d

由 Christoph Hellwig 提交于 4月 12, 2021

bio_list_copy_data is only used by pktcdvd, so move it there.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210412134658.2623190-2-hch@lst.deReviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5f03414d

block: remove zero_fill_bio_iter · 6f822e1b

由 Christoph Hellwig 提交于 4月 12, 2021

zero_fill_bio_iter is only used to implement zero_fill_bio, so
remove the indirection.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210412134658.2623190-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

6f822e1b

block: remove an incorrect check from blk_rq_append_bio · cbb749cf

由 Christoph Hellwig 提交于 4月 09, 2021

blk_rq_append_bio is also used for the copy case, not just the map case,
so tis debug check is not correct.

Fixes: 393bb12e ("block: stop calling blk_queue_bounce for passthrough requests")
Reported-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210409150447.1977410-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

cbb749cf

09 4月, 2021 11 次提交

treewide: Change list_sort to use const pointers · 4f0f586b

由 Sami Tolvanen 提交于 4月 08, 2021

list_sort() internally casts the comparison function passed to it
to a different type with constant struct list_head pointers, and
uses this pointer to call the functions, which trips indirect call
Control-Flow Integrity (CFI) checking.

Instead of removing the consts, this change defines the
list_cmp_func_t type and changes the comparison function types of
all list_sort() callers to use const pointers, thus avoiding type
mismatches.
Suggested-by: NNick Desaulniers <ndesaulniers@google.com>
Signed-off-by: NSami Tolvanen <samitolvanen@google.com>
Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKees Cook <keescook@chromium.org>
Tested-by: NNick Desaulniers <ndesaulniers@google.com>
Tested-by: NNathan Chancellor <nathan@kernel.org>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com

4f0f586b

block: Fix sys_ioprio_set(.which=IOPRIO_WHO_PGRP) task iteration · 40c7fd3f

由 Peter Zijlstra 提交于 4月 08, 2021

do_each_pid_thread() { } while_each_pid_thread() is a double loop and
thus break doesn't work as expected. Also, it should be used under
tasklist_lock because otherwise we can race against change_pid() for
PGID/SID.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/YG7Q5C4Rb5dx5GFx@hirez.programming.kicks-ass.netSigned-off-by: NJens Axboe <axboe@kernel.dk>

40c7fd3f

block: remove disk_part_iter · 3212135a

由 Christoph Hellwig 提交于 4月 06, 2021

Just open code the xa_for_each in the remaining user.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-12-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

3212135a

block: simplify diskstats_show · 7fae67cc

由 Christoph Hellwig 提交于 4月 06, 2021

Just use xa_for_each to iterate over the partitions as there is no need
to grab a reference to each partition.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-11-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

7fae67cc

block: simplify show_partition · ecc75a98

由 Christoph Hellwig 提交于 4月 06, 2021

Just use xa_for_each to iterate over the partitions as there is no need
to grab a reference to each partition.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-10-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

ecc75a98

block: simplify printk_all_partitions · e559f58d

由 Christoph Hellwig 提交于 4月 06, 2021

Just use xa_for_each to iterate over the partitions as there is no need
to grab a reference to each partition.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-9-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

e559f58d

block: simplify partition_overlaps · e3069123

由 Christoph Hellwig 提交于 4月 06, 2021

Just use xa_for_each to iterate over the partitions as there is no need
to grab a reference to each partition.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-8-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

e3069123

block: simplify partition removal · 6c4541a8

由 Christoph Hellwig 提交于 4月 06, 2021

Always look up the first available entry instead of the complicated
stateful traversal.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-7-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

6c4541a8

block: take bd_mutex around delete_partitions in del_gendisk · c76f48eb

由 Christoph Hellwig 提交于 4月 06, 2021

There is nothing preventing an ioctl from trying do delete partition
concurrenly with del_gendisk, so take open_mutex to serialize against
that.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-6-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

c76f48eb

block: refactor blk_drop_partitions · d3c4a43d

由 Christoph Hellwig 提交于 4月 06, 2021

Move the busy check and disk-wide sync into the only caller, so that
the remainder can be shared with del_gendisk. Also pass the gendisk
instead of the bdev as that is all that is needed.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-5-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

d3c4a43d

block: move more syncing and invalidation to delete_partition · 473338be

由 Christoph Hellwig 提交于 4月 06, 2021

Move the calls to fsync_bdev and __invalidate_device from del_gendisk to
delete_partition. For the other two callers that check that there are
no openers for the delete partitions(s) the callouts are a no-op as no
file system can be mounted, but this keeps all the cleanup in one
place.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

473338be

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功