提交 · 50e34d78815e474d410f342fbe783b18192ca518 · openeuler / Kernel

17 6月, 2022 6 次提交

block: disable the elevator int del_gendisk · 50e34d78

由 Christoph Hellwig 提交于 6月 14, 2022

The elevator is only used for file system requests, which are stopped in
del_gendisk. Move disabling the elevator and freeing the scheduler tags
to the end of del_gendisk instead of doing that work in disk_release and
blk_cleanup_queue to avoid a use after free on q->tag_set from
disk_release as the tag_set might not be alive at that point.

Move the blk_qos_exit call as well, as it just depends on the elevator
exit and would be the only reason to keep the not exactly cheap queue
freeze in disk_release.

Fixes: e155b0c2 ("blk-mq: Use shared tags for shared sbitmap support")
Reported-by: syzbot+3e3f419f4a7816471838@syzkaller.appspotmail.com
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: syzbot+3e3f419f4a7816471838@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20220614074827.458955-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

50e34d78

block/bfq: Enable I/O statistics · b96f3cab

由 Bart Van Assche 提交于 6月 13, 2022

BFQ uses io_start_time_ns. That member variable is only set if I/O
statistics are enabled. Hence this patch that enables I/O statistics
at the time BFQ is associated with a request queue.

Compile-tested only.
Reported-by: NCixi Geng <cixi.geng1@unisoc.com>
Cc: Cixi Geng <cixi.geng1@unisoc.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Paolo Valente <paolo.valente@unimore.it>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b96f3cab

blk-mq: don't clear flush_rq from tags->rqs[] · 6cfeadbf

由 Ming Lei 提交于 6月 16, 2022

commit 364b6181 ("blk-mq: clearing flush request reference in
tags->rqs[]") is added to clear the to-be-free flush request from
tags->rqs[] for avoiding use-after-free on the flush rq.

Yu Kuai reported that blk_mq_clear_flush_rq_mapping() slows down boot time
by ~8s because running scsi probe which may create and remove lots of
unpresent LUNs on megaraid-sas which uses BLK_MQ_F_TAG_HCTX_SHARED and
each request queue has lots of hw queues.

Improve the situation by not running blk_mq_clear_flush_rq_mapping if
disk isn't added when there can't be any flush request issued.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reported-by: NYu Kuai <yukuai3@huawei.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220616014401.817001-4-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

6cfeadbf

blk-mq: avoid to touch q->elevator without any protection · 4d337ceb

由 Ming Lei 提交于 6月 16, 2022

q->elevator is referred in blk_mq_has_sqsched() without any protection,
no .q_usage_counter is held, no queue srcu and rcu read lock is held,
so potential use-after-free may be triggered.

Fix the issue by adding one queue flag for checking if the elevator
uses single queue style dispatch. Meantime the elevator feature flag
of ELEVATOR_F_MQ_AWARE isn't needed any more.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220616014401.817001-3-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

4d337ceb

blk-mq: protect q->elevator by ->sysfs_lock in blk_mq_elv_switch_none · 5fd7a84a

由 Ming Lei 提交于 6月 16, 2022

elevator can be tore down by sysfs switch interface or disk release, so
hold ->sysfs_lock before referring to q->elevator, then potential
use-after-free can be avoided.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220616014401.817001-2-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

5fd7a84a

block: Fix handling of offline queues in blk_mq_alloc_request_hctx() · 14dc7a18

由 Bart Van Assche 提交于 6月 15, 2022

This patch prevents that test nvme/004 triggers the following:

UBSAN: array-index-out-of-bounds in block/blk-mq.h:135:9
index 512 is out of range for type 'long unsigned int [512]'
Call Trace:
 show_stack+0x52/0x58
 dump_stack_lvl+0x49/0x5e
 dump_stack+0x10/0x12
 ubsan_epilogue+0x9/0x3b
 __ubsan_handle_out_of_bounds.cold+0x44/0x49
 blk_mq_alloc_request_hctx+0x304/0x310
 __nvme_submit_sync_cmd+0x70/0x200 [nvme_core]
 nvmf_connect_io_queue+0x23e/0x2a0 [nvme_fabrics]
 nvme_loop_connect_io_queues+0x8d/0xb0 [nvme_loop]
 nvme_loop_create_ctrl+0x58e/0x7d0 [nvme_loop]
 nvmf_create_ctrl+0x1d7/0x4d0 [nvme_fabrics]
 nvmf_dev_write+0xae/0x111 [nvme_fabrics]
 vfs_write+0x144/0x560
 ksys_write+0xb7/0x140
 __x64_sys_write+0x42/0x50
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Fixes: 20e4d813 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220615210004.1031820-1-bvanassche@acm.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

14dc7a18

09 6月, 2022 1 次提交

block: remove bioset_init_from_src · d5a37b19

由 Christoph Hellwig 提交于 6月 08, 2022

Unused now, and the interface never really made a whole lot of sense to
start with.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMike Snitzer <snitzer@kernel.org>

d5a37b19

03 6月, 2022 1 次提交

block: Fix potential deadlock in blk_ia_range_sysfs_show() · 41e46b3c

由 Damien Le Moal 提交于 6月 03, 2022

When being read, a sysfs attribute is already protected against removal
with the kobject node active reference counter. As a result, in
blk_ia_range_sysfs_show(), there is no need to take the queue sysfs
lock when reading the value of a range attribute. Using the queue sysfs
lock in this function creates a potential deadlock situation with the
disk removal, something that a lockdep signals with a splat when the
device is removed:

[  760.703551]  Possible unsafe locking scenario:
[  760.703551]
[  760.703554]        CPU0                    CPU1
[  760.703556]        ----                    ----
[  760.703558]   lock(&q->sysfs_lock);
[  760.703565]                                lock(kn->active#385);
[  760.703573]                                lock(&q->sysfs_lock);
[  760.703579]   lock(kn->active#385);
[  760.703587]
[  760.703587]  *** DEADLOCK ***

Solve this by removing the mutex_lock()/mutex_unlock() calls from
blk_ia_range_sysfs_show().

Fixes: a2247f19 ("block: Add independent access ranges support")
Cc: stable@vger.kernel.org
Signed-off-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220603021905.1441419-1-damien.lemoal@opensource.wdc.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

41e46b3c

02 6月, 2022 2 次提交

block: fix bio_clone_blkg_association() to associate with proper blkcg_gq · 22b106e5

由 Jan Kara 提交于 6月 02, 2022

Commit d92c370a ("block: really clone the block cgroup in
bio_clone_blkg_association") changed bio_clone_blkg_association() to
just clone bio->bi_blkg reference from source to destination bio. This
is however wrong if the source and destination bios are against
different block devices because struct blkcg_gq is different for each
bdev-blkcg pair. This will result in IOs being accounted (and throttled
as a result) multiple times against the same device (src bdev) while
throttling of the other device (dst bdev) is ignored. In case of BFQ the
inconsistency can even result in crashes in bfq_bic_update_cgroup().
Fix the problem by looking up correct blkcg_gq for the cloned bio.
Reported-by: NLogan Gunthorpe <logang@deltatee.com>
Reported-and-tested-by: NDonald Buczek <buczek@molgen.mpg.de>
Fixes: d92c370a ("block: really clone the block cgroup in bio_clone_blkg_association")
CC: stable@vger.kernel.org
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220602081242.7731-1-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

22b106e5

block: remove useless BUG_ON() in blk_mq_put_tag() · ff47dbd1

由 Damien Le Moal 提交于 6月 02, 2022

Since the if condition in blk_mq_put_tag() checks that the tag to put is
not a reserved one, the BUG_ON() check in the else branch checking if
the tag is indeed a reserved one is useless. Remove it.
Signed-off-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220602075159.1273366-1-damien.lemoal@opensource.wdc.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

ff47dbd1

30 5月, 2022 1 次提交

blk-mq: do not update io_ticks with passthrough requests · b81c14ca

由 Haisu Wang 提交于 5月 30, 2022

Flush or passthrough requests are not accounted as normal IO in completion.
To reflect iostat for slow IO, io_ticks is updated when stat show called
based on inflight numbers.
It may cause inconsistent io_ticks calculation result.

So do not account non-passthrough request when check inflight.

Fixes: 86d73312 ("block: update io_ticks when io hang")
Signed-off-by: NHaisu Wang <haisuwang@tencent.com>
Reviewed-by: Nsamuelliao <samuelliao@tencent.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220530064059.1120058-1-haisuwang@tencent.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

b81c14ca

29 5月, 2022 1 次提交

block: make bioset_exit() fully resilient against being called twice · 605f7415

由 Jens Axboe 提交于 5月 29, 2022

Most of bioset_exit() is fine being called twice, as it clears the
various allocations etc when they are freed. The exception is
bio_alloc_cache_destroy(), which does not clear ->cache when it has
freed it.

This isn't necessarily a bug, but can be if buggy users does call the
exit path more then once, or with just a memset() bioset which has
never been initialized. dm appears to be one such user.

Fixes: be4d234d ("bio: add allocation cache abstraction")
Link: https://lore.kernel.org/linux-block/YpK7m+14A+pZKs5k@casper.infradead.org/Reported-by: NMatthew Wilcox <willy@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

605f7415

28 5月, 2022 5 次提交

blk-mq: remove the done argument to blk_execute_rq_nowait · e2e53086

由 Christoph Hellwig 提交于 5月 24, 2022

Let the caller set it together with the end_io_data instead of passing
a pointless argument. Note the the target code did in fact already
set it and then just overrode it again by calling blk_execute_rq_nowait.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NKanchan Joshi <joshi.k@samsung.com>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220524121530.943123-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

e2e53086

blk-mq: avoid a mess of casts for blk_end_sync_rq · 32ac5a9b

由 Christoph Hellwig 提交于 5月 24, 2022

Instead of trying to cast a __bitwise 32-bit integer to a larger integer
and then a pointer, just allow a struct with the blk_status_t and the
completion on stack and set the end_io_data to that.  Use the
opportunity to move the code to where it belongs and drop rather
confusing comments.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220524121530.943123-3-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

32ac5a9b

blk-mq: remove __blk_execute_rq_nowait · ae948fd6

由 Christoph Hellwig 提交于 5月 24, 2022

We don't want to plug for synchronous execution that where we immediately
wait for the request. Once that is done not a whole lot of code is
shared, so just remove __blk_execute_rq_nowait.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220524121530.943123-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

ae948fd6

block: use bio_queue_enter instead of blk_queue_enter in bio_poll · ebd076bf

由 Christoph Hellwig 提交于 5月 23, 2022

We want to have a valid live gendisk to call ->poll and not just a
request_queue, so call the right helper.

Fixes: 3e08773c ("block: switch polling to be bio based")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220523124302.526186-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

ebd076bf

block: take destination bvec offsets into account in bio_copy_data_iter · 403d5034

由 Christoph Hellwig 提交于 5月 24, 2022

Appartly bcache can copy into bios that do not just contain fresh
pages but can have offsets into the bio_vecs. Restore support for tht
in bio_copy_data_iter.

Fixes: f8b679a0 ("block: rewrite bio_copy_data_iter to use bvec_kmap_local and memcpy_to_bvec")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220524143919.1155501-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

403d5034

27 5月, 2022 2 次提交

block, loop: support partitions without scanning · b9684a71

由 Christoph Hellwig 提交于 5月 27, 2022

Historically we did distinguish between a flag that surpressed partition
scanning, and a combinations of the minors variable and another flag if
any partitions were supported. This was generally confusing and doesn't
make much sense, but some corner case uses of the loop driver actually
do want to support manually added partitions on a device that does not
actively scan for partitions. To make things worsee the loop driver
also wants to dynamically toggle the scanning for partitions on a live
gendisk, which makes the disk->flags updates non-atomic.

Introduce a new GD_SUPPRESS_PART_SCAN bit in disk->state that disables
just scanning for partitions, and toggle that instead of GENHD_FL_NO_PART
in the loop driver.

Fixes: 1ebe2e5f ("block: remove GENHD_FL_EXT_DEVT")
Reported-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220527055806.1972352-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

b9684a71

blk-iolatency: Fix inflight count imbalances and IO hangs on offline · 8a177a36

由 Tejun Heo 提交于 5月 13, 2022

iolatency needs to track the number of inflight IOs per cgroup. As this
tracking can be expensive, it is disabled when no cgroup has iolatency
configured for the device. To ensure that the inflight counters stay
balanced, iolatency_set_limit() freezes the request_queue while manipulating
the enabled counter, which ensures that no IO is in flight and thus all
counters are zero.

Unfortunately, iolatency_set_limit() isn't the only place where the enabled
counter is manipulated. iolatency_pd_offline() can also dec the counter and
trigger disabling. As this disabling happens without freezing the q, this
can easily happen while some IOs are in flight and thus leak the counts.

This can be easily demonstrated by turning on iolatency on an one empty
cgroup while IOs are in flight in other cgroups and then removing the
cgroup. Note that iolatency shouldn't have been enabled elsewhere in the
system to ensure that removing the cgroup disables iolatency for the whole
device.

The following keeps flipping on and off iolatency on sda:

  echo +io > /sys/fs/cgroup/cgroup.subtree_control
  while true; do
      mkdir -p /sys/fs/cgroup/test
      echo '8:0 target=100000' > /sys/fs/cgroup/test/io.latency
      sleep 1
      rmdir /sys/fs/cgroup/test
      sleep 1
  done

and there's concurrent fio generating direct rand reads:

  fio --name test --filename=/dev/sda --direct=1 --rw=randread \
      --runtime=600 --time_based --iodepth=256 --numjobs=4 --bs=4k

while monitoring with the following drgn script:

  while True:
    for css in css_for_each_descendant_pre(prog['blkcg_root'].css.address_of_()):
        for pos in hlist_for_each(container_of(css, 'struct blkcg', 'css').blkg_list):
            blkg = container_of(pos, 'struct blkcg_gq', 'blkcg_node')
            pd = blkg.pd[prog['blkcg_policy_iolatency'].plid]
            if pd.value_() == 0:
                continue
            iolat = container_of(pd, 'struct iolatency_grp', 'pd')
            inflight = iolat.rq_wait.inflight.counter.value_()
            if inflight:
                print(f'inflight={inflight} {disk_name(blkg.q.disk).decode("utf-8")} '
                      f'{cgroup_path(css.cgroup).decode("utf-8")}')
    time.sleep(1)

The monitoring output looks like the following:

  inflight=1 sda /user.slice
  inflight=1 sda /user.slice
  ...
  inflight=14 sda /user.slice
  inflight=13 sda /user.slice
  inflight=17 sda /user.slice
  inflight=15 sda /user.slice
  inflight=18 sda /user.slice
  inflight=17 sda /user.slice
  inflight=20 sda /user.slice
  inflight=19 sda /user.slice <- fio stopped, inflight stuck at 19
  inflight=19 sda /user.slice
  inflight=19 sda /user.slice

If a cgroup with stuck inflight ends up getting throttled, the throttled IOs
will never get issued as there's no completion event to wake it up leading
to an indefinite hang.

This patch fixes the bug by unifying enable handling into a work item which
is automatically kicked off from iolatency_set_min_lat_nsec() which is
called from both iolatency_set_limit() and iolatency_pd_offline() paths.
Punting to a work item is necessary as iolatency_pd_offline() is called
under spinlocks while freezing a request_queue requires a sleepable context.

This also simplifies the code reducing LOC sans the comments and avoids the
unnecessary freezes which were happening whenever a cgroup's latency target
is newly set or cleared.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Liu Bo <bo.liu@linux.alibaba.com>
Fixes: 8c772a9b ("blk-iolatency: fix IO hang due to negative inflight counter")
Cc: stable@vger.kernel.org # v5.0+
Link: https://lore.kernel.org/r/Yn9ScX6Nx2qIiQQi@slm.duckdns.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

8a177a36

23 5月, 2022 2 次提交

blk-mq: don't touch ->tagset in blk_mq_get_sq_hctx · 5d05426e

由 Ming Lei 提交于 5月 22, 2022

blk_mq_run_hw_queues() could be run when there isn't queued request and
after queue is cleaned up, at that time tagset is freed, because tagset
lifetime is covered by driver, and often freed after blk_cleanup_queue()
returns.

So don't touch ->tagset for figuring out current default hctx by the mapping
built in request queue, so use-after-free on tagset can be avoided. Meantime
this way should be fast than retrieving mapping from tagset.

Cc: "yukuai (C)" <yukuai3@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Fixes: b6e68ee8 ("blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220522122350.743103-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

5d05426e

block: add sync_blockdev_range() · 97d6fb1b

由 Yuezhang Mo 提交于 4月 12, 2022

sync_blockdev_range() is to support syncing multiple sectors
with as few block device requests as possible, it is helpful
to make the block device to give full play to its performance.
Signed-off-by: NYuezhang Mo <Yuezhang.Mo@sony.com>
Suggested-by: NChristoph Hellwig <hch@infradead.org>
Reviewed-by: NAndy Wu <Andy.Wu@sony.com>
Reviewed-by: NAoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NSungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: NNamjae Jeon <linkinjeon@kernel.org>

97d6fb1b

21 5月, 2022 1 次提交

blk-mq: fix typo in comment · 2aaf5160

由 Julia Lawall 提交于 5月 21, 2022

Spelling mistake (triple letters) in comment.
Detected with the help of Coccinelle.
Signed-off-by: NJulia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/20220521111145.81697-29-Julia.Lawall@inria.frSigned-off-by: NJens Axboe <axboe@kernel.dk>

2aaf5160

19 5月, 2022 5 次提交

bfq: Remove bfq_requeue_request_body() · a249ca7d

由 Jan Kara 提交于 5月 19, 2022

The function has only a single caller and two lines. Just remove it
since it is pointless and just harming readability.
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220519105235.31397-4-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

a249ca7d

bfq: Remove superfluous conversion from RQ_BIC() · e79cf889

由 Jan Kara 提交于 5月 19, 2022

We store struct bfq_io_cq pointer in rq->elv.priv[0] in bfq_init_rq().
Thus a call to icq_to_bic() in RQ_BIC() is wrong. Luckily it does no
harm currently because struct io_iq is the first one in struct
bfq_io_cq.
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220519105235.31397-3-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

e79cf889

bfq: Allow current waker to defend against a tentative one · c5ac56bb

由 Jan Kara 提交于 5月 19, 2022

The code in bfq_check_waker() ignores wake up events from the current
waker. This makes it more likely we select a new tentative waker
although the current one is generating more wake up events. Treat
current waker the same way as any other process and allow it to reset
the waker detection logic.

Fixes: 71217df3 ("block, bfq: make waker-queue detection more robust")
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220519105235.31397-2-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

c5ac56bb

bfq: Relax waker detection for shared queues · f9506673

由 Jan Kara 提交于 5月 19, 2022

Currently we look for waker only if current queue has no requests. This
makes sense for bfq queues with a single process however for shared
queues when there is a larger number of processes the condition that
queue has no requests is difficult to meet because often at least one
process has some request in flight although all the others are waiting
for the waker to do the work and this harms throughput. Relax the "no
queued request for bfq queue" condition to "the current task has no
queued requests yet". For this, we also need to start tracking number of
requests in flight for each task.

This patch (together with the following one) restores the performance
for dbench with 128 clients that regressed with commit c65e6fd4
("bfq: Do not let waker requests skip proper accounting") because
this commit makes requests of wakers properly enter BFQ queues and thus
these queues become ineligible for the old waker detection logic.
Dbench results:

Vanilla 5.18-rc3 5.18-rc3 + revert 5.18-rc3 patched
Mean 1237.36 ( 0.00%) 950.16 * 23.21%* 988.35 * 20.12%*

Numbers are time to complete workload so lower is better.

Fixes: c65e6fd4 ("bfq: Do not let waker requests skip proper accounting")
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220519105235.31397-1-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

f9506673

blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE() · 1305e2c9

由 Jens Axboe 提交于 5月 18, 2022

A previous commit got rid of unnecessary rcu_read_lock() inside the
IRQ disabling queue_lock, but this debug statement was left. It's now
firing since we are indeed not inside a RCU read lock, but we don't
need to be as we're still preempt safe.

Get rid of the check, as we have a lockdep assert for holding the
queue lock right after it anyway.

Link: https://lore.kernel.org/linux-block/46253c48-81cb-0787-20ad-9133afdd9e21@samsung.com/Reported-by: NMarek Szyprowski <m.szyprowski@samsung.com>
Fixes: 77c570a1 ("blk-cgroup: Remove unnecessary rcu_read_lock/unlock()")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1305e2c9

18 5月, 2022 1 次提交

blk-throttle: Set BIO_THROTTLED when bio has been throttled · 5a011f88

由 Laibin Qiu 提交于 3月 01, 2022

1.In current process, all bio will set the BIO_THROTTLED flag
after __blk_throtl_bio().

2.If bio needs to be throttled, it will start the timer and
stop submit bio directly. Bio will submit in
blk_throtl_dispatch_work_fn() when the timer expires.But in
the current process, if bio is throttled. The BIO_THROTTLED
will be set to bio after timer start. If the bio has been
completed, it may cause use-after-free blow.

BUG: KASAN: use-after-free in blk_throtl_bio+0x12f0/0x2c70
Read of size 2 at addr ffff88801b8902d4 by task fio/26380

 dump_stack+0x9b/0xce
 print_address_description.constprop.6+0x3e/0x60
 kasan_report.cold.9+0x22/0x3a
 blk_throtl_bio+0x12f0/0x2c70
 submit_bio_checks+0x701/0x1550
 submit_bio_noacct+0x83/0xc80
 submit_bio+0xa7/0x330
 mpage_readahead+0x380/0x500
 read_pages+0x1c1/0xbf0
 page_cache_ra_unbounded+0x471/0x6f0
 do_page_cache_ra+0xda/0x110
 ondemand_readahead+0x442/0xae0
 page_cache_async_ra+0x210/0x300
 generic_file_buffered_read+0x4d9/0x2130
 generic_file_read_iter+0x315/0x490
 blkdev_read_iter+0x113/0x1b0
 aio_read+0x2ad/0x450
 io_submit_one+0xc8e/0x1d60
 __se_sys_io_submit+0x125/0x350
 do_syscall_64+0x2d/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Allocated by task 26380:
 kasan_save_stack+0x19/0x40
 __kasan_kmalloc.constprop.2+0xc1/0xd0
 kmem_cache_alloc+0x146/0x440
 mempool_alloc+0x125/0x2f0
 bio_alloc_bioset+0x353/0x590
 mpage_alloc+0x3b/0x240
 do_mpage_readpage+0xddf/0x1ef0
 mpage_readahead+0x264/0x500
 read_pages+0x1c1/0xbf0
 page_cache_ra_unbounded+0x471/0x6f0
 do_page_cache_ra+0xda/0x110
 ondemand_readahead+0x442/0xae0
 page_cache_async_ra+0x210/0x300
 generic_file_buffered_read+0x4d9/0x2130
 generic_file_read_iter+0x315/0x490
 blkdev_read_iter+0x113/0x1b0
 aio_read+0x2ad/0x450
 io_submit_one+0xc8e/0x1d60
 __se_sys_io_submit+0x125/0x350
 do_syscall_64+0x2d/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 0:
 kasan_save_stack+0x19/0x40
 kasan_set_track+0x1c/0x30
 kasan_set_free_info+0x1b/0x30
 __kasan_slab_free+0x111/0x160
 kmem_cache_free+0x94/0x460
 mempool_free+0xd6/0x320
 bio_free+0xe0/0x130
 bio_put+0xab/0xe0
 bio_endio+0x3a6/0x5d0
 blk_update_request+0x590/0x1370
 scsi_end_request+0x7d/0x400
 scsi_io_completion+0x1aa/0xe50
 scsi_softirq_done+0x11b/0x240
 blk_mq_complete_request+0xd4/0x120
 scsi_mq_done+0xf0/0x200
 virtscsi_vq_done+0xbc/0x150
 vring_interrupt+0x179/0x390
 __handle_irq_event_percpu+0xf7/0x490
 handle_irq_event_percpu+0x7b/0x160
 handle_irq_event+0xcc/0x170
 handle_edge_irq+0x215/0xb20
 common_interrupt+0x60/0x120
 asm_common_interrupt+0x1e/0x40

Fix this by move BIO_THROTTLED set into the queue_lock.
Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220301123919.2381579-1-qiulaibin@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

5a011f88

17 5月, 2022 5 次提交

blk-cgroup: Remove unnecessary rcu_read_lock/unlock() · 77c570a1

由 Fanjun Kong 提交于 5月 17, 2022

spin_lock_irq/spin_unlock_irq contains preempt_disable/enable().
Which can serve as RCU read-side critical region, so remove
rcu_read_lock/unlock().
Signed-off-by: NFanjun Kong <bh1scw@gmail.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220516173930.159535-1-bh1scw@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

77c570a1

blk-cgroup: always terminate io.stat lines · 3607849d

由 Wolfgang Bumiller 提交于 1月 11, 2022

With the removal of seq_get_buf in blkcg_print_one_stat, we
cannot make adding the newline conditional on there being
relevant stats because the name was already written out
unconditionally.
Otherwise we may end up with multiple device names in one
line which is confusing and doesn't follow the nested-keyed
file format.
Signed-off-by: NWolfgang Bumiller <w.bumiller@proxmox.com>
Fixes: 252c651a ("blk-cgroup: stop using seq_get_buf")
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220111083159.42340-1-w.bumiller@proxmox.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

3607849d

block, bfq: make bfq_has_work() more accurate · ddc25c86

由 Yu Kuai 提交于 5月 13, 2022

bfq_has_work() is using busy_queues currently, which is not accurate
because bfq_queue is busy doesn't represent that it has requests. Since
bfqd aready has a counter 'queued' to record how many requests are in
bfq, use it instead of busy_queues.

Noted that bfq_has_work() can be called with 'bfqd->lock' held, thus the
lock can't be held in bfq_has_work() to protect 'bfqd->queued'.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220513023507.2625717-3-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

ddc25c86

block, bfq: protect 'bfqd->queued' by 'bfqd->lock' · 181490d5

由 Yu Kuai 提交于 5月 13, 2022

If bfq_schedule_dispatch() is called from bfq_idle_slice_timer_body(),
then 'bfqd->queued' is read without holding 'bfqd->lock'. This is
wrong since it can be wrote concurrently.

Fix the problem by holding 'bfqd->lock' in such case.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220513023507.2625717-2-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

181490d5

block: cleanup the VM accounting in submit_bio · a3e7689b

由 Christoph Hellwig 提交于 5月 16, 2022

submit_bio uses some extremely convoluted checks and confusing comments
to only account REQ_OP_READ/REQ_OP_WRITE comments.  Just switch to the
plain obvious checks instead.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220516063654.2782792-1-hch@lst.de
[axboe: fixup WRITE -> REQ_OP_WRITE]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a3e7689b

14 5月, 2022 1 次提交

block/mq-deadline: Set the fifo_time member also if inserting at head · 725f22a1

由 Bart Van Assche 提交于 5月 13, 2022

Before commit 322cff70 the fifo_time member of requests on a dispatch
list was not used. Commit 322cff70 introduces code that reads the
fifo_time member of requests on dispatch lists. Hence this patch that sets
the fifo_time member when adding a request to a dispatch list.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Fixes: 322cff70 ("block/mq-deadline: Prioritize high-priority requests")
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220513171307.32564-1-bvanassche@acm.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

725f22a1

12 5月, 2022 2 次提交

blk-mq: fix passthrough plugging · a327c341

由 Ming Lei 提交于 5月 12, 2022

First we can't add request into plug list in blk_mq_request_bypass_insert
which may be called when flushing plug list, so nested plug is caused.

Second if polled passthrough request is inserted via blk_execute_rq(),
it can't be added to plug list too since io polling needs the request
to be issued to driver.

Fixes the two by moving plugging into blk_execute_rq_no_wait().

Cc: Christoph Hellwig <hch@lst.de>
Fixes: 1c2d2fff ("block: wire-up support for passthrough plugging")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220512140010.1458645-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

a327c341

blk-iocost: combine local_stat and desc_stat to stat · 2a371f7d

由 Chengming Zhou 提交于 5月 10, 2022

When we flush usage, wait, indebt stat in iocg_flush_stat(), we use
local_stat and desc_stat, which has no point since the leaf iocg
only has local_stat and the inner iocg only has desc_stat. Also
we don't need to flush percpu abs_vusage for these inner iocgs.

This patch combine local_stat and desc_stat to stat, only flush
percpu abs_vusage for active leaf iocgs, then build inner walk
list to propagate.
Signed-off-by: NChengming Zhou <zhouchengming@bytedance.com>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220510034757.21761-1-zhouchengming@bytedance.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

2a371f7d

11 5月, 2022 1 次提交

block: wire-up support for passthrough plugging · 1c2d2fff

由 Jens Axboe 提交于 5月 11, 2022

Add support for plugging in passthrough path. When plugging is enabled, the
requests are added to a plug instead of getting dispatched to the driver.
And when the plug is finished, the whole batch gets dispatched via
->queue_rqs which turns out to be more efficient. Otherwise dispatching
used to happen via ->queue_rq, one request at a time.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220511054750.20432-3-joshi.k@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

1c2d2fff

10 5月, 2022 1 次提交

fs: Convert block_read_full_page() to block_read_full_folio() · 2c69e205

由 Matthew Wilcox (Oracle) 提交于 4月 29, 2022

This function is NOT converted to handle large folios, so include
an assert that the filesystem isn't passing one in. Otherwise, use
the folio functions instead of the page functions, where they exist.
Convert all filesystems which use block_read_full_page().
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>

2c69e205

09 5月, 2022 2 次提交

fs: Remove flags parameter from aops->write_begin · 9d6b0cd7

由 Matthew Wilcox (Oracle) 提交于 2月 22, 2022

There are no more aop flags left, so remove the parameter.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

9d6b0cd7

fs: Remove aop flags parameter from block_write_begin() · b3992d1e

由 Matthew Wilcox (Oracle) 提交于 2月 22, 2022

There are no more aop flags left, so remove the parameter.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

b3992d1e

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功