提交 · 4f9ed36f09845aabace708c4b8f70efaab1eb3aa · openeuler / Kernel

14 7月, 2021 1 次提交

smp: Cleanup smp_call_function*() · 4f9ed36f

由 Peter Zijlstra 提交于 7月 09, 2021

mainline inclusion
from mainline-v5.11-rc1
commit 545b8c8d
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZV2C
CVE: NA

-------------------------------------------------

Get rid of the __call_single_node union and cleanup the API a little
to avoid external code relying on the structure layout as much.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>

conflict:
	kernel/debug/debug_core.c
	kernel/sched/core.c
	kernel/smp.c: fix csd_lock_wait_getcpu() csd->node.dst
Signed-off-by: NTong Tiangen <tongtiangen@huawei.com>
Reviewed-by: NChen Wandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

4f9ed36f

06 7月, 2021 2 次提交

Revert "block: Fix a lockdep complaint triggered by request queue flushing" · dce56a4c

由 Ming Lei 提交于 6月 28, 2021

mainline inclusion
from mainline-5.11-rc1
commit 7aa390ec
category: bugfix
bugzilla: 108493
CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7aa390ec2d9db0cd6677d95d0b8f307f9c086770

---------------------------

This reverts commit b3c6a599.

Now we can avoid nvme-loop lockdep warning of 'lockdep possible recursive locking'
by nvme-loop's lock class, no need to apply dynamically allocated lock class key,
so revert commit b3c6a599("block: Fix a lockdep complaint triggered by request
queue flushing").

This way fixes horrible SCSI probe delay issue on megaraid_sas, and it is reported
the whole probe may take more than half an hour.
Tested-by: NKashyap Desai <kashyap.desai@broadcom.com>
Reported-by: NQian Cai <cai@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

dce56a4c

blk-mq: add new API of blk_mq_hctx_set_fq_lock_class · 5476d7ce

由 Ming Lei 提交于 6月 28, 2021

mainline inclusion
from mainline-5.11-rc1
commit fb01a293
category: bugfix
bugzilla: 108493
CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fb01a2932e81a1fb2273f87ff92dc8172b8880ee

---------------------------

flush_end_io() may be called recursively from some driver, such as
nvme-loop, so lockdep may complain 'possible recursive locking'.
Commit b3c6a599("block: Fix a lockdep complaint triggered by
request queue flushing") tried to address this issue by assigning
dynamically allocated per-flush-queue lock class. This solution
adds synchronize_rcu() for each hctx's release handler, and causes
horrible SCSI MQ probe delay(more than half an hour on megaraid sas).

Add new API of blk_mq_hctx_set_fq_lock_class() for these drivers, so
we just need to use driver specific lock class for avoiding the
lockdep warning of 'possible recursive locking'.
Tested-by: NKashyap Desai <kashyap.desai@broadcom.com>
Reported-by: NQian Cai <cai@redhat.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5476d7ce

03 7月, 2021 6 次提交

block: check disk exist before trying to add partition · c4e462d2

由 Yufen Yu 提交于 7月 01, 2021

hulk inclusion
category: bugfix
bugzilla: 168631
CVE: NA

-------------------------------------------------

If disk have been deleted, we should return fail for ioctl
BLKPG_DEL_PARTITION. Otherwise, the directory /sys/class/block
may remain invalid symlinks file. The race as following:

blkdev_open
				del_gendisk
				    disk->flags &= ~GENHD_FL_UP;
				    blk_drop_partitions
blkpg_ioctl
    bdev_add_partition
    add_partition
        device_add
	    device_add_class_symlinks

ioctl may add_partition after del_gendisk() have tried to delete
partitions. Then, symlinks file will be created.

Link: https://lore.kernel.org/linux-block/20210608092707.1062259-1-yuyufen@huawei.com/Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c4e462d2

block: avoid creating invalid symlink file for patitions · 7405ca0d

由 Yufen Yu 提交于 7月 01, 2021

hulk inclusion
category: bugfix
bugzilla: 168631
CVE: NA

-------------------------------------------------

For now, there is no mechanism that can provent ioctl to call
add_partition after del_gendisk() have called delete_partition().
Then, invalid symlinks file may be created into /sys/class/block.

We try to fix this problem by setting GENHD_FL_UP early in del_gendisk()
and check the flag before adding partitions likely that do in
mainline kernel. Since all of them are cover by bdev->bd_mutex,
either add_partition success but will delete by del_gendisk(),
or add_partition will fail return as GENHD_FL_UP have been cleared.
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7405ca0d

block: take bd_mutex around delete_partitions in del_gendisk · 956ae9cd

由 Christoph Hellwig 提交于 7月 01, 2021

mainline inclusion
from mainline-v5.13-rc1
commit c76f48eb
category: bugfix
bugzilla: 168631
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c76f48eb5c084b1e15c931ae8cc1826cd771d70d

--------------------------------

There is nothing preventing an ioctl from trying do delete partition
concurrenly with del_gendisk, so take open_mutex to serialize against
that.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210406062303.811835-6-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
Conflict:
	block/genhd.c
	block/partitions/core.c
[yufen: del_gendisk didn't call blk_drop_partitions().]
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

956ae9cd

scsi: remove unused kobj map for sd devie to avoid memleak · ecb923f0

由 Yufen Yu 提交于 7月 01, 2021

hulk inclusion
category: bugfix
bugzilla: 168625
CVE: NA

-------------------------------------------------

After calling add_disk, we have register new kobj map for sd device,
then we can remove old unused kobj map which probed by sd_remove.
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ecb923f0

block: fix inflight statistics of part0 · 377f3132

由 Jeffle Xu 提交于 6月 21, 2021

mainline inclusion
from mainline-5.11-rc1
commit b0d97557
category: bugfix
bugzilla: 108592
CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b0d97557ebfc9d5ba5f2939339a9fdd267abafeb

---------------------------

The inflight of partition 0 doesn't include inflight IOs to all
sub-partitions, since currently mq calculates inflight of specific
partition by simply camparing the value of the partition pointer.

Thus the following case is possible:

$ cat /sys/block/vda/inflight
       0        0
$ cat /sys/block/vda/vda1/inflight
       0      128

While single queue device (on a previous version, e.g. v3.10) has no
this issue:

$cat /sys/block/sda/sda3/inflight
       0       33
$cat /sys/block/sda/inflight
       0       33

Partition 0 should be specially handled since it represents the whole
disk. This issue is introduced since commit bf0ddaba ("blk-mq: fix
sysfs inflight counter").

Besides, this patch can also fix the inflight statistics of part 0 in
/proc/diskstats. Before this patch, the inflight statistics of part 0
doesn't include that of sub partitions. (I have marked the 'inflight'
field with asterisk.)

$cat /proc/diskstats
 259       0 nvme0n1 45974469 0 367814768 6445794 1 0 1 0 *0* 111062 6445794 0 0 0 0 0 0
 259       2 nvme0n1p1 45974058 0 367797952 6445727 0 0 0 0 *33* 111001 6445727 0 0 0 0 0 0

This is introduced since commit f299b7c7 ("blk-mq: provide internal
in-flight variant").

Fixes: bf0ddaba ("blk-mq: fix sysfs inflight counter")
Fixes: f299b7c7 ("blk-mq: provide internal in-flight variant")
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
[axboe: adapt for 5.11 partition change]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

Conflicts:
	block/blk-mq.c
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

377f3132

block, bfq: set next_rq to waker_bfqq->next_rq in waker injection · ccc60759

由 Jia Cheng Hu 提交于 6月 15, 2021

mainline inclusion
from mainline-v5.12-rc1
commit d4fc3640
category: bugfix
bugzilla: 107810
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d4fc3640ff361a09e359867e0bca898abd2b7ecb

-----------------------------------------------

Since commit c5089591c3ba ("block, bfq: detect wakers and
unconditionally inject their I/O"), when the in-service bfq_queue, say
Q, is temporarily empty, BFQ checks whether there are I/O requests to
inject (also) from the waker bfq_queue for Q. To this goal, the value
pointed by bfqq->waker_bfqq->next_rq must be controlled. However, the
current implementation mistakenly looks at bfqq->next_rq, which
instead points to the next request of the currently served queue.

This mistake evidently causes losses of throughput in scenarios with
waker bfq_queues.

This commit corrects this mistake.

Fixes: c5089591c3ba ("block, bfq: detect wakers and unconditionally inject their I/O")
Signed-off-by: NJia Cheng Hu <jia.jiachenghu@gmail.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ccc60759

03 6月, 2021 5 次提交

blk-mq: Swap two calls in blk_mq_exit_queue() · ad5859e5

由 Bart Van Assche 提交于 5月 25, 2021

stable inclusion
from stable-5.10.38
commit 3a96437f6bf85fa64e933cc100445f9278cee1ff
bugzilla: 51875
CVE: NA

--------------------------------

[ Upstream commit 630ef623 ]

If a tag set is shared across request queues (e.g. SCSI LUNs) then the
block layer core keeps track of the number of active request queues in
tags->active_queues. blk_mq_tag_busy() and blk_mq_tag_idle() update that
atomic counter if the hctx flag BLK_MQ_F_TAG_QUEUE_SHARED is set. Make
sure that blk_mq_exit_queue() calls blk_mq_tag_idle() before that flag is
cleared by blk_mq_del_queue_tag_set().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Fixes: 0d2602ca ("blk-mq: improve support for shared tags maps")
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210513171529.7977-1-bvanassche@acm.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ad5859e5

blk-mq: plug request for shared sbitmap · 66d42d37

由 Ming Lei 提交于 5月 25, 2021

stable inclusion
from stable-5.10.38
commit c9c1ed08c174c2fa88fe1badbb876a7317a8224f
bugzilla: 51875
CVE: NA

--------------------------------

[ Upstream commit 03f26d8f ]

In case of shared sbitmap, request won't be held in plug list any more
sine commit 32bc15af ("blk-mq: Facilitate a shared sbitmap per
tagset"), this way makes request merge from flush plug list & batching
submission not possible, so cause performance regression.

Yanhui reports performance regression when running sequential IO
test(libaio, 16 jobs, 8 depth for each job) in VM, and the VM disk
is emulated with image stored on xfs/megaraid_sas.

Fix the issue by recovering original behavior to allow to hold request
in plug list.

Cc: Yanhui Ma <yama@redhat.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: kashyap.desai@broadcom.com
Fixes: 32bc15af ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514022052.1047665-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

66d42d37

kyber: fix out of bounds access when preempted · 6df833af

由 Omar Sandoval 提交于 5月 25, 2021

stable inclusion
from stable-5.10.38
commit 54dbe2d2c1fcabf650c7a8b747601da355cd7f9f
bugzilla: 51875
CVE: NA

--------------------------------

[ Upstream commit efed9a33 ]

__blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
for the current CPU again and uses that to get the corresponding Kyber
context in the passed hctx. However, the thread may be preempted between
the two calls to blk_mq_get_ctx(), and the ctx returned the second time
may no longer correspond to the passed hctx. This "works" accidentally
most of the time, but it can cause us to read garbage if the second ctx
came from an hctx with more ctx's than the first one (i.e., if
ctx->index_hw[hctx->type] > hctx->nr_ctx).

This manifested as this UBSAN array index out of bounds error reported
by Jakub:

UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
index 13106 is out of range for type 'long unsigned int [128]'
Call Trace:
 dump_stack+0xa4/0xe5
 ubsan_epilogue+0x5/0x40
 __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
 queued_spin_lock_slowpath+0x476/0x480
 do_raw_spin_lock+0x1c2/0x1d0
 kyber_bio_merge+0x112/0x180
 blk_mq_submit_bio+0x1f5/0x1100
 submit_bio_noacct+0x7b0/0x870
 submit_bio+0xc2/0x3a0
 btrfs_map_bio+0x4f0/0x9d0
 btrfs_submit_data_bio+0x24e/0x310
 submit_one_bio+0x7f/0xb0
 submit_extent_page+0xc4/0x440
 __extent_writepage_io+0x2b8/0x5e0
 __extent_writepage+0x28d/0x6e0
 extent_write_cache_pages+0x4d7/0x7a0
 extent_writepages+0xa2/0x110
 do_writepages+0x8f/0x180
 __writeback_single_inode+0x99/0x7f0
 writeback_sb_inodes+0x34e/0x790
 __writeback_inodes_wb+0x9e/0x120
 wb_writeback+0x4d2/0x660
 wb_workfn+0x64d/0xa10
 process_one_work+0x53a/0xa80
 worker_thread+0x69/0x5b0
 kthread+0x20b/0x240
 ret_from_fork+0x1f/0x30

Only Kyber uses the hctx, so fix it by passing the request_queue to
->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
map the queues itself to avoid the mismatch.

Fixes: a6088845 ("block: kyber: make kyber more friendly with merging")
Reported-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6df833af

blk-iocost: fix weight updates of inner active iocgs · 3833fe5a

由 Tejun Heo 提交于 5月 25, 2021

stable inclusion
from stable-5.10.38
commit 70748bba55658f4bf61ba1686fec9879ca6559c9
bugzilla: 51875
CVE: NA

--------------------------------

commit e9f4eee9 upstream.

When the weight of an active iocg is updated, weight_updated() is called
which in turn calls __propagate_weights() to update the active and inuse
weights so that the effective hierarchical weights are update accordingly.

The current implementation is incorrect for inner active nodes. For an
active leaf iocg, inuse can be any value between 1 and active and the
difference represents how much the iocg is donating. When weight is updated,
as long as inuse is clamped between 1 and the new weight, we're alright and
this is what __propagate_weights() currently implements.

However, that's not how an active inner node's inuse is set. An inner node's
inuse is solely determined by the ratio between the sums of inuse's and
active's of its children - ie. they're results of propagating the leaves'
active and inuse weights upwards. __propagate_weights() incorrectly applies
the same clamping as for a leaf when an active inner node's weight is
updated. Consider a hierarchy which looks like the following with saturating
workloads in AA and BB.

     R
   /   \
  A     B
  |     |
 AA     BB

1. For both A and B, active=100, inuse=100, hwa=0.5, hwi=0.5.

2. echo 200 > A/io.weight

3. __propagate_weights() update A's active to 200 and leave inuse at 100 as
   it's already between 1 and the new active, making A:active=200,
   A:inuse=100. As R's active_sum is updated along with A's active,
   A:hwa=2/3, B:hwa=1/3. However, because the inuses didn't change, the
   hwi's remain unchanged at 0.5.

4. The weight of A is now twice that of B but AA and BB still have the same
   hwi of 0.5 and thus are doing the same amount of IOs.

Fix it by making __propgate_weights() always calculate the inuse of an
active inner iocg based on the ratio of child_inuse_sum to child_active_sum.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NDan Schatzberg <dschatzberg@fb.com>
Fixes: 7caa4715 ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Link: https://lore.kernel.org/r/YJsxnLZV1MnBcqjj@slm.duckdns.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

3833fe5a

block: return -EBUSY when there are open partitions in blkdev_reread_part · fdeaefd8

由 Christoph Hellwig 提交于 5月 07, 2021

stable inclusion
from stable-5.10.33
commit fc2454cc0c4bbf3ab7556c8b38e042c6c7651e42
bugzilla: 51834
CVE: NA

--------------------------------

[ Upstream commit 68e6582e ]

The switch to go through blkdev_get_by_dev means we now ignore the
return value from bdev_disk_changed in __blkdev_get.  Add a manual
check to restore the old semantics.

Fixes: 4601b4b1 ("block: reopen the device in blkdev_reread_part")
Reported-by: NKarel Zak <kzak@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210421160502.447418-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

fdeaefd8

26 4月, 2021 1 次提交

block: only update parent bi_status when bio fail · b9e7742d

由 Yufen Yu 提交于 4月 22, 2021

stable inclusion
from stable-5.10.31
commit 1d2310d95fb8e29e69ebfc038919c968fbbdcb64
bugzilla: 51792

--------------------------------

[ Upstream commit 3edf5346 ]

For multiple split bios, if one of the bio is fail, the whole
should return error to application. But we found there is a race
between bio_integrity_verify_fn and bio complete, which return
io success to application after one of the bio fail. The race as
following:

split bio(READ)          kworker

nvme_complete_rq
blk_update_request //split error=0
  bio_endio
    bio_integrity_endio
      queue_work(kintegrityd_wq, &bip->bip_work);

                         bio_integrity_verify_fn
                         bio_endio //split bio
                          __bio_chain_endio
                             if (!parent->bi_status)

                               <interrupt entry>
                               nvme_irq
                                 blk_update_request //parent error=7
                                 req_bio_endio
                                    bio->bi_status = 7 //parent bio
                               <interrupt exit>

                               parent->bi_status = 0
                        parent->bi_end_io() // return bi_status=0

The bio has been split as two: split and parent. When split
bio completed, it depends on kworker to do endio, while
bio_integrity_verify_fn have been interrupted by parent bio
complete irq handler. Then, parent bio->bi_status which have
been set in irq handler will overwrite by kworker.

In fact, even without the above race, we also need to conside
the concurrency beteen mulitple split bio complete and update
the same parent bi_status. Normally, multiple split bios will
be issued to the same hctx and complete from the same irq
vector. But if we have updated queue map between multiple split
bios, these bios may complete on different hw queue and different
irq vector. Then the concurrency update parent bi_status may
cause the final status error.
Suggested-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210331115359.1125679-1-yuyufen@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b9e7742d

19 4月, 2021 4 次提交

block: recalculate segment count for multi-segment discards correctly · f7d417ef

由 David Jeffery 提交于 4月 07, 2021

stable inclusion
from stable-5.10.27
commit fc062d21c011dc9e9e49f20e26fb5930fa24c720
bugzilla: 51493

--------------------------------

[ Upstream commit a958937f ]

When a stacked block device inserts a request into another block device
using blk_insert_cloned_request, the request's nr_phys_segments field gets
recalculated by a call to blk_recalc_rq_segments in
blk_cloned_rq_check_limits. But blk_recalc_rq_segments does not know how to
handle multi-segment discards. For disk types which can handle
multi-segment discards like nvme, this results in discard requests which
claim a single segment when it should report several, triggering a warning
in nvme and causing nvme to fail the discard from the invalid state.

 WARNING: CPU: 5 PID: 191 at drivers/nvme/host/core.c:700 nvme_setup_discard+0x170/0x1e0 [nvme_core]
 ...
 nvme_setup_cmd+0x217/0x270 [nvme_core]
 nvme_loop_queue_rq+0x51/0x1b0 [nvme_loop]
 __blk_mq_try_issue_directly+0xe7/0x1b0
 blk_mq_request_issue_directly+0x41/0x70
 ? blk_account_io_start+0x40/0x50
 dm_mq_queue_rq+0x200/0x3e0
 blk_mq_dispatch_rq_list+0x10a/0x7d0
 ? __sbitmap_queue_get+0x25/0x90
 ? elv_rb_del+0x1f/0x30
 ? deadline_remove_request+0x55/0xb0
 ? dd_dispatch_request+0x181/0x210
 __blk_mq_do_dispatch_sched+0x144/0x290
 ? bio_attempt_discard_merge+0x134/0x1f0
 __blk_mq_sched_dispatch_requests+0x129/0x180
 blk_mq_sched_dispatch_requests+0x30/0x60
 __blk_mq_run_hw_queue+0x47/0xe0
 __blk_mq_delay_run_hw_queue+0x15b/0x170
 blk_mq_sched_insert_requests+0x68/0xe0
 blk_mq_flush_plug_list+0xf0/0x170
 blk_finish_plug+0x36/0x50
 xlog_cil_committed+0x19f/0x290 [xfs]
 xlog_cil_process_committed+0x57/0x80 [xfs]
 xlog_state_do_callback+0x1e0/0x2a0 [xfs]
 xlog_ioend_work+0x2f/0x80 [xfs]
 process_one_work+0x1b6/0x350
 worker_thread+0x53/0x3e0
 ? process_one_work+0x350/0x350
 kthread+0x11b/0x140
 ? __kthread_bind_mask+0x60/0x60
 ret_from_fork+0x22/0x30

This patch fixes blk_recalc_rq_segments to be aware of devices which can
have multi-segment discards. It calculates the correct discard segment
count by counting the number of bio as each discard bio is considered its
own segment.

Fixes: 1e739730 ("block: optionally merge discontiguous discard bios into a single request")
Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NLaurence Oberman <loberman@redhat.com>
Link: https://lore.kernel.org/r/20210211143807.GA115624@redhatSigned-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f7d417ef

block: Suppress uevent for hidden device when removed · 55e27551

由 Daniel Wagner 提交于 4月 07, 2021

stable inclusion
from stable-5.10.27
commit 07feac84efc65c7d0a4ad44096334766bbe68dcb
bugzilla: 51493

--------------------------------

[ Upstream commit 9ec49144 ]

register_disk() suppress uevents for devices with the GENHD_FL_HIDDEN
but enables uevents at the end again in order to announce disk after
possible partitions are created.

When the device is removed the uevents are still on and user land sees
'remove' messages for devices which were never 'add'ed to the system.

  KERNEL[95481.571887] remove   /devices/virtual/nvme-fabrics/ctl/nvme5/nvme0c5n1 (block)

Let's suppress the uevents for GENHD_FL_HIDDEN by not enabling the
uevents at all.
Signed-off-by: NDaniel Wagner <dwagner@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin Wilck <mwilck@suse.com>
Link: https://lore.kernel.org/r/20210311151917.136091-1-dwagner@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

55e27551

block: Fix REQ_OP_ZONE_RESET_ALL handling · 87a6903a

由 Damien Le Moal 提交于 4月 07, 2021

stable inclusion
from stable-5.10.27
commit d27b0964ade97211fa7a8cd0010ddc8737a054a5
bugzilla: 51493

--------------------------------

[ Upstream commit faa44c69 ]

Similarly to a single zone reset operation (REQ_OP_ZONE_RESET), execute
REQ_OP_ZONE_RESET_ALL operations with REQ_SYNC set.
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

87a6903a

blk-cgroup: Fix the recursive blkg rwstat · b6160d7c

由 Xunlei Pang 提交于 4月 07, 2021

stable inclusion
from stable-5.10.27
commit 71b996c9b883313be4320954c902e84031399fd9
bugzilla: 51493

--------------------------------

[ Upstream commit 4f44657d ]

The current blkio.throttle.io_service_bytes_recursive doesn't
work correctly.

As an example, for the following blkcg hierarchy:
 (Made 1GB READ in test1, 512MB READ in test2)
     test
    /    \
 test1   test2

$ head -n 1 test/test1/blkio.throttle.io_service_bytes_recursive
8:0 Read 1073684480
$ head -n 1 test/test2/blkio.throttle.io_service_bytes_recursive
8:0 Read 537448448
$ head -n 1 test/blkio.throttle.io_service_bytes_recursive
8:0 Read 537448448

Clearly, above data of "test" reflects "test2" not "test1"+"test2".

Do the correct summary in blkg_rwstat_recursive_sum().
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b6160d7c

09 4月, 2021 5 次提交

block: Discard page cache of zone reset target range · 808b190a

由 Shin'ichiro Kawasaki 提交于 3月 27, 2021

stable inclusion
from stable-5.10.24
commit a53477849286c518232231e8983629d33d0499a8
bugzilla: 51348

--------------------------------

commit e5113505 upstream.

When zone reset ioctl and data read race for a same zone on zoned block
devices, the data read leaves stale page cache even though the zone
reset ioctl zero clears all the zone data on the device. To avoid
non-zero data read from the stale page cache after zone reset, discard
page cache of reset target zones in blkdev_zone_mgmt_ioctl(). Introduce
the helper function blkdev_truncate_zone_range() to discard the page
cache. Ensure the page cache discarded by calling the helper function
before and after zone reset in same manner as fallocate does.

This patch can be applied back to the stable kernel version v5.10.y.
Rework is needed for older stable kernels.
Signed-off-by: NShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 3ed05a98 ("blk-zoned: implement ioctls")
Cc: <stable@vger.kernel.org> # 5.10+
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210311072546.678999-1-shinichiro.kawasaki@wdc.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

808b190a

blk-settings: align max_sectors on "logical_block_size" boundary · 9ee7761e

由 Mikulas Patocka 提交于 3月 15, 2021

stable inclusion
from stable-5.10.20
commit 556c513e6bac619b19edbfc63bfa3fc0217a83ea
bugzilla: 50608

--------------------------------

commit 97f433c3 upstream.

We get I/O errors when we run md-raid1 on the top of dm-integrity on the
top of ramdisk.
device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1
device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8048, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8147, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8246, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8345, 0xbb

The ramdisk device has logical_block_size 512 and max_sectors 255. The
dm-integrity device uses logical_block_size 4096 and it doesn't affect the
"max_sectors" value - thus, it inherits 255 from the ramdisk. So, we have
a device with max_sectors not aligned on logical_block_size.

The md-raid device sees that the underlying leg has max_sectors 255 and it
will split the bios on 255-sector boundary, making the bios unaligned on
logical_block_size.

In order to fix the bug, we round down max_sectors to logical_block_size.

Cc: stable@vger.kernel.org
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

9ee7761e

block: reopen the device in blkdev_reread_part · b98bcd9e

由 Christoph Hellwig 提交于 3月 15, 2021

stable inclusion
from stable-5.10.20
commit cc88a819a14c2d2948090b6ddb2db2eee2904efb
bugzilla: 50608

--------------------------------

[ Upstream commit 4601b4b1 ]

Historically the BLKRRPART ioctls called into the now defunct ->revalidate
method, which caused the sd driver to check if any media is present.
When the ->revalidate method was removed this revalidation was lost,
leading to lots of I/O errors when using the eject command.  Fix this by
reopening the device to rescan the partitions, and thus calling the
revalidation logic in the sd driver.

Fixes: 471bd0af ("sd: use bdev_check_media_change")
Reported--by: NTom Seewald <tseewald@gmail.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NTom Seewald <tseewald@gmail.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b98bcd9e

bsg: free the request before return error code · 3c5db875

由 Pan Bian 提交于 3月 15, 2021

stable inclusion
from stable-5.10.20
commit 1c7b7d476e6aedd2a72ee08e3e59a14cf120946f
bugzilla: 50608

--------------------------------

[ Upstream commit 0f7b4bc6 ]

Free the request rq before returning error code.

Fixes: 972248e9 ("scsi: bsg-lib: handle bidi requests without block layer help")
Signed-off-by: NPan Bian <bianpan2016@163.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

3c5db875

bfq: Avoid false bfq queue merging · 097f826f

由 Jan Kara 提交于 3月 15, 2021

stable inclusion
from stable-5.10.20
commit 89e3d1a85df80de70239582f44c91ed943f50006
bugzilla: 50608

--------------------------------

commit 41e76c85 upstream.

bfq_setup_cooperator() uses bfqd->in_serv_last_pos so detect whether it
makes sense to merge current bfq queue with the in-service queue.
However if the in-service queue is freshly scheduled and didn't dispatch
any requests yet, bfqd->in_serv_last_pos is stale and contains value
from the previously scheduled bfq queue which can thus result in a bogus
decision that the two queues should be merged. This bug can be observed
for example with the following fio jobfile:

[global]
direct=0
ioengine=sync
invalidate=1
size=1g
rw=read

[reader]
numjobs=4
directory=/mnt

where the 4 processes will end up in the one shared bfq queue although
they do IO to physically very distant files (for some reason I was able to
observe this only with slice_idle=1ms setting).

Fix the problem by invalidating bfqd->in_serv_last_pos when switching
in-service queue.

Fixes: 058fdecc ("block, bfq: fix in-service-queue check for queue merging")
CC: stable@vger.kernel.org
Signed-off-by: NJan Kara <jack@suse.cz>
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

097f826f

09 3月, 2021 2 次提交

bfq-iosched: Revert "bfq: Fix computation of shallow depth" · 53f3de01

由 Lin Feng 提交于 2月 23, 2021

stable inclusion
from stable-5.10.17
commit d93178df8f754b8ae5b5c804edcd6d4b64aad5a7
bugzilla: 48169

--------------------------------

[ Upstream commit 388c705b ]

This reverts commit 6d4d2735.

bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core
sbitmap_get_shallow, which uses just the number to limit the scan depth of
each bitmap word, formula:
scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100%

That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct.
But after commit patch 'bfq: Fix computation of shallow depth', we use
sbitmap.depth instead, as a example in following case:

sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64.
The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and
three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit
nothing.
Signed-off-by: NLin Feng <linf@wangsu.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

53f3de01

blk-cgroup: Use cond_resched() when destroy blkgs · 6881cf57

由 Baolin Wang 提交于 2月 22, 2021

stable inclusion
from stable-5.10.16
commit fb8f9b2f7d229a8bc74db1cfa53814a0d6b42b7f
bugzilla: 48168

--------------------------------

[ Upstream commit 6c635cae ]

On !PREEMPT kernel, we can get below softlockup when doing stress
testing with creating and destroying block cgroup repeatly. The
reason is it may take a long time to acquire the queue's lock in
the loop of blkcg_destroy_blkgs(), or the system can accumulate a
huge number of blkgs in pathological cases. We can add a need_resched()
check on each loop and release locks and do cond_resched() if true
to avoid this issue, since the blkcg_destroy_blkgs() is not called
from atomic contexts.

[ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
[ 4757.010698] Call trace:
[ 4757.010700]  blkcg_destroy_blkgs+0x68/0x150
[ 4757.010701]  cgwb_release_workfn+0x104/0x158
[ 4757.010702]  process_one_work+0x1bc/0x3f0
[ 4757.010704]  worker_thread+0x164/0x468
[ 4757.010705]  kthread+0x108/0x138
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

6881cf57

09 2月, 2021 1 次提交

blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue · 3fb35a72

由 Ming Lei 提交于 2月 07, 2021

stable inclusion
from stable-5.10.13
commit 20786fdd2fb0c648e8c4895d3839d57b1d78375f
bugzilla: 47995

--------------------------------

commit 2569063c upstream.

In case of blk_mq_is_sbitmap_shared(), we should test QUEUE_FLAG_HCTX_ACTIVE against
q->queue_flags instead of BLK_MQ_S_TAG_ACTIVE.

So fix it.

Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Fixes: f1b49fdc ("blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NJohn Garry <john.garry@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

3fb35a72

28 1月, 2021 2 次提交

blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED · 64f861fb

由 John Garry 提交于 1月 25, 2021

stable inclusion
from stable-5.10.9
commit 847c76518c41ba45ec02742a5d03065ebd4b3c39
bugzilla: 47457

--------------------------------

[ Upstream commit 02f938e9 ]

Showing the hctx flags for when BLK_MQ_F_TAG_HCTX_SHARED is set gives
something like:

root@debian:/home/john# more /sys/kernel/debug/block/sda/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED|3

Add the decoding for that flag.

Fixes: 32bc15af ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

64f861fb

bfq: Fix computation of shallow depth · dabe9585

由 Jan Kara 提交于 1月 25, 2021

stable inclusion
from stable-5.10.9
commit 7fdaca86fc9b853c44e0104919989b6cb387cdc2
bugzilla: 47457

--------------------------------

[ Upstream commit 6d4d2735 ]

BFQ computes number of tags it allows to be allocated for each request type
based on tag bitmap. However it uses 1 << bitmap.shift as number of
available tags which is wrong. 'shift' is just an internal bitmap value
containing logarithm of how many bits bitmap uses in each bitmap word.
Thus number of tags allowed for some request types can be far to low.
Use proper bitmap.depth which has the number of tags instead.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

dabe9585

27 1月, 2021 6 次提交

block: fix use-after-free in disk_part_iter_next · 57914fac

由 Ming Lei 提交于 1月 23, 2021

stable inclusion
from stable-5.10.8
commit 481097d6617414167c0018f1ece1bfb8e117f62f
bugzilla: 47450

--------------------------------

commit aebf5db9 upstream.

Make sure that bdgrab() is done on the 'block_device' instance before
referring to it for avoiding use-after-free.

Cc: <stable@vger.kernel.org>
Reported-by: syzbot+825f0f9657d4e528046e@syzkaller.appspotmail.com
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

57914fac

blk-iocost: fix NULL iocg deref from racing against initialization · 69bbba2e

由 Tejun Heo 提交于 1月 19, 2021

stable inclusion
from stable-5.10.7
commit cafc6e70a63c5ca30b1cc9ae1bb492fcc54bfd62
bugzilla: 47429

--------------------------------

commit d16baa3f upstream.

When initializing iocost for a queue, its rqos should be registered before
the blkcg policy is activated to allow policy data initiailization to lookup
the associated ioc. This unfortunately means that the rqos methods can be
called on bios before iocgs are attached to all existing blkgs.

While the race is theoretically possible on ioc_rqos_throttle(), it mostly
happened in ioc_rqos_merge() due to the difference in how they lookup ioc.
The former determines it from the passed in @rqos and then bails before
dereferencing iocg if the looked up ioc is disabled, which most likely is
the case if initialization is still in progress. The latter looked up ioc by
dereferencing the possibly NULL iocg making it a lot more prone to actually
triggering the bug.

* Make ioc_rqos_merge() use the same method as ioc_rqos_throttle() to look
  up ioc for consistency.

* Make ioc_rqos_throttle() and ioc_rqos_merge() test for NULL iocg before
  dereferencing it.

* Explain the danger of NULL iocgs in blk_iocost_init().
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NJonathan Lemon <bsd@fb.com>
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

69bbba2e

scsi: block: Do not accept any requests while suspended · 27b386e5

由 Alan Stern 提交于 1月 19, 2021

stable inclusion
from stable-5.10.7
commit d55d15a332ec651ccb49c42a8a10c03447fdf418
bugzilla: 47429

--------------------------------

[ Upstream commit 52abca64 ]

blk_queue_enter() accepts BLK_MQ_REQ_PM requests independent of the runtime
power management state. Now that SCSI domain validation no longer depends
on this behavior, modify the behavior of blk_queue_enter() as follows:

   - Do not accept any requests while suspended.

   - Only process power management requests while suspending or resuming.

Submitting BLK_MQ_REQ_PM requests to a device that is runtime suspended
causes runtime-suspended devices not to resume as they should. The request
which should cause a runtime resume instead gets issued directly, without
resuming the device first. Of course the device can't handle it properly,
the I/O fails, and the device remains suspended.

The problem is fixed by checking that the queue's runtime-PM status isn't
RPM_SUSPENDED before allowing a request to be issued, and queuing a
runtime-resume request if it is.  In particular, the inline
blk_pm_request_resume() routine is renamed blk_pm_resume_queue() and the
code is unified by merging the surrounding checks into the routine.  If the
queue isn't set up for runtime PM, or there currently is no restriction on
allowed requests, the request is allowed.  Likewise if the BLK_MQ_REQ_PM
flag is set and the status isn't RPM_SUSPENDED.  Otherwise a runtime resume
is queued and the request is blocked until conditions are more suitable.

[ bvanassche: modified commit message and removed Cc: stable because
  without the previous patches from this series this patch would break
  parallel SCSI domain validation + introduced queue_rpm_status() ]

Link: https://lore.kernel.org/r/20201209052951.16136-9-bvanassche@acm.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Can Guo <cang@codeaurora.org>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-and-tested-by: NMartin Kepplinger <martin.kepplinger@puri.sm>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NCan Guo <cang@codeaurora.org>
Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

27b386e5

scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT · d8545d48

由 Bart Van Assche 提交于 1月 19, 2021

stable inclusion
from stable-5.10.7
commit 782c9ef2ac059a25d6afbac344319574414258db
bugzilla: 47429

--------------------------------

[ Upstream commit a4d34da7 ]

Remove flag RQF_PREEMPT and BLK_MQ_REQ_PREEMPT since these are no longer
used by any kernel code.

Link: https://lore.kernel.org/r/20201209052951.16136-8-bvanassche@acm.org
Cc: Can Guo <cang@codeaurora.org>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Martin Kepplinger <martin.kepplinger@puri.sm>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NCan Guo <cang@codeaurora.org>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

d8545d48

scsi: block: Introduce BLK_MQ_REQ_PM · 251eaec0

由 Bart Van Assche 提交于 1月 19, 2021

stable inclusion
from stable-5.10.7
commit 8ed46b329d4e62a1d0c7b17361c0e364eaf4a9da
bugzilla: 47429

--------------------------------

[ Upstream commit 0854bcdc ]

Introduce the BLK_MQ_REQ_PM flag. This flag makes the request allocation
functions set RQF_PM. This is the first step towards removing
BLK_MQ_REQ_PREEMPT.

Link: https://lore.kernel.org/r/20201209052951.16136-3-bvanassche@acm.org
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Can Guo <cang@codeaurora.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NCan Guo <cang@codeaurora.org>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

251eaec0

block: add debugfs stanza for QUEUE_FLAG_NOWAIT · 60014db5

由 Andres Freund 提交于 1月 19, 2021

stable inclusion
from stable-5.10.7
commit bfb39e6d67a5fb3875e0cfb2e108e4bcc56d7747
bugzilla: 47429

--------------------------------

[ Upstream commit dc304326 ]

This was missed in 021a2446. Leads to the numeric value of
QUEUE_FLAG_NOWAIT (i.e. 29) showing up in
/sys/kernel/debug/block/*/state.

Fixes: 021a2446
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAndres Freund <andres@anarazel.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

60014db5

18 1月, 2021 1 次提交

scsi: block: Fix a race in the runtime power management code · f1066066

由 Bart Van Assche 提交于 1月 11, 2021

stable inclusion
from stable-5.10.5
commit 092898b070e0fa53df6e598a5a5f1ea8f35476f1
bugzilla: 46931

--------------------------------

commit fa4d0f19 upstream.

With the current implementation the following race can happen:

 * blk_pre_runtime_suspend() calls blk_freeze_queue_start() and
   blk_mq_unfreeze_queue().

 * blk_queue_enter() calls blk_queue_pm_only() and that function returns
   true.

 * blk_queue_enter() calls blk_pm_request_resume() and that function does
   not call pm_request_resume() because the queue runtime status is
   RPM_ACTIVE.

 * blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.

Fix this race by changing the queue runtime status into RPM_SUSPENDING
before switching q_usage_counter to atomic mode.

Link: https://lore.kernel.org/r/20201209052951.16136-2-bvanassche@acm.org
Fixes: 986d413b ("blk-mq: Enable support for runtime power management")
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NAlan Stern <stern@rowland.harvard.edu>
Acked-by: NStanley Chu <stanley.chu@mediatek.com>
Co-developed-by: NCan Guo <cang@codeaurora.org>
Signed-off-by: NCan Guo <cang@codeaurora.org>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

f1066066

05 12月, 2020 1 次提交

dm: fix IO splitting · 3ee16db3

由 Mike Snitzer 提交于 11月 30, 2020

Commit 882ec4e6 ("dm table: stack 'chunk_sectors' limit to account
for target-specific splitting") caused a couple regressions:
1) Using lcm_not_zero() when stacking chunk_sectors was a bug because
   chunk_sectors must reflect the most limited of all devices in the
   IO stack.
2) DM targets that set max_io_len but that do _not_ provide an
   .iterate_devices method no longer had there IO split properly.

And commit 5091cdec ("dm: change max_io_len() to use
blk_max_size_offset()") also caused a regression where DM no longer
supported varied (per target) IO splitting. The implication being the
potential for severely reduced performance for IO stacks that use a DM
target like dm-cache to hide performance limitations of a slower
device (e.g. one that requires 4K IO splitting).

Coming full circle: Fix all these issues by discontinuing stacking
chunk_sectors up using ti->max_io_len in dm_calculate_queue_limits(),
add optional chunk_sectors override argument to blk_max_size_offset()
and update DM's max_io_len() to pass ti->max_io_len to its
blk_max_size_offset() call.

Passing in an optional chunk_sectors override to blk_max_size_offset()
allows for code reuse of block's centralized calculation for max IO
size based on provided offset and split boundary.

Fixes: 882ec4e6 ("dm table: stack 'chunk_sectors' limit to account for target-specific splitting")
Fixes: 5091cdec ("dm: change max_io_len() to use blk_max_size_offset()")
Cc: stable@vger.kernel.org
Reported-by: NJohn Dorminy <jdorminy@redhat.com>
Reported-by: NBruce Johnston <bjohnsto@redhat.com>
Reported-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: NJohn Dorminy <jdorminy@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NJens Axboe <axboe@kernel.dk>

3ee16db3

02 12月, 2020 1 次提交

block: use gcd() to fix chunk_sectors limit stacking · 7e7986f9

由 Mike Snitzer 提交于 12月 01, 2020

commit 22ada802 ("block: use lcm_not_zero() when stacking
chunk_sectors") broke chunk_sectors limit stacking. chunk_sectors must
reflect the most limited of all devices in the IO stack.

Otherwise malformed IO may result. E.g.: prior to this fix,
->chunk_sectors = lcm_not_zero(8, 128) would result in
blk_max_size_offset() splitting IO at 128 sectors rather than the
required more restrictive 8 sectors.

And since commit 07d098e6 ("block: allow 'chunk_sectors' to be
non-power-of-2") care must be taken to properly stack chunk_sectors to
be compatible with the possibility that a non-power-of-2 chunk_sectors
may be stacked. This is why gcd() is used instead of reverting back
to using min_not_zero().

Fixes: 22ada802 ("block: use lcm_not_zero() when stacking chunk_sectors")
Fixes: 07d098e6 ("block: allow 'chunk_sectors' to be non-power-of-2")
Reported-by: NJohn Dorminy <jdorminy@redhat.com>
Reported-by: NBruce Johnston <bjohnsto@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NJohn Dorminy <jdorminy@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7e7986f9

21 11月, 2020 1 次提交

block/keyslot-manager: prevent crash when num_slots=1 · 47a84653

由 Eric Biggers 提交于 11月 11, 2020

If there is only one keyslot, then blk_ksm_init() computes
slot_hashtable_size=1 and log_slot_ht_size=0.  This causes
blk_ksm_find_keyslot() to crash later because it uses
hash_ptr(key, log_slot_ht_size) to find the hash bucket containing the
key, and hash_ptr() doesn't support the bits == 0 case.

Fix this by making the hash table always have at least 2 buckets.

Tested by running:

    kvm-xfstests -c ext4 -g encrypt -m inlinecrypt \
                 -o blk-crypto-fallback.num_keyslots=1

Fixes: 1b262839 ("block: Keyslot Manager for Inline Encryption")
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

47a84653

15 11月, 2020 1 次提交

blk-cgroup: fix a hd_struct leak in blkcg_fill_root_iostats · b7131ee0

由 Christoph Hellwig 提交于 11月 14, 2020

disk_get_part needs to be paired with a disk_put_part.

Cc: stable@vger.kernel.org
Fixes: ef45fe47 ("blk-cgroup: show global disk stats in root cgroup io.stat")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b7131ee0

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功