提交 · fa01b1e9733fd59ecb8b5b6d85dfb481d2025fbf · openeuler / Kernel

25 9月, 2020 5 次提交

block: add a bdev_is_partition helper · fa01b1e9

由 Christoph Hellwig 提交于 9月 03, 2020

Add a littler helper to make the somewhat arcane bd_contains checks a
little more obvious.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NUlf Hansson <ulf.hansson@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fa01b1e9

bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag · 1cb039f3

由 Christoph Hellwig 提交于 9月 24, 2020

The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it.  This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore.  It is replaced with a queue attribute which
also is writable for easier testing.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1cb039f3

bdi: remove BDI_CAP_CGROUP_WRITEBACK · ed7b6b4f

由 Christoph Hellwig 提交于 9月 24, 2020

Just checking SB_I_CGROUPWB for cgroup writeback support is enough.
Either the file system allocates its own bdi (e.g. btrfs), in which case
it is known to support cgroup writeback, or the bdi comes from the block
layer, which always supports cgroup writeback.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ed7b6b4f

block: lift setting the readahead size into the block layer · c2e4cd57

由 Christoph Hellwig 提交于 9月 24, 2020

Drivers shouldn't really mess with the readahead size, as that is a VM
concept.  Instead set it based on the optimal I/O size by lifting the
algorithm from the md driver when registering the disk.  Also set
bdi->io_pages there as well by applying the same scheme based on
max_sectors.  To ensure the limits work well for stacking drivers a
new helper is added to update the readahead limits from the block
limits, which is also called from disk_stack_limits.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Acked-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c2e4cd57

bdi: initialize ->ra_pages and ->io_pages in bdi_init · 55b2598e

由 Christoph Hellwig 提交于 9月 24, 2020

Set up a readahead size by default, as very few users have a good
reason to change it.  This means code, ecryptfs, and orangefs now
set up the values while they were previously missing it, while ubifs,
mtd and vboxsf manually set it to 0 to avoid readahead.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

55b2598e

24 9月, 2020 5 次提交

block: cleanup blkdev_bszset · 47816282

由 Christoph Hellwig 提交于 9月 21, 2020

Use blkdev_get_by_dev instead of bdgrab + blkdev_get.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

47816282

block: cleanup partition scanning in register_disk · 9301fe73

由 Christoph Hellwig 提交于 9月 21, 2020

Use blkdev_get_by_dev instead of open coding it using bdget_disk +
blkdev_get, and split the code to read the partition table into a
separate helper to make it a little more obvious.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9301fe73

block: move the NEED_PART_SCAN flag to struct gendisk · 38430f08

由 Christoph Hellwig 提交于 9月 21, 2020

We can only scan for partitions on the whole disk, so move the flag
from struct block_device to struct gendisk.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

38430f08

block: allow 'chunk_sectors' to be non-power-of-2 · 07d098e6

由 Mike Snitzer 提交于 9月 21, 2020

It is possible, albeit more unlikely, for a block device to have a non
power-of-2 for chunk_sectors (e.g. 10+2 RAID6 with 128K chunk_sectors,
which results in a full-stripe size of 1280K. This causes the RAID6's
io_opt to be advertised as 1280K, and a stacked device _could_ then be
made to use a blocksize, aka chunk_sectors, that matches non power-of-2
io_opt of underlying RAID6 -- resulting in stacked device's
chunk_sectors being a non power-of-2).

Update blk_queue_chunk_sectors() and blk_max_size_offset() to
accommodate drivers that need a non power-of-2 chunk_sectors.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

07d098e6

block: use lcm_not_zero() when stacking chunk_sectors · 22ada802

由 Mike Snitzer 提交于 9月 21, 2020

Like 'io_opt', blk_stack_limits() should stack 'chunk_sectors' using
lcm_not_zero() rather than min_not_zero() -- otherwise the final
'chunk_sectors' could result in sub-optimal alignment of IO to
component devices in the IO stack.

Also, if 'chunk_sectors' isn't a multiple of 'physical_block_size'
then it is a bug in the driver and the device should be flagged as
'misaligned'.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

22ada802

23 9月, 2020 2 次提交

block: fix bmd->is_null_mapped initialization · 03859717

由 Christoph Hellwig 提交于 9月 23, 2020

bmd is allocated using kmalloc in bio_alloc_map_data, so make sure
is_null_mapped is properly initialized to false for the !null_mapped
case.

Fixes: f3256075 ("block: remove the BIO_NULL_MAPPED flag")
Reported-by: NMarc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

03859717

block: drop double zeroing · f952eefe

由 Julia Lawall 提交于 9月 20, 2020

sg_init_table zeroes its first argument, so the allocation of that argument
doesn't have to.

the semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression x;
@@

x =
- kzalloc
+ kmalloc
 (...)
...
sg_init_table(x,...)
// </smpl>
Signed-off-by: NJulia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f952eefe

15 9月, 2020 6 次提交

blk-throttle: Avoid checking bps/iops limitation if bps or iops is unlimited · 87fbeb88

由 Baolin Wang 提交于 9月 07, 2020

Do not need check the bps or iops limitation if bps or iops is unlimited.
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

87fbeb88

blk-throttle: Avoid calculating bps/iops limitation repeatedly · 4599ea49

由 Baolin Wang 提交于 9月 07, 2020

The tg_may_dispatch() will call tg_with_in_bps_limit() and
tg_with_in_iops_limit() to check if we can dispatch a bio or
not, which will calculate bps/iops limitation multiple times.
But tg_may_dispatch() is always called under queue lock, which
means the bps/iops limitation will not change in tg_may_dispatch().

So we can calculate the bps/iops limitation only once, and pass
them to tg_with_in_bps_limit() and tg_with_in_iops_limit() to
avoid calculating bps/iops limitation repeatedly.
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4599ea49

blk-throttle: Define readable macros instead of static variables · e675df2a

由 Baolin Wang 提交于 9月 07, 2020

The 'throtl_grp_quantum' and 'throtl_quantum' are both read-only
variables, thus better to use readable macros instead of static
variables, which can also save some spaces for .bss area.
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e675df2a

blk-throttle: Use readable READ/WRITE macros · ff8b22c0

由 Baolin Wang 提交于 9月 07, 2020

Use readable READ/WRITE macros instead of magic numbers.
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ff8b22c0

blk-throttle: Fix some comments' typos · b53b072c

由 Baolin Wang 提交于 9月 07, 2020

Fix some comments' typos.
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b53b072c

iocost: fix infinite loop bug in adjust_inuse_and_calc_cost() · aa67db24

由 Tejun Heo 提交于 9月 14, 2020

adjust_inuse_and_calc_cost() is responsible for reducing the amount of
donated weights dynamically in period as the budget runs low. Because we
don't want to do full donation calculation in period, we keep latching up
inuse by INUSE_ADJ_STEP_PCT of the active weight of the cgroup until the
resulting hweight_inuse is satisfactory.

Unfortunately, the adj_step calculation was reading the active weight before
acquiring ioc->lock. Because the current thread could have lost race to
activate the iocg to another thread before entering this function, it may
read the active weight as zero before acquiring ioc->lock. When this
happens, the adj_step is calculated as zero and the incremental adjustment
loop becomes an infinite one.

Fix it by fetching the active weight after acquiring ioc->lock.

Fixes: b0853ab4 ("blk-iocost: revamp in-period donation snapbacks")
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

aa67db24

12 9月, 2020 2 次提交

blk-iocost: fix divide-by-zero in transfer_surpluses() · 769b628d

由 Tejun Heo 提交于 9月 11, 2020

Conceptually, root_iocg->hweight_donating must be less than WEIGHT_ONE but
all hweight calculations round up and thus it may end up >= WEIGHT_ONE
triggering divide-by-zero and other issues. Bound the value to avoid
surprises.

Fixes: e08d02aa ("blk-iocost: implement Andy's method for donation weight updates")
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

769b628d

block: introduce part_[begin|end]_io_acct · 7b26410b

由 Song Liu 提交于 8月 31, 2020

These functions can be used to enable iostat for partitions on devices
like md, bcache.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7b26410b

11 9月, 2020 2 次提交

blk-mq: always allow reserved allocation in hctx_may_queue · 28500850

由 Ming Lei 提交于 9月 11, 2020

NVMe shares tagset between fabric queue and admin queue or between
connect_q and NS queue, so hctx_may_queue() can be called to allocate
request for these queues.

Tags can be reserved in these tagset. Before error recovery, there is
often lots of in-flight requests which can't be completed, and new
reserved request may be needed in error recovery path. However,
hctx_may_queue() can always return false because there is too many
in-flight requests which can't be completed during error handling.
Finally, nothing can proceed.

Fix this issue by always allowing reserved tag allocation in
hctx_may_queue(). This is reasonable because reserved tags are supposed
to always be available.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Cc: David Milburn <dmilburn@redhat.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

28500850

block: remove duplicate include statement in scsi_ioctl.c · 84ed2573

由 Tian Tao 提交于 9月 11, 2020

scsi/sg.h is included more than once, Remove the one that isn't
necessary.
Signed-off-by: NTian Tao <tiantao6@hisilicon.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

84ed2573

10 9月, 2020 2 次提交

blkcg: add plugging support for punt bio · 192f1c6b

由 Xianting Tian 提交于 9月 10, 2020

The test and the explaination of the patch as bellow.

Before test we added more debug code in blkg_async_bio_workfn():
	int count = 0
	if (bios.head && bios.head->bi_next) {
		need_plug = true;
		blk_start_plug(&plug);
	}
	while ((bio = bio_list_pop(&bios))) {
		/*io_punt is a sysctl user interface to control the print*/
		if(io_punt) {
			printk("[%s:%d] bio start,size:%llu,%d count=%d plug?%d\n",
				current->comm, current->pid, bio->bi_iter.bi_sector,
				(bio->bi_iter.bi_size)>>9, count++, need_plug);
		}
		submit_bio(bio);
	}
	if (need_plug)
		blk_finish_plug(&plug);

Steps that need to be set to trigger *PUNT* io before testing:
	mount -t btrfs -o compress=lzo /dev/sda6 /btrfs
	mount -t cgroup2 nodev /cgroup2
	mkdir /cgroup2/cg3
	echo "+io" > /cgroup2/cgroup.subtree_control
	echo "8:0 wbps=1048576000" > /cgroup2/cg3/io.max #1000M/s
	echo $$ > /cgroup2/cg3/cgroup.procs

Then use dd command to test btrfs PUNT io in current shell:
	dd if=/dev/zero of=/btrfs/file bs=64K count=100000

Test hardware environment as below:
	[root@localhost btrfs]# lscpu
	Architecture:          x86_64
	CPU op-mode(s):        32-bit, 64-bit
	Byte Order:            Little Endian
	CPU(s):                32
	On-line CPU(s) list:   0-31
	Thread(s) per core:    2
	Core(s) per socket:    8
	Socket(s):             2
	NUMA node(s):          2
	Vendor ID:             GenuineIntel

With above debug code, test command and test environment, I did the
tests under 3 different system loads, which are triggered by stress:
1, Run 64 threads by command "stress -c 64 &"
	[53615.975974] [kworker/u66:18:1490] bio start,size:45583056,8 count=0 plug?1
	[53615.975980] [kworker/u66:18:1490] bio start,size:45583064,8 count=1 plug?1
	[53615.975984] [kworker/u66:18:1490] bio start,size:45583072,8 count=2 plug?1
	[53615.975987] [kworker/u66:18:1490] bio start,size:45583080,8 count=3 plug?1
	[53615.975990] [kworker/u66:18:1490] bio start,size:45583088,8 count=4 plug?1
	[53615.975993] [kworker/u66:18:1490] bio start,size:45583096,8 count=5 plug?1
	... ...
	[53615.977041] [kworker/u66:18:1490] bio start,size:45585480,8 count=303 plug?1
	[53615.977044] [kworker/u66:18:1490] bio start,size:45585488,8 count=304 plug?1
	[53615.977047] [kworker/u66:18:1490] bio start,size:45585496,8 count=305 plug?1
	[53615.977050] [kworker/u66:18:1490] bio start,size:45585504,8 count=306 plug?1
	[53615.977053] [kworker/u66:18:1490] bio start,size:45585512,8 count=307 plug?1
	[53615.977056] [kworker/u66:18:1490] bio start,size:45585520,8 count=308 plug?1
	[53615.977058] [kworker/u66:18:1490] bio start,size:45585528,8 count=309 plug?1

2, Run 32 threads by command "stress -c 32 &"
	[50586.290521] [kworker/u66:6:32351] bio start,size:45806496,8 count=0 plug?1
	[50586.290526] [kworker/u66:6:32351] bio start,size:45806504,8 count=1 plug?1
	[50586.290529] [kworker/u66:6:32351] bio start,size:45806512,8 count=2 plug?1
	[50586.290531] [kworker/u66:6:32351] bio start,size:45806520,8 count=3 plug?1
	[50586.290533] [kworker/u66:6:32351] bio start,size:45806528,8 count=4 plug?1
	[50586.290535] [kworker/u66:6:32351] bio start,size:45806536,8 count=5 plug?1
	... ...
	[50586.299640] [kworker/u66:5:32350] bio start,size:45808576,8 count=252 plug?1
	[50586.299643] [kworker/u66:5:32350] bio start,size:45808584,8 count=253 plug?1
	[50586.299646] [kworker/u66:5:32350] bio start,size:45808592,8 count=254 plug?1
	[50586.299649] [kworker/u66:5:32350] bio start,size:45808600,8 count=255 plug?1
	[50586.299652] [kworker/u66:5:32350] bio start,size:45808608,8 count=256 plug?1
	[50586.299663] [kworker/u66:5:32350] bio start,size:45808616,8 count=257 plug?1
	[50586.299665] [kworker/u66:5:32350] bio start,size:45808624,8 count=258 plug?1
	[50586.299668] [kworker/u66:5:32350] bio start,size:45808632,8 count=259 plug?1

3, Don't run thread by stress
	[50861.355246] [kworker/u66:19:32376] bio start,size:13544504,8 count=0 plug?0
	[50861.355288] [kworker/u66:19:32376] bio start,size:13544512,8 count=0 plug?0
	[50861.355322] [kworker/u66:19:32376] bio start,size:13544520,8 count=0 plug?0
	[50861.355353] [kworker/u66:19:32376] bio start,size:13544528,8 count=0 plug?0
	[50861.355392] [kworker/u66:19:32376] bio start,size:13544536,8 count=0 plug?0
	[50861.355431] [kworker/u66:19:32376] bio start,size:13544544,8 count=0 plug?0
	[50861.355468] [kworker/u66:19:32376] bio start,size:13544552,8 count=0 plug?0
	[50861.355499] [kworker/u66:19:32376] bio start,size:13544560,8 count=0 plug?0
	[50861.355532] [kworker/u66:19:32376] bio start,size:13544568,8 count=0 plug?0
	[50861.355575] [kworker/u66:19:32376] bio start,size:13544576,8 count=0 plug?0
	[50861.355618] [kworker/u66:19:32376] bio start,size:13544584,8 count=0 plug?0
	[50861.355659] [kworker/u66:19:32376] bio start,size:13544592,8 count=0 plug?0
	[50861.355740] [kworker/u66:0:32346] bio start,size:13544600,8 count=0 plug?1
	[50861.355748] [kworker/u66:0:32346] bio start,size:13544608,8 count=1 plug?1
	[50861.355962] [kworker/u66:2:32347] bio start,size:13544616,8 count=0 plug?0
	[50861.356272] [kworker/u66:7:31962] bio start,size:13544624,8 count=0 plug?0
	[50861.356446] [kworker/u66:7:31962] bio start,size:13544632,8 count=0 plug?0
	[50861.356567] [kworker/u66:7:31962] bio start,size:13544640,8 count=0 plug?0
	[50861.356707] [kworker/u66:19:32376] bio start,size:13544648,8 count=0 plug?0
	[50861.356748] [kworker/u66:15:32355] bio start,size:13544656,8 count=0 plug?0
	[50861.356825] [kworker/u66:17:31970] bio start,size:13544664,8 count=0 plug?0

Analysis of above 3 test results with different system load:
>From above test, we can see more and more continuous bios can be plugged
with system load increasing. When run "stress -c 64 &", 310 continuous
bios are plugged; When run "stress -c 32 &", 260 continuous bios are
plugged; When don't run stress, at most only 2 continuous bios are
plugged, in most cases, bio_list only contains one single bio.

How to explain above phenomenon:
We know, in submit_bio(), if the bio is a REQ_CGROUP_PUNT io, it will
queue a work to workqueue blkcg_punt_bio_wq. But when the workqueue is
scheduled, it depends on the system load.  When system load is low, the
workqueue will be quickly scheduled, and the bio in bio_list will be
quickly processed in blkg_async_bio_workfn(), so there is less chance
that the same io submit thread can add multiple continuous bios to
bio_list before workqueue is scheduled to run. The analysis aligned with
above test "3".
When system load is high, there is some delay before the workqueue can
be scheduled to run, the higher the system load the greater the delay.
So there is more chance that the same io submit thread can add multiple
continuous bios to bio_list. Then when the workqueue is scheduled to run,
there are more continuous bios in bio_list, which will be processed in
blkg_async_bio_workfn(). The analysis aligned with above test "1" and "2".

According to test, we can get io performance improved with the patch,
especially when system load is higher. Another optimazition is to use
the plug only when bio_list contains at least 2 bios.
Signed-off-by: NXianting Tian <tian.xianting@h3c.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

192f1c6b

block: add a bdev_check_media_change helper · 95f6f3a4

由 Christoph Hellwig 提交于 9月 08, 2020

Like check_disk_changed, except that it does not call ->revalidate_disk
but leaves that to the caller.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

95f6f3a4

08 9月, 2020 4 次提交

block: make QUEUE_SYSFS_BIT_FNS more useful · fc93fe14

由 Christoph Hellwig 提交于 9月 03, 2020

Switch to the naming used by the other entries so that we can use the
QUEUE_RW_ENTRY helper.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fc93fe14

block: add helper macros for queue sysfs entries · 35626147

由 Christoph Hellwig 提交于 9月 03, 2020

Add two helpers macros to avoid boilerplate code for the queue sysfs
entries.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

35626147

block: Remove unused blk_mq_sched_free_hctx_data() · ddfb8b0b

由 Baolin Wang 提交于 9月 07, 2020

Now we usually free the hctx->sched_data by e->type->ops.exit_hctx(),
and no users will use blk_mq_sched_free_hctx_data() function.
Remove it.
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ddfb8b0b

block: Do not discard buffers under a mounted filesystem · 384d87ef

由 Jan Kara 提交于 9月 04, 2020

Discarding blocks and buffers under a mounted filesystem is hardly
anything admin wants to do. Usually it will confuse the filesystem and
sometimes the loss of buffer_head state (including b_private field) can
even cause crashes like:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G O     --------- -  - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
...
Call Trace:
 __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
 jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
 kjournald2+0xbd/0x270 [jbd2]

So if we don't have block device open with O_EXCL already, claim the
block device while we truncate buffer cache. This makes sure any
exclusive block device user (such as filesystem) cannot operate on the
device while we are discarding buffer cache.
Reported-by: NYe Bin <yebin10@huawei.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
[axboe: fix !CONFIG_BLOCK error in truncate_bdev_range()]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

384d87ef

04 9月, 2020 10 次提交

blk-mq, elevator: Count requests per hctx to improve performance · b445547e

由 Kashyap Desai 提交于 8月 19, 2020

High CPU utilization on "native_queued_spin_lock_slowpath" due to lock
contention is possible for mq-deadline and bfq IO schedulers
when nr_hw_queues is more than one.

It is because kblockd work queue can submit IO from all online CPUs
(through blk_mq_run_hw_queues()) even though only one hctx has pending
commands.

The elevator callback .has_work for mq-deadline and bfq scheduler considers
pending work if there are any IOs on request queue but it does not account
hctx context.

Add a per-hctx 'elevator_queued' count to the hctx to avoid triggering
the elevator even though there are no requests queued.

[jpg: Relocated atomic_dec() in dd_dispatch_request(), update commit message per Kashyap]
Signed-off-by: NKashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b445547e

blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap · f1b49fdc

由 John Garry 提交于 8月 19, 2020

For when using a shared sbitmap, no longer should the number of active
request queues per hctx be relied on for when judging how to share the tag
bitmap.

Instead maintain the number of active request queues per tag_set, and make
the judgement based on that.

Originally-from: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f1b49fdc

blk-mq: Record nr_active_requests per queue for when using shared sbitmap · bccf5e26

由 John Garry 提交于 8月 19, 2020

The per-hctx nr_active value can no longer be used to fairly assign a share
of tag depth per request queue for when using a shared sbitmap, as it does
not consider that the tags are shared tags over all hctx's.

For this case, record the nr_active_requests per request_queue, and make
the judgement based on that value.

Co-developed-with: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bccf5e26

blk-mq: Relocate hctx_may_queue() · a0235d23

由 John Garry 提交于 8月 19, 2020

blk-mq.h and blk-mq-tag.h include on each other, which is less than ideal.

Locate hctx_may_queue() to blk-mq.h, as it is not really tag specific code.

In this way, we can drop the blk-mq-tag.h include of blk-mq.h
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a0235d23

blk-mq: Facilitate a shared sbitmap per tagset · 32bc15af

由 John Garry 提交于 8月 19, 2020

Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.

In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0
("blk-mq: drain I/O when all CPUs in a hctx are offline").

However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.

In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.

However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e0 ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.

To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.

New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.

Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.

This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].

[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144beSigned-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

32bc15af

blk-mq: Use pointers for blk_mq_tags bitmap tags · 222a5ae0

由 John Garry 提交于 8月 19, 2020

Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
with the goal of later being able to use a common shared tag bitmap across
all HW contexts in a set.
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

222a5ae0

blk-mq: Pass flags for tag init/free · 1c0706a7

由 John Garry 提交于 8月 19, 2020

Pass hctx/tagset flags argument down to blk_mq_init_tags() and
blk_mq_free_tags() for selective init/free.

For now, make it include the alloc policy flag, which can be evaluated
when needed (in blk_mq_init_tags()).
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1c0706a7

blk-mq: Free tags in blk_mq_init_tags() upon error · 4d063237

由 Hannes Reinecke 提交于 8月 19, 2020

Since the tags are allocated in blk_mq_init_tags(), it's better practice
to free in that same function upon error, rather than a callee which is to
init the bitmap tags (blk_mq_init_tags()).

[jpg: Split from an earlier patch with a new commit message]
Signed-off-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4d063237

blk-mq: Rename blk_mq_update_tag_set_depth() · 655ac300

由 Hannes Reinecke 提交于 8月 19, 2020

The function does not set the depth, but rather transitions from
shared to non-shared queues and vice versa.

So rename it to blk_mq_update_tag_set_shared() to better reflect
its purpose.

[jpg: take out some unrelated changes in blk_mq_init_bitmap_tags()]
Signed-off-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

655ac300

blk-mq: Rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED · 51db1c37

由 Ming Lei 提交于 8月 19, 2020

BLK_MQ_F_TAG_SHARED actually means that tags is shared among request
queues, all of which should belong to LUNs attached to same HBA.

So rename it to make the point explicitly.

[jpg: rebase a few times, add rnbd-clt.c change]
Suggested-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

51db1c37

02 9月, 2020 2 次提交

block: use revalidate_disk_size in set_capacity_revalidate_and_notify · b8086d3f

由 Christoph Hellwig 提交于 9月 01, 2020

Only virtio_blk and xen-blkfront set the revalidate argument to true,
and both do not implement the ->revalidate_disk method.  So switch
to the helper that just updates the size instead.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b8086d3f

block: rename bd_invalidated · f4ad06f2

由 Christoph Hellwig 提交于 9月 01, 2020

Replace bd_invalidate with a new BDEV_NEED_PART_SCAN flag in a bd_flags
variable to better describe the condition.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f4ad06f2

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功