提交 · 867d89676f7dd292f1c9bc605b992a876667a233 · openanolis / cloud-kernel

02 9月, 2020 36 次提交

alinux: blk: export sector and len fields for iohang · 867d8967

由 Jeffle Xu 提交于 7月 30, 2020

fix #29612968

Sector address of all bios in a single request should be guaranteed to
be contiguous, except for DISCARD request. We could get the whole sector
range of the request by blk_rq_pos() and blk_rq_bytes() for normal read
/write requests, but here we still print the sector range of every bio
for code simpility. Since it is a low frequency operation, this design
will lead to no performance penalty.

Besides squash the 'if(bio)' and 'while(1)' into one single
'while(bio)'.
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

867d8967

alinux: block: don't decrement nr_phys_segments for physically contigous segments · e6e31258

由 Baolin Wang 提交于 5月 21, 2019

fix #29327388

This patch is inspired by commit eded341c085b ("don't decrement
nr_phys_segments for physically contigous segments"). Now the
ll_merge_requests_fn() will reduces nr_phys_segments by one if the
last segment of the previous, and the first segment of the next
segment are contigous. But that will mismatch the segment numbers
when handling discard request for virtio_block and nvme, moreover
if nvme queue selects IO scheduler, that also will mistrigger the
single segment optimization in the nvme-pci driver. Thus we should
not decrement nr_phys_segments in ll_merge_requests_fn().

But in __blk_recalc_rq_segments(), if we re-calculate the segments of
one request, we may get a different segment number, due to it will merge
physically contigous segments here. Originally we should simply remove
the bvec merging optimization under the assumption that most users already
build good enough bvecs, but we do not want to touch so many core logics
here, which may introduce more issues. So we can simply not allow to merge
contigous segments between 2 bios to fix this issue, and this is not in
the fast path, which will not affect the performance.

Fixes: dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests").
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e6e31258

block: Fix blk_mq_*_map_queues() kernel-doc headers · b4b4ab05

由 Bart Van Assche 提交于 5月 30, 2019

to #28991349

commit 0542cd57d266074114d70791ab245e18f750cd32 upstream

This patch avoids that the kernel-doc script complains about these
function headers when building with W=1.

Cc: Hannes Reinecke <hare@suse.com>
Cc: Keith Busch <keith.busch@intel.com>
Fixes: ed76e329d74a ("blk-mq: abstract out queue map") # v5.0.
Fixes: e42b3867de4b ("blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues") # v5.0.
Reviewed-by: NChaitanya Kulkarni <chiatanya.kulkarni@wdc.com>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b4b4ab05

blk-mq: only dispatch to non-defauly queue maps if they have queues · 0d7b3c30

由 Christoph Hellwig 提交于 12月 17, 2018

to #28991349

commit 5aceaeb26394538858a9dbae5830d628469a44cf upstream

We should check if a given queue map actually has queues enabled before
dispatching to it.  This allows drivers to not initialize optional but
not used map types, which subsequently will allow fixing problems with
queue map rebuilds for that case.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0d7b3c30

blk-mq: skip zero-queue maps in blk_mq_map_swqueue · 01062597

由 Ming Lei 提交于 12月 18, 2018

to #28991349

commit e5edd5f298fafda28284bafb8371e6f0b7681035 upstream

From 7e849dd9cf37 ("nvme-pci: don't share queue maps"), the mapping
table won't be initialized actually if map->nr_queues is zero, so
we can't use blk_mq_map_queue_type() to retrieve hctx any more.

This way still may cause broken mapping, fix it by skipping zero-queues
maps in blk_mq_map_swqueue().

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

01062597

blk-mq: Keep set->nr_hw_queues and set->map[].nr_queues in sync · 48f487df

由 Bart Van Assche 提交于 3月 09, 2020

to #28991349

commit 6e66b49392419f3fe134e1be583323ef75da1e4b upstream

blk_mq_map_queues() and multiple .map_queues() implementations expect that
set->map[HCTX_TYPE_DEFAULT].nr_queues is set to the number of hardware
queues. Hence set .nr_queues before calling these functions. This patch
fixes the following kernel warning:

WARNING: CPU: 0 PID: 2501 at include/linux/cpumask.h:137
Call Trace:
 blk_mq_run_hw_queue+0x19d/0x350 block/blk-mq.c:1508
 blk_mq_run_hw_queues+0x112/0x1a0 block/blk-mq.c:1525
 blk_mq_requeue_work+0x502/0x780 block/blk-mq.c:775
 process_one_work+0x9af/0x1740 kernel/workqueue.c:2269
 worker_thread+0x98/0xe40 kernel/workqueue.c:2415
 kthread+0x361/0x430 kernel/kthread.c:255

Fixes: ed76e329d74a ("blk-mq: abstract out queue map") # v5.0
Reported-by: syzbot+d44e1b26ce5c3e77458d@syzkaller.appspotmail.com
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Johannes Thumshirn <jth@kernel.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

48f487df

blk-mq: re-build queue map in case of kdump kernel · 6ec12c5f

由 Ming Lei 提交于 12月 07, 2018

to #28991349

commit 5938870247be4453ef6602c7ce467bebb48113c8 upstream

Now almost all .map_queues() implementation based on managed irq
affinity doesn't update queue mapping and it just retrieves the
old built mapping, so if nr_hw_queues is changed, the mapping talbe
includes stale mapping. And only blk_mq_map_queues() may rebuild
the mapping talbe.

One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
However, drivers often builds queue mapping before allocating tagset
via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
as 1 in case of kdump kernel, so wrong queue mapping is used, and
kernel panic[1] is observed during booting.

This patch fixes the kernel panic triggerd on nvme by rebulding the
mapping table via blk_mq_map_queues().

[1] kernel panic log
[    4.438371] nvme nvme0: 16/0/0 default/read/poll queues
[    4.443277] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
[    4.444681] PGD 0 P4D 0
[    4.445367] Oops: 0000 [#1] SMP NOPTI
[    4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
[    4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
[    4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[    4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
[    4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
[    4.453132] RSP: 0018:ffffc900023b3cd8 EFLAGS: 00010286
[    4.454061] RAX: 0000000000000000 RBX: ffff888174448000 RCX: 0000000000000001
[    4.456480] RDX: 0000000000000001 RSI: ffffe8feffc506c0 RDI: 0000000000000001
[    4.458750] RBP: ffff88810722d008 R08: ffff88817647a880 R09: 0000000000000002
[    4.464580] R10: ffffc900023b3c10 R11: 0000000000000004 R12: ffff888174448538
[    4.467803] R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000001
[    4.469220] FS:  0000000000000000(0000) GS:ffff88817bac0000(0000) knlGS:0000000000000000
[    4.471554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    4.472464] CR2: 0000000000000098 CR3: 0000000174e4e001 CR4: 0000000000760ee0
[    4.474264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    4.476007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    4.477061] PKRU: 55555554
[    4.477464] Call Trace:
[    4.478731]  blk_mq_init_allocated_queue+0x36a/0x3ad
[    4.479595]  blk_mq_init_queue+0x32/0x4e
[    4.480178]  nvme_validate_ns+0x98/0x623 [nvme_core]
[    4.480963]  ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
[    4.481685]  ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
[    4.482601]  nvme_scan_work+0x23a/0x29b [nvme_core]
[    4.483269]  ? _raw_spin_unlock_irqrestore+0x25/0x38
[    4.483930]  ? try_to_wake_up+0x38d/0x3b3
[    4.484478]  ? process_one_work+0x179/0x2fc
[    4.485118]  process_one_work+0x1d3/0x2fc
[    4.485655]  ? rescuer_thread+0x2ae/0x2ae
[    4.486196]  worker_thread+0x1e9/0x2be
[    4.486841]  kthread+0x115/0x11d
[    4.487294]  ? kthread_park+0x76/0x76
[    4.487784]  ret_from_fork+0x3a/0x50
[    4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi ip_tables
[    4.489428] Dumping ftrace buffer:
[    4.489939]    (ftrace buffer empty)
[    4.490492] CR2: 0000000000000098
[    4.491052] ---[ end trace 03cd268ad5a86ff7 ]---

Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-nvme@lists.infradead.org
Cc: David Milburn <dmilburn@redhat.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

6ec12c5f

block: move queues types to the block layer · cd87b6e5

由 Christoph Hellwig 提交于 12月 02, 2018

to #28991349

commit e20ba6e1da029136ded295f33076483d65ddf50a upstream

Having another indirect all in the fast path doesn't really help
in our post-spectre world.  Also having too many queue type is just
going to create confusion, so I'd rather manage them centrally.

Note that the queue type naming and ordering changes a bit - the
first index now is the default queue for everything not explicitly
marked, the optional ones are read and poll queues.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

cd87b6e5

blk-mq: fix allocation for queue mapping table · 0c1a14f6

由 Ming Lei 提交于 12月 17, 2018

to #28991349

commit 07b35eb5a364fa59f88f65e6c786192f2c9163be upstream

Type of each element in queue mapping table is 'unsigned int,
intead of 'struct blk_mq_queue_map)', so fix it.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0c1a14f6

blk-mq: balance mapping between present CPUs and queues · 1944434f

由 Ming Lei 提交于 7月 25, 2019

to #28991349

commit 556f36e90dbe7dded81f4fac084d2bc8a2458330 upstream

Spread queues among present CPUs first, then building mapping on other
non-present CPUs.

So we can minimize count of dead queues which are mapped by un-present
CPUs only. Then bad IO performance can be avoided by unbalanced mapping
between present CPUs and queues.

The similar policy has been applied on Managed IRQ affinity.

Cc: Yi Zhang <yi.zhang@redhat.com>
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

1944434f

blk-mq: initial support for multiple queue maps · c454176d

由 Jens Axboe 提交于 10月 24, 2018

to #28991349

commit 843477d4cc5c4bb4e346c561ecd3b9d0bd67e8c8 upstream

Add a queue offset to the tag map. This enables users to map
iteratively, for each queue map type they support.

Bump maximum number of supported maps to 2, we're now fully
able to support more than 1 map.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c454176d

blk-mq: improve plug list sorting · ba2986d9

由 Jens Axboe 提交于 10月 30, 2018

to #28991349

commit 3110fc79606fb6003949246c6fb325dd43445273 upstream

Currently we only look at the software queue, but with support
for multiple maps, we should also look at the hardware queue.
This is important since we'll flush out the request list if
either the software queue or hardware queue don't match.

This sorts by software queue first, then hardware queue if
that differs. Finally we sort by request location like before.
This minimizes the flush points per plug list.
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

ba2986d9

blk-mq: cleanup and improve list insertion · 239c0d8b

由 Jens Axboe 提交于 10月 30, 2018

to #28991349

commit 67cae4c948a5311121905a2a8740c50daf7f6478 upstream

It's somewhat strange to have a list insertion function that
relies on the fact that the caller has mapped things correctly.
Pass in the hardware queue directly for insertion, which makes
for a much cleaner interface and implementation.
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

239c0d8b

blk-mq: cache request hardware queue mapping · 107174b9

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit ea4f995ee8b8f0578b3319949f2edd5d812fdb0a upstream

We call blk_mq_map_queue() a lot, at least two times for each
request per IO, sometimes more. Since we now have an indirect
call as well in that function. cache the mapping so we don't
have to re-call blk_mq_map_queue() for the same request
multiple times.
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

107174b9

blk-mq: separate number of hardware queues from nr_cpu_ids · 3eea3213

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit 392546aed22009060911f76b6ea24520e2f8b50f upstream

With multiple maps, nr_cpu_ids is no longer the maximum number of
hardware queues we support on a given devices. The initializer of
the tag_set can have set ->nr_hw_queues larger than the available
number of CPUs, since we can exceed that with multiple queue maps.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

3eea3213

blk-mq: support multiple hctx maps · 2ca20c32

由 Jens Axboe 提交于 10月 30, 2018

to #28991349

commit b3c661b15d5ab11d982e58bee23e05c1780528a1 upstream

Add support for the tag set carrying multiple queue maps, and
for the driver to inform blk-mq how many it wishes to support
through setting set->nr_maps.

This adds an mq_ops helper for drivers that support more than 1
map, mq_ops->rq_flags_to_type(). The function takes request/bio
flags and CPU, and returns a queue map index for that. We then
use the type information in blk_mq_map_queue() to index the map
set.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

2ca20c32

blk-mq: add 'type' attribute to the sysfs hctx directory · 37567ba0

由 Jens Axboe 提交于 10月 25, 2018

to #28991349

commit a783b81820fe3532809c98371ec904dfdb0ea9e5 upstream

It can be useful for a user to verify what type a given hardware
queue is, expose this information in sysfs.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

37567ba0

blk-mq: allow software queue to map to multiple hardware queues · d78f5292

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit f31967f0e455d08d3ea1d2f849bf62dafc92dbf4 upstream

The mapping used to be dependent on just the CPU location, but
now it's a tuple of (type, cpu) instead. This is a prep patch
for allowing a single software queue to map to multiple hardware
queues. No functional changes in this patch.

This changes the software queue count to an unsigned short
to save a bit of space. We can still support 64K-1 CPUs,
which should be enough. Add a check to catch a wrap.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d78f5292

block: don't lose track of REQ_INTEGRITY flag · c7abd0f1

由 Ming Lei 提交于 1月 16, 2019

to #28991349

commit 7809167da5c86fd6bf309b33dee7a797e263342f upstream

We need to pass bio->bi_opf after bio intergrity preparing, otherwise
the flag of REQ_INTEGRITY may not be set on the allocated request, then
breaks block integrity.

Fixes: f9afca4d367b ("blk-mq: pass in request/bio flags to queue mapping")
Cc: Hannes Reinecke <hare@suse.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c7abd0f1

blk-mq: pass in request/bio flags to queue mapping · b8617081

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit f9afca4d367b8c915f28d29fcaba7460640403ff upstream

Prep patch for being able to place request based not just on
CPU location, but also on the type of request.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b8617081

blk-mq: provide dummy blk_mq_map_queue_type() helper · 67c25121

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit ff2c56609d9b1f0739ae3a3bfdb78191d01e4192 upstream

Doesn't do anything right now, but it's needed as a prep patch
to get the interfaces right.

While in there, correct the blk_mq_map_queue() CPU type to an unsigned
int.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

67c25121

blk-mq: abstract out queue map · f63859ea

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit ed76e329d74a4b15ac0f5fd3adbd52ec0178a134 upstream

This is in preparation for allowing multiple sets of maps per
queue, if so desired.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f63859ea

blk-mq: kill q->mq_map · 7c058cfc

由 Jens Axboe 提交于 10月 16, 2018

to #28991349

commit a8908939af569ce2419f43fd56eeaf003bc3d85d upstream

It's just a pointer to set->mq_map, use that instead. Move the
assignment a bit earlier, so we always know it's valid.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

7c058cfc

Revert "blk-mq: balance mapping between present CPUs and queues" · 783899aa

由 Xiaoguang Wang 提交于 6月 25, 2020

to #28991349

This reverts commit a3d72a0c79fac0e113bbeb85e1e19b3b3568e2f5.

Previously we just backported this patch partly, now we revert
it temporarily and will backport it in later patches formally.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

783899aa

iocost: protect iocg->abs_vdebt with iocg->waitq.lock · 1ec0deaf

由 Tejun Heo 提交于 5月 04, 2020

to #29361128

commit 0b80f9866e6bbfb905140ed8787ff2af03652c0c upstream.

abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup
is and controls the activation of use_delay mechanism. Once a cgroup goes
over budget from forced IOs, it has to pay it back with its future budget.
The progress guarantee on debt paying comes from the iocg being active -
active iocgs are processed by the periodic timer, which ensures that as time
passes the debts dissipate and the iocg returns to normal operation.

However, both iocg activation and vdebt handling are asynchronous and a
sequence like the following may happen.

1. The iocg is in the process of being deactivated by the periodic timer.

2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns
without anything because it still sees that the iocg is already active.

3. The iocg is deactivated.

4. The bio from #2 is over budget but needs to be forced. It increases
abs_vdebt and goes over the threshold and enables use_delay.

5. IO control is enabled for the iocg's subtree and now IOs are attributed
to the descendant cgroups and the iocg itself no longer issues IOs.

This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no
further IOs which can activate it. This can end up unduly punishing all the
descendants cgroups.

The usual throttling path has the same issue - the iocg must be active while
throttled to ensure that future event will wake it up - and solves the
problem by synchronizing the throttling path with a spinlock. abs_vdebt
handling is another form of overage handling and shares a lot of
characteristics including the fact that it isn't in the hottest path.

This patch fixes the above and other possible races by strictly
synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NVlad Dmitriev <vvd@fb.com>
Cc: stable@vger.kernel.org # v5.4+
Fixes: e1518f63f246 ("blk-iocost: Don't let merges push vtime into the future")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

1ec0deaf

blk-iocost: account for IO size when testing latencies · 8c8fb141

由 Tejun Heo 提交于 4月 13, 2020

to #29361128

commit cd006509b0a93cb7ee9d9fd50ae274098997a460 upstream.

On each IO completion, iocost decides whether the IO met or missed its latency
target. Currently, the targets are fixed numbers per IO type. While this can be
good enough for loose latency targets way higher than typical completion
latencies, the effect of IO size makes it difficult to tighten the latency
target - a target adequate for 4k IOs might be too tight for 512k IOs and
vice-versa.

iocost already has all the necessary information to account for different IO
sizes when testing whether the latency target is met as iocost can calculate the
size vtime cost of a given IO. This patch updates the completion path to
calculate the size vtime cost of the IO, deduct the nsec equivalent from the
observed latency and use the adjusted value to decide whether the target is met.

This makes latency targets independent from IO size and enables determining
adequate latency targets with fixed size fio runs.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8c8fb141

block: make rq sector size accessible for block stats · 9bdcaff2

由 Hou Tao 提交于 5月 21, 2019

to #29361128

commit 3d24430694077313c75c6b89f618db09943621e4 upstream.

Currently rq->data_len will be decreased by partial completion or
zeroed by completion, so when blk_stat_add() is invoked, data_len
will be zero and there will never be samples in poll_cb because
blk_mq_poll_stats_bkt() will return -1 if data_len is zero.

We could move blk_stat_add() back to __blk_mq_complete_request(),
but that would make the effort of trying to call ktime_get_ns()
once in vain. Instead we can reuse throtl_size field, and use
it for both block stats and block throttle, and adjust the
logic in blk_mq_poll_stats_bkt() accordingly.

Fixes: 4bc6339a ("block: move blk_stat_add() to __blk_mq_end_request()")
Tested-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9bdcaff2

blk-iocost: switch to fixed non-auto-decaying use_delay · fc94dc72

由 Tejun Heo 提交于 4月 13, 2020

to #29361128

commit 54c52e10dc9b939084a7e6e3d32ce8fd8dee7898 upstream.

The use_delay mechanism was introduced by blk-iolatency to hold memory
allocators accountable for the reclaim and other shared IOs they cause. The
duration of the delay is dynamically balanced between iolatency increasing the
value on each target miss and it auto-decaying as time passes and threads get
delayed on it.

While this works well for iolatency, iocost's control model isn't compatible
with it. There is no repeated "violation" events which can be balanced against
auto-decaying. iocost instead knows how much a given cgroup is over budget and
wants to prevent that cgroup from issuing IOs while over budget. Until now,
iocost has been adding the cost of force-issued IOs. However, this doesn't
reflect the amount which is already over budget and is simply not enough to
counter the auto-decaying allowing anon-memory leaking low priority cgroup to
go over its alloted share of IOs.

As auto-decaying doesn't make much sense for iocost, this patch introduces a
different mode of operation for use_delay - when blkcg_set_delay() are used
insted of blkcg_add/use_delay(), the delay duration is not auto-decayed until it
is explicitly cleared with blkcg_clear_delay(). iocost is updated to keep the
delay duration synchronized to the budget overage amount.

With this change, iocost can effectively police cgroups which generate
significant amount of force-issued IOs.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

fc94dc72

blk-iocost: Fix error on iocost_ioc_vrate_adj · d14b5329

由 Waiman Long 提交于 4月 21, 2020

to #29361128

commmit d6c8e949a35d6906d6c03a50e9a9cdf4e494528a upstream.

Systemtap 4.2 is unable to correctly interpret the "u32 (*missed_ppm)[2]"
argument of the iocost_ioc_vrate_adj trace entry defined in
include/trace/events/iocost.h leading to the following error:

  /tmp/stapAcz0G0/stap_c89c58b83cea1724e26395efa9ed4939_6321_aux_6.c:78:8:
  error: expected ‘;’, ‘,’ or ‘)’ before ‘*’ token
   , u32[]* __tracepoint_arg_missed_ppm

That argument type is indeed rather complex and hard to read. Looking
at block/blk-iocost.c. It is just a 2-entry u32 array. By simplifying
the argument to a simple "u32 *missed_ppm" and adjusting the trace
entry accordingly, the compilation error was gone.

Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d14b5329

blk-iocost: remove duplicated lines in comments · 4b3109d5

由 Weiping Zhang 提交于 2月 27, 2020

to #29361128

commit fa800d73c8d0d36b1f5929198371f421b69e610e upstream.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

4b3109d5

blk-iocost: fix incorrect vtime comparison in iocg_is_idle() · e7c5f028

由 Tejun Heo 提交于 3月 10, 2020

to 29361128

commit dcd6589b11d3b1e71f516a87a7b9646ed356b4c0 upstream.

vtimes may wrap and time_before/after64() should be used to determine
whether a given vtime is before or after another. iocg_is_idle() was
incorrectly using plain "<" comparison do determine whether done_vtime
is before vtime. Here, the only thing we're interested in is whether
done_vtime matches vtime which indicates that there's nothing in
flight. Let's test for inequality instead.
Signed-off-by: NTejun Heo <tj@kernel.org>
Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e7c5f028

iocost: over-budget forced IOs should schedule async delay · 9ab225fe

由 Tejun Heo 提交于 12月 16, 2019

to #29361128

commit d7bd15a138aef3be227818aad9c501e43c89c8c5 upstream.

When over-budget IOs are force-issued through root cgroup,
iocg_kick_delay() adjusts the async delay accordingly but doesn't
actually schedule async throttle for the issuing task.  This bug is
pretty well masked because sooner or later the offending threads are
gonna get directly throttled on regular IOs or have async delay
scheduled by mem_cgroup_throttle_swaprate().

However, it can affect control quality on filesystem metadata heavy
operations.  Let's fix it by invoking blkcg_schedule_throttle() when
iocg_kick_delay() says async delay is needed.
Signed-off-by: NTejun Heo <tj@kernel.org>
Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org
Reported-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9ab225fe

alinux: block: initialize io hang counter · 5d951856

由 Xiaoguang Wang 提交于 7月 21, 2020

fix #29420707

Otherwise we'll get stale io hang counter.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

5d951856

alinux: blk-iocost: bypass IOs earlier if disabled · e166c200

由 Joseph Qi 提交于 7月 16, 2020

to #29357063

The blkg lookup or create logic may bring much overhead even iocost is
disabled. So bypass it earlier in such case.

Fixes: 9da41925 ("alinux: iocost: fix NULL pointer dereference in ioc_rqos_throttle")
Reported-by: NHongnan Li <hongnan.li@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

e166c200

alinux: block-throttle: only do io statistics if needed · b8a94ed8

由 Xiaoguang Wang 提交于 7月 04, 2020

task #29063222

Current blk throttle codes always do io statistics even though users
don't specify valid throttle rules, which will introduce significant
overheads for applications that don't use blk throttle function and
is wrose in arm, see below perf data captured in arm:

sudo taskset -c 66 fio -ioengine=io_uring -sqthread_poll=1 -hipri=1
-sqthread_poll_cpu=65 -registerfiles=1 -fixedbufs=1 -direct=1
-filename=/dev/nvme0n1 -bs=4k -iodepth=8 -rw=randwrite  -time_based
-ramp_time=30 -runtime=60  -name="test"

Samples: 25K of event 'cycles', Event count (approx.): 16586974662
Overhead  Command      Shared Object      Symbol
   3.54%  io_uring-sq  [kernel.kallsyms]  [k]
throtl_stats_update_completion
   0.89%  io_uring-sq  [kernel.kallsyms]  [k] throtl_bio_end_io
   0.66%  io_uring-sq  [kernel.kallsyms]  [k] blk_throtl_bio
   0.05%  io_uring-sq  [kernel.kallsyms]  [k] blk_throtl_stat_add
   0.05%  io_uring-sq  [kernel.kallsyms]  [k] throtl_track_latency
   0.01%  io_uring-sq  [kernel.kallsyms]  [k] blk_throtl_bio_endio

Samples: 25K of event 'cycles', Event count (approx.): 16586974662
Overhead  Command      Shared Object      Symbol
   1.62%  io_uring-sq  [kernel.kallsyms]  [k] io_submit_sqes
   1.06%  io_uring-sq  [kernel.kallsyms]  [k] io_issue_sqe
   0.32%  io_uring-sq  [kernel.kallsyms]  [k] __io_queue_sqe
   0.06%  io_uring-sq  [kernel.kallsyms]  [k] io_queue_sqe

Above test doesn't set valid blk throttle rules, but the overhead
introduced by blk throttle is even bigger than many io_uring framework
functions, which is not acceptable.

To improve this issue, only do do io statistics if users specify valid
blk throttle rules, and this will also improve performance.

Before this patch:
clat (usec): min=5, max=6871, avg=18.70, stdev=17.89
 lat (usec): min=9, max=6871, avg=18.84, stdev=17.89
WRITE: bw=1618MiB/s (1697MB/s), 1618MiB/s-1618MiB/s (1697MB/s-1697MB/s),
io=94.8GiB (102GB), run=60001-60001msec

With this patch:
clat (usec): min=5, max=7554, avg=17.49, stdev=18.24
lat (usec): min=9, max=7554, avg=17.62, stdev=18.24
 WRITE: bw=1727MiB/s (1810MB/s), 1727MiB/s-1727MiB/s
(1810MB/s-1810MB/s), io=101GiB (109GB), run=60001-60001msec

About 6.6% bps improvement and 6.4% latency reduction.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b8a94ed8

blk-iolatency: only call ktime_get() if needed · d02cca18

由 Hongnan Li 提交于 7月 06, 2020

to #29139300

commit 6e2fa4dd683a22a7697e7ff51dad499406094d28 upstream

ktime_to_ns(ktime_get()), which is expensive, does not need to be called
if blk_iolatency_enabled() return false in blkcg_iolatency_done_bio().
Postponing ktime_to_ns(ktime_get()) execution reduces the CPU usage when
blk_iolatency is disabled.
Signed-off-by: NHongnan Li <hongnan.li@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d02cca18

29 6月, 2020 4 次提交

blk-mq: use plug for devices that implement ->commits_rqs() · 92339373

由 Jens Axboe 提交于 11月 29, 2018

fix #28871358

commit b2c5d16b72df1116f05c9be16a630ac939d34101 upstream

If we have that hook, we know the driver handles bd->last == true in
a smart fashion. If it does, even for multiple hardware queues, it's
a good idea to flush batches of requests to the device, if we have
batches of requests from the submitter.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

92339373

blk-mq: use bd->last == true for list inserts · 9e166ffa

由 Jens Axboe 提交于 11月 24, 2018

fix #28871358

commit be94f058f2bde6f0b0ee9059a35daa8e15be308f upstream

If we are issuing a list of requests, we know if we're at the last one.
If we fail issuing, ensure that we call ->commits_rqs() to flush any
potential previous requests.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9e166ffa

blk-mq: add mq_ops->commit_rqs() · 0111cff3

由 Jens Axboe 提交于 11月 27, 2018

fix #28871358

commit d666ba98f849ad44c4405ecc2180390ebe80f4f9 upstream

blk-mq passes information to the hardware about any given request being
the last that we will issue in this sequence. The point is that hardware
can defer costly doorbell type writes to the last request. But if we run
into errors issuing a sequence of requests, we may never send the request
with bd->last == true set. For that case, we need a hook that tells the
hardware that nothing else is coming right now.

For failures returned by the drivers ->queue_rq() hook, the driver is
responsible for flushing pending requests, if it uses bd->last to
optimize that part. This works like before, no changes there.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0111cff3

block: improve logic around when to sort a plug list · 26702d43

由 Jens Axboe 提交于 11月 27, 2018

fix #28871358

Only do it if we have requests for multiple queues in the same
plug.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

26702d43

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功