提交 · 1eb17f5e15b73669df635fb07df2853cb1244a69 · openeuler / Kernel

29 11月, 2021 6 次提交

由 Jan Kara 提交于 11月 25, 2021

Waker - wakee relationships are important in deciding whether one queue
can preempt the other one. Print information about detected waker-wakee
relationships so that scheduling decisions can be better understood from
block traces.
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-7-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

1eb17f5e

bfq: Limit waker detection in time · 1f18b700

由 Jan Kara 提交于 11月 25, 2021

Currently, when process A starts issuing requests shortly after process
B has completed some IO three times in a row, we decide that B is a
"waker" of A meaning that completing IO of B is needed for A to make
progress and generally stop separating A's and B's IO much. This logic
is useful to avoid unnecessary idling and thus throughput loss for cases
where workload needs to switch e.g. between the process and the
journaling thread doing IO. However the detection heuristic tends to
frequently give false positives when A and B are fighting IO bandwidth
and other processes aren't doing much IO as we are basically deemed to
eventually accumulate three occurences of a situation where one process
starts issuing requests after the other has completed some IO. To reduce
these false positives, cancel the waker detection also if we didn't
accumulate three detected wakeups within given timeout. The rationale is
that if wakeups are really rare, the pointless idling doesn't hurt
throughput that much anyway.

This significantly reduces false waker detection for workload like:

[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync

[slowwriter]
numjobs=1
fsync=200

[fastwriter]
numjobs=1
fsync=200
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-5-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

1f18b700

bfq: Limit number of requests consumed by each cgroup · 76f1df88

由 Jan Kara 提交于 11月 25, 2021

When cgroup IO scheduling is used with BFQ it does not really provide
service differentiation if the cgroup drives a big IO depth. That for
example happens with writeback which asynchronously submits lots of IO
but it can happen with AIO as well. The problem is that if we have two
cgroups that submit IO with different weights, the cgroup with higher
weight properly gets more IO time and is able to dispatch more IO.
However this causes lower weight cgroup to accumulate more requests
inside BFQ and eventually lower weight cgroup consumes most of IO
scheduler tags. At that point higher weight cgroup stops getting better
service as it is mostly blocked waiting for a scheduler tag while its
queues inside BFQ are empty and thus lower weight cgroup gets served.

Check how many requests submitting cgroup has allocated in
bfq_limit_depth() and if it consumes more requests than what would
correspond to its weight limit available depth to 1 so that the cgroup
cannot consume many more requests. With this limitation the higher
weight cgroup gets proper service even with writeback.
Reviewed-by: NMichal Koutný <mkoutny@suse.com>
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-4-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

76f1df88

bfq: Store full bitmap depth in bfq_data · 44dfa279

由 Jan Kara 提交于 11月 25, 2021

Store bitmap depth shift inside bfq_data so that we can use it in
bfq_limit_depth() for proportioning when limiting number of available
request tags for a cgroup.
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-3-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

44dfa279

bfq: Track number of allocated requests in bfq_entity · 98f04499

由 Jan Kara 提交于 11月 25, 2021

When we want to limit number of requests used by each bfqq and also
cgroup, we need to track also number of requests used by each cgroup.
So track number of allocated requests for each bfq_entity.
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-2-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

98f04499

block: move io_context creation into where it's needed · 5a9d041b

由 Jens Axboe 提交于 11月 13, 2021

The only user of the io_context for IO is BFQ, yet we put the checking
and logic of it into the normal IO path.

Put the creation into blk_mq_sched_assign_ioc(), and have BFQ use that
helper.
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5a9d041b

18 10月, 2021 2 次提交

blk-mq: Stop using pointers for blk_mq_tags bitmap tags · ae0f1a73

由 John Garry 提交于 10月 05, 2021

Now that we use shared tags for shared sbitmap support, we don't require
the tags sbitmap pointers, so drop them.

This essentially reverts commit 222a5ae0 ("blk-mq: Use pointers for
blk_mq_tags bitmap tags").

Function blk_mq_init_bitmap_tags() is removed also, since it would be only
a wrappper for blk_mq_init_bitmaps().
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1633429419-228500-14-git-send-email-john.garry@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

ae0f1a73

block: move elevator.h to block/ · 2e9bc346

由 Christoph Hellwig 提交于 9月 20, 2021

Except for the features passed to blk_queue_required_elevator_features,
elevator.h is only needed internally to the block layer. Move the
ELEVATOR_F_* definitions to blkdev.h, and the move elevator.h to
block/, dropping all the spurious includes outside of that.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-13-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

2e9bc346

28 9月, 2021 1 次提交

Revert "block, bfq: honor already-setup queue merges" · ebc69e89

由 Jens Axboe 提交于 9月 28, 2021

This reverts commit 2d52c58b.

We have had several folks complain that this causes hangs for them, which
is especially problematic as the commit has also hit stable already.

As no resolution seems to be forthcoming right now, revert the patch.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=214503
Fixes: 2d52c58b ("block, bfq: honor already-setup queue merges")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ebc69e89

02 9月, 2021 1 次提交

block, bfq: honor already-setup queue merges · 2d52c58b

由 Paolo Valente 提交于 8月 02, 2021

The function bfq_setup_merge prepares the merging between two
bfq_queues, say bfqq and new_bfqq. To this goal, it assigns
bfqq->new_bfqq = new_bfqq. Then, each time some I/O for bfqq arrives,
the process that generated that I/O is disassociated from bfqq and
associated with new_bfqq (merging is actually a redirection). In this
respect, bfq_setup_merge increases new_bfqq->ref in advance, adding
the number of processes that are expected to be associated with
new_bfqq.

Unfortunately, the stable-merging mechanism interferes with this
setup. After bfqq->new_bfqq has been set by bfq_setup_merge, and
before all the expected processes have been associated with
bfqq->new_bfqq, bfqq may happen to be stably merged with a different
queue than the current bfqq->new_bfqq. In this case, bfqq->new_bfqq
gets changed. So, some of the processes that have been already
accounted for in the ref counter of the previous new_bfqq will not be
associated with that queue. This creates an unbalance, because those
references will never be decremented.

This commit fixes this issue by reestablishing the previous, natural
behaviour: once bfqq->new_bfqq has been set, it will not be changed
until all expected redirections have occurred.
Signed-off-by: NDavide Zini <davidezini2@gmail.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210802141352.74353-2-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

2d52c58b

24 8月, 2021 1 次提交

block: add an explicit ->disk backpointer to the request_queue · d152c682

由 Christoph Hellwig 提交于 8月 16, 2021

Replace the magic lookup through the kobject tree with an explicit
backpointer, given that the device model links are set up and torn
down at times when I/O is still possible, leading to potential
NULL or invalid pointer dereferences.

Fixes: edb0872f ("block: move the bdi from the request_queue to the gendisk")
Reported-by: Nsyzbot <syzbot+aa0801b6b32dca9dda82@syzkaller.appspotmail.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NSven Schnelle <svens@linux.ibm.com>
Link: https://lore.kernel.org/r/20210816134624.GA24234@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

d152c682

18 8月, 2021 3 次提交

block: fix default IO priority handling · e70344c0

由 Damien Le Moal 提交于 8月 11, 2021

The default IO priority is the best effort (BE) class with the
normal priority level IOPRIO_NORM (4). However, get_task_ioprio()
returns IOPRIO_CLASS_NONE/IOPRIO_NORM as the default priority and
get_current_ioprio() returns IOPRIO_CLASS_NONE/0. Let's be consistent
with the defined default and have both of these functions return the
default priority IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_NORM) when
the user did not define another default IO priority for the task.

In include/uapi/linux/ioprio.h, introduce the IOPRIO_BE_NORM macro as
an alias to IOPRIO_NORM to clarify that this default level applies to
the BE priotity class. In include/linux/ioprio.h, define the macro
IOPRIO_DEFAULT as IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_BE_NORM)
and use this new macro when setting a priority to the default.
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210811033702.368488-7-damien.lemoal@wdc.com
[axboe: drop unnecessary lightnvm change]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e70344c0

block: Introduce IOPRIO_NR_LEVELS · 202bc942

由 Damien Le Moal 提交于 8月 11, 2021

The BFQ scheduler and ioprio_check_cap() both assume that the RT
priority class (IOPRIO_CLASS_RT) can have up to 8 different priority
levels, similarly to the BE class (IOPRIO_CLASS_iBE). This is
controlled using the IOPRIO_BE_NR macro , which is badly named as the
number of levels also applies to the RT class.

Introduce the class independent IOPRIO_NR_LEVELS macro, defined to 8,
to make things clear. Keep the old IOPRIO_BE_NR macro definition as an
alias for IOPRIO_NR_LEVELS.
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210811033702.368488-6-damien.lemoal@wdc.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

202bc942

block: bfq: fix bfq_set_next_ioprio_data() · a680dd72

由 Damien Le Moal 提交于 8月 11, 2021

For a request that has a priority level equal to or larger than
IOPRIO_BE_NR, bfq_set_next_ioprio_data() prints a critical warning but
defaults to setting the request new_ioprio field to IOPRIO_BE_NR. This
is not consistent with the warning and the allowed values for priority
levels. Fix this by setting the request new_ioprio field to
IOPRIO_BE_NR - 1, the lowest priority level allowed.

Cc: <stable@vger.kernel.org>
Fixes: aee69d78 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210811033702.368488-2-damien.lemoal@wdc.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

a680dd72

10 8月, 2021 2 次提交

block: return ELEVATOR_DISCARD_MERGE if possible · 866663b7

由 Ming Lei 提交于 7月 29, 2021

When merging one bio to request, if they are discard IO and the queue
supports multi-range discard, we need to return ELEVATOR_DISCARD_MERGE
because both block core and related drivers(nvme, virtio-blk) doesn't
handle mixed discard io merge(traditional IO merge together with
discard merge) well.

Fix the issue by returning ELEVATOR_DISCARD_MERGE in this situation,
so both blk-mq and drivers just need to handle multi-range discard.
Reported-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Fixes: 2705dfb2 ("block: fix discard request merge")
Link: https://lore.kernel.org/r/20210729034226.1591070-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

866663b7

block: move the bdi from the request_queue to the gendisk · edb0872f

由 Christoph Hellwig 提交于 8月 09, 2021

The backing device information only makes sense for file system I/O,
and thus belongs into the gendisk and not the lower level request_queue
structure. Move it there.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-5-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

edb0872f

25 6月, 2021 2 次提交

blk: Fix lock inversion between ioc lock and bfqd lock · fd2ef39c

由 Jan Kara 提交于 6月 23, 2021

Lockdep complains about lock inversion between ioc->lock and bfqd->lock:

bfqd -> ioc:
 put_io_context+0x33/0x90 -> ioc->lock grabbed
 blk_mq_free_request+0x51/0x140
 blk_put_request+0xe/0x10
 blk_attempt_req_merge+0x1d/0x30
 elv_attempt_insert_merge+0x56/0xa0
 blk_mq_sched_try_insert_merge+0x4b/0x60
 bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed
 blk_mq_sched_insert_requests+0xd6/0x2b0
 blk_mq_flush_plug_list+0x154/0x280
 blk_finish_plug+0x40/0x60
 ext4_writepages+0x696/0x1320
 do_writepages+0x1c/0x80
 __filemap_fdatawrite_range+0xd7/0x120
 sync_file_range+0xac/0xf0

ioc->bfqd:
 bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
 put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
 exit_io_context+0x48/0x50
 do_exit+0x7e9/0xdd0
 do_group_exit+0x54/0xc0

To avoid this inversion we change blk_mq_sched_try_insert_merge() to not
free the merged request but rather leave that upto the caller similarly
to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure
to free all the merged requests after dropping bfqd->lock.

Fixes: aee69d78 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

fd2ef39c

bfq: Remove merged request already in bfq_requests_merged() · a921c655

由 Jan Kara 提交于 6月 23, 2021

Currently, bfq does very little in bfq_requests_merged() and handles all
the request cleanup in bfq_finish_requeue_request() called from
blk_mq_free_request(). That is currently safe only because
blk_mq_free_request() is called shortly after bfq_requests_merged()
while bfqd->lock is still held. However to fix a lock inversion between
bfqd->lock and ioc->lock, we need to call blk_mq_free_request() after
dropping bfqd->lock. That would mean that already merged request could
be seen by other processes inside bfq queues and possibly dispatched to
the device which is wrong. So move cleanup of the request from
bfq_finish_requeue_request() to bfq_requests_merged().
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210623093634.27879-2-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

a921c655

22 6月, 2021 7 次提交

block, bfq: reset waker pointer with shared queues · 9a2ac41b

由 Paolo Valente 提交于 6月 19, 2021

Commit 85686d0d ("block, bfq: keep shared queues out of the waker
mechanism") leaves shared bfq_queues out of the waker-detection
mechanism. It attains this goal by not updating the pointer
last_completed_rq_bfqq, if the last request completed belongs to a
shared bfq_queue (so that the pointer will not point to the shared
bfq_queue).

Yet this has a side effect: the pointer last_completed_rq_bfqq keeps
pointing, deceptively, to a bfq_queue that actually is not the last
one to have had a request completed. As a consequence, such a
bfq_queue may deceptively be considered as a waker of some bfq_queue,
even of some shared bfq_queue.

To address this issue, reset last_completed_rq_bfqq if the last
request completed belongs to a shared queue.

Fixes: 85686d0d ("block, bfq: keep shared queues out of the waker mechanism")
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-8-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

9a2ac41b

block, bfq: check waker only for queues with no in-flight I/O · efc72524

由 Paolo Valente 提交于 6月 19, 2021

Consider two bfq_queues, say Q1 and Q2, with Q2 empty. If a request of
Q1 gets completed shortly before a new request arrives for Q2, then
BFQ flags Q1 as a candidate waker for Q2. Yet, the arrival of this new
request may have a different cause, in the following case. If also Q2
has requests in flight while waiting for the arrival of a new request,
then the completion of its own requests may be the actual cause of the
awakening of the process that sends I/O to Q2. So Q1 may be flagged
wrongly as a candidate waker.

This commit avoids this deceptive flagging, by disabling
candidate-waker flagging for Q2, if Q2 has in-flight I/O.
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-7-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

efc72524

block, bfq: avoid delayed merge of async queues · bd3664b3

由 Paolo Valente 提交于 6月 19, 2021

Since commit 430a67f9 ("block, bfq: merge bursts of newly-created
queues"), BFQ may schedule a merge between a newly created sync
bfq_queue, say Q2, and the last sync bfq_queue created, say Q1. To this
goal, BFQ stores the address of Q1 in the field bic->stable_merge_bfqq
of the bic associated with Q2. So, when the time for the possible merge
arrives, BFQ knows which bfq_queue to merge Q2 with. In particular,
BFQ checks for possible merges on request arrivals.

Yet the same bic may also be associated with an async bfq_queue, say
Q3. So, if a request for Q3 arrives, then the above check may happen
to be executed while the bfq_queue at hand is Q3, instead of Q2. In
this case, Q1 happens to be merged with an async bfq_queue. This is
not only a conceptual mistake, because async queues are to be kept out
of queue merging, but also a bug that leads to inconsistent states.

This commits simply filters async queues out of delayed merges.

Fixes: 430a67f9 ("block, bfq: merge bursts of newly-created queues")
Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-6-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

bd3664b3

block, bfq: boost throughput by extending queue-merging times · 7812472f

由 Pietro Pedroni 提交于 6月 19, 2021

One of the methods with which bfq boosts throughput is by merging queues.
One of the merging variants in bfq is the stable merge.
This mechanism is activated between two queues only if they are created
within a certain maximum time T1 from each other.
Merging can happen soon or be delayed. In the second case, before
merging, bfq needs to evaluate a throughput-boost parameter that
indicates whether the queue generates a high throughput is served alone.
Merging occurs when this throughput-boost is not high enough.
In particular, this parameter is evaluated and late merging may occur
only after at least a time T2 from the creation of the queue.

Currently T1 and T2 are set to 180ms and 200ms, respectively.
In this way the merging mechanism rarely occurs because time is not
enough. This results in a noticeable lowering of the overall throughput
with some workloads (see the example below).

This commit introduces two constants bfq_activation_stable_merging and
bfq_late_stable_merging in order to increase the duration of T1 and T2.
Both the stable merging activation time and the late merging
time are set to 600ms. This value has been experimentally evaluated
using sqlite benchmark in the Phoronix Test Suite on a HDD.
The duration of the benchmark before this fix was 111.02s, while now
it has reached 97.02s, a better result than that of all the other
schedulers.
Signed-off-by: NPietro Pedroni <pedroni.pietro.96@gmail.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-5-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

7812472f

block, bfq: consider also creation time in delayed stable merge · d4f49983

由 Paolo Valente 提交于 6月 19, 2021

Since commit 430a67f9 ("block, bfq: merge bursts of newly-created
queues"), BFQ may schedule a merge between a newly created sync
bfq_queue and the last sync bfq_queue created. Such a merging is not
performed immediately, because BFQ needs first to find out whether the
newly created queue actually reaches a higher throughput if not merged
at all (and in that case BFQ will not perform any stable merging). To
check that, a little time must be waited after the creation of the new
queue, so that some I/O can flow in the queue, and statistics on such
I/O can be computed.

Yet, to evaluate the above waiting time, the last split time is
considered as start time, instead of the creation time of the
queue. This is a mistake, because considering the split time is
correct only in the following scenario.

The queue undergoes a non-stable merges on the arrival of its very
first I/O request, due to close I/O with some other queue. While the
queue is merged for close I/O, stable merging is not considered. Yet
the queue may then happen to be split, if the close I/O finishes (or
happens to be a false positive). From this time on, the queue can
again be considered for stable merging. But, again, a little time must
elapse, to let some new I/O flow in the queue and to get updated
statistics. To wait for this time, the split time is to be taken into
account.

Yet, if the queue does not undergo a non-stable merge on the arrival
of its very first request, then BFQ immediately checks whether the
stable merge is to be performed. It happens because the split time for
a queue is initialized to minus infinity when the queue is created.

This commit fixes this mistake by adding the missing condition. Now
the check for delayed stable-merge is performed after a little time is
elapsed not only from the last queue split time, but also from the
creation time of the queue.

Fixes: 430a67f9 ("block, bfq: merge bursts of newly-created queues")
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-4-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

d4f49983

block, bfq: fix delayed stable merge check · e03f2ab7

由 Luca Mariotti 提交于 6月 19, 2021

When attempting to schedule a merge of a given bfq_queue with the currently
in-service bfq_queue or with a cooperating bfq_queue among the scheduled
bfq_queues, delayed stable merge is checked for rotational or non-queueing
devs. For this stable merge to be performed, some conditions must be met.
If the current bfq_queue underwent some split from some merged bfq_queue,
one of these conditions is that two hundred milliseconds must elapse from
split, otherwise this condition is always met.

Unfortunately, by mistake, time_is_after_jiffies() was written instead of
time_is_before_jiffies() for this check, verifying that less than two
hundred milliseconds have elapsed instead of verifying that at least two
hundred milliseconds have elapsed.

Fix this issue by replacing time_is_after_jiffies() with
time_is_before_jiffies().
Signed-off-by: NLuca Mariotti <mariottiluca1@hotmail.it>
Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
Signed-off-by: NPietro Pedroni <pedroni.pietro.96@gmail.com>
Link: https://lore.kernel.org/r/20210619140948.98712-3-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

e03f2ab7

block, bfq: let also stably merged queues enjoy weight raising · 511a2699

由 Paolo Valente 提交于 6月 19, 2021

Merged bfq_queues are kept out of weight-raising (low-latency)
mechanisms. The reason is that these queues are usually created for
non-interactive and non-soft-real-time tasks. Yet this is not the case
for stably-merged queues. These queues are merged just because they
are created shortly after each other. So they may easily serve the I/O
of an interactive or soft-real time application, if the application
happens to spawn multiple processes.

To address this issue, this commits lets also stably-merged queued
enjoy weight raising.
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-2-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

511a2699

12 5月, 2021 1 次提交

block, bfq: avoid circular stable merges · 7ea96eef

由 Paolo Valente 提交于 5月 12, 2021

BFQ may merge a new bfq_queue, stably, with the last bfq_queue
created. In particular, BFQ first waits a little bit for some I/O to
flow inside the new queue, say Q2, if this is needed to understand
whether it is better or worse to merge Q2 with the last queue created,
say Q1. This delayed stable merge is performed by assigning
bic->stable_merge_bfqq = Q1, for the bic associated with Q1.

Yet, while waiting for some I/O to flow in Q2, a non-stable queue
merge of Q2 with Q1 may happen, causing the bic previously associated
with Q2 to be associated with exactly Q1 (bic->bfqq = Q1). After that,
Q2 and Q1 may happen to be split, and, in the split, Q1 may happen to
be recycled as a non-shared bfq_queue. In that case, Q1 may then
happen to undergo a stable merge with the bfq_queue pointed by
bic->stable_merge_bfqq. Yet bic->stable_merge_bfqq still points to
Q1. So Q1 would be merged with itself.

This commit fixes this error by intercepting this situation, and
canceling the schedule of the stable merge.

Fixes: 430a67f9 ("block, bfq: merge bursts of newly-created queues")
Signed-off-by: NPietro Pedroni <pedroni.pietro.96@gmail.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210512094352.85545-2-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

7ea96eef

11 5月, 2021 1 次提交

kyber: fix out of bounds access when preempted · efed9a33

由 Omar Sandoval 提交于 5月 10, 2021

__blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
for the current CPU again and uses that to get the corresponding Kyber
context in the passed hctx. However, the thread may be preempted between
the two calls to blk_mq_get_ctx(), and the ctx returned the second time
may no longer correspond to the passed hctx. This "works" accidentally
most of the time, but it can cause us to read garbage if the second ctx
came from an hctx with more ctx's than the first one (i.e., if
ctx->index_hw[hctx->type] > hctx->nr_ctx).

This manifested as this UBSAN array index out of bounds error reported
by Jakub:

UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
index 13106 is out of range for type 'long unsigned int [128]'
Call Trace:
 dump_stack+0xa4/0xe5
 ubsan_epilogue+0x5/0x40
 __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
 queued_spin_lock_slowpath+0x476/0x480
 do_raw_spin_lock+0x1c2/0x1d0
 kyber_bio_merge+0x112/0x180
 blk_mq_submit_bio+0x1f5/0x1100
 submit_bio_noacct+0x7b0/0x870
 submit_bio+0xc2/0x3a0
 btrfs_map_bio+0x4f0/0x9d0
 btrfs_submit_data_bio+0x24e/0x310
 submit_one_bio+0x7f/0xb0
 submit_extent_page+0xc4/0x440
 __extent_writepage_io+0x2b8/0x5e0
 __extent_writepage+0x28d/0x6e0
 extent_write_cache_pages+0x4d7/0x7a0
 extent_writepages+0xa2/0x110
 do_writepages+0x8f/0x180
 __writeback_single_inode+0x99/0x7f0
 writeback_sb_inodes+0x34e/0x790
 __writeback_inodes_wb+0x9e/0x120
 wb_writeback+0x4d2/0x660
 wb_workfn+0x64d/0xa10
 process_one_work+0x53a/0xa80
 worker_thread+0x69/0x5b0
 kthread+0x20b/0x240
 ret_from_fork+0x1f/0x30

Only Kyber uses the hctx, so fix it by passing the request_queue to
->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
map the queues itself to avoid the mismatch.

Fixes: a6088845 ("block: kyber: make kyber more friendly with merging")
Reported-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

efed9a33

16 4月, 2021 1 次提交

bfq/mq-deadline: remove redundant check for passthrough request · 7687b38a

由 Lin Feng 提交于 4月 15, 2021

Since commit 01e99aec 'blk-mq: insert passthrough request into
hctx->dispatch directly', passthrough request should not appear in
IO-scheduler any more, so blk_rq_is_passthrough checking in addon IO
schedulers is redundant.

(Notes: this patch passes generic IO load test with hdds under SAS
controller and hdds under AHCI controller but obviously not covers all.
Not sure if passthrough request can still escape into IO scheduler from
blk_mq_sched_insert_requests, which is used by blk_mq_flush_plug_list and
has lots of indirect callers.)
Signed-off-by: NLin Feng <linf@wangsu.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7687b38a

26 3月, 2021 6 次提交

block, bfq: merge bursts of newly-created queues · 430a67f9

由 Paolo Valente 提交于 3月 04, 2021

Many throughput-sensitive workloads are made of several parallel I/O
flows, with all flows generated by the same application, or more
generically by the same task (e.g., system boot). The most
counterproductive action with these workloads is plugging I/O dispatch
when one of the bfq_queues associated with these flows remains
temporarily empty.

To avoid this plugging, BFQ has been using a burst-handling mechanism
for years now. This mechanism has proven effective for throughput, and
not detrimental for service guarantees. This commit pushes this
mechanism a little bit further, basing on the following two facts.

First, all the I/O flows of a the same application or task contribute
to the execution/completion of that common application or task. So the
performance figures that matter are total throughput of the flows and
task-wide I/O latency. In particular, these flows do not need to be
protected from each other, in terms of individual bandwidth or
latency.

Second, the above fact holds regardless of the number of flows.

Putting these two facts together, this commits merges stably the
bfq_queues associated with these I/O flows, i.e., with the processes
that generate these IO/ flows, regardless of how many the involved
processes are.

To decide whether a set of bfq_queues is actually associated with the
I/O flows of a common application or task, and to merge these queues
stably, this commit operates as follows: given a bfq_queue, say Q2,
currently being created, and the last bfq_queue, say Q1, created
before Q2, Q2 is merged stably with Q1 if
- very little time has elapsed since when Q1 was created
- Q2 has the same ioprio as Q1
- Q2 belongs to the same group as Q1

Merging bfq_queues also reduces scheduling overhead. A fio test with
ten random readers on /dev/nullb shows a throughput boost of 40%, with
a quadcore. Since BFQ's execution time amounts to ~50% of the total
per-request processing time, the above throughput boost implies that
BFQ's overhead is reduced by more than 50%.
Tested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-7-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

430a67f9

block, bfq: keep shared queues out of the waker mechanism · 85686d0d

由 Paolo Valente 提交于 3月 04, 2021

Shared queues are likely to receive I/O at a high rate. This may
deceptively let them be considered as wakers of other queues. But a
false waker will unjustly steal bandwidth to its supposedly woken
queue. So considering also shared queues in the waking mechanism may
cause more control troubles than throughput benefits. This commit
keeps shared queues out of the waker-detection mechanism.
Tested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-6-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

85686d0d

block, bfq: fix weight-raising resume with !low_latency · 8c544770

由 Paolo Valente 提交于 3月 04, 2021

When the io_latency heuristic is off, bfq_queues must not start to be
weight-raised. Unfortunately, by mistake, this may happen when the
state of a previously weight-raised bfq_queue is resumed after a queue
split. This commit fixes this error.
Tested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-5-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

8c544770

block, bfq: make shared queues inherit wakers · 8ef3fc3a

由 Paolo Valente 提交于 3月 04, 2021

Consider a bfq_queue bfqq that is about to be merged with another
bfq_queue new_bfqq. The processes associated with bfqq are cooperators
of the processes associated with new_bfqq. So, if bfqq has a waker,
then it is reasonable (and beneficial for throughput) to assume that
all these processes will be happy to let bfqq's waker freely inject
I/O when they have no I/O. So this commit makes new_bfqq inherit
bfqq's waker.
Tested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-4-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

8ef3fc3a

block, bfq: put reqs of waker and woken in dispatch list · 7cc4ffc5

由 Paolo Valente 提交于 3月 04, 2021

Consider a new I/O request that arrives for a bfq_queue bfqq. If, when
this happens, the only active bfq_queues are bfqq and either its waker
bfq_queue or one of its woken bfq_queues, then there is no point in
queueing this new I/O request in bfqq for service. In fact, the
in-service queue and bfqq agree on serving this new I/O request as
soon as possible. So this commit puts this new I/O request directly
into the dispatch list.
Tested-by: NJan Kara <jack@suse.cz>
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-3-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

7cc4ffc5

block, bfq: always inject I/O of queues blocked by wakers · 2ec5a5c4

由 Paolo Valente 提交于 3月 04, 2021

Suppose that I/O dispatch is plugged, to wait for new I/O for the
in-service bfq-queue, say bfqq. Suppose then that there is a further
bfq_queue woken by bfqq, and that this woken queue has pending I/O. A
woken queue does not steal bandwidth from bfqq, because it remains
soon without I/O if bfqq is not served. So there is virtually no risk
of loss of bandwidth for bfqq if this woken queue has I/O dispatched
while bfqq is waiting for new I/O. In contrast, this extra I/O
injection boosts throughput. This commit performs this extra
injection.
Tested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-2-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

2ec5a5c4

03 3月, 2021 1 次提交

block/bfq: update comments and default value in docs for fifo_expire · 4168a8d2

由 Joseph Qi 提交于 2月 23, 2021

Correct the comments since bfq_fifo_expire[0] is for async request,
while bfq_fifo_expire[1] is for sync request.
Also update docs, according the source code, the default
fifo_expire_async is 250ms, and fifo_expire_sync is 125ms.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4168a8d2

22 2月, 2021 1 次提交

block: get rid of the trace rq insert wrapper · b357e4a6

由 Chaitanya Kulkarni 提交于 2月 21, 2021

Get rid of the wrapper for trace_block_rq_insert() and call the function
directly.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b357e4a6

03 2月, 2021 1 次提交

bfq-iosched: Revert "bfq: Fix computation of shallow depth" · 388c705b

由 Lin Feng 提交于 2月 02, 2021

This reverts commit 6d4d2735.

bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core
sbitmap_get_shallow, which uses just the number to limit the scan depth of
each bitmap word, formula:
scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100%

That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct.
But after commit patch 'bfq: Fix computation of shallow depth', we use
sbitmap.depth instead, as a example in following case:

sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64.
The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and
three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit
nothing.
Signed-off-by: NLin Feng <linf@wangsu.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

388c705b

28 1月, 2021 3 次提交

bfq: Use only idle IO periods for think time calculations · 7684fbde

由 Jan Kara 提交于 6月 05, 2020

Currently whenever bfq queue has a request queued we add now -
last_completion_time to the think time statistics. This is however
misleading in case the process is able to submit several requests in
parallel because e.g. if the queue has request completed at time T0 and
then queues new requests at times T1, T2, then we will add T1-T0 and
T2-T0 to think time statistics which just doesn't make any sence (the
queue's think time is penalized by the queue being able to submit more
IO). So add to think time statistics only time intervals when the queue
had no IO pending.
Signed-off-by: NJan Kara <jack@suse.cz>
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
[axboe: fix whitespace on empty line]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7684fbde

bfq: Use 'ttime' local variable · 28c6def0

由 Jan Kara 提交于 6月 05, 2020

Use local variable 'ttime' instead of dereferencing bfqq.
Signed-off-by: NJan Kara <jack@suse.cz>
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

28c6def0

bfq: Avoid false bfq queue merging · 41e76c85

由 Jan Kara 提交于 6月 05, 2020

bfq_setup_cooperator() uses bfqd->in_serv_last_pos so detect whether it
makes sense to merge current bfq queue with the in-service queue.
However if the in-service queue is freshly scheduled and didn't dispatch
any requests yet, bfqd->in_serv_last_pos is stale and contains value
from the previously scheduled bfq queue which can thus result in a bogus
decision that the two queues should be merged. This bug can be observed
for example with the following fio jobfile:

[global]
direct=0
ioengine=sync
invalidate=1
size=1g
rw=read

[reader]
numjobs=4
directory=/mnt

where the 4 processes will end up in the one shared bfq queue although
they do IO to physically very distant files (for some reason I was able to
observe this only with slice_idle=1ms setting).

Fix the problem by invalidating bfqd->in_serv_last_pos when switching
in-service queue.

Fixes: 058fdecc ("block, bfq: fix in-service-queue check for queue merging")
CC: stable@vger.kernel.org
Signed-off-by: NJan Kara <jack@suse.cz>
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

41e76c85

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功