提交 · d322f355e9368045ff89b7a3df9324fe0c33839b · openeuler / Kernel

23 8月, 2022 2 次提交

block, bfq: remove useless parameter for bfq_add/del_bfqq_busy() · d322f355

由 Yu Kuai 提交于 8月 16, 2022

'bfqd' can be accessed through 'bfqq->bfqd', there is no need to pass
it as a parameter separately.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220816015631.1323948-4-yukuai1@huaweicloud.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

d322f355

block, bfq: remove unused functions · c2090eab

由 Yu Kuai 提交于 8月 16, 2022

While doing code coverage testing(CONFIG_BFQ_CGROUP_DEBUG is disabled), we
found that some functions doesn't have caller, thus remove them.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220816015631.1323948-2-yukuai1@huaweicloud.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c2090eab

15 7月, 2022 1 次提交

block/bfq: Use the new blk_opf_t type · dc469ba2

由 Bart Van Assche 提交于 7月 14, 2022

Use the new blk_opf_t type for arguments and variables that represent
request flags or a bitwise combination of a request operation and
request flags. Rename those variables from 'op' into 'opf'.

This patch does not change any functionality.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-8-bvanassche@acm.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

dc469ba2

19 5月, 2022 1 次提交

bfq: Relax waker detection for shared queues · f9506673

由 Jan Kara 提交于 5月 19, 2022

Currently we look for waker only if current queue has no requests. This
makes sense for bfq queues with a single process however for shared
queues when there is a larger number of processes the condition that
queue has no requests is difficult to meet because often at least one
process has some request in flight although all the others are waiting
for the waker to do the work and this harms throughput. Relax the "no
queued request for bfq queue" condition to "the current task has no
queued requests yet". For this, we also need to start tracking number of
requests in flight for each task.

This patch (together with the following one) restores the performance
for dbench with 128 clients that regressed with commit c65e6fd4
("bfq: Do not let waker requests skip proper accounting") because
this commit makes requests of wakers properly enter BFQ queues and thus
these queues become ineligible for the old waker detection logic.
Dbench results:

Vanilla 5.18-rc3 5.18-rc3 + revert 5.18-rc3 patched
Mean 1237.36 ( 0.00%) 950.16 * 23.21%* 988.35 * 20.12%*

Numbers are time to complete workload so lower is better.

Fixes: c65e6fd4 ("bfq: Do not let waker requests skip proper accounting")
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220519105235.31397-1-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

f9506673

03 5月, 2022 1 次提交

blktrace: cleanup the __trace_note_message interface · f4a6a61c

由 Christoph Hellwig 提交于 4月 20, 2022

Pass the cgroup_subsys_state instead of a the blkg so that blktrace
doesn't need to poke into blk-cgroup internals, and give the name a
blk prefix as the current name is way too generic for a public
interface.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-9-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

f4a6a61c

18 4月, 2022 3 次提交

bfq: Get rid of __bio_blkcg() usage · 4e54a249

由 Jan Kara 提交于 4月 01, 2022

BFQ usage of __bio_blkcg() is a relict from the past. Furthermore if bio
would not be associated with any blkcg, the usage of __bio_blkcg() in
BFQ is prone to races with the task being migrated between cgroups as
__bio_blkcg() calls at different places could return different blkcgs.

Convert BFQ to the new situation where bio->bi_blkg is initialized in
bio_set_dev() and thus practically always valid. This allows us to save
blkcg_gq lookup and noticeably simplify the code.

CC: stable@vger.kernel.org
Fixes: 0fe061b9 ("blkcg: fix ref count issue with bio_blkcg() using task_css")
Tested-by: N"yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-8-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

4e54a249

bfq: Track whether bfq_group is still online · 09f87186

由 Jan Kara 提交于 4月 01, 2022

Track whether bfq_group is still online. We cannot rely on
blkcg_gq->online because that gets cleared only after all policies are
offlined and we need something that gets updated already under
bfqd->lock when we are cleaning up our bfq_group to be able to guarantee
that when we see online bfq_group, it will stay online while we are
holding bfqd->lock lock.

CC: stable@vger.kernel.org
Tested-by: N"yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-7-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

09f87186

bfq: Split shared queues on move between cgroups · 3bc5e683

由 Jan Kara 提交于 4月 01, 2022

When bfqq is shared by multiple processes it can happen that one of the
processes gets moved to a different cgroup (or just starts submitting IO
for different cgroup). In case that happens we need to split the merged
bfqq as otherwise we will have IO for multiple cgroups in one bfqq and
we will just account IO time to wrong entities etc.

Similarly if the bfqq is scheduled to merge with another bfqq but the
merge didn't happen yet, cancel the merge as it need not be valid
anymore.

CC: stable@vger.kernel.org
Fixes: e21b7a0b ("block, bfq: add full hierarchical scheduling and cgroups support")
Tested-by: N"yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-3-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

3bc5e683

18 2月, 2022 1 次提交

block, bfq: cleanup bfq_bfqq_to_bfqg() · 43a4b1fe

由 Yu Kuai 提交于 1月 29, 2022

Use bfq_group() instead, which do the same thing.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220129015924.3958918-2-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

43a4b1fe

12 2月, 2022 1 次提交

block: partition include/linux/blk-cgroup.h · 672fdcf0

由 Ming Lei 提交于 2月 11, 2022

Partition include/linux/blk-cgroup.h into two parts: one is public part,
the other is block layer private part.

Suggested by Christoph Hellwig.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220211101149.2368042-4-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

672fdcf0

29 11月, 2021 4 次提交

bfq: Provide helper to generate bfqq name · 582f04e1

由 Jan Kara 提交于 11月 25, 2021

Instead of having helper formating bfqq pid, provide a helper to
generate full bfqq name as used in the traces. It saves some code
duplication and will save more in the coming tracepoints.
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-6-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

582f04e1

bfq: Limit waker detection in time · 1f18b700

由 Jan Kara 提交于 11月 25, 2021

Currently, when process A starts issuing requests shortly after process
B has completed some IO three times in a row, we decide that B is a
"waker" of A meaning that completing IO of B is needed for A to make
progress and generally stop separating A's and B's IO much. This logic
is useful to avoid unnecessary idling and thus throughput loss for cases
where workload needs to switch e.g. between the process and the
journaling thread doing IO. However the detection heuristic tends to
frequently give false positives when A and B are fighting IO bandwidth
and other processes aren't doing much IO as we are basically deemed to
eventually accumulate three occurences of a situation where one process
starts issuing requests after the other has completed some IO. To reduce
these false positives, cancel the waker detection also if we didn't
accumulate three detected wakeups within given timeout. The rationale is
that if wakeups are really rare, the pointless idling doesn't hurt
throughput that much anyway.

This significantly reduces false waker detection for workload like:

[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync

[slowwriter]
numjobs=1
fsync=200

[fastwriter]
numjobs=1
fsync=200
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-5-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

1f18b700

bfq: Store full bitmap depth in bfq_data · 44dfa279

由 Jan Kara 提交于 11月 25, 2021

Store bitmap depth shift inside bfq_data so that we can use it in
bfq_limit_depth() for proportioning when limiting number of available
request tags for a cgroup.
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-3-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

44dfa279

bfq: Track number of allocated requests in bfq_entity · 98f04499

由 Jan Kara 提交于 11月 25, 2021

When we want to limit number of requests used by each bfqq and also
cgroup, we need to track also number of requests used by each cgroup.
So track number of allocated requests for each bfq_entity.
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-2-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>

98f04499

25 8月, 2021 1 次提交

block, bfq: cleanup the repeated declaration · 1e294970

由 Shaokun Zhang 提交于 8月 25, 2021

Function 'bfq_entity_to_bfqq' is declared twice, so remove the
repeated declaration and blank line.

Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
Link: https://lore.kernel.org/r/1629872391-46399-1-git-send-email-zhangshaokun@hisilicon.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

1e294970

18 8月, 2021 1 次提交

block: Introduce IOPRIO_NR_LEVELS · 202bc942

由 Damien Le Moal 提交于 8月 11, 2021

The BFQ scheduler and ioprio_check_cap() both assume that the RT
priority class (IOPRIO_CLASS_RT) can have up to 8 different priority
levels, similarly to the BE class (IOPRIO_CLASS_iBE). This is
controlled using the IOPRIO_BE_NR macro , which is badly named as the
number of levels also applies to the RT class.

Introduce the class independent IOPRIO_NR_LEVELS macro, defined to 8,
to make things clear. Keep the old IOPRIO_BE_NR macro definition as an
alias for IOPRIO_NR_LEVELS.
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210811033702.368488-6-damien.lemoal@wdc.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

202bc942

26 3月, 2021 1 次提交

block, bfq: merge bursts of newly-created queues · 430a67f9

由 Paolo Valente 提交于 3月 04, 2021

Many throughput-sensitive workloads are made of several parallel I/O
flows, with all flows generated by the same application, or more
generically by the same task (e.g., system boot). The most
counterproductive action with these workloads is plugging I/O dispatch
when one of the bfq_queues associated with these flows remains
temporarily empty.

To avoid this plugging, BFQ has been using a burst-handling mechanism
for years now. This mechanism has proven effective for throughput, and
not detrimental for service guarantees. This commit pushes this
mechanism a little bit further, basing on the following two facts.

First, all the I/O flows of a the same application or task contribute
to the execution/completion of that common application or task. So the
performance figures that matter are total throughput of the flows and
task-wide I/O latency. In particular, these flows do not need to be
protected from each other, in terms of individual bandwidth or
latency.

Second, the above fact holds regardless of the number of flows.

Putting these two facts together, this commits merges stably the
bfq_queues associated with these I/O flows, i.e., with the processes
that generate these IO/ flows, regardless of how many the involved
processes are.

To decide whether a set of bfq_queues is actually associated with the
I/O flows of a common application or task, and to merge these queues
stably, this commit operates as follows: given a bfq_queue, say Q2,
currently being created, and the last bfq_queue, say Q1, created
before Q2, Q2 is merged stably with Q1 if
- very little time has elapsed since when Q1 was created
- Q2 has the same ioprio as Q1
- Q2 belongs to the same group as Q1

Merging bfq_queues also reduces scheduling overhead. A fio test with
ten random readers on /dev/nullb shows a throughput boost of 40%, with
a quadcore. Since BFQ's execution time amounts to ~50% of the total
per-request processing time, the above throughput boost implies that
BFQ's overhead is reduced by more than 50%.
Tested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-7-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

430a67f9

26 1月, 2021 4 次提交

block, bfq: make waker-queue detection more robust · 71217df3

由 Paolo Valente 提交于 1月 25, 2021

In the presence of many parallel I/O flows, the detection of waker
bfq_queues suffers from false positives. This commits addresses this
issue by making the filtering of actual wakers more selective. In more
detail, a candidate waker must be found to meet waker requirements
three times before being promoted to actual waker.
Tested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

71217df3

block, bfq: save also injection state on queue merging · 5a5436b9

由 Paolo Valente 提交于 1月 25, 2021

To prevent injection information from being lost on bfq_queue merging,
also the amount of service that a bfq_queue receives must be saved and
restored when the bfq_queue is merged and split, respectively.
Tested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5a5436b9

block, bfq: save also weight-raised service on queue merging · e673914d

由 Paolo Valente 提交于 1月 25, 2021

To prevent weight-raising information from being lost on bfq_queue merging,
also the amount of service that a bfq_queue receives must be saved and
restored when the bfq_queue is merged and split, respectively.
Tested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e673914d

block, bfq: replace mechanism for evaluating I/O intensity · eb2fd80f

由 Paolo Valente 提交于 1月 25, 2021

Some BFQ mechanisms make their decisions on a bfq_queue basing also on
whether the bfq_queue is I/O bound. In this respect, the current logic
for evaluating whether a bfq_queue is I/O bound is rather rough. This
commits replaces this logic with a more effective one.

The new logic measures the percentage of time during which a bfq_queue
is active, and marks the bfq_queue as I/O bound if the latter if this
percentage is above a fixed threshold.
Tested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eb2fd80f

18 8月, 2020 1 次提交

bfq: fix blkio cgroup leakage v4 · 2de791ab

由 Dmitry Monakhov 提交于 8月 11, 2020

Changes from v1:
    - update commit description with proper ref-accounting justification

commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
introduce leak forbfq_group and blkcg_gq objects because of get/put
imbalance.
In fact whole idea of original commit is wrong because bfq_group entity
can not dissapear under us because it is referenced by child bfq_queue's
entities from here:
 -> bfq_init_entity()
    ->bfqg_and_blkg_get(bfqg);
    ->entity->parent = bfqg->my_entity

 -> bfq_put_queue(bfqq)
    FINAL_PUT
    ->bfqg_and_blkg_put(bfqq_group(bfqq))
    ->kmem_cache_free(bfq_pool, bfqq);

So parent entity can not disappear while child entity is in tree,
and child entities already has proper protection.
This patch revert commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")

bfq_group leak trace caused by bad commit:
-> blkg_alloc
   -> bfq_pq_alloc
     -> bfqg_get (+1)
->bfq_activate_bfqq
  ->bfq_activate_requeue_entity
    -> __bfq_activate_entity
       ->bfq_get_entity
         ->bfqg_and_blkg_get (+1)  <==== : Note1
->bfq_del_bfqq_busy
  ->bfq_deactivate_entity+0x53/0xc0 [bfq]
    ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
      -> bfq_forget_entity(is_in_service = true)
	 entity->on_st_or_in_serv = false   <=== :Note2
	 if (is_in_service)
	     return;  ==> do not touch reference
-> blkcg_css_offline
 -> blkcg_destroy_blkgs
  -> blkg_destroy
   -> bfq_pd_offline
    -> __bfq_deactivate_entity
         if (!entity->on_st_or_in_serv) /* true, because (Note2)
		return false;
 -> bfq_pd_free
    -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
So bfq_group and blkcg_gq  will leak forever, see test-case below.

##TESTCASE_BEGIN:
#!/bin/bash

max_iters=${1:-100}
#prep cgroup mounts
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/blkio
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

# Prepare blkdev
grep blkio /proc/cgroups
truncate -s 1M img
losetup /dev/loop0 img
echo bfq > /sys/block/loop0/queue/scheduler

grep blkio /proc/cgroups
for ((i=0;i<max_iters;i++))
do
    mkdir -p /sys/fs/cgroup/blkio/a
    echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
    dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
    echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
    rmdir /sys/fs/cgroup/blkio/a
    grep blkio /proc/cgroups
done
##TESTCASE_END:

Fixes: db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: NDmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2de791ab

22 3月, 2020 1 次提交

block, bfq: turn put_queue into release_process_ref in __bfq_bic_change_cgroup · c8997736

由 Paolo Valente 提交于 3月 21, 2020

A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The
goal of this put is to release a process reference to a bfq_queue. But
process-reference releases may trigger also some extra operation, and,
to this goal, are handled through bfq_release_process_ref(). So, turn
the invocation of bfq_put_queue() into an invocation of
bfq_release_process_ref().

Tested-by: cki-project@redhat.com
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c8997736

03 2月, 2020 3 次提交

block, bfq: get a ref to a group when adding it to a service tree · db37a34c

由 Paolo Valente 提交于 2月 03, 2020

BFQ schedules generic entities, which may represent either bfq_queues
or groups of bfq_queues. When an entity is inserted into a service
tree, a reference must be taken, to make sure that the entity does not
disappear while still referred in the tree. Unfortunately, such a
reference is mistakenly taken only if the entity represents a
bfq_queue. This commit takes a reference also in case the entity
represents a group.
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: NChris Evich <cevich@redhat.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

db37a34c

block, bfq: remove ifdefs from around gets/puts of bfq groups · 4d8340d0

由 Paolo Valente 提交于 2月 03, 2020

ifdefs around gets and puts of bfq groups reduce readability, remove them.
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Reported-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4d8340d0

block, bfq: extend incomplete name of field on_st · 33a16a98

由 Paolo Valente 提交于 2月 03, 2020

The flag on_st in the bfq_entity data structure is true if the entity
is on a service tree or is in service. Yet the name of the field,
confusingly, does not mention the second, very important case. Extend
the name to mention the second case too.
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

33a16a98

21 11月, 2019 1 次提交

block,bfq: Skip tracing hooks if possible · 40d47c15

由 Dmitry Monakhov 提交于 11月 01, 2019

In most cases blk_tracing is not active, but  bfq_log_bfqq macro
generate pid_str unconditionally, which result in significant overhead.

## Test
modprobe null_blk
echo bfq > /sys/block/nullb0/queue/scheduler
fio --name=t --ioengine=libaio --direct=1 --filename=/dev/nullb0 \
   --runtime=30 --time_based=1 --rw=write --iodepth=128 --bs=4k

# Results
|        | baseline | w/ patch | gain |
| iops   | 113.19K  | 126.42K  | +11% |
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NDmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

40d47c15

08 11月, 2019 2 次提交

blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT · 1d156646

由 Tejun Heo 提交于 11月 07, 2019

blkg_rwstat is now only used by bfq-iosched and blk-throtl when on
cgroup1.  Let's move it into its own files and gate it behind a config
option.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1d156646

bfq-iosched: stop using blkg->stat_bytes and ->stat_ios · fd41e603

由 Tejun Heo 提交于 11月 07, 2019

When used on cgroup1, bfq uses the blkg->stat_bytes and ->stat_ios
from blk-cgroup core to populate six stat knobs.  blk-cgroup core is
moving away from blkg_rwstat to improve scalability and won't be able
to support this usage.

It isn't like the sharing gains all that much.  Let's break it out to
dedicated rwstat counters which are updated when on cgroup1.  This
makes use of bfqg_*rwstat*() helpers outside of
CONFIG_BFQ_CGROUP_DEBUG.  Move them out.

v2: Compile fix when !CONFIG_BFQ_CGROUP_DEBUG.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fd41e603

07 9月, 2019 1 次提交

bfq: Add per-device weight · 795fe54c

由 Fam Zheng 提交于 8月 28, 2019

This adds to BFQ the missing per-device weight interfaces:
blkio.bfq.weight_device on legacy and io.bfq.weight on unified. The
implementation pretty closely resembles what we had in CFQ and the parsing code
is basically reused.

Tests
=====

Using two cgroups and three block devices, having weights setup as:

Cgroup          test1           test2
============================================
default         100             500
sda             500             100
sdb             default         default
sdc             200             200

cgroup v1 runs
--------------

    sda.test1.out:   READ: bw=913MiB/s
    sda.test2.out:   READ: bw=183MiB/s

    sdb.test1.out:   READ: bw=213MiB/s
    sdb.test2.out:   READ: bw=1054MiB/s

    sdc.test1.out:   READ: bw=650MiB/s
    sdc.test2.out:   READ: bw=650MiB/s

cgroup v2 runs
--------------

    sda.test1.out:   READ: bw=915MiB/s
    sda.test2.out:   READ: bw=184MiB/s

    sdb.test1.out:   READ: bw=216MiB/s
    sdb.test2.out:   READ: bw=1069MiB/s

    sdc.test1.out:   READ: bw=621MiB/s
    sdc.test2.out:   READ: bw=622MiB/s
Signed-off-by: NFam Zheng <zhengfeiran@bytedance.com>
Acked-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

795fe54c

25 6月, 2019 1 次提交

block, bfq: detect wakers and unconditionally inject their I/O · 13a857a4

由 Paolo Valente 提交于 6月 25, 2019

A bfq_queue Q may happen to be synchronized with another
bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
receive new I/O. We call Q2 "waker queue".

If I/O plugging is being performed for Q, and Q is not receiving any
more I/O because of the above synchronization, then, thanks to BFQ's
injection mechanism, the waker queue is likely to get served before
the I/O-plugging timeout fires.

Unfortunately, this fact may not be sufficient to guarantee a high
throughput during the I/O plugging, because the inject limit for Q may
be too low to guarantee a lot of injected I/O. In addition, the
duration of the plugging, i.e., the time before Q finally receives new
I/O, may not be minimized, because the waker queue may happen to be
served only after other queues.

To address these issues, this commit introduces the explicit detection
of the waker queue, and the unconditional injection of a pending I/O
request of the waker queue on each invocation of
bfq_dispatch_request().

One may be concerned that this systematic injection of I/O from the
waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
the contrary, next Q's I/O is brought forward dramatically, for it is
not blocked for milliseconds.
Reported-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Tested-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

13a857a4

21 6月, 2019 2 次提交

block: rename CONFIG_DEBUG_BLK_CGROUP to CONFIG_BFQ_CGROUP_DEBUG · 8060c47b

由 Christoph Hellwig 提交于 6月 06, 2019

This option is entirely bfq specific, give it an appropinquate name.

Also make it depend on CONFIG_BFQ_GROUP_IOSCHED in Kconfig, as all
the functionality already does so anyway.
Acked-by: NTejun Heo <tj@kernel.org>
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8060c47b

blk-cgroup: move struct blkg_stat to bfq · c0ce79dc

由 Christoph Hellwig 提交于 6月 06, 2019

This structure and assorted infrastructure is only used by the bfq I/O
scheduler.  Move it there instead of bloating the common code.
Acked-by: NTejun Heo <tj@kernel.org>
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c0ce79dc

01 5月, 2019 1 次提交

block: switch all files cleared marked as GPLv2 or later to SPDX tags · a497ee34

由 Christoph Hellwig 提交于 4月 30, 2019

All these files have some form of the usual GPLv2 or later boilerplate.
Switch them to use SPDX tags instead.
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a497ee34

10 4月, 2019 1 次提交

block, bfq: fix use after free in bfq_bfqq_expire · eed47d19

由 Paolo Valente 提交于 4月 10, 2019

The function bfq_bfqq_expire() invokes the function
__bfq_bfqq_expire(), and the latter may free the in-service bfq-queue.
If this happens, then no other instruction of bfq_bfqq_expire() must
be executed, or a use-after-free will occur.

Basing on the assumption that __bfq_bfqq_expire() invokes
bfq_put_queue() on the in-service bfq-queue exactly once, the queue is
assumed to be freed if its refcounter is equal to one right before
invoking __bfq_bfqq_expire().

But, since commit 9dee8b3b ("block, bfq: fix queue removal from
weights tree") this assumption is false. __bfq_bfqq_expire() may also
invoke bfq_weights_tree_remove() and, since commit 9dee8b3b
("block, bfq: fix queue removal from weights tree"), also
the latter function may invoke bfq_put_queue(). So __bfq_bfqq_expire()
may invoke bfq_put_queue() twice, and this is the actual case where
the in-service queue may happen to be freed.

To address this issue, this commit moves the check on the refcounter
of the queue right around the last bfq_put_queue() that may be invoked
on the queue.

Fixes: 9dee8b3b ("block, bfq: fix queue removal from weights tree")
Reported-by: NDmitrii Tcvetkov <demfloro@demfloro.ru>
Reported-by: NDouglas Anderson <dianders@chromium.org>
Tested-by: NDmitrii Tcvetkov <demfloro@demfloro.ru>
Tested-by: NDouglas Anderson <dianders@chromium.org>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eed47d19

09 4月, 2019 1 次提交

block, bfq: fix some typos in comments · 636b8fe8

由 Angelo Ruocco 提交于 4月 08, 2019

Some of the comments in the bfq files had typos. This patch fixes them.
Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

636b8fe8

01 4月, 2019 4 次提交

block, bfq: save & resume weight on a queue merge/split · fffca087

由 Francesco Pollicino 提交于 3月 12, 2019

bfq saves the state of a queue each time a merge occurs, to be
able to resume such a state when the queue is associated again
with its original process, on a split.

Unfortunately bfq does not save & restore also the weight of the
queue. If the weight is not correctly resumed when the queue is
recycled, then the weight of the recycled queue could differ
from the weight of the original queue.

This commit adds the missing save & resume of the weight.
Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fffca087

block, bfq: print SHARED instead of pid for shared queues in logs · 1e66413c

由 Francesco Pollicino 提交于 3月 12, 2019

The function "bfq_log_bfqq" prints the pid of the process
associated with the queue passed as input.

Unfortunately, if the queue is shared, then more than one process
is associated with the queue. The pid that gets printed in this
case is the pid of one of the associated processes.
Which process gets printed depends on the exact sequence of merge
events the queue underwent. So printing such a pid is rather
useless and above all is often rather confusing because it
reports a random pid between those of the associated processes.

This commit addresses this issue by printing SHARED instead of a pid
if the queue is shared.
Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1e66413c

block, bfq: do not merge queues on flash storage with queueing · 8cacc5ab

由 Paolo Valente 提交于 3月 12, 2019

To boost throughput with a set of processes doing interleaved I/O
(i.e., a set of processes whose individual I/O is random, but whose
merged cumulative I/O is sequential), BFQ merges the queues associated
with these processes, i.e., redirects the I/O of these processes into a
common, shared queue. In the shared queue, I/O requests are ordered by
their position on the medium, thus sequential I/O gets dispatched to
the device when the shared queue is served.

Queue merging costs execution time, because, to detect which queues to
merge, BFQ must maintain a list of the head I/O requests of active
queues, ordered by request positions. Measurements showed that this
costs about 10% of BFQ's total per-request processing time.

Request processing time becomes more and more critical as the speed of
the underlying storage device grows. Yet, fortunately, queue merging
is basically useless on the very devices that are so fast to make
request processing time critical. To reach a high throughput, these
devices must have many requests queued at the same time. But, in this
configuration, the internal scheduling algorithms of these devices do
also the job of queue merging: they reorder requests so as to obtain
as much as possible a sequential I/O pattern. As a consequence, with
processes doing interleaved I/O, the throughput reached by one such
device is likely to be the same, with and without queue merging.

In view of this fact, this commit disables queue merging, and all
related housekeeping, for non-rotational devices with internal
queueing. The total, single-lock-protected, per-request processing
time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
(time measured with simple code instrumentation, and using the
throughput-sync.sh script of the S suite [1], in performance-profiling
mode). To put this result into context, the total,
single-lock-protected, per-request execution time of the lightest I/O
scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
~800 LOC, against ~10500 LOC for BFQ).

Disabling merging provides a further, remarkable benefit in terms of
throughput. Merging tends to make many workloads artificially more
uneven, mainly because of shared queues remaining non empty for
incomparably more time than normal queues. So, if, e.g., one of the
queues in a set of merged queues has a higher weight than a normal
queue, then the shared queue may inherit such a high weight and, by
staying almost always active, may force BFQ to perform I/O plugging
most of the time. This evidently makes it harder for BFQ to let the
device reach a high throughput.

As a practical example of this problem, and of the benefits of this
commit, we measured again the throughput in the nasty scenario
considered in previous commit messages: dbench test (in the Phoronix
suite), with 6 clients, on a filesystem with journaling, and with the
journaling daemon enjoying a higher weight than normal processes. With
this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
of the other I/O schedulers. As such, this is also likely to be the
maximum possible throughput reachable with this workload on this
device, because I/O is mostly random, and the other schedulers
basically just pass I/O requests to the drive as fast as possible.

[1] https://github.com/Algodev-github/STested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: NAlessio Masola <alessio.masola@gmail.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8cacc5ab

block, bfq: tune service injection basing on request service times · 2341d662

由 Paolo Valente 提交于 3月 12, 2019

The processes associated with a bfq_queue, say Q, may happen to
generate their cumulative I/O at a lower rate than the rate at which
the device could serve the same I/O. This is rather probable, e.g., if
only one process is associated with Q and the device is an SSD. It
results in Q becoming often empty while in service. If BFQ is not
allowed to switch to another queue when Q becomes empty, then, during
the service of Q, there will be frequent "service holes", i.e., time
intervals during which Q gets empty and the device can only consume
the I/O already queued in its hardware queues. This easily causes
considerable losses of throughput.

To counter this problem, BFQ implements a request injection mechanism,
which tries to fill the above service holes with I/O requests taken
from other bfq_queues. The hard part in this mechanism is finding the
right amount of I/O to inject, so as to both boost throughput and not
break Q's bandwidth and latency guarantees. To this goal, the current
version of this mechanism measures the bandwidth enjoyed by Q while it
is being served, and tries to inject the maximum possible amount of
extra service that does not cause Q's bandwidth to decrease too
much.

This solution has an important shortcoming. For bandwidth measurements
to be stable and reliable, Q must remain in service for a much longer
time than that needed to serve a single I/O request. Unfortunately,
this does not hold with many workloads. This commit addresses this
issue by changing the way the amount of injection allowed is
dynamically computed. It tunes injection as a function of the service
times of single I/O requests of Q, instead of Q's
bandwidth. Single-request service times are evidently meaningful even
if Q gets very few I/O requests completed while it is in service.

As a testbed for this new solution, we measured the throughput reached
by BFQ for one of the nastiest workloads and configurations for this
scheduler: the workload generated by the dbench test (in the Phoronix
suite), with 6 clients, on a filesystem with journaling, and with the
journaling daemon enjoying a higher weight than normal processes.
With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on
a PLEXTOR PX-256M5.
Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2341d662

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功