提交 · 43f57d41edc70c6abdd48065c946e8a25152d8f1 · openanolis / cloud-kernel

17 1月, 2020 1 次提交

alinux: iocost: fix format mismatch build warning · 43f57d41

由 Joseph Qi 提交于 1月 03, 2020

This fixes the following format build warning:
block/blk-iocost.c: In function 'ioc_stat_prfill':
block/blk-iocost.c:2506:17: warning: format '%llu' expects argument of type 'long long unsigned int', but argument 9 has type 'long int' [-Wformat=]
Reported-by: Nkbuild test robot <lkp@intel.com>
Fixes: 0670363c ("alinux: iocost: add ioc_gq stat")
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

43f57d41

15 1月, 2020 2 次提交

J
alinux: iocost: rename weight to cost.weight to avoid conflict with cfq · 7d59e27f
由 Jiufei Xue 提交于 12月 05, 2019
```
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
```
7d59e27f

alinux: iocost: add ioc_gq stat · 0670363c

由 Jiufei Xue 提交于 11月 26, 2019

Add a stat file to monitor the ioc_gq stat.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0670363c

27 12月, 2019 18 次提交

iocost: check active_list of all the ancestors in iocg_activate() · 9fe84dc5

由 Jiufei Xue 提交于 11月 13, 2019

commit 8b37bc277fb459fa100808880a9d4e0641fff444 upstream.

There is a bug that checking the same active_list over and over again
in iocg_activate(). The intention of the code was checking whether all
the ancestors and self have already been activated. So fix it.

Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9fe84dc5

iocost: don't nest spin_lock_irq in ioc_weight_write() · 779d625e

由 Dan Carpenter 提交于 10月 31, 2019

commit 41591a51f00d2dc7bb9dc6e9bedf56c5cf6f2392 upstream.

This code causes a static analysis warning:

    block/blk-iocost.c:2113 ioc_weight_write() error: double lock 'irq'

We disable IRQs in blkg_conf_prep() and re-enable them in
blkg_conf_finish().  IRQ disable/enable should not be nested because
that means the IRQs will be enabled at the first unlock instead of the
second one.

Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

779d625e

alinux: iocost: fix a deadlock in ioc_rqos_throttle() · 573ddb46

由 Jiufei Xue 提交于 11月 01, 2019

Function ioc_rqos_throttle() may called inside queue_lock.
We should unlock the queue_lock before entering sleep.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

573ddb46

iocost: don't let vrate run wild while there's no saturation signal · c08d3e4b

由 Tejun Heo 提交于 10月 14, 2019

When the QoS targets are met and nothing is being throttled, there's
no way to tell how saturated the underlying device is - it could be
almost entirely idle, at the cusp of saturation or anywhere inbetween.
Given that there's no information, it's best to keep vrate as-is in
this state.  Before 7cd806a9a953 ("iocost: improve nr_lagging
handling"), this was the case - if the device isn't missing QoS
targets and nothing is being throttled, busy_level was reset to zero.

While fixing nr_lagging handling, 7cd806a9a953 ("iocost: improve
nr_lagging handling") broke this.  Now, while the device is hitting
QoS targets and nothing is being throttled, vrate keeps getting
adjusted according to the existing busy_level.

This led to vrate keeping climing till it hits max when there's an IO
issuer with limited request concurrency if the vrate started low.
vrate starts getting adjusted upwards until the issuer can issue IOs
w/o being throttled.  From then on, QoS targets keeps getting met and
nothing on the system needs throttling and vrate keeps getting
increased due to the existing busy_level.

This patch makes the following changes to the busy_level logic.

* Reset busy_level if nr_shortages is zero to avoid the above
  scenario.

* Make non-zero nr_lagging block lowering nr_level but still clear
  positive busy_level if there's clear non-saturation signal - QoS
  targets are met and nr_shortages is non-zero.  nr_lagging's role is
  preventing adjusting vrate upwards while there are long-running
  commands and it shouldn't keep busy_level positive while there's
  clear non-saturation signal.

* Restructure code for clarity and add comments.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NAndy Newell <newella@fb.com>
Fixes: 7cd806a9a953 ("iocost: improve nr_lagging handling")
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c08d3e4b

iocost: bump up default latency targets for hard disks · 54c73cd5

由 Tejun Heo 提交于 9月 25, 2019

commit 7afcccafa59fb63b58f863a6c5e603a34625955b upstream.

The default hard disk param sets latency targets at 50ms.  As the
default target percentiles are zero, these don't directly regulate
vrate; however, they're still used to calculate the period length -
100ms in this case.

This is excessively low.  A SATA drive with QD32 saturated with random
IOs can easily reach avg completion latency of several hundred msecs.
A period duration which is substantially lower than avg completion
latency can lead to wildly fluctuating vrate.

Let's bump up the default latency targets to 250ms so that the period
duration is sufficiently long.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

54c73cd5

iocost: improve nr_lagging handling · f63e7224

由 Tejun Heo 提交于 9月 25, 2019

commit 7cd806a9a953f234b9865c30028f47fd738ce375 upstream.

Some IOs may span multiple periods.  As latencies are collected on
completion, the inbetween periods won't register them and may
incorrectly decide to increase vrate.  nr_lagging tracks these IOs to
avoid those situations.  Currently, whenever there are IOs which are
spanning from the previous period, busy_level is reset to 0 if
negative thus suppressing vrate increase.

This has the following two problems.

* When latency target percentiles aren't set, vrate adjustment should
  only be governed by queue depth depletion; however, the current code
  keeps nr_lagging active which pulls in latency results and can keep
  down vrate unexpectedly.

* When lagging condition is detected, it resets the entire negative
  busy_level.  This turned out to be way too aggressive on some
  devices which sometimes experience extended latencies on a small
  subset of commands.  In addition, a lagging IO will be accounted as
  latency target miss on completion anyway and resetting busy_level
  amplifies its impact unnecessarily.

This patch fixes the above two problems by disabling nr_lagging
counting when latency target percentiles aren't set and blocking vrate
increases when there are lagging IOs while leaving busy_level as-is.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f63e7224

iocost: better trace vrate changes · c017dc0a

由 Tejun Heo 提交于 9月 25, 2019

commit 25d41e4aadb0788b4fae8a8fca90f437b9ebd727 upstream.

vrate_adj tracepoint traces vrate changes; however, it does so only
when busy_level is non-zero.  busy_level turning to zero can sometimes
be as interesting an event.  This patch also enables vrate_adj
tracepoint on other vrate related events - busy_level changes and
non-zero nr_lagging.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c017dc0a

alinux: iocost: fix NULL pointer dereference in ioc_rqos_throttle · 9da41925

由 Jiufei Xue 提交于 9月 30, 2019

Bios are not associated with blkg before entering iocost controller.
do it in ioc_rqos_throttle() as well as ioc_rqos_merge().

Considering that there are so many chances to create blkg before
ioc_rqos_merge(), we just lookup the blkg here and if blkg are not
exist, just return rather than create it.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9da41925

alinux: iocost: add legacy interface file · e2b73f3e

由 Jiufei Xue 提交于 9月 30, 2019

To support cgroup v1.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e2b73f3e

iocost_monitor: Report debt · 0748d0f9

由 Tejun Heo 提交于 9月 04, 2019

commit 7c1ee704a1d6450f92372d57f5b76a458b51c1d4 upstream.

Report debt and rename del_ms row to delay for consistency.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0748d0f9

blk-iocost: Don't let merges push vtime into the future · 318de196

由 Tejun Heo 提交于 9月 04, 2019

commit e1518f63f246831af222758ead022cd40e79fab8 upstream.

Merges have the same problem that forced-bios had which is fixed by
the previous patch.  The cost of a merge is calculated at the time of
issue and force-advances vtime into the future.  Until global vtime
catches up, how the cgroup's hweight changes in the meantime doesn't
matter and it often leads to situations where the cost is calculated
at one hweight and paid at a very different one.  See the previous
patch for more details.

Fix it by never advancing vtime into the future for merges.  If budget
is available, vtime is advanced.  Otherwise, the cost is charged as
debt.

This brings merge cost handling in line with issue cost handling in
ioc_rqos_throttle().
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

318de196

blk-iocost: Account force-charged overage in absolute vtime · e9e10067

由 Tejun Heo 提交于 9月 04, 2019

commit 36a524814ff3e5d5385f42d30152fe8c5e1fd2c1 upstream.

Currently, when a bio needs to be force-charged and there isn't enough
budget, vtime is simply pushed into the future.  This means that the
cost of the whole bio is scaled using the current hweight and then
charged immediately.  Until the global vtime advances beyond this
future vtime, the cgroup won't be allowed to issue normal IOs.

This is incorrect and can lead to, for example, exploding vrate or
extended stalls if vrate range is constrained.  Consider the following
scenario.

1. A cgroup with a very low hweight runs out of budget.

2. A storm of swap-out happens on it.  All of them are scaled
   according to the current low hweight and charged to vtime pushing
   it to a far future.

3. All other cgroups go idle and now the above cgroup has access to
   the whole device.  However, because vtime is already wound using
   the past low hweight, what its current hweight is doesn't matter
   until global vtime catches up to the local vtime.

4. As a result, either vrate gets ramped up extremely or the IOs stall
   while the underlying device is idle.

This is because the hweight the overage is calculated at is different
from the hweight that it's being paid at.

Fix it by remembering the overage in absoulte vtime and continuously
paying with the actual budget according to the current hweight at each
period.

Note that non-forced bios which wait already remembers the cost in
absolute vtime.  This brings forced-bio accounting in line.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e9e10067

blk-iocost: Fix incorrect operation order during iocg free · e808b3c2

由 Tejun Heo 提交于 9月 10, 2019

commit e036c4cabaa8d24375262ced3a191819a8077b74 upstream.

ioc_pd_free() first cancels the hrtimers and then deactivates the
iocg.  However, the iocg timer can run inbetween and reschedule the
hrtimers which will end up running after the iocg is freed leading to
crashes like the following.

  general protection fault: 0000 [#1] SMP
  ...
  RIP: 0010:iocg_kick_delay+0xbe/0x1b0
  RSP: 0018:ffffc90003598ea0 EFLAGS: 00010046
  RAX: 1cee00fd69512b54 RBX: ffff8881bba48400 RCX: 00000000000003e8
  RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881bba48400
  RBP: 0000000000004e20 R08: 0000000000000002 R09: 00000000000003e8
  R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90003598ef0
  R13: 00979f3810ad461f R14: ffff8881bba4b400 R15: 25439f950d26e1d1
  FS:  0000000000000000(0000) GS:ffff88885f800000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f64328c7e40 CR3: 0000000002409005 CR4: 00000000003606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <IRQ>
   iocg_delay_timer_fn+0x3d/0x60
   __hrtimer_run_queues+0xfe/0x270
   hrtimer_interrupt+0xf4/0x210
   smp_apic_timer_interrupt+0x5e/0x120
   apic_timer_interrupt+0xf/0x20
   </IRQ>

Fix it by canceling hrtimers after deactivating the iocg.

Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Reported-by: NDave Jones <davej@codemonkey.org.uk>
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e808b3c2

blkcg: add missing NULL check in ioc_cpd_alloc() · 9f8427b6

由 Tejun Heo 提交于 8月 30, 2019

commit e916ad29d96485e5aa3d3237bfeab1522c713d5e upstream.

ioc_cpd_alloc() forgot to check NULL return from kzalloc().  Add it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9f8427b6

blkcg: fix missing free on error path of blk_iocost_init() · 016c3524

由 Tejun Heo 提交于 8月 29, 2019

commit 3532e7227243beb0b782266dc05c40b6184ad051 upstream.

blk_iocost_init() forgot to free its percpu stat on the error path.
Fix it.

Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Reported-by: NHillf Danton <hdanton@sina.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

016c3524

blkcg: add tools/cgroup/iocost_coef_gen.py · 0581637c

由 Tejun Heo 提交于 8月 28, 2019

commit 8504dea783b044cab620acbaef87b86ee84646fe upstream.

Add a script which can be used to generate device-specific iocost
linear model coefficients.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0581637c

blkcg: add tools/cgroup/iocost_monitor.py · 5ec3f278

由 Tejun Heo 提交于 8月 28, 2019

commit 6954ff185ee0811cdd2e0f388ff4dd7df17f11af upstream.

Instead of mucking with debugfs and ->pd_stat(), add drgn based
monitoring script.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

5ec3f278

blkcg: implement blk-iocost · e383d72b

由 Tejun Heo 提交于 8月 28, 2019

commit 7caa47151ab2e644dd221f741ec7578d9532c9a3 upstream.

This patchset implements IO cost model based work-conserving
proportional controller.

While io.latency provides the capability to comprehensively prioritize
and protect IOs depending on the cgroups, its protection is binary -
the lowest latency target cgroup which is suffering is protected at
the cost of all others.  In many use cases including stacking multiple
workload containers in a single system, it's necessary to distribute
IO capacity with better granularity.

One challenge of controlling IO resources is the lack of trivially
observable cost metric.  The most common metrics - bandwidth and iops
- can be off by orders of magnitude depending on the device type and
IO pattern.  However, the cost isn't a complete mystery.  Given
several key attributes, we can make fairly reliable predictions on how
expensive a given stream of IOs would be, at least compared to other
IO patterns.

The function which determines the cost of a given IO is the IO cost
model for the device.  This controller distributes IO capacity based
on the costs estimated by such model.  The more accurate the cost
model the better but the controller adapts based on IO completion
latency and as long as the relative costs across differents IO
patterns are consistent and sensible, it'll adapt to the actual
performance of the device.

Currently, the only implemented cost model is a simple linear one with
a few sets of default parameters for different classes of device.
This covers most common devices reasonably well.  All the
infrastructure to tune and add different cost models is already in
place and a later patch will also allow using bpf progs for cost
models.

Please see the top comment in blk-iocost.c and documentation for
more details.

v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
    for a divide-by-zero bug in current_hweight() triggered by zero
    inuse_sum.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[Joseph: fix confilcts with ioc_rqos_throttle()]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e383d72b

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功