提交 · dbae2c551377b6533a00c11fc7ede370100ab404 · openeuler / Kernel

18 7月, 2018 5 次提交

block: Define and use STAT_READ and STAT_WRITE · dbae2c55

由 Michael Callahan 提交于 7月 18, 2018

Add defines for STAT_READ and STAT_WRITE for indexing the partition
stat entries. This clarifies some fs/ code which has hardcoded 1 for
STAT_WRITE and will make it easier to extend the stats with additional
fields.

tj: Refreshed on top of v4.17.
Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dbae2c55

block: Add part_stat_read_accum to read across field entries. · 59767fbd

由 Michael Callahan 提交于 7月 18, 2018

Add a part_stat_read_accum macro to genhd.h to read and sum across
field entries.  For example to sum up the number read and write
sectors completed.  In addition to being ar reasonable cleanup by
itself this will make it easier to add new stat fields in the future.

tj: Refreshed on top of v4.17.
Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

59767fbd

block: make bdev_ops->rw_page() take a REQ_OP instead of bool · 3f289dcb

由 Tejun Heo 提交于 7月 18, 2018

c11f0c0b ("block/mm: make bdev_ops->rw_page() take a bool for
read/write") replaced @OP with boolean @is_write, which limited the
amount of information going into ->rw_page() and more importantly
page_endio(), which removed the need to expose block internals to mm.

Unfortunately, we want to track discards separately and @is_write
isn't enough information.  This patch updates bdev_ops->rw_page() to
take REQ_OP instead but leaves page_endio() to take bool @is_write.
This allows the block part of operations to have enough information
while not leaking it to mm.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3f289dcb

pktcdvd: remove assignment in if condition · ada94973

由 RAGHU Halharvi 提交于 7月 17, 2018

* Remove checkpatch errors caused due to assignment operation in if
condition
Signed-off-by: NRAGHU Halharvi <raghuhack78@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ada94973

blk-mq: issue directly if hw queue isn't busy in case of 'none' · 6ce3dd6e

由 Ming Lei 提交于 7月 10, 2018

In case of 'none' io scheduler, when hw queue isn't busy, it isn't
necessary to enqueue request to sw queue and dequeue it from
sw queue because request may be submitted to hw queue asap without
extra cost, meantime there shouldn't be much request in sw queue,
and we don't need to worry about effect on IO merge.

There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
which may connect high performance devices, so 'none' is often required
for obtaining good performance.

This patch improves IOPS and decreases CPU unilization on megaraid_sas,
per Kashyap's test.

Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: NKashyap Desai <kashyap.desai@broadcom.com>
Tested-by: NKashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6ce3dd6e

17 7月, 2018 2 次提交

blk-iolatency: truncate our current time · 71e9690b

由 Josef Bacik 提交于 7月 16, 2018

In our longer tests we noticed that some boxes would degrade to the
point of uselessness. This is because we truncate the current time when
saving it in our bio, but I was using the raw current time to subtract
from. So once the box had been up a certain amount of time it would
appear as if our IO's were taking several years to complete. Fix this
by truncating the current time so it matches the issue time. Verified
this worked by running with this patch for a week on our test tier.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

71e9690b

blk-iolatency: don't change the latency window · d607eefa

由 Josef Bacik 提交于 7月 16, 2018

Early versions of these patches had us waiting for seconds at a time
during submission, so we had to adjust the timing window we monitored
for latency.  Now we don't do things like that so this is unnecessary
code.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d607eefa

13 7月, 2018 12 次提交

lightnvm: pblk: assume that chunks are closed on 1.2 devices · f6352103

由 Hans Holmberg 提交于 7月 13, 2018

We can't know if a block is closed or not on 1.2 devices, so assume
closed state to make sure that blocks are erased before writing.

Fixes: 32ef9412 ("lightnvm: pblk: implement get log report chunk")
Signed-off-by: NHans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f6352103

lightnvm: pblk: add asynchronous partial read · 11f6ad69

由 Heiner Litz 提交于 7月 13, 2018

In the read path, partial reads are currently performed synchronously
which affects performance for workloads that generate many partial
reads. This patch adds an asynchronous partial read path as well as
the required partial read ctx.
Signed-off-by: NHeiner Litz <hlitz@ucsc.edu>
Reviewed-by: NIgor Konopko <igor.j.konopko@intel.com>
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

11f6ad69

lightnvm: pblk: mark expected switch fall-through · 884b031b

由 Gustavo A. R. Silva 提交于 7月 13, 2018

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

884b031b

lightnvm: pblk: expose generic disk name on pr_* msgs · 4e495a46

由 Matias Bjørling 提交于 7月 13, 2018

The error messages in pblk does not say which pblk instance that
a message occurred from. Update each error message to reflect the
instance it belongs to, and also prefix it with pblk, so we know
the message comes from the pblk module.
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Reviewed-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4e495a46

lightnvm: limit get chunk meta request size · 59a8f43b

由 Matias Bjørling 提交于 7月 13, 2018

For devices that does not specify a limit on its transfer size, the
get_chk_meta command may send down a single I/O retrieving the full
chunk metadata table. Resulting in large 2-4MB I/O requests. Instead,
split up the I/Os to a maximum of 256KB and issue them separately to
reduce memory requirements.
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Reviewed-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

59a8f43b

lightnvm: pblk: fix read_bitmap for 32bit archs · 921aebfa

由 Matias Bjørling 提交于 7月 13, 2018

If using pblk on a 32bit architecture, and there is a need to
perform a partial read, the partial read bitmap will only have
allocated 32 entries, where as 64 are needed.

Make sure that the read_bitmap is initialized to 64bits on 32bit
architectures as well.
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Reviewed-by: NIgor Konopko <igor.j.konopko@intel.com>
Reviewed-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

921aebfa

lightnvm: Remove redundant rq->__data_len initialization · 242e461f

由 Bart Van Assche 提交于 7月 13, 2018

Since both blk_old_get_request() and blk_mq_alloc_request() initialize
rq->__data_len to zero, it is not necessary to initialize that member
in nvme_nvm_alloc_request(). Hence remove the rq->__data_len
initialization from nvme_nvm_alloc_request().
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

242e461f

lightnvm: pblk: enable line minor version detection · 99b8dad1

由 Matias Bjørling 提交于 7月 13, 2018

When recovering a line, an extra check was added when debugging was
active, such that minor version where also checked. Unfortunately,
this used the ifdef NVM_DEBUG, which is not correct.

Instead use the proper DEBUG def, and now that it compiles, also fix
the variable.
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Fixes: d0ab0b1a ("lightnvm: pblk: check data lines version on recovery")
Reviewed-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

99b8dad1

lightnvm: move NVM_DEBUG to pblk · 880eda54

由 Matias Bjørling 提交于 7月 13, 2018

There is no users of CONFIG_NVM_DEBUG in the LightNVM subsystem. All
users are in pblk. Rename NVM_DEBUG to NVM_PBLK_DEBUG and enable
only for pblk.

Also fix up the CONFIG_NVM_PBLK entry to follow the code style for
Kconfig files.
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Reviewed-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

880eda54

lightnvm: pblk: handle case when mw_cunits equals to 0 · ffc03fb7

由 Marcin Dziegielewski 提交于 7月 13, 2018

Some devices can expose mw_cunits equal to 0, it can cause the
creation of too small write buffer and cause performance to drop
on write workloads.

Additionally, write buffer size must cover write data requirements,
such as WS_MIN and MW_CUNITS - it must be greater than or equal to
the larger one multiplied by the number of PUs. However, for
performance reasons, use the WS_OPT value to calculation instead of
WS_MIN.

Because the place where buffer size is calculated was changed, this
patch also removes pgs_in_buffer filed in pblk structure.
Signed-off-by: NMarcin Dziegielewski <marcin.dziegielewski@intel.com>
Signed-off-by: NIgor Konopko <igor.j.konopko@intel.com>
Reviewed-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ffc03fb7

block: remove blkdev_entry_to_request() macro · 05814a10

由 Vladimir Zapolskiy 提交于 7月 13, 2018

Remove blkdev_entry_to_request() macro, which remained unused through
the observable history, also note that it repeats list_entry_rq() macro
verbatim.
Signed-off-by: NVladimir Zapolskiy <vz@mleia.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

05814a10

block: skd: Use %pad printk format for dma_addr_t values · ea870bb2

由 Helge Deller 提交于 7月 12, 2018

Use the existing %pad printk format to print dma_addr_t values.
This avoids the following warnings when compiling on the parisc64 platform:

drivers/block/skd_main.c: In function 'skd_preop_sg_list':
drivers/block/skd_main.c:660:4: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 6 has type 'dma_addr_t {aka unsigned int}' [-Wformat=]
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NHelge Deller <deller@gmx.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ea870bb2

12 7月, 2018 1 次提交

bsg: remove read/write support · 28519c89

由 Christoph Hellwig 提交于 7月 12, 2018

The code poses a security risk due to user memory access in ->release
and had an API that can't be used reliably.  As far as we know it was
never used for real, but if that turns out wrong we'll have to revert
this commit and come up with a band aid.

Jann Horn did look software archives for users of this interface,
and the only users found were example code in sg3_utils, and optional
support in an optional module of the tgt user space iscsi target,
which looks like a proof of concept extension of the /dev/sg
read/write support.

Tony Battersby chimes in that the code is basically unsafe to use in
general:

  The read/write interface on /dev/bsg is impossible to use safely
  because the list of completed commands is per-device (bd->done_list)
  rather than per-fd like it is with /dev/sg.  So if program A and
  program B are both using the write/read interface on the same bsg
  device, then their command responses will get mixed up, and program
  A will read() some command results from program B and vice versa.
  So no, I don't use read/write on /dev/bsg.  From a security standpoint,
  it should definitely be fixed or removed.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

28519c89

11 7月, 2018 2 次提交

blk-iolatency: fix max_depth comparisons · a284390b

由 Josef Bacik 提交于 7月 11, 2018

max_depth used to be a u64, but I changed it to a unsigned int but
didn't convert my comparisons over everywhere.  Fix by using UINT_MAX
everywhere instead of (u64)-1.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a284390b

block: iolatency: avoid 64-bit division · 88b7210c

由 Arnd Bergmann 提交于 7月 10, 2018

On 32-bit architectures, dividing a 64-bit number needs to use the
do_div() function or something like it to avoid a link failure:

block/blk-iolatency.o: In function `iolatency_prfill_limit':
blk-iolatency.c:(.text+0x8cc): undefined reference to `__aeabi_uldivmod'

Using div_u64() gives us the best output and avoids the need for an
explicit cast.

Fixes: d7067512 ("block: introduce blk-iolatency io controller")
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

88b7210c

09 7月, 2018 18 次提交

block/DAC960.c: fix defined but not used build warnings · 3993e501

由 Randy Dunlap 提交于 7月 06, 2018

Fix build warnings in DAC960.c when CONFIG_PROC_FS is not enabled
by marking the unused functions as __maybe_unused.

../drivers/block/DAC960.c:6429:12: warning: 'dac960_proc_show' defined but not used [-Wunused-function]
../drivers/block/DAC960.c:6449:12: warning: 'dac960_initial_status_proc_show' defined but not used [-Wunused-function]
../drivers/block/DAC960.c:6456:12: warning: 'dac960_current_status_proc_show' defined but not used [-Wunused-function]
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3993e501

null_blk: add zone support · ca4b2a01

由 Matias Bjørling 提交于 7月 06, 2018

Adds support for exposing a null_blk device through the zone device
interface.

The interface is managed with the parameters zoned and zone_size.
If zoned is set, the null_blk instance registers as a zoned block
device. The zone_size parameter defines how big each zone will be.
Signed-off-by: NMatias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ca4b2a01

null_blk: move shared definitions to header file · 6dad38d3

由 Matias Bjørling 提交于 7月 06, 2018

Split the null_blk device driver, such that it can prepare for
zoned block interface support.
Signed-off-by: NMatias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6dad38d3

block: Add default switch case to blk_pm_allow_request() to kill warning · e9a83853

由 Geert Uytterhoeven 提交于 7月 06, 2018

With gcc 4.9.0 and 7.3.0:

    block/blk-core.c: In function 'blk_pm_allow_request':
    block/blk-core.c:2747:2: warning: enumeration value 'RPM_ACTIVE' not handled in switch [-Wswitch]
      switch (rq->q->rpm_status) {
      ^

Convert the return statement below the switch() block into a default
case to fix this.

Fixes: e4f36b24 ("block: fix peeking requests during PM")
Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e9a83853

block: fix infinite loop if the device loses discard capability · b88aef36

由 Mikulas Patocka 提交于 7月 03, 2018

If __blkdev_issue_discard is in progress and a device mapper device is
reloaded with a table that doesn't support discard,
q->limits.max_discard_sectors is set to zero. This results in infinite
loop in __blkdev_issue_discard.

This patch checks if max_discard_sectors is zero and aborts with
-EOPNOTSUPP.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Tested-by: NZdenek Kabelac <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b88aef36

block, mm: remove unnecessary __GFP_HIGH flag · c137969b

由 Shakeel Butt 提交于 7月 03, 2018

The flag GFP_ATOMIC already contains __GFP_HIGH. There is no need to
explicitly or __GFP_HIGH again. So, just remove unnecessary __GFP_HIGH.
Signed-off-by: NShakeel Butt <shakeelb@google.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c137969b

null_blk: remove NULLB_DEV_FL_CONFIGURED on turning off nullb device · 00a8cdb8

由 Liu Bo 提交于 7月 06, 2018

Currently mbps knob could only be set once before switching power knob to
on, after power knob has been set at least once, there is no way to set
mbps knob again due to -EBUSY.

As nullb is mainly used for testing, in order to make it flexible, this
removes the flag NULLB_DEV_FL_CONFIGURED so that mbps knob can be reset
when power knob is off, e.g.

echo 0 > /config/nullb/a/power
echo 40 > /config/nullb/a/mbps
echo 1 > /config/nullb/a/power

So does other knobs under /config/nullb/a.
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

00a8cdb8

mm: skip readahead if the cgroup is congested · ca47e8c7

由 Josef Bacik 提交于 7月 03, 2018

We noticed in testing we'd get pretty bad latency stalls under heavy
pressure because read ahead would try to do its thing while the cgroup
was under severe pressure.  If we're under this much pressure we want to
do as little IO as possible so we can still make progress on real work
if we're a throttled cgroup, so just skip readahead if our group is
under pressure.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Acked-by: NTejun Heo <tj@kernel.org>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ca47e8c7

Documentation: add a doc for blk-iolatency · b351f0c7

由 Josef Bacik 提交于 7月 03, 2018

A basic documentation to describe the interface, statistics, and
behavior of io.latency.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b351f0c7

block: introduce blk-iolatency io controller · d7067512

由 Josef Bacik 提交于 7月 03, 2018

Current IO controllers for the block layer are less than ideal for our
use case.  The io.max controller is great at hard limiting, but it is
not work conserving.  This patch introduces io.latency.  You provide a
latency target for your group and we monitor the io in short windows to
make sure we are not exceeding those latency targets.  This makes use of
the rq-qos infrastructure and works much like the wbt stuff.  There are
a few differences from wbt

 - It's bio based, so the latency covers the whole block layer in addition to
   the actual io.
 - We will throttle all IO types that comes in here if we need to.
 - We use the mean latency over the 100ms window.  This is because writes can
   be particularly fast, which could give us a false sense of the impact of
   other workloads on our protected workload.
 - By default there's no throttling, we set the queue_depth to INT_MAX so that
   we can have as many outstanding bio's as we're allowed to.  Only at
   throttle time do we pay attention to the actual queue depth.
 - We backcharge cgroups for root cg issued IO and induce artificial
   delays in order to deal with cases like metadata only or swap heavy
   workloads.

In testing this has worked out relatively well.  Protected workloads
will throttle noisy workloads down to 1 io at time if they are doing
normal IO on their own, or induce up to a 1 second delay per syscall if
they are doing a lot of root issued IO (metadata/swap IO).

Our testing has revolved mostly around our production web servers where
we have hhvm (the web server application) in a protected group and
everything else in another group.  We see slightly higher requests per
second (RPS) on the test tier vs the control tier, and much more stable
RPS across all machines in the test tier vs the control tier.

Another test we run is a slow memory allocator in the unprotected group.
Before this would eventually push us into swap and cause the whole box
to die and not recover at all.  With these patches we see slight RPS
drops (usually 10-15%) before the memory consumer is properly killed and
things recover within seconds.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d7067512

rq-qos: introduce dio_bio callback · 67b42d0b

由 Josef Bacik 提交于 7月 03, 2018

wbt cares only about request completion time, but controllers may need
information that is on the bio itself, so add a done_bio callback for
rq-qos so things like blk-iolatency can use it to have the bio when it
completes.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

67b42d0b

block: remove external dependency on wbt_flags · c1c80384

由 Josef Bacik 提交于 7月 03, 2018

We don't really need to save this stuff in the core block code, we can
just pass the bio back into the helpers later on to derive the same
flags and update the rq->wbt_flags appropriately.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c1c80384

blk-rq-qos: refactor out common elements of blk-wbt · a7905043

由 Josef Bacik 提交于 7月 03, 2018

blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis.  Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a7905043

blk-stat: export helpers for modifying blk_rq_stat · 2ecbf456

由 Josef Bacik 提交于 7月 03, 2018

We need to use blk_rq_stat in the blkcg qos stuff, so export some of
these helpers so they can be used by other things.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2ecbf456

memcontrol: schedule throttling if we are congested · 2cf85583

由 Tejun Heo 提交于 7月 03, 2018

Memory allocations can induce swapping via kswapd or direct reclaim.  If
we are having IO done for us by kswapd and don't actually go into direct
reclaim we may never get scheduled for throttling.  So instead check to
see if our cgroup is congested, and if so schedule the throttling.
Before we return to user space the throttling stuff will only throttle
if we actually required it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2cf85583

blkcg: add generic throttling mechanism · d09d8df3

由 Josef Bacik 提交于 7月 03, 2018

Since IO can be issued from literally anywhere it's almost impossible to
do throttling without having some sort of adverse effect somewhere else
in the system because of locking or other dependencies.  The best way to
solve this is to do the throttling when we know we aren't holding any
other kernel resources.  Do this by tracking throttling in a per-blkg
basis, and if we require throttling flag the task that it needs to check
before it returns to user space and possibly sleep there.

This is to address the case where a process is doing work that is
generating IO that can't be throttled, whether that is directly with a
lot of REQ_META IO, or indirectly by allocating so much memory that it
is swamping the disk with REQ_SWAP.  We can't use task_add_work as we
don't want to induce a memory allocation in the IO path, so simply
saving the request queue in the task and flagging it to do the
notify_resume thing achieves the same result without the overhead of a
memory allocation.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d09d8df3

swap,blkcg: issue swap io with the appropriate context · 0d3bd88d

由 Tejun Heo 提交于 7月 03, 2018

For backcharging we need to know who the page belongs to when swapping
it out.  We don't worry about things that do ->rw_page (zram etc) at the
moment, we're only worried about pages that actually go to a block
device.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0d3bd88d

blk: introduce REQ_SWAP · 0d1e0c7c

由 Josef Bacik 提交于 7月 03, 2018

Just like REQ_META, it's important to know the IO coming down is swap
in order to guard against potential IO priority inversion issues with
cgroups.  Add REQ_SWAP and use it for all swap IO, and add it to our
bio_issue_as_root_blkg helper.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0d1e0c7c

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功