提交 4b925432 编写于 作者: J Jens Axboe

Merge branch 'for-4.21/block' into for-4.21/aio

* for-4.21/block: (351 commits)
  blk-mq: enable IO poll if .nr_queues of type poll > 0
  blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
  blk-mq: skip zero-queue maps in blk_mq_map_swqueue
  block: fix blk-iolatency accounting underflow
  blk-mq: fix dispatch from sw queue
  block: mq-deadline: Fix write completion handling
  nvme-pci: don't share queue maps
  blk-mq: only dispatch to non-defauly queue maps if they have queues
  blk-mq: export hctx->type in debugfs instead of sysfs
  blk-mq: fix allocation for queue mapping table
  blk-wbt: export internal state via debugfs
  blk-mq-debugfs: support rq_qos
  block: update sysfs documentation
  block: loop: check error using IS_ERR instead of IS_ERR_OR_NULL in loop_add()
  aoe: add __exit annotation
  block: clear REQ_HIPRI if polling is not supported
  blk-mq: replace and kill blk_mq_request_issue_directly
  blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests
  blk-mq: refactor the code of issue request directly
  block: remove the bio_integrity_advance export
  ...
......@@ -244,7 +244,7 @@ Description:
What: /sys/block/<disk>/queue/zoned
Date: September 2016
Contact: Damien Le Moal <damien.lemoal@hgst.com>
Contact: Damien Le Moal <damien.lemoal@wdc.com>
Description:
zoned indicates if the device is a zoned block device
and the zone model of the device if it is indeed zoned.
......@@ -259,6 +259,14 @@ Description:
zone commands, they will be treated as regular block
devices and zoned will report "none".
What: /sys/block/<disk>/queue/nr_zones
Date: November 2018
Contact: Damien Le Moal <damien.lemoal@wdc.com>
Description:
nr_zones indicates the total number of zones of a zoned block
device ("host-aware" or "host-managed" zone model). For regular
block devices, the value is always 0.
What: /sys/block/<disk>/queue/chunk_sectors
Date: September 2016
Contact: Hannes Reinecke <hare@suse.com>
......@@ -268,6 +276,6 @@ Description:
indicates the size in 512B sectors of the RAID volume
stripe segment. For a zoned block device, either
host-aware or host-managed, chunk_sectors indicates the
size of 512B sectors of the zones of the device, with
size in 512B sectors of the zones of the device, with
the eventual exception of the last zone of the device
which may be smaller.
......@@ -1879,8 +1879,10 @@ following two functions.
wbc_init_bio(@wbc, @bio)
Should be called for each bio carrying writeback data and
associates the bio with the inode's owner cgroup. Can be
called anytime between bio allocation and submission.
associates the bio with the inode's owner cgroup and the
corresponding request queue. This must be called after
a queue (device) has been associated with the bio and
before submission.
wbc_account_io(@wbc, @page, @bytes)
Should be called for each data segment being written out.
......@@ -1899,7 +1901,7 @@ the configuration, the bio may be executed at a lower priority and if
the writeback session is holding shared resources, e.g. a journal
entry, may lead to priority inversion. There is no one easy solution
for the problem. Filesystems can try to work around specific problem
cases by skipping wbc_init_bio() or using bio_associate_blkcg()
cases by skipping wbc_init_bio() and using bio_associate_blkg()
directly.
......
......@@ -65,7 +65,6 @@ Description of Contents:
3.2.3 I/O completion
3.2.4 Implications for drivers that do not interpret bios (don't handle
multiple segments)
3.2.5 Request command tagging
3.3 I/O submission
4. The I/O scheduler
5. Scalability related changes
......@@ -708,93 +707,6 @@ is crossed on completion of a transfer. (The end*request* functions should
be used if only if the request has come down from block/bio path, not for
direct access requests which only specify rq->buffer without a valid rq->bio)
3.2.5 Generic request command tagging
3.2.5.1 Tag helpers
Block now offers some simple generic functionality to help support command
queueing (typically known as tagged command queueing), ie manage more than
one outstanding command on a queue at any given time.
blk_queue_init_tags(struct request_queue *q, int depth)
Initialize internal command tagging structures for a maximum
depth of 'depth'.
blk_queue_free_tags((struct request_queue *q)
Teardown tag info associated with the queue. This will be done
automatically by block if blk_queue_cleanup() is called on a queue
that is using tagging.
The above are initialization and exit management, the main helpers during
normal operations are:
blk_queue_start_tag(struct request_queue *q, struct request *rq)
Start tagged operation for this request. A free tag number between
0 and 'depth' is assigned to the request (rq->tag holds this number),
and 'rq' is added to the internal tag management. If the maximum depth
for this queue is already achieved (or if the tag wasn't started for
some other reason), 1 is returned. Otherwise 0 is returned.
blk_queue_end_tag(struct request_queue *q, struct request *rq)
End tagged operation on this request. 'rq' is removed from the internal
book keeping structures.
To minimize struct request and queue overhead, the tag helpers utilize some
of the same request members that are used for normal request queue management.
This means that a request cannot both be an active tag and be on the queue
list at the same time. blk_queue_start_tag() will remove the request, but
the driver must remember to call blk_queue_end_tag() before signalling
completion of the request to the block layer. This means ending tag
operations before calling end_that_request_last()! For an example of a user
of these helpers, see the IDE tagged command queueing support.
3.2.5.2 Tag info
Some block functions exist to query current tag status or to go from a
tag number to the associated request. These are, in no particular order:
blk_queue_tagged(q)
Returns 1 if the queue 'q' is using tagging, 0 if not.
blk_queue_tag_request(q, tag)
Returns a pointer to the request associated with tag 'tag'.
blk_queue_tag_depth(q)
Return current queue depth.
blk_queue_tag_queue(q)
Returns 1 if the queue can accept a new queued command, 0 if we are
at the maximum depth already.
blk_queue_rq_tagged(rq)
Returns 1 if the request 'rq' is tagged.
3.2.5.2 Internal structure
Internally, block manages tags in the blk_queue_tag structure:
struct blk_queue_tag {
struct request **tag_index; /* array or pointers to rq */
unsigned long *tag_map; /* bitmap of free tags */
struct list_head busy_list; /* fifo list of busy tags */
int busy; /* queue depth */
int max_depth; /* max queue depth */
};
Most of the above is simple and straight forward, however busy_list may need
a bit of explaining. Normally we don't care too much about request ordering,
but in the event of any barrier requests in the tag queue we need to ensure
that requests are restarted in the order they were queue.
3.3 I/O Submission
The routine submit_bio() is used to submit a single io. Higher level i/o
......
CFQ (Complete Fairness Queueing)
===============================
The main aim of CFQ scheduler is to provide a fair allocation of the disk
I/O bandwidth for all the processes which requests an I/O operation.
CFQ maintains the per process queue for the processes which request I/O
operation(synchronous requests). In case of asynchronous requests, all the
requests from all the processes are batched together according to their
process's I/O priority.
CFQ ioscheduler tunables
========================
slice_idle
----------
This specifies how long CFQ should idle for next request on certain cfq queues
(for sequential workloads) and service trees (for random workloads) before
queue is expired and CFQ selects next queue to dispatch from.
By default slice_idle is a non-zero value. That means by default we idle on
queues/service trees. This can be very helpful on highly seeky media like
single spindle SATA/SAS disks where we can cut down on overall number of
seeks and see improved throughput.
Setting slice_idle to 0 will remove all the idling on queues/service tree
level and one should see an overall improved throughput on faster storage
devices like multiple SATA/SAS disks in hardware RAID configuration. The down
side is that isolation provided from WRITES also goes down and notion of
IO priority becomes weaker.
So depending on storage and workload, it might be useful to set slice_idle=0.
In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
keeping slice_idle enabled should be useful. For any configurations where
there are multiple spindles behind single LUN (Host based hardware RAID
controller or for storage arrays), setting slice_idle=0 might end up in better
throughput and acceptable latencies.
back_seek_max
-------------
This specifies, given in Kbytes, the maximum "distance" for backward seeking.
The distance is the amount of space from the current head location to the
sectors that are backward in terms of distance.
This parameter allows the scheduler to anticipate requests in the "backward"
direction and consider them as being the "next" if they are within this
distance from the current head location.
back_seek_penalty
-----------------
This parameter is used to compute the cost of backward seeking. If the
backward distance of request is just 1/back_seek_penalty from a "front"
request, then the seeking cost of two requests is considered equivalent.
So scheduler will not bias toward one or the other request (otherwise scheduler
will bias toward front request). Default value of back_seek_penalty is 2.
fifo_expire_async
-----------------
This parameter is used to set the timeout of asynchronous requests. Default
value of this is 248ms.
fifo_expire_sync
----------------
This parameter is used to set the timeout of synchronous requests. Default
value of this is 124ms. In case to favor synchronous requests over asynchronous
one, this value should be decreased relative to fifo_expire_async.
group_idle
-----------
This parameter forces idling at the CFQ group level instead of CFQ
queue level. This was introduced after a bottleneck was observed
in higher end storage due to idle on sequential queue and allow dispatch
from a single queue. The idea with this parameter is that it can be run with
slice_idle=0 and group_idle=8, so that idling does not happen on individual
queues in the group but happens overall on the group and thus still keeps the
IO controller working.
Not idling on individual queues in the group will dispatch requests from
multiple queues in the group at the same time and achieve higher throughput
on higher end storage.
Default value for this parameter is 8ms.
low_latency
-----------
This parameter is used to enable/disable the low latency mode of the CFQ
scheduler. If enabled, CFQ tries to recompute the slice time for each process
based on the target_latency set for the system. This favors fairness over
throughput. Disabling low latency (setting it to 0) ignores target latency,
allowing each process in the system to get a full time slice.
By default low latency mode is enabled.
target_latency
--------------
This parameter is used to calculate the time slice for a process if cfq's
latency mode is enabled. It will ensure that sync requests have an estimated
latency. But if sequential workload is higher(e.g. sequential read),
then to meet the latency constraints, throughput may decrease because of less
time for each process to issue I/O request before the cfq queue is switched.
Though this can be overcome by disabling the latency_mode, it may increase
the read latency for some applications. This parameter allows for changing
target_latency through the sysfs interface which can provide the balanced
throughput and read latency.
Default value for target_latency is 300ms.
slice_async
-----------
This parameter is same as of slice_sync but for asynchronous queue. The
default value is 40ms.
slice_async_rq
--------------
This parameter is used to limit the dispatching of asynchronous request to
device request queue in queue's slice time. The maximum number of request that
are allowed to be dispatched also depends upon the io priority. Default value
for this is 2.
slice_sync
----------
When a queue is selected for execution, the queues IO requests are only
executed for a certain amount of time(time_slice) before switching to another
queue. This parameter is used to calculate the time slice of synchronous
queue.
time_slice is computed using the below equation:-
time_slice = slice_sync + (slice_sync/5 * (4 - prio)). To increase the
time_slice of synchronous queue, increase the value of slice_sync. Default
value is 100ms.
quantum
-------
This specifies the number of request dispatched to the device queue. In a
queue's time slice, a request will not be dispatched if the number of request
in the device exceeds this parameter. This parameter is used for synchronous
request.
In case of storage with several disk, this setting can limit the parallel
processing of request. Therefore, increasing the value can improve the
performance although this can cause the latency of some I/O to increase due
to more number of requests.
CFQ Group scheduling
====================
CFQ supports blkio cgroup and has "blkio." prefixed files in each
blkio cgroup directory. It is weight-based and there are four knobs
for configuration - weight[_device] and leaf_weight[_device].
Internal cgroup nodes (the ones with children) can also have tasks in
them, so the former two configure how much proportion the cgroup as a
whole is entitled to at its parent's level while the latter two
configure how much proportion the tasks in the cgroup have compared to
its direct children.
Another way to think about it is assuming that each internal node has
an implicit leaf child node which hosts all the tasks whose weight is
configured by leaf_weight[_device]. Let's assume a blkio hierarchy
composed of five cgroups - root, A, B, AA and AB - with the following
weights where the names represent the hierarchy.
weight leaf_weight
root : 125 125
A : 500 750
B : 250 500
AA : 500 500
AB : 1000 500
root never has a parent making its weight is meaningless. For backward
compatibility, weight is always kept in sync with leaf_weight. B, AA
and AB have no child and thus its tasks have no children cgroup to
compete with. They always get 100% of what the cgroup won at the
parent level. Considering only the weights which matter, the hierarchy
looks like the following.
root
/ | \
A B leaf
500 250 125
/ | \
AA AB leaf
500 1000 750
If all cgroups have active IOs and competing with each other, disk
time will be distributed like the following.
Distribution below root. The total active weight at this level is
A:500 + B:250 + C:125 = 875.
root-leaf : 125 / 875 =~ 14%
A : 500 / 875 =~ 57%
B(-leaf) : 250 / 875 =~ 28%
A has children and further distributes its 57% among the children and
the implicit leaf node. The total active weight at this level is
AA:500 + AB:1000 + A-leaf:750 = 2250.
A-leaf : ( 750 / 2250) * A =~ 19%
AA(-leaf) : ( 500 / 2250) * A =~ 12%
AB(-leaf) : (1000 / 2250) * A =~ 25%
CFQ IOPS Mode for group scheduling
===================================
Basic CFQ design is to provide priority based time slices. Higher priority
process gets bigger time slice and lower priority process gets smaller time
slice. Measuring time becomes harder if storage is fast and supports NCQ and
it would be better to dispatch multiple requests from multiple cfq queues in
request queue at a time. In such scenario, it is not possible to measure time
consumed by single queue accurately.
What is possible though is to measure number of requests dispatched from a
single queue and also allow dispatch from multiple cfq queue at the same time.
This effectively becomes the fairness in terms of IOPS (IO operations per
second).
If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
to IOPS mode and starts providing fairness in terms of number of requests
dispatched. Note that this mode switching takes effect only for group
scheduling. For non-cgroup users nothing should change.
CFQ IO scheduler Idling Theory
===============================
Idling on a queue is primarily about waiting for the next request to come
on same queue after completion of a request. In this process CFQ will not
dispatch requests from other cfq queues even if requests are pending there.
The rationale behind idling is that it can cut down on number of seeks
on rotational media. For example, if a process is doing dependent
sequential reads (next read will come on only after completion of previous
one), then not dispatching request from other queue should help as we
did not move the disk head and kept on dispatching sequential IO from
one queue.
CFQ has following service trees and various queues are put on these trees.
sync-idle sync-noidle async
All cfq queues doing synchronous sequential IO go on to sync-idle tree.
On this tree we idle on each queue individually.
All synchronous non-sequential queues go on sync-noidle tree. Also any
synchronous write request which is not marked with REQ_IDLE goes on this
service tree. On this tree we do not idle on individual queues instead idle
on the whole group of queues or the tree. So if there are 4 queues waiting
for IO to dispatch we will idle only once last queue has dispatched the IO
and there is no more IO on this service tree.
All async writes go on async service tree. There is no idling on async
queues.
CFQ has some optimizations for SSDs and if it detects a non-rotational
media which can support higher queue depth (multiple requests at in
flight at a time), then it cuts down on idling of individual queues and
all the queues move to sync-noidle tree and only tree idle remains. This
tree idling provides isolation with buffered write queues on async tree.
FAQ
===
Q1. Why to idle at all on queues not marked with REQ_IDLE.
A1. We only do tree idle (all queues on sync-noidle tree) on queues not marked
with REQ_IDLE. This helps in providing isolation with all the sync-idle
queues. Otherwise in presence of many sequential readers, other
synchronous IO might not get fair share of disk.
For example, if there are 10 sequential readers doing IO and they get
100ms each. If a !REQ_IDLE request comes in, it will be scheduled
roughly after 1 second. If after completion of !REQ_IDLE request we
do not idle, and after a couple of milli seconds a another !REQ_IDLE
request comes in, again it will be scheduled after 1second. Repeat it
and notice how a workload can lose its disk share and suffer due to
multiple sequential readers.
fsync can generate dependent IO where bunch of data is written in the
context of fsync, and later some journaling data is written. Journaling
data comes in only after fsync has finished its IO (atleast for ext4
that seemed to be the case). Now if one decides not to idle on fsync
thread due to !REQ_IDLE, then next journaling write will not get
scheduled for another second. A process doing small fsync, will suffer
badly in presence of multiple sequential readers.
Hence doing tree idling on threads using !REQ_IDLE flag on requests
provides isolation from multiple sequential readers and at the same
time we do not idle on individual threads.
Q2. When to specify REQ_IDLE
A2. I would think whenever one is doing synchronous write and expecting
more writes to be dispatched from same context soon, should be able
to specify REQ_IDLE on writes and that probably should work well for
most of the cases.
......@@ -64,7 +64,7 @@ guess, the kernel will put the process issuing IO to sleep for an amount
of time, before entering a classic poll loop. This mode might be a
little slower than pure classic polling, but it will be more efficient.
If set to a value larger than 0, the kernel will put the process issuing
IO to sleep for this amont of microseconds before entering classic
IO to sleep for this amount of microseconds before entering classic
polling.
iostats (RW)
......@@ -194,4 +194,31 @@ blk-throttle makes decision based on the samplings. Lower time means cgroups
have more smooth throughput, but higher CPU overhead. This exists only when
CONFIG_BLK_DEV_THROTTLING_LOW is enabled.
zoned (RO)
----------
This indicates if the device is a zoned block device and the zone model of the
device if it is indeed zoned. The possible values indicated by zoned are
"none" for regular block devices and "host-aware" or "host-managed" for zoned
block devices. The characteristics of host-aware and host-managed zoned block
devices are described in the ZBC (Zoned Block Commands) and ZAC
(Zoned Device ATA Command Set) standards. These standards also define the
"drive-managed" zone model. However, since drive-managed zoned block devices
do not support zone commands, they will be treated as regular block devices
and zoned will report "none".
nr_zones (RO)
-------------
For zoned block devices (zoned attribute indicating "host-managed" or
"host-aware"), this indicates the total number of zones of the device.
This is always 0 for regular block devices.
chunk_sectors (RO)
------------------
This has different meaning depending on the type of the block device.
For a RAID device (dm-raid), chunk_sectors indicates the size in 512B sectors
of the RAID volume stripe segment. For a zoned block device, either host-aware
or host-managed, chunk_sectors indicates the size in 512B sectors of the zones
of the device, with the eventual exception of the last zone of the device which
may be smaller.
Jens Axboe <jens.axboe@oracle.com>, February 2009
......@@ -97,11 +97,6 @@ parameters may be changed at runtime by the command
allowing boot to proceed. none ignores them, expecting
user space to do the scan.
scsi_mod.use_blk_mq=
[SCSI] use blk-mq I/O path by default
See SCSI_MQ_DEFAULT in drivers/scsi/Kconfig.
Format: <y/n>
sim710= [SCSI,HW]
See header of drivers/scsi/sim710.c.
......
......@@ -155,12 +155,6 @@ config BLK_CGROUP_IOLATENCY
Note, this is an experimental interface and could be changed someday.
config BLK_WBT_SQ
bool "Single queue writeback throttling"
depends on BLK_WBT
---help---
Enable writeback throttling by default on legacy single queue devices
config BLK_WBT_MQ
bool "Multiqueue writeback throttling"
default y
......
......@@ -3,67 +3,6 @@ if BLOCK
menu "IO Schedulers"
config IOSCHED_NOOP
bool
default y
---help---
The no-op I/O scheduler is a minimal scheduler that does basic merging
and sorting. Its main uses include non-disk based block devices like
memory devices, and specialised software or hardware environments
that do their own scheduling and require only minimal assistance from
the kernel.
config IOSCHED_DEADLINE
tristate "Deadline I/O scheduler"
default y
---help---
The deadline I/O scheduler is simple and compact. It will provide
CSCAN service with FIFO expiration of requests, switching to
a new point in the service tree and doing a batch of IO from there
in case of expiry.
config IOSCHED_CFQ
tristate "CFQ I/O scheduler"
default y
---help---
The CFQ I/O scheduler tries to distribute bandwidth equally
among all processes in the system. It should provide a fair
and low latency working environment, suitable for both desktop
and server systems.
This is the default I/O scheduler.
config CFQ_GROUP_IOSCHED
bool "CFQ Group Scheduling support"
depends on IOSCHED_CFQ && BLK_CGROUP
---help---
Enable group IO scheduling in CFQ.
choice
prompt "Default I/O scheduler"
default DEFAULT_CFQ
help
Select the I/O scheduler which will be used by default for all
block devices.
config DEFAULT_DEADLINE
bool "Deadline" if IOSCHED_DEADLINE=y
config DEFAULT_CFQ
bool "CFQ" if IOSCHED_CFQ=y
config DEFAULT_NOOP
bool "No-op"
endchoice
config DEFAULT_IOSCHED
string
default "deadline" if DEFAULT_DEADLINE
default "cfq" if DEFAULT_CFQ
default "noop" if DEFAULT_NOOP
config MQ_IOSCHED_DEADLINE
tristate "MQ deadline I/O scheduler"
default y
......
......@@ -3,7 +3,7 @@
# Makefile for the kernel block layer
#
obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
......@@ -18,9 +18,6 @@ obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o
obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o
obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
......
......@@ -334,7 +334,7 @@ static void bfqg_stats_xfer_dead(struct bfq_group *bfqg)
parent = bfqg_parent(bfqg);
lockdep_assert_held(bfqg_to_blkg(bfqg)->q->queue_lock);
lockdep_assert_held(&bfqg_to_blkg(bfqg)->q->queue_lock);
if (unlikely(!parent))
return;
......@@ -642,7 +642,7 @@ void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio)
uint64_t serial_nr;
rcu_read_lock();
serial_nr = bio_blkcg(bio)->css.serial_nr;
serial_nr = __bio_blkcg(bio)->css.serial_nr;
/*
* Check whether blkcg has changed. The condition may trigger
......@@ -651,7 +651,7 @@ void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio)
if (unlikely(!bfqd) || likely(bic->blkcg_serial_nr == serial_nr))
goto out;
bfqg = __bfq_bic_change_cgroup(bfqd, bic, bio_blkcg(bio));
bfqg = __bfq_bic_change_cgroup(bfqd, bic, __bio_blkcg(bio));
/*
* Update blkg_path for bfq_log_* functions. We cache this
* path, and update it here, for the following
......
......@@ -399,9 +399,9 @@ static struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
unsigned long flags;
struct bfq_io_cq *icq;
spin_lock_irqsave(q->queue_lock, flags);
spin_lock_irqsave(&q->queue_lock, flags);
icq = icq_to_bic(ioc_lookup_icq(ioc, q));
spin_unlock_irqrestore(q->queue_lock, flags);
spin_unlock_irqrestore(&q->queue_lock, flags);
return icq;
}
......@@ -4066,7 +4066,7 @@ static void bfq_update_dispatch_stats(struct request_queue *q,
* In addition, the following queue lock guarantees that
* bfqq_group(bfqq) exists as well.
*/
spin_lock_irq(q->queue_lock);
spin_lock_irq(&q->queue_lock);
if (idle_timer_disabled)
/*
* Since the idle timer has been disabled,
......@@ -4085,7 +4085,7 @@ static void bfq_update_dispatch_stats(struct request_queue *q,
bfqg_stats_set_start_empty_time(bfqg);
bfqg_stats_update_io_remove(bfqg, rq->cmd_flags);
}
spin_unlock_irq(q->queue_lock);
spin_unlock_irq(&q->queue_lock);
}
#else
static inline void bfq_update_dispatch_stats(struct request_queue *q,
......@@ -4416,7 +4416,7 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
rcu_read_lock();
bfqg = bfq_find_set_group(bfqd, bio_blkcg(bio));
bfqg = bfq_find_set_group(bfqd, __bio_blkcg(bio));
if (!bfqg) {
bfqq = &bfqd->oom_bfqq;
goto out;
......@@ -4669,11 +4669,11 @@ static void bfq_update_insert_stats(struct request_queue *q,
* In addition, the following queue lock guarantees that
* bfqq_group(bfqq) exists as well.
*/
spin_lock_irq(q->queue_lock);
spin_lock_irq(&q->queue_lock);
bfqg_stats_update_io_add(bfqq_group(bfqq), bfqq, cmd_flags);
if (idle_timer_disabled)
bfqg_stats_update_idle_time(bfqq_group(bfqq));
spin_unlock_irq(q->queue_lock);
spin_unlock_irq(&q->queue_lock);
}
#else
static inline void bfq_update_insert_stats(struct request_queue *q,
......@@ -5414,9 +5414,9 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
}
eq->elevator_data = bfqd;
spin_lock_irq(q->queue_lock);
spin_lock_irq(&q->queue_lock);
q->elevator = eq;
spin_unlock_irq(q->queue_lock);
spin_unlock_irq(&q->queue_lock);
/*
* Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
......@@ -5756,7 +5756,7 @@ static struct elv_fs_entry bfq_attrs[] = {
};
static struct elevator_type iosched_bfq_mq = {
.ops.mq = {
.ops = {
.limit_depth = bfq_limit_depth,
.prepare_request = bfq_prepare_request,
.requeue_request = bfq_finish_requeue_request,
......@@ -5777,7 +5777,6 @@ static struct elevator_type iosched_bfq_mq = {
.exit_sched = bfq_exit_queue,
},
.uses_mq = true,
.icq_size = sizeof(struct bfq_io_cq),
.icq_align = __alignof__(struct bfq_io_cq),
.elevator_attrs = bfq_attrs,
......
......@@ -390,7 +390,6 @@ void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
bip->bip_iter.bi_sector += bytes_done >> 9;
bvec_iter_advance(bip->bip_vec, &bip->bip_iter, bytes);
}
EXPORT_SYMBOL(bio_integrity_advance);
/**
* bio_integrity_trim - Trim integrity vector
......@@ -460,7 +459,6 @@ void bioset_integrity_free(struct bio_set *bs)
mempool_exit(&bs->bio_integrity_pool);
mempool_exit(&bs->bvec_integrity_pool);
}
EXPORT_SYMBOL(bioset_integrity_free);
void __init bio_integrity_init(void)
{
......
......@@ -244,7 +244,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
void bio_uninit(struct bio *bio)
{
bio_disassociate_task(bio);
bio_disassociate_blkg(bio);
}
EXPORT_SYMBOL(bio_uninit);
......@@ -571,14 +571,13 @@ void bio_put(struct bio *bio)
}
EXPORT_SYMBOL(bio_put);
inline int bio_phys_segments(struct request_queue *q, struct bio *bio)
int bio_phys_segments(struct request_queue *q, struct bio *bio)
{
if (unlikely(!bio_flagged(bio, BIO_SEG_VALID)))
blk_recount_segments(q, bio);
return bio->bi_phys_segments;
}
EXPORT_SYMBOL(bio_phys_segments);
/**
* __bio_clone_fast - clone a bio that shares the original bio's biovec
......@@ -610,7 +609,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
bio_clone_blkcg_association(bio, bio_src);
bio_clone_blkg_association(bio, bio_src);
blkcg_bio_issue_init(bio);
}
EXPORT_SYMBOL(__bio_clone_fast);
......@@ -901,7 +901,6 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
return 0;
}
EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
static void submit_bio_wait_endio(struct bio *bio)
{
......@@ -1592,7 +1591,6 @@ void bio_set_pages_dirty(struct bio *bio)
set_page_dirty_lock(bvec->bv_page);
}
}
EXPORT_SYMBOL_GPL(bio_set_pages_dirty);
static void bio_release_pages(struct bio *bio)
{
......@@ -1662,17 +1660,33 @@ void bio_check_pages_dirty(struct bio *bio)
spin_unlock_irqrestore(&bio_dirty_lock, flags);
schedule_work(&bio_dirty_work);
}
EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
void update_io_ticks(struct hd_struct *part, unsigned long now)
{
unsigned long stamp;
again:
stamp = READ_ONCE(part->stamp);
if (unlikely(stamp != now)) {
if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
__part_stat_add(part, io_ticks, 1);
}
}
if (part->partno) {
part = &part_to_disk(part)->part0;
goto again;
}
}
void generic_start_io_acct(struct request_queue *q, int op,
unsigned long sectors, struct hd_struct *part)
{
const int sgrp = op_stat_group(op);
int cpu = part_stat_lock();
part_round_stats(q, cpu, part);
part_stat_inc(cpu, part, ios[sgrp]);
part_stat_add(cpu, part, sectors[sgrp], sectors);
part_stat_lock();
update_io_ticks(part, jiffies);
part_stat_inc(part, ios[sgrp]);
part_stat_add(part, sectors[sgrp], sectors);
part_inc_in_flight(q, part, op_is_write(op));
part_stat_unlock();
......@@ -1682,12 +1696,15 @@ EXPORT_SYMBOL(generic_start_io_acct);
void generic_end_io_acct(struct request_queue *q, int req_op,
struct hd_struct *part, unsigned long start_time)
{
unsigned long duration = jiffies - start_time;
unsigned long now = jiffies;
unsigned long duration = now - start_time;
const int sgrp = op_stat_group(req_op);
int cpu = part_stat_lock();
part_stat_add(cpu, part, nsecs[sgrp], jiffies_to_nsecs(duration));
part_round_stats(q, cpu, part);
part_stat_lock();
update_io_ticks(part, now);
part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
part_stat_add(part, time_in_queue, duration);
part_dec_in_flight(q, part, op_is_write(req_op));
part_stat_unlock();
......@@ -1957,102 +1974,133 @@ EXPORT_SYMBOL(bioset_init_from_src);
#ifdef CONFIG_BLK_CGROUP
#ifdef CONFIG_MEMCG
/**
* bio_associate_blkcg_from_page - associate a bio with the page's blkcg
* bio_disassociate_blkg - puts back the blkg reference if associated
* @bio: target bio
* @page: the page to lookup the blkcg from
*
* Associate @bio with the blkcg from @page's owning memcg. This works like
* every other associate function wrt references.
* Helper to disassociate the blkg from @bio if a blkg is associated.
*/
int bio_associate_blkcg_from_page(struct bio *bio, struct page *page)
void bio_disassociate_blkg(struct bio *bio)
{
struct cgroup_subsys_state *blkcg_css;
if (unlikely(bio->bi_css))
return -EBUSY;
if (!page->mem_cgroup)
return 0;
blkcg_css = cgroup_get_e_css(page->mem_cgroup->css.cgroup,
&io_cgrp_subsys);
bio->bi_css = blkcg_css;
return 0;
if (bio->bi_blkg) {
blkg_put(bio->bi_blkg);
bio->bi_blkg = NULL;
}
}
#endif /* CONFIG_MEMCG */
EXPORT_SYMBOL_GPL(bio_disassociate_blkg);
/**
* bio_associate_blkcg - associate a bio with the specified blkcg
* __bio_associate_blkg - associate a bio with the a blkg
* @bio: target bio
* @blkcg_css: css of the blkcg to associate
* @blkg: the blkg to associate
*
* Associate @bio with the blkcg specified by @blkcg_css. Block layer will
* treat @bio as if it were issued by a task which belongs to the blkcg.
* This tries to associate @bio with the specified @blkg. Association failure
* is handled by walking up the blkg tree. Therefore, the blkg associated can
* be anything between @blkg and the root_blkg. This situation only happens
* when a cgroup is dying and then the remaining bios will spill to the closest
* alive blkg.
*
* This function takes an extra reference of @blkcg_css which will be put
* when @bio is released. The caller must own @bio and is responsible for
* synchronizing calls to this function.
* A reference will be taken on the @blkg and will be released when @bio is
* freed.
*/
int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css)
static void __bio_associate_blkg(struct bio *bio, struct blkcg_gq *blkg)
{
if (unlikely(bio->bi_css))
return -EBUSY;
css_get(blkcg_css);
bio->bi_css = blkcg_css;
return 0;
bio_disassociate_blkg(bio);
bio->bi_blkg = blkg_tryget_closest(blkg);
}
EXPORT_SYMBOL_GPL(bio_associate_blkcg);
/**
* bio_associate_blkg - associate a bio with the specified blkg
* bio_associate_blkg_from_css - associate a bio with a specified css
* @bio: target bio
* @blkg: the blkg to associate
* @css: target css
*
* Associate @bio with the blkg specified by @blkg. This is the queue specific
* blkcg information associated with the @bio, a reference will be taken on the
* @blkg and will be freed when the bio is freed.
* Associate @bio with the blkg found by combining the css's blkg and the
* request_queue of the @bio. This falls back to the queue's root_blkg if
* the association fails with the css.
*/
int bio_associate_blkg(struct bio *bio, struct blkcg_gq *blkg)
void bio_associate_blkg_from_css(struct bio *bio,
struct cgroup_subsys_state *css)
{
if (unlikely(bio->bi_blkg))
return -EBUSY;
if (!blkg_try_get(blkg))
return -ENODEV;
bio->bi_blkg = blkg;
return 0;
struct request_queue *q = bio->bi_disk->queue;
struct blkcg_gq *blkg;
rcu_read_lock();
if (!css || !css->parent)
blkg = q->root_blkg;
else
blkg = blkg_lookup_create(css_to_blkcg(css), q);
__bio_associate_blkg(bio, blkg);
rcu_read_unlock();
}
EXPORT_SYMBOL_GPL(bio_associate_blkg_from_css);
#ifdef CONFIG_MEMCG
/**
* bio_disassociate_task - undo bio_associate_current()
* bio_associate_blkg_from_page - associate a bio with the page's blkg
* @bio: target bio
* @page: the page to lookup the blkcg from
*
* Associate @bio with the blkg from @page's owning memcg and the respective
* request_queue. If cgroup_e_css returns %NULL, fall back to the queue's
* root_blkg.
*/
void bio_disassociate_task(struct bio *bio)
void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
{
if (bio->bi_ioc) {
put_io_context(bio->bi_ioc);
bio->bi_ioc = NULL;
}
if (bio->bi_css) {
css_put(bio->bi_css);
bio->bi_css = NULL;
}
if (bio->bi_blkg) {
blkg_put(bio->bi_blkg);
bio->bi_blkg = NULL;
}
struct cgroup_subsys_state *css;
if (!page->mem_cgroup)
return;
rcu_read_lock();
css = cgroup_e_css(page->mem_cgroup->css.cgroup, &io_cgrp_subsys);
bio_associate_blkg_from_css(bio, css);
rcu_read_unlock();
}
#endif /* CONFIG_MEMCG */
/**
* bio_associate_blkg - associate a bio with a blkg
* @bio: target bio
*
* Associate @bio with the blkg found from the bio's css and request_queue.
* If one is not found, bio_lookup_blkg() creates the blkg. If a blkg is
* already associated, the css is reused and association redone as the
* request_queue may have changed.
*/
void bio_associate_blkg(struct bio *bio)
{
struct cgroup_subsys_state *css;
rcu_read_lock();
if (bio->bi_blkg)
css = &bio_blkcg(bio)->css;
else
css = blkcg_css();
bio_associate_blkg_from_css(bio, css);
rcu_read_unlock();
}
EXPORT_SYMBOL_GPL(bio_associate_blkg);
/**
* bio_clone_blkcg_association - clone blkcg association from src to dst bio
* bio_clone_blkg_association - clone blkg association from src to dst bio
* @dst: destination bio
* @src: source bio
*/
void bio_clone_blkcg_association(struct bio *dst, struct bio *src)
void bio_clone_blkg_association(struct bio *dst, struct bio *src)
{
if (src->bi_css)
WARN_ON(bio_associate_blkcg(dst, src->bi_css));
if (src->bi_blkg)
__bio_associate_blkg(dst, src->bi_blkg);
}
EXPORT_SYMBOL_GPL(bio_clone_blkcg_association);
EXPORT_SYMBOL_GPL(bio_clone_blkg_association);
#endif /* CONFIG_BLK_CGROUP */
static void __init biovec_init_slabs(void)
......
......@@ -76,14 +76,42 @@ static void blkg_free(struct blkcg_gq *blkg)
if (blkg->pd[i])
blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
if (blkg->blkcg != &blkcg_root)
blk_exit_rl(blkg->q, &blkg->rl);
blkg_rwstat_exit(&blkg->stat_ios);
blkg_rwstat_exit(&blkg->stat_bytes);
kfree(blkg);
}
static void __blkg_release(struct rcu_head *rcu)
{
struct blkcg_gq *blkg = container_of(rcu, struct blkcg_gq, rcu_head);
percpu_ref_exit(&blkg->refcnt);
/* release the blkcg and parent blkg refs this blkg has been holding */
css_put(&blkg->blkcg->css);
if (blkg->parent)
blkg_put(blkg->parent);
wb_congested_put(blkg->wb_congested);
blkg_free(blkg);
}
/*
* A group is RCU protected, but having an rcu lock does not mean that one
* can access all the fields of blkg and assume these are valid. For
* example, don't try to follow throtl_data and request queue links.
*
* Having a reference to blkg under an rcu allows accesses to only values
* local to groups like group stats and group rate limits.
*/
static void blkg_release(struct percpu_ref *ref)
{
struct blkcg_gq *blkg = container_of(ref, struct blkcg_gq, refcnt);
call_rcu(&blkg->rcu_head, __blkg_release);
}
/**
* blkg_alloc - allocate a blkg
* @blkcg: block cgroup the new blkg is associated with
......@@ -110,14 +138,6 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
blkg->q = q;
INIT_LIST_HEAD(&blkg->q_node);
blkg->blkcg = blkcg;
atomic_set(&blkg->refcnt, 1);
/* root blkg uses @q->root_rl, init rl only for !root blkgs */
if (blkcg != &blkcg_root) {
if (blk_init_rl(&blkg->rl, q, gfp_mask))
goto err_free;
blkg->rl.blkg = blkg;
}
for (i = 0; i < BLKCG_MAX_POLS; i++) {
struct blkcg_policy *pol = blkcg_policy[i];
......@@ -157,7 +177,7 @@ struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
blkg = radix_tree_lookup(&blkcg->blkg_tree, q->id);
if (blkg && blkg->q == q) {
if (update_hint) {
lockdep_assert_held(q->queue_lock);
lockdep_assert_held(&q->queue_lock);
rcu_assign_pointer(blkcg->blkg_hint, blkg);
}
return blkg;
......@@ -180,7 +200,13 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
int i, ret;
WARN_ON_ONCE(!rcu_read_lock_held());
lockdep_assert_held(q->queue_lock);
lockdep_assert_held(&q->queue_lock);
/* request_queue is dying, do not create/recreate a blkg */
if (blk_queue_dying(q)) {
ret = -ENODEV;
goto err_free_blkg;
}
/* blkg holds a reference to blkcg */
if (!css_tryget_online(&blkcg->css)) {
......@@ -217,6 +243,11 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
blkg_get(blkg->parent);
}
ret = percpu_ref_init(&blkg->refcnt, blkg_release, 0,
GFP_NOWAIT | __GFP_NOWARN);
if (ret)
goto err_cancel_ref;
/* invoke per-policy init */
for (i = 0; i < BLKCG_MAX_POLS; i++) {
struct blkcg_policy *pol = blkcg_policy[i];
......@@ -249,6 +280,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
blkg_put(blkg);
return ERR_PTR(ret);
err_cancel_ref:
percpu_ref_exit(&blkg->refcnt);
err_put_congested:
wb_congested_put(wb_congested);
err_put_css:
......@@ -259,7 +292,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
}
/**
* blkg_lookup_create - lookup blkg, try to create one if not there
* __blkg_lookup_create - lookup blkg, try to create one if not there
* @blkcg: blkcg of interest
* @q: request_queue of interest
*
......@@ -268,24 +301,16 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
* that all non-root blkg's have access to the parent blkg. This function
* should be called under RCU read lock and @q->queue_lock.
*
* Returns pointer to the looked up or created blkg on success, ERR_PTR()
* value on error. If @q is dead, returns ERR_PTR(-EINVAL). If @q is not
* dead and bypassing, returns ERR_PTR(-EBUSY).
* Returns the blkg or the closest blkg if blkg_create() fails as it walks
* down from root.
*/
struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
struct request_queue *q)
struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
struct request_queue *q)
{
struct blkcg_gq *blkg;
WARN_ON_ONCE(!rcu_read_lock_held());
lockdep_assert_held(q->queue_lock);
/*
* This could be the first entry point of blkcg implementation and
* we shouldn't allow anything to go through for a bypassing queue.
*/
if (unlikely(blk_queue_bypass(q)))
return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY);
lockdep_assert_held(&q->queue_lock);
blkg = __blkg_lookup(blkcg, q, true);
if (blkg)
......@@ -293,30 +318,62 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
/*
* Create blkgs walking down from blkcg_root to @blkcg, so that all
* non-root blkgs have access to their parents.
* non-root blkgs have access to their parents. Returns the closest
* blkg to the intended blkg should blkg_create() fail.
*/
while (true) {
struct blkcg *pos = blkcg;
struct blkcg *parent = blkcg_parent(blkcg);
while (parent && !__blkg_lookup(parent, q, false)) {
struct blkcg_gq *ret_blkg = q->root_blkg;
while (parent) {
blkg = __blkg_lookup(parent, q, false);
if (blkg) {
/* remember closest blkg */
ret_blkg = blkg;
break;
}
pos = parent;
parent = blkcg_parent(parent);
}
blkg = blkg_create(pos, q, NULL);
if (pos == blkcg || IS_ERR(blkg))
if (IS_ERR(blkg))
return ret_blkg;
if (pos == blkcg)
return blkg;
}
}
/**
* blkg_lookup_create - find or create a blkg
* @blkcg: target block cgroup
* @q: target request_queue
*
* This looks up or creates the blkg representing the unique pair
* of the blkcg and the request_queue.
*/
struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
struct request_queue *q)
{
struct blkcg_gq *blkg = blkg_lookup(blkcg, q);
if (unlikely(!blkg)) {
spin_lock_irq(&q->queue_lock);
blkg = __blkg_lookup_create(blkcg, q);
spin_unlock_irq(&q->queue_lock);
}
return blkg;
}
static void blkg_destroy(struct blkcg_gq *blkg)
{
struct blkcg *blkcg = blkg->blkcg;
struct blkcg_gq *parent = blkg->parent;
int i;
lockdep_assert_held(blkg->q->queue_lock);
lockdep_assert_held(&blkg->q->queue_lock);
lockdep_assert_held(&blkcg->lock);
/* Something wrong if we are trying to remove same group twice */
......@@ -353,7 +410,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
* Put the reference taken at the time of creation so that when all
* queues are gone, group can be destroyed.
*/
blkg_put(blkg);
percpu_ref_kill(&blkg->refcnt);
}
/**
......@@ -366,8 +423,7 @@ static void blkg_destroy_all(struct request_queue *q)
{
struct blkcg_gq *blkg, *n;
lockdep_assert_held(q->queue_lock);
spin_lock_irq(&q->queue_lock);
list_for_each_entry_safe(blkg, n, &q->blkg_list, q_node) {
struct blkcg *blkcg = blkg->blkcg;
......@@ -377,7 +433,7 @@ static void blkg_destroy_all(struct request_queue *q)
}
q->root_blkg = NULL;
q->root_rl.blkg = NULL;
spin_unlock_irq(&q->queue_lock);
}
/*
......@@ -403,41 +459,6 @@ void __blkg_release_rcu(struct rcu_head *rcu_head)
}
EXPORT_SYMBOL_GPL(__blkg_release_rcu);
/*
* The next function used by blk_queue_for_each_rl(). It's a bit tricky
* because the root blkg uses @q->root_rl instead of its own rl.
*/
struct request_list *__blk_queue_next_rl(struct request_list *rl,
struct request_queue *q)
{
struct list_head *ent;
struct blkcg_gq *blkg;
/*
* Determine the current blkg list_head. The first entry is
* root_rl which is off @q->blkg_list and mapped to the head.
*/
if (rl == &q->root_rl) {
ent = &q->blkg_list;
/* There are no more block groups, hence no request lists */
if (list_empty(ent))
return NULL;
} else {
blkg = container_of(rl, struct blkcg_gq, rl);
ent = &blkg->q_node;
}
/* walk to the next list_head, skip root blkcg */
ent = ent->next;
if (ent == &q->root_blkg->q_node)
ent = ent->next;
if (ent == &q->blkg_list)
return NULL;
blkg = container_of(ent, struct blkcg_gq, q_node);
return &blkg->rl;
}
static int blkcg_reset_stats(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 val)
{
......@@ -477,7 +498,6 @@ const char *blkg_dev_name(struct blkcg_gq *blkg)
return dev_name(blkg->q->backing_dev_info->dev);
return NULL;
}
EXPORT_SYMBOL_GPL(blkg_dev_name);
/**
* blkcg_print_blkgs - helper for printing per-blkg data
......@@ -508,10 +528,10 @@ void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg,
rcu_read_lock();
hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
spin_lock_irq(blkg->q->queue_lock);
spin_lock_irq(&blkg->q->queue_lock);
if (blkcg_policy_enabled(blkg->q, pol))
total += prfill(sf, blkg->pd[pol->plid], data);
spin_unlock_irq(blkg->q->queue_lock);
spin_unlock_irq(&blkg->q->queue_lock);
}
rcu_read_unlock();
......@@ -709,7 +729,7 @@ u64 blkg_stat_recursive_sum(struct blkcg_gq *blkg,
struct cgroup_subsys_state *pos_css;
u64 sum = 0;
lockdep_assert_held(blkg->q->queue_lock);
lockdep_assert_held(&blkg->q->queue_lock);
rcu_read_lock();
blkg_for_each_descendant_pre(pos_blkg, pos_css, blkg) {
......@@ -752,7 +772,7 @@ struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkcg_gq *blkg,
struct blkg_rwstat sum = { };
int i;
lockdep_assert_held(blkg->q->queue_lock);
lockdep_assert_held(&blkg->q->queue_lock);
rcu_read_lock();
blkg_for_each_descendant_pre(pos_blkg, pos_css, blkg) {
......@@ -783,18 +803,10 @@ static struct blkcg_gq *blkg_lookup_check(struct blkcg *blkcg,
struct request_queue *q)
{
WARN_ON_ONCE(!rcu_read_lock_held());
lockdep_assert_held(q->queue_lock);
lockdep_assert_held(&q->queue_lock);
if (!blkcg_policy_enabled(q, pol))
return ERR_PTR(-EOPNOTSUPP);
/*
* This could be the first entry point of blkcg implementation and
* we shouldn't allow anything to go through for a bypassing queue.
*/
if (unlikely(blk_queue_bypass(q)))
return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY);
return __blkg_lookup(blkcg, q, true /* update_hint */);
}
......@@ -812,7 +824,7 @@ static struct blkcg_gq *blkg_lookup_check(struct blkcg *blkcg,
*/
int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
char *input, struct blkg_conf_ctx *ctx)
__acquires(rcu) __acquires(disk->queue->queue_lock)
__acquires(rcu) __acquires(&disk->queue->queue_lock)
{
struct gendisk *disk;
struct request_queue *q;
......@@ -840,7 +852,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
q = disk->queue;
rcu_read_lock();
spin_lock_irq(q->queue_lock);
spin_lock_irq(&q->queue_lock);
blkg = blkg_lookup_check(blkcg, pol, q);
if (IS_ERR(blkg)) {
......@@ -867,7 +879,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
}
/* Drop locks to do new blkg allocation with GFP_KERNEL. */
spin_unlock_irq(q->queue_lock);
spin_unlock_irq(&q->queue_lock);
rcu_read_unlock();
new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
......@@ -877,7 +889,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
}
rcu_read_lock();
spin_lock_irq(q->queue_lock);
spin_lock_irq(&q->queue_lock);
blkg = blkg_lookup_check(pos, pol, q);
if (IS_ERR(blkg)) {
......@@ -905,7 +917,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
return 0;
fail_unlock:
spin_unlock_irq(q->queue_lock);
spin_unlock_irq(&q->queue_lock);
rcu_read_unlock();
fail:
put_disk_and_module(disk);
......@@ -921,7 +933,6 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
}
return ret;
}
EXPORT_SYMBOL_GPL(blkg_conf_prep);
/**
* blkg_conf_finish - finish up per-blkg config update
......@@ -931,13 +942,12 @@ EXPORT_SYMBOL_GPL(blkg_conf_prep);
* with blkg_conf_prep().
*/
void blkg_conf_finish(struct blkg_conf_ctx *ctx)
__releases(ctx->disk->queue->queue_lock) __releases(rcu)
__releases(&ctx->disk->queue->queue_lock) __releases(rcu)
{
spin_unlock_irq(ctx->disk->queue->queue_lock);
spin_unlock_irq(&ctx->disk->queue->queue_lock);
rcu_read_unlock();
put_disk_and_module(ctx->disk);
}
EXPORT_SYMBOL_GPL(blkg_conf_finish);
static int blkcg_print_stat(struct seq_file *sf, void *v)
{
......@@ -967,7 +977,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
*/
off += scnprintf(buf+off, size-off, "%s ", dname);
spin_lock_irq(blkg->q->queue_lock);
spin_lock_irq(&blkg->q->queue_lock);
rwstat = blkg_rwstat_recursive_sum(blkg, NULL,
offsetof(struct blkcg_gq, stat_bytes));
......@@ -981,7 +991,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
wios = atomic64_read(&rwstat.aux_cnt[BLKG_RWSTAT_WRITE]);
dios = atomic64_read(&rwstat.aux_cnt[BLKG_RWSTAT_DISCARD]);
spin_unlock_irq(blkg->q->queue_lock);
spin_unlock_irq(&blkg->q->queue_lock);
if (rbytes || wbytes || rios || wios) {
has_stats = true;
......@@ -1102,9 +1112,9 @@ void blkcg_destroy_blkgs(struct blkcg *blkcg)
struct blkcg_gq, blkcg_node);
struct request_queue *q = blkg->q;
if (spin_trylock(q->queue_lock)) {
if (spin_trylock(&q->queue_lock)) {
blkg_destroy(blkg);
spin_unlock(q->queue_lock);
spin_unlock(&q->queue_lock);
} else {
spin_unlock_irq(&blkcg->lock);
cpu_relax();
......@@ -1225,36 +1235,31 @@ int blkcg_init_queue(struct request_queue *q)
/* Make sure the root blkg exists. */
rcu_read_lock();
spin_lock_irq(q->queue_lock);
spin_lock_irq(&q->queue_lock);
blkg = blkg_create(&blkcg_root, q, new_blkg);
if (IS_ERR(blkg))
goto err_unlock;
q->root_blkg = blkg;
q->root_rl.blkg = blkg;
spin_unlock_irq(q->queue_lock);
spin_unlock_irq(&q->queue_lock);
rcu_read_unlock();
if (preloaded)
radix_tree_preload_end();
ret = blk_iolatency_init(q);
if (ret) {
spin_lock_irq(q->queue_lock);
blkg_destroy_all(q);
spin_unlock_irq(q->queue_lock);
return ret;
}
if (ret)
goto err_destroy_all;
ret = blk_throtl_init(q);
if (ret) {
spin_lock_irq(q->queue_lock);
blkg_destroy_all(q);
spin_unlock_irq(q->queue_lock);
}
return ret;
if (ret)
goto err_destroy_all;
return 0;
err_destroy_all:
blkg_destroy_all(q);
return ret;
err_unlock:
spin_unlock_irq(q->queue_lock);
spin_unlock_irq(&q->queue_lock);
rcu_read_unlock();
if (preloaded)
radix_tree_preload_end();
......@@ -1269,7 +1274,7 @@ int blkcg_init_queue(struct request_queue *q)
*/
void blkcg_drain_queue(struct request_queue *q)
{
lockdep_assert_held(q->queue_lock);
lockdep_assert_held(&q->queue_lock);
/*
* @q could be exiting and already have destroyed all blkgs as
......@@ -1289,10 +1294,7 @@ void blkcg_drain_queue(struct request_queue *q)
*/
void blkcg_exit_queue(struct request_queue *q)
{
spin_lock_irq(q->queue_lock);
blkg_destroy_all(q);
spin_unlock_irq(q->queue_lock);
blk_throtl_exit(q);
}
......@@ -1396,10 +1398,8 @@ int blkcg_activate_policy(struct request_queue *q,
if (blkcg_policy_enabled(q, pol))
return 0;
if (q->mq_ops)
if (queue_is_mq(q))
blk_mq_freeze_queue(q);
else
blk_queue_bypass_start(q);
pd_prealloc:
if (!pd_prealloc) {
pd_prealloc = pol->pd_alloc_fn(GFP_KERNEL, q->node);
......@@ -1409,7 +1409,7 @@ int blkcg_activate_policy(struct request_queue *q,
}
}
spin_lock_irq(q->queue_lock);
spin_lock_irq(&q->queue_lock);
list_for_each_entry(blkg, &q->blkg_list, q_node) {
struct blkg_policy_data *pd;
......@@ -1421,7 +1421,7 @@ int blkcg_activate_policy(struct request_queue *q,
if (!pd)
swap(pd, pd_prealloc);
if (!pd) {
spin_unlock_irq(q->queue_lock);
spin_unlock_irq(&q->queue_lock);
goto pd_prealloc;
}
......@@ -1435,12 +1435,10 @@ int blkcg_activate_policy(struct request_queue *q,
__set_bit(pol->plid, q->blkcg_pols);
ret = 0;
spin_unlock_irq(q->queue_lock);
spin_unlock_irq(&q->queue_lock);
out_bypass_end:
if (q->mq_ops)
if (queue_is_mq(q))
blk_mq_unfreeze_queue(q);
else
blk_queue_bypass_end(q);
if (pd_prealloc)
pol->pd_free_fn(pd_prealloc);
return ret;
......@@ -1463,12 +1461,10 @@ void blkcg_deactivate_policy(struct request_queue *q,
if (!blkcg_policy_enabled(q, pol))
return;
if (q->mq_ops)
if (queue_is_mq(q))
blk_mq_freeze_queue(q);
else
blk_queue_bypass_start(q);
spin_lock_irq(q->queue_lock);
spin_lock_irq(&q->queue_lock);
__clear_bit(pol->plid, q->blkcg_pols);
......@@ -1481,12 +1477,10 @@ void blkcg_deactivate_policy(struct request_queue *q,
}
}
spin_unlock_irq(q->queue_lock);
spin_unlock_irq(&q->queue_lock);
if (q->mq_ops)
if (queue_is_mq(q))
blk_mq_unfreeze_queue(q);
else
blk_queue_bypass_end(q);
}
EXPORT_SYMBOL_GPL(blkcg_deactivate_policy);
......@@ -1748,8 +1742,7 @@ void blkcg_maybe_throttle_current(void)
blkg = blkg_lookup(blkcg, q);
if (!blkg)
goto out;
blkg = blkg_try_get(blkg);
if (!blkg)
if (!blkg_tryget(blkg))
goto out;
rcu_read_unlock();
......@@ -1761,7 +1754,6 @@ void blkcg_maybe_throttle_current(void)
rcu_read_unlock();
blk_put_queue(q);
}
EXPORT_SYMBOL_GPL(blkcg_maybe_throttle_current);
/**
* blkcg_schedule_throttle - this task needs to check for throttling
......@@ -1795,7 +1787,6 @@ void blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay)
current->use_memdelay = use_memdelay;
set_notify_resume(current);
}
EXPORT_SYMBOL_GPL(blkcg_schedule_throttle);
/**
* blkcg_add_delay - add delay to this blkg
......@@ -1810,7 +1801,6 @@ void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta)
blkcg_scale_delay(blkg, now);
atomic64_add(delta, &blkg->delay_nsec);
}
EXPORT_SYMBOL_GPL(blkcg_add_delay);
module_param(blkcg_debug_stats, bool, 0644);
MODULE_PARM_DESC(blkcg_debug_stats, "True if you want debug stats, false if not");
此差异已折叠。
......@@ -48,8 +48,6 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
struct request *rq, int at_head,
rq_end_io_fn *done)
{
int where = at_head ? ELEVATOR_INSERT_FRONT : ELEVATOR_INSERT_BACK;
WARN_ON(irqs_disabled());
WARN_ON(!blk_rq_is_passthrough(rq));
......@@ -60,23 +58,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
* don't check dying flag for MQ because the request won't
* be reused after dying flag is set
*/
if (q->mq_ops) {
blk_mq_sched_insert_request(rq, at_head, true, false);
return;
}
spin_lock_irq(q->queue_lock);
if (unlikely(blk_queue_dying(q))) {
rq->rq_flags |= RQF_QUIET;
__blk_end_request_all(rq, BLK_STS_IOERR);
spin_unlock_irq(q->queue_lock);
return;
}
__elv_add_request(q, rq, where);
__blk_run_queue(q);
spin_unlock_irq(q->queue_lock);
blk_mq_sched_insert_request(rq, at_head, true, false);
}
EXPORT_SYMBOL_GPL(blk_execute_rq_nowait);
......
......@@ -93,7 +93,7 @@ enum {
FLUSH_PENDING_TIMEOUT = 5 * HZ,
};
static bool blk_kick_flush(struct request_queue *q,
static void blk_kick_flush(struct request_queue *q,
struct blk_flush_queue *fq, unsigned int flags);
static unsigned int blk_flush_policy(unsigned long fflags, struct request *rq)
......@@ -132,18 +132,9 @@ static void blk_flush_restore_request(struct request *rq)
rq->end_io = rq->flush.saved_end_io;
}
static bool blk_flush_queue_rq(struct request *rq, bool add_front)
static void blk_flush_queue_rq(struct request *rq, bool add_front)
{
if (rq->q->mq_ops) {
blk_mq_add_to_requeue_list(rq, add_front, true);
return false;
} else {
if (add_front)
list_add(&rq->queuelist, &rq->q->queue_head);
else
list_add_tail(&rq->queuelist, &rq->q->queue_head);
return true;
}
blk_mq_add_to_requeue_list(rq, add_front, true);
}
/**
......@@ -157,18 +148,17 @@ static bool blk_flush_queue_rq(struct request *rq, bool add_front)
* completion and trigger the next step.
*
* CONTEXT:
* spin_lock_irq(q->queue_lock or fq->mq_flush_lock)
* spin_lock_irq(fq->mq_flush_lock)
*
* RETURNS:
* %true if requests were added to the dispatch queue, %false otherwise.
*/
static bool blk_flush_complete_seq(struct request *rq,
static void blk_flush_complete_seq(struct request *rq,
struct blk_flush_queue *fq,
unsigned int seq, blk_status_t error)
{
struct request_queue *q = rq->q;
struct list_head *pending = &fq->flush_queue[fq->flush_pending_idx];
bool queued = false, kicked;
unsigned int cmd_flags;
BUG_ON(rq->flush.seq & seq);
......@@ -191,7 +181,7 @@ static bool blk_flush_complete_seq(struct request *rq,
case REQ_FSEQ_DATA:
list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
queued = blk_flush_queue_rq(rq, true);
blk_flush_queue_rq(rq, true);
break;
case REQ_FSEQ_DONE:
......@@ -204,42 +194,34 @@ static bool blk_flush_complete_seq(struct request *rq,
BUG_ON(!list_empty(&rq->queuelist));
list_del_init(&rq->flush.list);
blk_flush_restore_request(rq);
if (q->mq_ops)
blk_mq_end_request(rq, error);
else
__blk_end_request_all(rq, error);
blk_mq_end_request(rq, error);
break;
default:
BUG();
}
kicked = blk_kick_flush(q, fq, cmd_flags);
return kicked | queued;
blk_kick_flush(q, fq, cmd_flags);
}
static void flush_end_io(struct request *flush_rq, blk_status_t error)
{
struct request_queue *q = flush_rq->q;
struct list_head *running;
bool queued = false;
struct request *rq, *n;
unsigned long flags = 0;
struct blk_flush_queue *fq = blk_get_flush_queue(q, flush_rq->mq_ctx);
struct blk_mq_hw_ctx *hctx;
if (q->mq_ops) {
struct blk_mq_hw_ctx *hctx;
/* release the tag's ownership to the req cloned from */
spin_lock_irqsave(&fq->mq_flush_lock, flags);
hctx = blk_mq_map_queue(q, flush_rq->mq_ctx->cpu);
if (!q->elevator) {
blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
flush_rq->tag = -1;
} else {
blk_mq_put_driver_tag_hctx(hctx, flush_rq);
flush_rq->internal_tag = -1;
}
/* release the tag's ownership to the req cloned from */
spin_lock_irqsave(&fq->mq_flush_lock, flags);
hctx = flush_rq->mq_hctx;
if (!q->elevator) {
blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
flush_rq->tag = -1;
} else {
blk_mq_put_driver_tag_hctx(hctx, flush_rq);
flush_rq->internal_tag = -1;
}
running = &fq->flush_queue[fq->flush_running_idx];
......@@ -248,35 +230,16 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
/* account completion of the flush request */
fq->flush_running_idx ^= 1;
if (!q->mq_ops)
elv_completed_request(q, flush_rq);
/* and push the waiting requests to the next stage */
list_for_each_entry_safe(rq, n, running, flush.list) {
unsigned int seq = blk_flush_cur_seq(rq);
BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
queued |= blk_flush_complete_seq(rq, fq, seq, error);
blk_flush_complete_seq(rq, fq, seq, error);
}
/*
* Kick the queue to avoid stall for two cases:
* 1. Moving a request silently to empty queue_head may stall the
* queue.
* 2. When flush request is running in non-queueable queue, the
* queue is hold. Restart the queue after flush request is finished
* to avoid stall.
* This function is called from request completion path and calling
* directly into request_fn may confuse the driver. Always use
* kblockd.
*/
if (queued || fq->flush_queue_delayed) {
WARN_ON(q->mq_ops);
blk_run_queue_async(q);
}
fq->flush_queue_delayed = 0;
if (q->mq_ops)
spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
}
/**
......@@ -289,12 +252,10 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
* Please read the comment at the top of this file for more info.
*
* CONTEXT:
* spin_lock_irq(q->queue_lock or fq->mq_flush_lock)
* spin_lock_irq(fq->mq_flush_lock)
*
* RETURNS:
* %true if flush was issued, %false otherwise.
*/
static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
unsigned int flags)
{
struct list_head *pending = &fq->flush_queue[fq->flush_pending_idx];
......@@ -304,7 +265,7 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
/* C1 described at the top of this file */
if (fq->flush_pending_idx != fq->flush_running_idx || list_empty(pending))
return false;
return;
/* C2 and C3
*
......@@ -312,11 +273,10 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
* assigned to empty flushes, and we deadlock if we are expecting
* other requests to make progress. Don't defer for that case.
*/
if (!list_empty(&fq->flush_data_in_flight) &&
!(q->mq_ops && q->elevator) &&
if (!list_empty(&fq->flush_data_in_flight) && q->elevator &&
time_before(jiffies,
fq->flush_pending_since + FLUSH_PENDING_TIMEOUT))
return false;
return;
/*
* Issue flush and toggle pending_idx. This makes pending_idx
......@@ -334,19 +294,15 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
* In case of IO scheduler, flush rq need to borrow scheduler tag
* just for cheating put/get driver tag.
*/
if (q->mq_ops) {
struct blk_mq_hw_ctx *hctx;
flush_rq->mq_ctx = first_rq->mq_ctx;
if (!q->elevator) {
fq->orig_rq = first_rq;
flush_rq->tag = first_rq->tag;
hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
} else {
flush_rq->internal_tag = first_rq->internal_tag;
}
flush_rq->mq_ctx = first_rq->mq_ctx;
flush_rq->mq_hctx = first_rq->mq_hctx;
if (!q->elevator) {
fq->orig_rq = first_rq;
flush_rq->tag = first_rq->tag;
blk_mq_tag_set_rq(flush_rq->mq_hctx, first_rq->tag, flush_rq);
} else {
flush_rq->internal_tag = first_rq->internal_tag;
}
flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
......@@ -355,62 +311,17 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
flush_rq->rq_disk = first_rq->rq_disk;
flush_rq->end_io = flush_end_io;
return blk_flush_queue_rq(flush_rq, false);
}
static void flush_data_end_io(struct request *rq, blk_status_t error)
{
struct request_queue *q = rq->q;
struct blk_flush_queue *fq = blk_get_flush_queue(q, NULL);
lockdep_assert_held(q->queue_lock);
/*
* Updating q->in_flight[] here for making this tag usable
* early. Because in blk_queue_start_tag(),
* q->in_flight[BLK_RW_ASYNC] is used to limit async I/O and
* reserve tags for sync I/O.
*
* More importantly this way can avoid the following I/O
* deadlock:
*
* - suppose there are 40 fua requests comming to flush queue
* and queue depth is 31
* - 30 rqs are scheduled then blk_queue_start_tag() can't alloc
* tag for async I/O any more
* - all the 30 rqs are completed before FLUSH_PENDING_TIMEOUT
* and flush_data_end_io() is called
* - the other rqs still can't go ahead if not updating
* q->in_flight[BLK_RW_ASYNC] here, meantime these rqs
* are held in flush data queue and make no progress of
* handling post flush rq
* - only after the post flush rq is handled, all these rqs
* can be completed
*/
elv_completed_request(q, rq);
/* for avoiding double accounting */
rq->rq_flags &= ~RQF_STARTED;
/*
* After populating an empty queue, kick it to avoid stall. Read
* the comment in flush_end_io().
*/
if (blk_flush_complete_seq(rq, fq, REQ_FSEQ_DATA, error))
blk_run_queue_async(q);
blk_flush_queue_rq(flush_rq, false);
}
static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
{
struct request_queue *q = rq->q;
struct blk_mq_hw_ctx *hctx;
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
struct blk_mq_ctx *ctx = rq->mq_ctx;
unsigned long flags;
struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
hctx = blk_mq_map_queue(q, ctx->cpu);
if (q->elevator) {
WARN_ON(rq->tag < 0);
blk_mq_put_driver_tag_hctx(hctx, rq);
......@@ -443,9 +354,6 @@ void blk_insert_flush(struct request *rq)
unsigned int policy = blk_flush_policy(fflags, rq);
struct blk_flush_queue *fq = blk_get_flush_queue(q, rq->mq_ctx);
if (!q->mq_ops)
lockdep_assert_held(q->queue_lock);
/*
* @policy now records what operations need to be done. Adjust
* REQ_PREFLUSH and FUA for the driver.
......@@ -468,10 +376,7 @@ void blk_insert_flush(struct request *rq)
* complete the request.
*/
if (!policy) {
if (q->mq_ops)
blk_mq_end_request(rq, 0);
else
__blk_end_request(rq, 0, 0);
blk_mq_end_request(rq, 0);
return;
}
......@@ -484,10 +389,7 @@ void blk_insert_flush(struct request *rq)
*/
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
if (q->mq_ops)
blk_mq_request_bypass_insert(rq, false);
else
list_add_tail(&rq->queuelist, &q->queue_head);
blk_mq_request_bypass_insert(rq, false);
return;
}
......@@ -499,17 +401,12 @@ void blk_insert_flush(struct request *rq)
INIT_LIST_HEAD(&rq->flush.list);
rq->rq_flags |= RQF_FLUSH_SEQ;
rq->flush.saved_end_io = rq->end_io; /* Usually NULL */
if (q->mq_ops) {
rq->end_io = mq_flush_data_end_io;
spin_lock_irq(&fq->mq_flush_lock);
blk_flush_complete_seq(rq, fq, REQ_FSEQ_ACTIONS & ~policy, 0);
spin_unlock_irq(&fq->mq_flush_lock);
return;
}
rq->end_io = flush_data_end_io;
rq->end_io = mq_flush_data_end_io;
spin_lock_irq(&fq->mq_flush_lock);
blk_flush_complete_seq(rq, fq, REQ_FSEQ_ACTIONS & ~policy, 0);
spin_unlock_irq(&fq->mq_flush_lock);
}
/**
......@@ -575,8 +472,7 @@ struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
if (!fq)
goto fail;
if (q->mq_ops)
spin_lock_init(&fq->mq_flush_lock);
spin_lock_init(&fq->mq_flush_lock);
rq_sz = round_up(rq_sz + cmd_size, cache_line_size());
fq->flush_rq = kzalloc_node(rq_sz, flags, node);
......
......@@ -28,7 +28,6 @@ void get_io_context(struct io_context *ioc)
BUG_ON(atomic_long_read(&ioc->refcount) <= 0);
atomic_long_inc(&ioc->refcount);
}
EXPORT_SYMBOL(get_io_context);
static void icq_free_icq_rcu(struct rcu_head *head)
{
......@@ -48,10 +47,8 @@ static void ioc_exit_icq(struct io_cq *icq)
if (icq->flags & ICQ_EXITED)
return;
if (et->uses_mq && et->ops.mq.exit_icq)
et->ops.mq.exit_icq(icq);
else if (!et->uses_mq && et->ops.sq.elevator_exit_icq_fn)
et->ops.sq.elevator_exit_icq_fn(icq);
if (et->ops.exit_icq)
et->ops.exit_icq(icq);
icq->flags |= ICQ_EXITED;
}
......@@ -113,9 +110,9 @@ static void ioc_release_fn(struct work_struct *work)
struct io_cq, ioc_node);
struct request_queue *q = icq->q;
if (spin_trylock(q->queue_lock)) {
if (spin_trylock(&q->queue_lock)) {
ioc_destroy_icq(icq);
spin_unlock(q->queue_lock);
spin_unlock(&q->queue_lock);
} else {
spin_unlock_irqrestore(&ioc->lock, flags);
cpu_relax();
......@@ -162,7 +159,6 @@ void put_io_context(struct io_context *ioc)
if (free_ioc)
kmem_cache_free(iocontext_cachep, ioc);
}
EXPORT_SYMBOL(put_io_context);
/**
* put_io_context_active - put active reference on ioc
......@@ -173,7 +169,6 @@ EXPORT_SYMBOL(put_io_context);
*/
void put_io_context_active(struct io_context *ioc)
{
struct elevator_type *et;
unsigned long flags;
struct io_cq *icq;
......@@ -187,25 +182,12 @@ void put_io_context_active(struct io_context *ioc)
* reverse double locking. Read comment in ioc_release_fn() for
* explanation on the nested locking annotation.
*/
retry:
spin_lock_irqsave_nested(&ioc->lock, flags, 1);
hlist_for_each_entry(icq, &ioc->icq_list, ioc_node) {
if (icq->flags & ICQ_EXITED)
continue;
et = icq->q->elevator->type;
if (et->uses_mq) {
ioc_exit_icq(icq);
} else {
if (spin_trylock(icq->q->queue_lock)) {
ioc_exit_icq(icq);
spin_unlock(icq->q->queue_lock);
} else {
spin_unlock_irqrestore(&ioc->lock, flags);
cpu_relax();
goto retry;
}
}
ioc_exit_icq(icq);
}
spin_unlock_irqrestore(&ioc->lock, flags);
......@@ -232,7 +214,7 @@ static void __ioc_clear_queue(struct list_head *icq_list)
while (!list_empty(icq_list)) {
struct io_cq *icq = list_entry(icq_list->next,
struct io_cq, q_node);
struct io_cq, q_node);
struct io_context *ioc = icq->ioc;
spin_lock_irqsave(&ioc->lock, flags);
......@@ -251,16 +233,11 @@ void ioc_clear_queue(struct request_queue *q)
{
LIST_HEAD(icq_list);
spin_lock_irq(q->queue_lock);
spin_lock_irq(&q->queue_lock);
list_splice_init(&q->icq_list, &icq_list);
spin_unlock_irq(&q->queue_lock);
if (q->mq_ops) {
spin_unlock_irq(q->queue_lock);
__ioc_clear_queue(&icq_list);
} else {
__ioc_clear_queue(&icq_list);
spin_unlock_irq(q->queue_lock);
}
__ioc_clear_queue(&icq_list);
}
int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node)
......@@ -336,7 +313,6 @@ struct io_context *get_task_io_context(struct task_struct *task,
return NULL;
}
EXPORT_SYMBOL(get_task_io_context);
/**
* ioc_lookup_icq - lookup io_cq from ioc
......@@ -350,7 +326,7 @@ struct io_cq *ioc_lookup_icq(struct io_context *ioc, struct request_queue *q)
{
struct io_cq *icq;
lockdep_assert_held(q->queue_lock);
lockdep_assert_held(&q->queue_lock);
/*
* icq's are indexed from @ioc using radix tree and hint pointer,
......@@ -409,16 +385,14 @@ struct io_cq *ioc_create_icq(struct io_context *ioc, struct request_queue *q,
INIT_HLIST_NODE(&icq->ioc_node);
/* lock both q and ioc and try to link @icq */
spin_lock_irq(q->queue_lock);
spin_lock_irq(&q->queue_lock);
spin_lock(&ioc->lock);
if (likely(!radix_tree_insert(&ioc->icq_tree, q->id, icq))) {
hlist_add_head(&icq->ioc_node, &ioc->icq_list);
list_add(&icq->q_node, &q->icq_list);
if (et->uses_mq && et->ops.mq.init_icq)
et->ops.mq.init_icq(icq);
else if (!et->uses_mq && et->ops.sq.elevator_init_icq_fn)
et->ops.sq.elevator_init_icq_fn(icq);
if (et->ops.init_icq)
et->ops.init_icq(icq);
} else {
kmem_cache_free(et->icq_cache, icq);
icq = ioc_lookup_icq(ioc, q);
......@@ -427,7 +401,7 @@ struct io_cq *ioc_create_icq(struct io_context *ioc, struct request_queue *q,
}
spin_unlock(&ioc->lock);
spin_unlock_irq(q->queue_lock);
spin_unlock_irq(&q->queue_lock);
radix_tree_preload_end();
return icq;
}
......
......@@ -262,29 +262,25 @@ static inline void iolat_update_total_lat_avg(struct iolatency_grp *iolat,
stat->rqs.mean);
}
static inline bool iolatency_may_queue(struct iolatency_grp *iolat,
wait_queue_entry_t *wait,
bool first_block)
static void iolat_cleanup_cb(struct rq_wait *rqw, void *private_data)
{
struct rq_wait *rqw = &iolat->rq_wait;
atomic_dec(&rqw->inflight);
wake_up(&rqw->wait);
}
if (first_block && waitqueue_active(&rqw->wait) &&
rqw->wait.head.next != &wait->entry)
return false;
static bool iolat_acquire_inflight(struct rq_wait *rqw, void *private_data)
{
struct iolatency_grp *iolat = private_data;
return rq_wait_inc_below(rqw, iolat->rq_depth.max_depth);
}
static void __blkcg_iolatency_throttle(struct rq_qos *rqos,
struct iolatency_grp *iolat,
spinlock_t *lock, bool issue_as_root,
bool issue_as_root,
bool use_memdelay)
__releases(lock)
__acquires(lock)
{
struct rq_wait *rqw = &iolat->rq_wait;
unsigned use_delay = atomic_read(&lat_to_blkg(iolat)->use_delay);
DEFINE_WAIT(wait);
bool first_block = true;
if (use_delay)
blkcg_schedule_throttle(rqos->q, use_memdelay);
......@@ -301,27 +297,7 @@ static void __blkcg_iolatency_throttle(struct rq_qos *rqos,
return;
}
if (iolatency_may_queue(iolat, &wait, first_block))
return;
do {
prepare_to_wait_exclusive(&rqw->wait, &wait,
TASK_UNINTERRUPTIBLE);
if (iolatency_may_queue(iolat, &wait, first_block))
break;
first_block = false;
if (lock) {
spin_unlock_irq(lock);
io_schedule();
spin_lock_irq(lock);
} else {
io_schedule();
}
} while (1);
finish_wait(&rqw->wait, &wait);
rq_qos_wait(rqw, iolat, iolat_acquire_inflight, iolat_cleanup_cb);
}
#define SCALE_DOWN_FACTOR 2
......@@ -478,38 +454,15 @@ static void check_scale_change(struct iolatency_grp *iolat)
scale_change(iolat, direction > 0);
}
static void blkcg_iolatency_throttle(struct rq_qos *rqos, struct bio *bio,
spinlock_t *lock)
static void blkcg_iolatency_throttle(struct rq_qos *rqos, struct bio *bio)
{
struct blk_iolatency *blkiolat = BLKIOLATENCY(rqos);
struct blkcg *blkcg;
struct blkcg_gq *blkg;
struct request_queue *q = rqos->q;
struct blkcg_gq *blkg = bio->bi_blkg;
bool issue_as_root = bio_issue_as_root_blkg(bio);
if (!blk_iolatency_enabled(blkiolat))
return;
rcu_read_lock();
blkcg = bio_blkcg(bio);
bio_associate_blkcg(bio, &blkcg->css);
blkg = blkg_lookup(blkcg, q);
if (unlikely(!blkg)) {
if (!lock)
spin_lock_irq(q->queue_lock);
blkg = blkg_lookup_create(blkcg, q);
if (IS_ERR(blkg))
blkg = NULL;
if (!lock)
spin_unlock_irq(q->queue_lock);
}
if (!blkg)
goto out;
bio_issue_init(&bio->bi_issue, bio_sectors(bio));
bio_associate_blkg(bio, blkg);
out:
rcu_read_unlock();
while (blkg && blkg->parent) {
struct iolatency_grp *iolat = blkg_to_lat(blkg);
if (!iolat) {
......@@ -518,7 +471,7 @@ static void blkcg_iolatency_throttle(struct rq_qos *rqos, struct bio *bio,
}
check_scale_change(iolat);
__blkcg_iolatency_throttle(rqos, iolat, lock, issue_as_root,
__blkcg_iolatency_throttle(rqos, iolat, issue_as_root,
(bio->bi_opf & REQ_SWAP) == REQ_SWAP);
blkg = blkg->parent;
}
......@@ -640,7 +593,7 @@ static void blkcg_iolatency_done_bio(struct rq_qos *rqos, struct bio *bio)
bool enabled = false;
blkg = bio->bi_blkg;
if (!blkg)
if (!blkg || !bio_flagged(bio, BIO_TRACKED))
return;
iolat = blkg_to_lat(bio->bi_blkg);
......@@ -730,7 +683,7 @@ static void blkiolatency_timer_fn(struct timer_list *t)
* We could be exiting, don't access the pd unless we have a
* ref on the blkg.
*/
if (!blkg_try_get(blkg))
if (!blkg_tryget(blkg))
continue;
iolat = blkg_to_lat(blkg);
......
......@@ -389,7 +389,6 @@ void blk_recount_segments(struct request_queue *q, struct bio *bio)
bio_set_flag(bio, BIO_SEG_VALID);
}
EXPORT_SYMBOL(blk_recount_segments);
static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
struct bio *nxt)
......@@ -596,17 +595,6 @@ int ll_front_merge_fn(struct request_queue *q, struct request *req,
return ll_new_hw_segment(q, req, bio);
}
/*
* blk-mq uses req->special to carry normal driver per-request payload, it
* does not indicate a prepared command that we cannot merge with.
*/
static bool req_no_special_merge(struct request *req)
{
struct request_queue *q = req->q;
return !q->mq_ops && req->special;
}
static bool req_attempt_discard_merge(struct request_queue *q, struct request *req,
struct request *next)
{
......@@ -632,13 +620,6 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
unsigned int seg_size =
req->biotail->bi_seg_back_size + next->bio->bi_seg_front_size;
/*
* First check if the either of the requests are re-queued
* requests. Can't merge them if they are.
*/
if (req_no_special_merge(req) || req_no_special_merge(next))
return 0;
if (req_gap_back_merge(req, next->bio))
return 0;
......@@ -703,12 +684,10 @@ static void blk_account_io_merge(struct request *req)
{
if (blk_do_io_stat(req)) {
struct hd_struct *part;
int cpu;
cpu = part_stat_lock();
part_stat_lock();
part = req->part;
part_round_stats(req->q, cpu, part);
part_dec_in_flight(req->q, part, rq_data_dir(req));
hd_struct_put(part);
......@@ -731,7 +710,8 @@ static inline bool blk_discard_mergable(struct request *req)
return false;
}
enum elv_merge blk_try_req_merge(struct request *req, struct request *next)
static enum elv_merge blk_try_req_merge(struct request *req,
struct request *next)
{
if (blk_discard_mergable(req))
return ELEVATOR_DISCARD_MERGE;
......@@ -748,9 +728,6 @@ enum elv_merge blk_try_req_merge(struct request *req, struct request *next)
static struct request *attempt_merge(struct request_queue *q,
struct request *req, struct request *next)
{
if (!q->mq_ops)
lockdep_assert_held(q->queue_lock);
if (!rq_mergeable(req) || !rq_mergeable(next))
return NULL;
......@@ -758,8 +735,7 @@ static struct request *attempt_merge(struct request_queue *q,
return NULL;
if (rq_data_dir(req) != rq_data_dir(next)
|| req->rq_disk != next->rq_disk
|| req_no_special_merge(next))
|| req->rq_disk != next->rq_disk)
return NULL;
if (req_op(req) == REQ_OP_WRITE_SAME &&
......@@ -773,6 +749,9 @@ static struct request *attempt_merge(struct request_queue *q,
if (req->write_hint != next->write_hint)
return NULL;
if (req->ioprio != next->ioprio)
return NULL;
/*
* If we are allowed to merge, then append bio list
* from next to rq and release next. merge_requests_fn
......@@ -828,10 +807,6 @@ static struct request *attempt_merge(struct request_queue *q,
*/
blk_account_io_merge(next);
req->ioprio = ioprio_best(req->ioprio, next->ioprio);
if (blk_rq_cpu_valid(next))
req->cpu = next->cpu;
/*
* ownership of bio passed from next to req, return 'next' for
* the caller to free
......@@ -863,16 +838,11 @@ struct request *attempt_front_merge(struct request_queue *q, struct request *rq)
int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
struct request *next)
{
struct elevator_queue *e = q->elevator;
struct request *free;
if (!e->uses_mq && e->type->ops.sq.elevator_allow_rq_merge_fn)
if (!e->type->ops.sq.elevator_allow_rq_merge_fn(q, rq, next))
return 0;
free = attempt_merge(q, rq, next);
if (free) {
__blk_put_request(q, free);
blk_put_request(free);
return 1;
}
......@@ -891,8 +861,8 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
if (bio_data_dir(bio) != rq_data_dir(rq))
return false;
/* must be same device and not a special request */
if (rq->rq_disk != bio->bi_disk || req_no_special_merge(rq))
/* must be same device */
if (rq->rq_disk != bio->bi_disk)
return false;
/* only merge integrity protected bio into ditto rq */
......@@ -911,6 +881,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
if (rq->write_hint != bio->bi_write_hint)
return false;
if (rq->ioprio != bio_prio(bio))
return false;
return true;
}
......
此差异已折叠。
此差异已折叠。
......@@ -31,6 +31,10 @@ void blk_mq_debugfs_unregister_sched(struct request_queue *q);
int blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
struct blk_mq_hw_ctx *hctx);
void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx);
int blk_mq_debugfs_register_rqos(struct rq_qos *rqos);
void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos);
void blk_mq_debugfs_unregister_queue_rqos(struct request_queue *q);
#else
static inline int blk_mq_debugfs_register(struct request_queue *q)
{
......@@ -78,6 +82,19 @@ static inline int blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
static inline void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx)
{
}
static inline int blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
{
return 0;
}
static inline void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
{
}
static inline void blk_mq_debugfs_unregister_queue_rqos(struct request_queue *q)
{
}
#endif
#ifdef CONFIG_BLK_DEBUG_FS_ZONED
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -145,6 +145,11 @@ static inline void blk_stat_activate_nsecs(struct blk_stat_callback *cb,
mod_timer(&cb->timer, jiffies + nsecs_to_jiffies(nsecs));
}
static inline void blk_stat_deactivate(struct blk_stat_callback *cb)
{
del_timer_sync(&cb->timer);
}
/**
* blk_stat_activate_msecs() - Gather block statistics during a time window in
* milliseconds.
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册