提交 d3c81080 编写于 作者: L Linus Torvalds

Merge tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Unify where the struct request handling code is located in the blk-mq
   code (Christoph)

 - Header cleanups (Christoph)

 - Clean up the io_context handling code (Christoph, me)

 - Get rid of ->rq_disk in struct request (Christoph)

 - Error handling fix for add_disk() (Christoph)

 - request allocation cleanusp (Christoph)

 - Documentation updates (Eric, Matthew)

 - Remove trivial crypto unregister helper (Eric)

 - Reduce shared tag overhead (John)

 - Reduce poll_stats memory overhead (me)

 - Known indirect function call for dio (me)

 - Use atomic references for struct request (me)

 - Support request list issue for block and NVMe (me)

 - Improve queue dispatch pinning (Ming)

 - Improve the direct list issue code (Keith)

 - BFQ improvements (Jan)

 - Direct completion helper and use it in mmc block (Sebastian)

 - Use raw spinlock for the blktrace code (Wander)

 - fsync error handling fix (Ye)

 - Various fixes and cleanups (Lukas, Randy, Yang, Tetsuo, Ming, me)

* tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-block: (132 commits)
  MAINTAINERS: add entries for block layer documentation
  docs: block: remove queue-sysfs.rst
  docs: sysfs-block: document virt_boundary_mask
  docs: sysfs-block: document stable_writes
  docs: sysfs-block: fill in missing documentation from queue-sysfs.rst
  docs: sysfs-block: add contact for nomerges
  docs: sysfs-block: sort alphabetically
  docs: sysfs-block: move to stable directory
  block: don't protect submit_bio_checks by q_usage_counter
  block: fix old-style declaration
  nvme-pci: fix queue_rqs list splitting
  block: introduce rq_list_move
  block: introduce rq_list_for_each_safe macro
  block: move rq_list macros to blk-mq.h
  block: drop needless assignment in set_task_ioprio()
  block: remove unnecessary trailing '\'
  bio.h: fix kernel-doc warnings
  block: check minor range in device_add_disk()
  block: use "unsigned long" for blk_validate_block_size().
  block: fix error unwinding in device_add_disk
  ...
......@@ -20,7 +20,6 @@ Block
kyber-iosched
null_blk
pr
queue-sysfs
request
stat
switching-sched
......
=================
Queue sysfs files
=================
This text file will detail the queue files that are located in the sysfs tree
for each block device. Note that stacked devices typically do not export
any settings, since their queue merely functions as a remapping target.
These files are the ones found in the /sys/block/xxx/queue/ directory.
Files denoted with a RO postfix are readonly and the RW postfix means
read-write.
add_random (RW)
---------------
This file allows to turn off the disk entropy contribution. Default
value of this file is '1'(on).
chunk_sectors (RO)
------------------
This has different meaning depending on the type of the block device.
For a RAID device (dm-raid), chunk_sectors indicates the size in 512B sectors
of the RAID volume stripe segment. For a zoned block device, either host-aware
or host-managed, chunk_sectors indicates the size in 512B sectors of the zones
of the device, with the eventual exception of the last zone of the device which
may be smaller.
dax (RO)
--------
This file indicates whether the device supports Direct Access (DAX),
used by CPU-addressable storage to bypass the pagecache. It shows '1'
if true, '0' if not.
discard_granularity (RO)
------------------------
This shows the size of internal allocation of the device in bytes, if
reported by the device. A value of '0' means device does not support
the discard functionality.
discard_max_hw_bytes (RO)
-------------------------
Devices that support discard functionality may have internal limits on
the number of bytes that can be trimmed or unmapped in a single operation.
The `discard_max_hw_bytes` parameter is set by the device driver to the
maximum number of bytes that can be discarded in a single operation.
Discard requests issued to the device must not exceed this limit.
A `discard_max_hw_bytes` value of 0 means that the device does not support
discard functionality.
discard_max_bytes (RW)
----------------------
While discard_max_hw_bytes is the hardware limit for the device, this
setting is the software limit. Some devices exhibit large latencies when
large discards are issued, setting this value lower will make Linux issue
smaller discards and potentially help reduce latencies induced by large
discard operations.
discard_zeroes_data (RO)
------------------------
Obsolete. Always zero.
fua (RO)
--------
Whether or not the block driver supports the FUA flag for write requests.
FUA stands for Force Unit Access. If the FUA flag is set that means that
write requests must bypass the volatile cache of the storage device.
hw_sector_size (RO)
-------------------
This is the hardware sector size of the device, in bytes.
io_poll (RW)
------------
When read, this file shows whether polling is enabled (1) or disabled
(0). Writing '0' to this file will disable polling for this device.
Writing any non-zero value will enable this feature.
io_poll_delay (RW)
------------------
If polling is enabled, this controls what kind of polling will be
performed. It defaults to -1, which is classic polling. In this mode,
the CPU will repeatedly ask for completions without giving up any time.
If set to 0, a hybrid polling mode is used, where the kernel will attempt
to make an educated guess at when the IO will complete. Based on this
guess, the kernel will put the process issuing IO to sleep for an amount
of time, before entering a classic poll loop. This mode might be a
little slower than pure classic polling, but it will be more efficient.
If set to a value larger than 0, the kernel will put the process issuing
IO to sleep for this amount of microseconds before entering classic
polling.
io_timeout (RW)
---------------
io_timeout is the request timeout in milliseconds. If a request does not
complete in this time then the block driver timeout handler is invoked.
That timeout handler can decide to retry the request, to fail it or to start
a device recovery strategy.
iostats (RW)
-------------
This file is used to control (on/off) the iostats accounting of the
disk.
logical_block_size (RO)
-----------------------
This is the logical block size of the device, in bytes.
max_discard_segments (RO)
-------------------------
The maximum number of DMA scatter/gather entries in a discard request.
max_hw_sectors_kb (RO)
----------------------
This is the maximum number of kilobytes supported in a single data transfer.
max_integrity_segments (RO)
---------------------------
Maximum number of elements in a DMA scatter/gather list with integrity
data that will be submitted by the block layer core to the associated
block driver.
max_active_zones (RO)
---------------------
For zoned block devices (zoned attribute indicating "host-managed" or
"host-aware"), the sum of zones belonging to any of the zone states:
EXPLICIT OPEN, IMPLICIT OPEN or CLOSED, is limited by this value.
If this value is 0, there is no limit.
If the host attempts to exceed this limit, the driver should report this error
with BLK_STS_ZONE_ACTIVE_RESOURCE, which user space may see as the EOVERFLOW
errno.
max_open_zones (RO)
-------------------
For zoned block devices (zoned attribute indicating "host-managed" or
"host-aware"), the sum of zones belonging to any of the zone states:
EXPLICIT OPEN or IMPLICIT OPEN, is limited by this value.
If this value is 0, there is no limit.
If the host attempts to exceed this limit, the driver should report this error
with BLK_STS_ZONE_OPEN_RESOURCE, which user space may see as the ETOOMANYREFS
errno.
max_sectors_kb (RW)
-------------------
This is the maximum number of kilobytes that the block layer will allow
for a filesystem request. Must be smaller than or equal to the maximum
size allowed by the hardware.
max_segments (RO)
-----------------
Maximum number of elements in a DMA scatter/gather list that is submitted
to the associated block driver.
max_segment_size (RO)
---------------------
Maximum size in bytes of a single element in a DMA scatter/gather list.
minimum_io_size (RO)
--------------------
This is the smallest preferred IO size reported by the device.
nomerges (RW)
-------------
This enables the user to disable the lookup logic involved with IO
merging requests in the block layer. By default (0) all merges are
enabled. When set to 1 only simple one-hit merges will be tried. When
set to 2 no merge algorithms will be tried (including one-hit or more
complex tree/hash lookups).
nr_requests (RW)
----------------
This controls how many requests may be allocated in the block layer for
read or write requests. Note that the total allocated number may be twice
this amount, since it applies only to reads or writes (not the accumulated
sum).
To avoid priority inversion through request starvation, a request
queue maintains a separate request pool per each cgroup when
CONFIG_BLK_CGROUP is enabled, and this parameter applies to each such
per-block-cgroup request pool. IOW, if there are N block cgroups,
each request queue may have up to N request pools, each independently
regulated by nr_requests.
nr_zones (RO)
-------------
For zoned block devices (zoned attribute indicating "host-managed" or
"host-aware"), this indicates the total number of zones of the device.
This is always 0 for regular block devices.
optimal_io_size (RO)
--------------------
This is the optimal IO size reported by the device.
physical_block_size (RO)
------------------------
This is the physical block size of device, in bytes.
read_ahead_kb (RW)
------------------
Maximum number of kilobytes to read-ahead for filesystems on this block
device.
rotational (RW)
---------------
This file is used to stat if the device is of rotational type or
non-rotational type.
rq_affinity (RW)
----------------
If this option is '1', the block layer will migrate request completions to the
cpu "group" that originally submitted the request. For some workloads this
provides a significant reduction in CPU cycles due to caching effects.
For storage configurations that need to maximize distribution of completion
processing setting this option to '2' forces the completion to run on the
requesting cpu (bypassing the "group" aggregation logic).
scheduler (RW)
--------------
When read, this file will display the current and available IO schedulers
for this block device. The currently active IO scheduler will be enclosed
in [] brackets. Writing an IO scheduler name to this file will switch
control of this block device to that new IO scheduler. Note that writing
an IO scheduler name to this file will attempt to load that IO scheduler
module, if it isn't already present in the system.
write_cache (RW)
----------------
When read, this file will display whether the device has write back
caching enabled or not. It will return "write back" for the former
case, and "write through" for the latter. Writing to this file can
change the kernels view of the device, but it doesn't alter the
device state. This means that it might not be safe to toggle the
setting from "write back" to "write through", since that will also
eliminate cache flushes issued by the kernel.
write_same_max_bytes (RO)
-------------------------
This is the number of bytes the device can write in a single write-same
command. A value of '0' means write-same is not supported by this
device.
wbt_lat_usec (RW)
-----------------
If the device is registered for writeback throttling, then this file shows
the target minimum read latency. If this latency is exceeded in a given
window of time (see wb_window_usec), then the writeback throttling will start
scaling back writes. Writing a value of '0' to this file disables the
feature. Writing a value of '-1' to this file resets the value to the
default setting.
throttle_sample_time (RW)
-------------------------
This is the time window that blk-throttle samples data, in millisecond.
blk-throttle makes decision based on the samplings. Lower time means cgroups
have more smooth throughput, but higher CPU overhead. This exists only when
CONFIG_BLK_DEV_THROTTLING_LOW is enabled.
write_zeroes_max_bytes (RO)
---------------------------
For block drivers that support REQ_OP_WRITE_ZEROES, the maximum number of
bytes that can be zeroed at once. The value 0 means that REQ_OP_WRITE_ZEROES
is not supported.
zone_append_max_bytes (RO)
--------------------------
This is the maximum number of bytes that can be written to a sequential
zone of a zoned block device using a zone append write operation
(REQ_OP_ZONE_APPEND). This value is always 0 for regular block devices.
zoned (RO)
----------
This indicates if the device is a zoned block device and the zone model of the
device if it is indeed zoned. The possible values indicated by zoned are
"none" for regular block devices and "host-aware" or "host-managed" for zoned
block devices. The characteristics of host-aware and host-managed zoned block
devices are described in the ZBC (Zoned Block Commands) and ZAC
(Zoned Device ATA Command Set) standards. These standards also define the
"drive-managed" zone model. However, since drive-managed zoned block devices
do not support zone commands, they will be treated as regular block devices
and zoned will report "none".
zone_write_granularity (RO)
---------------------------
This indicates the alignment constraint, in bytes, for write operations in
sequential zones of zoned block devices (devices with a zoned attributed
that reports "host-managed" or "host-aware"). This value is always 0 for
regular block devices.
independent_access_ranges (RO)
------------------------------
The presence of this sub-directory of the /sys/block/xxx/queue/ directory
indicates that the device is capable of executing requests targeting
different sector ranges in parallel. For instance, single LUN multi-actuator
hard-disks will have an independent_access_ranges directory if the device
correctly advertizes the sector ranges of its actuators.
The independent_access_ranges directory contains one directory per access
range, with each range described using the sector (RO) attribute file to
indicate the first sector of the range and the nr_sectors (RO) attribute file
to indicate the total number of sectors in the range starting from the first
sector of the range. For example, a dual-actuator hard-disk will have the
following independent_access_ranges entries.::
$ tree /sys/block/<device>/queue/independent_access_ranges/
/sys/block/<device>/queue/independent_access_ranges/
|-- 0
| |-- nr_sectors
| `-- sector
`-- 1
|-- nr_sectors
`-- sector
The sector and nr_sectors attributes use 512B sector unit, regardless of
the actual block size of the device. Independent access ranges do not
overlap and include all sectors within the device capacity. The access
ranges are numbered in increasing order of the range start sector,
that is, the sector attribute of range 0 always has the value 0.
Jens Axboe <jens.axboe@oracle.com>, February 2009
......@@ -294,9 +294,6 @@ Block Devices
.. kernel-doc:: block/blk-settings.c
:export:
.. kernel-doc:: block/blk-exec.c
:export:
.. kernel-doc:: block/blk-flush.c
:export:
......
......@@ -292,8 +292,6 @@ block/blk-sysfs.c
block/blk-settings.c
block/blk-exec.c
block/blk-flush.c
block/blk-lib.c
......
......@@ -3416,6 +3416,8 @@ M: Jens Axboe <axboe@kernel.dk>
L: linux-block@vger.kernel.org
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
F: Documentation/ABI/stable/sysfs-block
F: Documentation/block/
F: block/
F: drivers/block/
F: include/linux/blk*
......
......@@ -35,6 +35,9 @@ config BLK_CGROUP_RWSTAT
config BLK_DEV_BSG_COMMON
tristate
config BLK_ICQ
bool
config BLK_DEV_BSGLIB
bool "Block layer SG support v4 helper lib"
select BLK_DEV_BSG_COMMON
......
......@@ -18,6 +18,7 @@ config MQ_IOSCHED_KYBER
config IOSCHED_BFQ
tristate "BFQ I/O scheduler"
select BLK_ICQ
help
BFQ I/O scheduler for BLK-MQ. BFQ distributes the bandwidth of
of the device among all processes according to their weights,
......
......@@ -5,7 +5,7 @@
obj-y := bdev.o fops.o bio.o elevator.o blk-core.o blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-timeout.o \
blk-merge.o blk-timeout.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \
......
......@@ -665,7 +665,7 @@ static void blkdev_flush_mapping(struct block_device *bdev)
static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
{
struct gendisk *disk = bdev->bd_disk;
int ret = 0;
int ret;
if (disk->fops->open) {
ret = disk->fops->open(bdev, mode);
......@@ -750,14 +750,6 @@ struct block_device *blkdev_get_no_open(dev_t dev)
if (!kobject_get_unless_zero(&bdev->bd_device.kobj))
bdev = NULL;
iput(inode);
if (!bdev)
return NULL;
if ((bdev->bd_disk->flags & GENHD_FL_HIDDEN)) {
put_device(&bdev->bd_device);
return NULL;
}
return bdev;
}
......@@ -837,7 +829,7 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder)
* used in blkdev_get/put().
*/
if ((mode & FMODE_WRITE) && !bdev->bd_write_holder &&
(disk->flags & GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE)) {
(disk->event_flags & DISK_EVENT_FLAG_BLOCK_ON_EXCL_WRITE)) {
bdev->bd_write_holder = true;
unblock_events = false;
}
......@@ -963,15 +955,15 @@ void blkdev_put(struct block_device *bdev, fmode_t mode)
EXPORT_SYMBOL(blkdev_put);
/**
* lookup_bdev - lookup a struct block_device by name
* @pathname: special file representing the block device
* @dev: return value of the block device's dev_t
* lookup_bdev() - Look up a struct block_device by name.
* @pathname: Name of the block device in the filesystem.
* @dev: Pointer to the block device's dev_t, if found.
*
* Lookup the block device's dev_t at @pathname in the current
* namespace if possible and return it by @dev.
* namespace if possible and return it in @dev.
*
* RETURNS:
* 0 if succeeded, errno otherwise.
* Context: May sleep.
* Return: 0 if succeeded, negative errno otherwise.
*/
int lookup_bdev(const char *pathname, dev_t *dev)
{
......
......@@ -433,26 +433,21 @@ static struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
/**
* bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
* @bfqd: the lookup key.
* @ioc: the io_context of the process doing I/O.
* @q: the request queue.
*/
static struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
struct io_context *ioc,
struct request_queue *q)
static struct bfq_io_cq *bfq_bic_lookup(struct request_queue *q)
{
if (ioc) {
unsigned long flags;
struct bfq_io_cq *icq;
struct bfq_io_cq *icq;
unsigned long flags;
spin_lock_irqsave(&q->queue_lock, flags);
icq = icq_to_bic(ioc_lookup_icq(ioc, q));
spin_unlock_irqrestore(&q->queue_lock, flags);
if (!current->io_context)
return NULL;
return icq;
}
spin_lock_irqsave(&q->queue_lock, flags);
icq = icq_to_bic(ioc_lookup_icq(q));
spin_unlock_irqrestore(&q->queue_lock, flags);
return NULL;
return icq;
}
/*
......@@ -565,26 +560,134 @@ static struct request *bfq_choose_req(struct bfq_data *bfqd,
}
}
#define BFQ_LIMIT_INLINE_DEPTH 16
#ifdef CONFIG_BFQ_GROUP_IOSCHED
static bool bfqq_request_over_limit(struct bfq_queue *bfqq, int limit)
{
struct bfq_data *bfqd = bfqq->bfqd;
struct bfq_entity *entity = &bfqq->entity;
struct bfq_entity *inline_entities[BFQ_LIMIT_INLINE_DEPTH];
struct bfq_entity **entities = inline_entities;
int depth, level;
int class_idx = bfqq->ioprio_class - 1;
struct bfq_sched_data *sched_data;
unsigned long wsum;
bool ret = false;
if (!entity->on_st_or_in_serv)
return false;
/* +1 for bfqq entity, root cgroup not included */
depth = bfqg_to_blkg(bfqq_group(bfqq))->blkcg->css.cgroup->level + 1;
if (depth > BFQ_LIMIT_INLINE_DEPTH) {
entities = kmalloc_array(depth, sizeof(*entities), GFP_NOIO);
if (!entities)
return false;
}
spin_lock_irq(&bfqd->lock);
sched_data = entity->sched_data;
/* Gather our ancestors as we need to traverse them in reverse order */
level = 0;
for_each_entity(entity) {
/*
* If at some level entity is not even active, allow request
* queueing so that BFQ knows there's work to do and activate
* entities.
*/
if (!entity->on_st_or_in_serv)
goto out;
/* Uh, more parents than cgroup subsystem thinks? */
if (WARN_ON_ONCE(level >= depth))
break;
entities[level++] = entity;
}
WARN_ON_ONCE(level != depth);
for (level--; level >= 0; level--) {
entity = entities[level];
if (level > 0) {
wsum = bfq_entity_service_tree(entity)->wsum;
} else {
int i;
/*
* For bfqq itself we take into account service trees
* of all higher priority classes and multiply their
* weights so that low prio queue from higher class
* gets more requests than high prio queue from lower
* class.
*/
wsum = 0;
for (i = 0; i <= class_idx; i++) {
wsum = wsum * IOPRIO_BE_NR +
sched_data->service_tree[i].wsum;
}
}
limit = DIV_ROUND_CLOSEST(limit * entity->weight, wsum);
if (entity->allocated >= limit) {
bfq_log_bfqq(bfqq->bfqd, bfqq,
"too many requests: allocated %d limit %d level %d",
entity->allocated, limit, level);
ret = true;
break;
}
}
out:
spin_unlock_irq(&bfqd->lock);
if (entities != inline_entities)
kfree(entities);
return ret;
}
#else
static bool bfqq_request_over_limit(struct bfq_queue *bfqq, int limit)
{
return false;
}
#endif
/*
* Async I/O can easily starve sync I/O (both sync reads and sync
* writes), by consuming all tags. Similarly, storms of sync writes,
* such as those that sync(2) may trigger, can starve sync reads.
* Limit depths of async I/O and sync writes so as to counter both
* problems.
*
* Also if a bfq queue or its parent cgroup consume more tags than would be
* appropriate for their weight, we trim the available tag depth to 1. This
* avoids a situation where one cgroup can starve another cgroup from tags and
* thus block service differentiation among cgroups. Note that because the
* queue / cgroup already has many requests allocated and queued, this does not
* significantly affect service guarantees coming from the BFQ scheduling
* algorithm.
*/
static void bfq_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
{
struct bfq_data *bfqd = data->q->elevator->elevator_data;
struct bfq_io_cq *bic = bfq_bic_lookup(data->q);
struct bfq_queue *bfqq = bic ? bic_to_bfqq(bic, op_is_sync(op)) : NULL;
int depth;
unsigned limit = data->q->nr_requests;
/* Sync reads have full depth available */
if (op_is_sync(op) && !op_is_write(op)) {
depth = 0;
} else {
depth = bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
limit = (limit * depth) >> bfqd->full_depth_shift;
}
if (op_is_sync(op) && !op_is_write(op))
return;
data->shallow_depth =
bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
/*
* Does queue (or any parent entity) exceed number of requests that
* should be available to it? Heavily limit depth so that it cannot
* consume more available requests and thus starve other entities.
*/
if (bfqq && bfqq_request_over_limit(bfqq, limit))
depth = 1;
bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
__func__, bfqd->wr_busy_queues, op_is_sync(op),
data->shallow_depth);
__func__, bfqd->wr_busy_queues, op_is_sync(op), depth);
if (depth)
data->shallow_depth = depth;
}
static struct bfq_queue *
......@@ -1113,7 +1216,8 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
static int bfqq_process_refs(struct bfq_queue *bfqq)
{
return bfqq->ref - bfqq->allocated - bfqq->entity.on_st_or_in_serv -
return bfqq->ref - bfqq->entity.allocated -
bfqq->entity.on_st_or_in_serv -
(bfqq->weight_counter != NULL) - bfqq->stable_ref;
}
......@@ -1982,20 +2086,19 @@ static void bfq_update_io_intensity(struct bfq_queue *bfqq, u64 now_ns)
* aspect, see the comments on the choice of the queue for injection
* in bfq_select_queue().
*
* Turning back to the detection of a waker queue, a queue Q is deemed
* as a waker queue for bfqq if, for three consecutive times, bfqq
* happens to become non empty right after a request of Q has been
* completed. In this respect, even if bfqq is empty, we do not check
* for a waker if it still has some in-flight I/O. In fact, in this
* case bfqq is actually still being served by the drive, and may
* receive new I/O on the completion of some of the in-flight
* requests. In particular, on the first time, Q is tentatively set as
* a candidate waker queue, while on the third consecutive time that Q
* is detected, the field waker_bfqq is set to Q, to confirm that Q is
* a waker queue for bfqq. These detection steps are performed only if
* bfqq has a long think time, so as to make it more likely that
* bfqq's I/O is actually being blocked by a synchronization. This
* last filter, plus the above three-times requirement, make false
* Turning back to the detection of a waker queue, a queue Q is deemed as a
* waker queue for bfqq if, for three consecutive times, bfqq happens to become
* non empty right after a request of Q has been completed within given
* timeout. In this respect, even if bfqq is empty, we do not check for a waker
* if it still has some in-flight I/O. In fact, in this case bfqq is actually
* still being served by the drive, and may receive new I/O on the completion
* of some of the in-flight requests. In particular, on the first time, Q is
* tentatively set as a candidate waker queue, while on the third consecutive
* time that Q is detected, the field waker_bfqq is set to Q, to confirm that Q
* is a waker queue for bfqq. These detection steps are performed only if bfqq
* has a long think time, so as to make it more likely that bfqq's I/O is
* actually being blocked by a synchronization. This last filter, plus the
* above three-times requirement and time limit for detection, make false
* positives less likely.
*
* NOTE
......@@ -2019,6 +2122,8 @@ static void bfq_update_io_intensity(struct bfq_queue *bfqq, u64 now_ns)
static void bfq_check_waker(struct bfq_data *bfqd, struct bfq_queue *bfqq,
u64 now_ns)
{
char waker_name[MAX_BFQQ_NAME_LENGTH];
if (!bfqd->last_completed_rq_bfqq ||
bfqd->last_completed_rq_bfqq == bfqq ||
bfq_bfqq_has_short_ttime(bfqq) ||
......@@ -2027,8 +2132,16 @@ static void bfq_check_waker(struct bfq_data *bfqd, struct bfq_queue *bfqq,
bfqd->last_completed_rq_bfqq == bfqq->waker_bfqq)
return;
/*
* We reset waker detection logic also if too much time has passed
* since the first detection. If wakeups are rare, pointless idling
* doesn't hurt throughput that much. The condition below makes sure
* we do not uselessly idle blocking waker in more than 1/64 cases.
*/
if (bfqd->last_completed_rq_bfqq !=
bfqq->tentative_waker_bfqq) {
bfqq->tentative_waker_bfqq ||
now_ns > bfqq->waker_detection_started +
128 * (u64)bfqd->bfq_slice_idle) {
/*
* First synchronization detected with a
* candidate waker queue, or with a different
......@@ -2037,12 +2150,19 @@ static void bfq_check_waker(struct bfq_data *bfqd, struct bfq_queue *bfqq,
bfqq->tentative_waker_bfqq =
bfqd->last_completed_rq_bfqq;
bfqq->num_waker_detections = 1;
bfqq->waker_detection_started = now_ns;
bfq_bfqq_name(bfqq->tentative_waker_bfqq, waker_name,
MAX_BFQQ_NAME_LENGTH);
bfq_log_bfqq(bfqd, bfqq, "set tenative waker %s", waker_name);
} else /* Same tentative waker queue detected again */
bfqq->num_waker_detections++;
if (bfqq->num_waker_detections == 3) {
bfqq->waker_bfqq = bfqd->last_completed_rq_bfqq;
bfqq->tentative_waker_bfqq = NULL;
bfq_bfqq_name(bfqq->waker_bfqq, waker_name,
MAX_BFQQ_NAME_LENGTH);
bfq_log_bfqq(bfqd, bfqq, "set waker %s", waker_name);
/*
* If the waker queue disappears, then
......@@ -2332,7 +2452,7 @@ static bool bfq_bio_merge(struct request_queue *q, struct bio *bio,
* returned by bfq_bic_lookup does not go away before
* bfqd->lock is taken.
*/
struct bfq_io_cq *bic = bfq_bic_lookup(bfqd, current->io_context, q);
struct bfq_io_cq *bic = bfq_bic_lookup(q);
bool ret;
spin_lock_irq(&bfqd->lock);
......@@ -5878,6 +5998,22 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
}
}
static void bfqq_request_allocated(struct bfq_queue *bfqq)
{
struct bfq_entity *entity = &bfqq->entity;
for_each_entity(entity)
entity->allocated++;
}
static void bfqq_request_freed(struct bfq_queue *bfqq)
{
struct bfq_entity *entity = &bfqq->entity;
for_each_entity(entity)
entity->allocated--;
}
/* returns true if it causes the idle timer to be disabled */
static bool __bfq_insert_request(struct bfq_data *bfqd, struct request *rq)
{
......@@ -5891,8 +6027,8 @@ static bool __bfq_insert_request(struct bfq_data *bfqd, struct request *rq)
* Release the request's reference to the old bfqq
* and make sure one is taken to the shared queue.
*/
new_bfqq->allocated++;
bfqq->allocated--;
bfqq_request_allocated(new_bfqq);
bfqq_request_freed(bfqq);
new_bfqq->ref++;
/*
* If the bic associated with the process
......@@ -5991,48 +6127,7 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
spin_lock_irq(&bfqd->lock);
bfqq = bfq_init_rq(rq);
/*
* Reqs with at_head or passthrough flags set are to be put
* directly into dispatch list. Additional case for putting rq
* directly into the dispatch queue: the only active
* bfq_queues are bfqq and either its waker bfq_queue or one
* of its woken bfq_queues. The rationale behind this
* additional condition is as follows:
* - consider a bfq_queue, say Q1, detected as a waker of
* another bfq_queue, say Q2
* - by definition of a waker, Q1 blocks the I/O of Q2, i.e.,
* some I/O of Q1 needs to be completed for new I/O of Q2
* to arrive. A notable example of waker is journald
* - so, Q1 and Q2 are in any respect the queues of two
* cooperating processes (or of two cooperating sets of
* processes): the goal of Q1's I/O is doing what needs to
* be done so that new Q2's I/O can finally be
* issued. Therefore, if the service of Q1's I/O is delayed,
* then Q2's I/O is delayed too. Conversely, if Q2's I/O is
* delayed, the goal of Q1's I/O is hindered.
* - as a consequence, if some I/O of Q1/Q2 arrives while
* Q2/Q1 is the only queue in service, there is absolutely
* no point in delaying the service of such an I/O. The
* only possible result is a throughput loss
* - so, when the above condition holds, the best option is to
* have the new I/O dispatched as soon as possible
* - the most effective and efficient way to attain the above
* goal is to put the new I/O directly in the dispatch
* list
* - as an additional restriction, Q1 and Q2 must be the only
* busy queues for this commit to put the I/O of Q2/Q1 in
* the dispatch list. This is necessary, because, if also
* other queues are waiting for service, then putting new
* I/O directly in the dispatch list may evidently cause a
* violation of service guarantees for the other queues
*/
if (!bfqq ||
(bfqq != bfqd->in_service_queue &&
bfqd->in_service_queue != NULL &&
bfq_tot_busy_queues(bfqd) == 1 + bfq_bfqq_busy(bfqq) &&
(bfqq->waker_bfqq == bfqd->in_service_queue ||
bfqd->in_service_queue->waker_bfqq == bfqq)) || at_head) {
if (!bfqq || at_head) {
if (at_head)
list_add(&rq->queuelist, &bfqd->dispatch);
else
......@@ -6059,7 +6154,6 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
* merge).
*/
cmd_flags = rq->cmd_flags;
spin_unlock_irq(&bfqd->lock);
bfq_update_insert_stats(q, bfqq, idle_timer_disabled,
......@@ -6251,8 +6345,7 @@ static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
static void bfq_finish_requeue_request_body(struct bfq_queue *bfqq)
{
bfqq->allocated--;
bfqq_request_freed(bfqq);
bfq_put_queue(bfqq);
}
......@@ -6476,6 +6569,16 @@ static void bfq_finish_requeue_request(struct request *rq)
rq->elv.priv[1] = NULL;
}
static void bfq_finish_request(struct request *rq)
{
bfq_finish_requeue_request(rq);
if (rq->elv.icq) {
put_io_context(rq->elv.icq->ioc);
rq->elv.icq = NULL;
}
}
/*
* Removes the association between the current task and bfqq, assuming
* that bic points to the bfq iocontext of the task.
......@@ -6573,6 +6676,8 @@ static struct bfq_queue *bfq_get_bfqq_handle_split(struct bfq_data *bfqd,
*/
static void bfq_prepare_request(struct request *rq)
{
rq->elv.icq = ioc_find_get_icq(rq->q);
/*
* Regardless of whether we have an icq attached, we have to
* clear the scheduler pointers, as they might point to
......@@ -6672,7 +6777,7 @@ static struct bfq_queue *bfq_init_rq(struct request *rq)
}
}
bfqq->allocated++;
bfqq_request_allocated(bfqq);
bfqq->ref++;
bfq_log_bfqq(bfqd, bfqq, "get_request %p: bfqq %p, %d",
rq, bfqq, bfqq->ref);
......@@ -6835,11 +6940,11 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
* See the comments on bfq_limit_depth for the purpose of
* the depths set in the function. Return minimum shallow depth we'll use.
*/
static unsigned int bfq_update_depths(struct bfq_data *bfqd,
struct sbitmap_queue *bt)
static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
{
unsigned int i, j, min_shallow = UINT_MAX;
unsigned int depth = 1U << bt->sb.shift;
bfqd->full_depth_shift = bt->sb.shift;
/*
* In-word depths if no bfq_queue is being weight-raised:
* leaving 25% of tags only for sync reads.
......@@ -6851,13 +6956,13 @@ static unsigned int bfq_update_depths(struct bfq_data *bfqd,
* limit 'something'.
*/
/* no more than 50% of tags for async I/O */
bfqd->word_depths[0][0] = max((1U << bt->sb.shift) >> 1, 1U);
bfqd->word_depths[0][0] = max(depth >> 1, 1U);
/*
* no more than 75% of tags for sync writes (25% extra tags
* w.r.t. async I/O, to prevent async I/O from starving sync
* writes)
*/
bfqd->word_depths[0][1] = max(((1U << bt->sb.shift) * 3) >> 2, 1U);
bfqd->word_depths[0][1] = max((depth * 3) >> 2, 1U);
/*
* In-word depths in case some bfq_queue is being weight-
......@@ -6867,25 +6972,18 @@ static unsigned int bfq_update_depths(struct bfq_data *bfqd,
* shortage.
*/
/* no more than ~18% of tags for async I/O */
bfqd->word_depths[1][0] = max(((1U << bt->sb.shift) * 3) >> 4, 1U);
bfqd->word_depths[1][0] = max((depth * 3) >> 4, 1U);
/* no more than ~37% of tags for sync writes (~20% extra tags) */
bfqd->word_depths[1][1] = max(((1U << bt->sb.shift) * 6) >> 4, 1U);
for (i = 0; i < 2; i++)
for (j = 0; j < 2; j++)
min_shallow = min(min_shallow, bfqd->word_depths[i][j]);
return min_shallow;
bfqd->word_depths[1][1] = max((depth * 6) >> 4, 1U);
}
static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
{
struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
struct blk_mq_tags *tags = hctx->sched_tags;
unsigned int min_shallow;
min_shallow = bfq_update_depths(bfqd, &tags->bitmap_tags);
sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, min_shallow);
bfq_update_depths(bfqd, &tags->bitmap_tags);
sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1);
}
static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
......@@ -7300,7 +7398,7 @@ static struct elevator_type iosched_bfq_mq = {
.limit_depth = bfq_limit_depth,
.prepare_request = bfq_prepare_request,
.requeue_request = bfq_finish_requeue_request,
.finish_request = bfq_finish_requeue_request,
.finish_request = bfq_finish_request,
.exit_icq = bfq_exit_icq,
.insert_requests = bfq_insert_requests,
.dispatch_request = bfq_dispatch_request,
......
......@@ -25,7 +25,7 @@
#define BFQ_DEFAULT_GRP_IOPRIO 0
#define BFQ_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE
#define MAX_PID_STR_LENGTH 12
#define MAX_BFQQ_NAME_LENGTH 16
/*
* Soft real-time applications are extremely more latency sensitive
......@@ -170,6 +170,9 @@ struct bfq_entity {
/* budget, used also to calculate F_i: F_i = S_i + @budget / @weight */
int budget;
/* Number of requests allocated in the subtree of this entity */
int allocated;
/* device weight, if non-zero, it overrides the default weight of
* bfq_group_data */
int dev_weight;
......@@ -266,8 +269,6 @@ struct bfq_queue {
struct request *next_rq;
/* number of sync and async requests queued */
int queued[2];
/* number of requests currently allocated */
int allocated;
/* number of pending metadata requests */
int meta_pending;
/* fifo list of requests in sort_list */
......@@ -387,6 +388,8 @@ struct bfq_queue {
struct bfq_queue *tentative_waker_bfqq;
/* number of times the same tentative waker has been detected */
unsigned int num_waker_detections;
/* time when we started considering this waker */
u64 waker_detection_started;
/* node for woken_list, see below */
struct hlist_node woken_list_node;
......@@ -768,6 +771,7 @@ struct bfq_data {
* function)
*/
unsigned int word_depths[2][2];
unsigned int full_depth_shift;
};
enum bfqq_state_flags {
......@@ -1079,26 +1083,27 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq);
/* --------------- end of interface of B-WF2Q+ ---------------- */
/* Logging facilities. */
static inline void bfq_pid_to_str(int pid, char *str, int len)
static inline void bfq_bfqq_name(struct bfq_queue *bfqq, char *str, int len)
{
if (pid != -1)
snprintf(str, len, "%d", pid);
char type = bfq_bfqq_sync(bfqq) ? 'S' : 'A';
if (bfqq->pid != -1)
snprintf(str, len, "bfq%d%c", bfqq->pid, type);
else
snprintf(str, len, "SHARED-");
snprintf(str, len, "bfqSHARED-%c", type);
}
#ifdef CONFIG_BFQ_GROUP_IOSCHED
struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do { \
char pid_str[MAX_PID_STR_LENGTH]; \
char pid_str[MAX_BFQQ_NAME_LENGTH]; \
if (likely(!blk_trace_note_message_enabled((bfqd)->queue))) \
break; \
bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH); \
bfq_bfqq_name((bfqq), pid_str, MAX_BFQQ_NAME_LENGTH); \
blk_add_cgroup_trace_msg((bfqd)->queue, \
bfqg_to_blkg(bfqq_group(bfqq))->blkcg, \
"bfq%s%c " fmt, pid_str, \
bfq_bfqq_sync((bfqq)) ? 'S' : 'A', ##args); \
"%s " fmt, pid_str, ##args); \
} while (0)
#define bfq_log_bfqg(bfqd, bfqg, fmt, args...) do { \
......@@ -1109,13 +1114,11 @@ struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
#else /* CONFIG_BFQ_GROUP_IOSCHED */
#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do { \
char pid_str[MAX_PID_STR_LENGTH]; \
char pid_str[MAX_BFQQ_NAME_LENGTH]; \
if (likely(!blk_trace_note_message_enabled((bfqd)->queue))) \
break; \
bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH); \
blk_add_trace_msg((bfqd)->queue, "bfq%s%c " fmt, pid_str, \
bfq_bfqq_sync((bfqq)) ? 'S' : 'A', \
##args); \
bfq_bfqq_name((bfqq), pid_str, MAX_BFQQ_NAME_LENGTH); \
blk_add_trace_msg((bfqd)->queue, "%s " fmt, pid_str, ##args); \
} while (0)
#define bfq_log_bfqg(bfqd, bfqg, fmt, args...) do {} while (0)
......
......@@ -26,7 +26,7 @@
#include "blk-rq-qos.h"
struct bio_alloc_cache {
struct bio_list free_list;
struct bio *free_list;
unsigned int nr;
};
......@@ -630,7 +630,8 @@ static void bio_alloc_cache_prune(struct bio_alloc_cache *cache,
unsigned int i = 0;
struct bio *bio;
while ((bio = bio_list_pop(&cache->free_list)) != NULL) {
while ((bio = cache->free_list) != NULL) {
cache->free_list = bio->bi_next;
cache->nr--;
bio_free(bio);
if (++i == nr)
......@@ -689,7 +690,8 @@ void bio_put(struct bio *bio)
bio_uninit(bio);
cache = per_cpu_ptr(bio->bi_pool->cache, get_cpu());
bio_list_add_head(&cache->free_list, bio);
bio->bi_next = cache->free_list;
cache->free_list = bio;
if (++cache->nr > ALLOC_CACHE_MAX + ALLOC_CACHE_SLACK)
bio_alloc_cache_prune(cache, ALLOC_CACHE_SLACK);
put_cpu();
......@@ -1704,8 +1706,9 @@ struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned short nr_vecs,
return bio_alloc_bioset(GFP_KERNEL, nr_vecs, bs);
cache = per_cpu_ptr(bs->cache, get_cpu());
bio = bio_list_pop(&cache->free_list);
if (bio) {
if (cache->free_list) {
bio = cache->free_list;
cache->free_list = bio->bi_next;
cache->nr--;
put_cpu();
bio_init(bio, nr_vecs ? bio->bi_inline_vecs : NULL, nr_vecs);
......
......@@ -30,6 +30,7 @@
#include <linux/blk-cgroup.h>
#include <linux/tracehook.h>
#include <linux/psi.h>
#include <linux/part_stat.h>
#include "blk.h"
#include "blk-ioprio.h"
#include "blk-throttle.h"
......
......@@ -16,7 +16,6 @@
#include <linux/module.h>
#include <linux/bio.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/blk-pm.h>
#include <linux/blk-integrity.h>
#include <linux/highmem.h>
......@@ -40,6 +39,7 @@
#include <linux/debugfs.h>
#include <linux/bpf.h>
#include <linux/psi.h>
#include <linux/part_stat.h>
#include <linux/sched/sysctl.h>
#include <linux/blk-crypto.h>
......@@ -47,7 +47,6 @@
#include <trace/events/block.h>
#include "blk.h"
#include "blk-mq.h"
#include "blk-mq-sched.h"
#include "blk-pm.h"
#include "blk-throttle.h"
......@@ -67,6 +66,7 @@ DEFINE_IDA(blk_queue_ida);
* For queue allocation
*/
struct kmem_cache *blk_requestq_cachep;
struct kmem_cache *blk_requestq_srcu_cachep;
/*
* Controlling structure to kblockd
......@@ -109,23 +109,6 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q)
}
EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
void blk_rq_init(struct request_queue *q, struct request *rq)
{
memset(rq, 0, sizeof(*rq));
INIT_LIST_HEAD(&rq->queuelist);
rq->q = q;
rq->__sector = (sector_t) -1;
INIT_HLIST_NODE(&rq->hash);
RB_CLEAR_NODE(&rq->rb_node);
rq->tag = BLK_MQ_NO_TAG;
rq->internal_tag = BLK_MQ_NO_TAG;
rq->start_time_ns = ktime_get_ns();
rq->part = NULL;
blk_crypto_rq_set_defaults(rq);
}
EXPORT_SYMBOL(blk_rq_init);
#define REQ_OP_NAME(name) [REQ_OP_##name] = #name
static const char *const blk_op_name[] = {
REQ_OP_NAME(READ),
......@@ -216,38 +199,15 @@ int blk_status_to_errno(blk_status_t status)
}
EXPORT_SYMBOL_GPL(blk_status_to_errno);
void blk_print_req_error(struct request *req, blk_status_t status)
const char *blk_status_to_str(blk_status_t status)
{
int idx = (__force int)status;
if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors)))
return;
printk_ratelimited(KERN_ERR
"%s error, dev %s, sector %llu op 0x%x:(%s) flags 0x%x "
"phys_seg %u prio class %u\n",
blk_errors[idx].name,
req->rq_disk ? req->rq_disk->disk_name : "?",
blk_rq_pos(req), req_op(req), blk_op_str(req_op(req)),
req->cmd_flags & ~REQ_OP_MASK,
req->nr_phys_segments,
IOPRIO_PRIO_CLASS(req->ioprio));
return "<null>";
return blk_errors[idx].name;
}
void blk_dump_rq_flags(struct request *rq, char *msg)
{
printk(KERN_INFO "%s: dev %s: flags=%llx\n", msg,
rq->rq_disk ? rq->rq_disk->disk_name : "?",
(unsigned long long) rq->cmd_flags);
printk(KERN_INFO " sector %llu, nr/cnr %u/%u\n",
(unsigned long long)blk_rq_pos(rq),
blk_rq_sectors(rq), blk_rq_cur_sectors(rq));
printk(KERN_INFO " bio %p, biotail %p, len %u\n",
rq->bio, rq->biotail, blk_rq_bytes(rq));
}
EXPORT_SYMBOL(blk_dump_rq_flags);
/**
* blk_sync_queue - cancel any pending callbacks on a queue
* @q: the queue
......@@ -478,21 +438,27 @@ static void blk_timeout_work(struct work_struct *work)
{
}
struct request_queue *blk_alloc_queue(int node_id)
struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu)
{
struct request_queue *q;
int ret;
q = kmem_cache_alloc_node(blk_requestq_cachep,
GFP_KERNEL | __GFP_ZERO, node_id);
q = kmem_cache_alloc_node(blk_get_queue_kmem_cache(alloc_srcu),
GFP_KERNEL | __GFP_ZERO, node_id);
if (!q)
return NULL;
if (alloc_srcu) {
blk_queue_flag_set(QUEUE_FLAG_HAS_SRCU, q);
if (init_srcu_struct(q->srcu) != 0)
goto fail_q;
}
q->last_merge = NULL;
q->id = ida_simple_get(&blk_queue_ida, 0, 0, GFP_KERNEL);
if (q->id < 0)
goto fail_q;
goto fail_srcu;
ret = bioset_init(&q->bio_split, BIO_POOL_SIZE, 0, 0);
if (ret)
......@@ -549,8 +515,11 @@ struct request_queue *blk_alloc_queue(int node_id)
bioset_exit(&q->bio_split);
fail_id:
ida_simple_remove(&blk_queue_ida, q->id);
fail_srcu:
if (alloc_srcu)
cleanup_srcu_struct(q->srcu);
fail_q:
kmem_cache_free(blk_requestq_cachep, q);
kmem_cache_free(blk_get_queue_kmem_cache(alloc_srcu), q);
return NULL;
}
......@@ -594,7 +563,7 @@ static int __init setup_fail_make_request(char *str)
}
__setup("fail_make_request=", setup_fail_make_request);
static bool should_fail_request(struct block_device *part, unsigned int bytes)
bool should_fail_request(struct block_device *part, unsigned int bytes)
{
return part->bd_make_it_fail && should_fail(&fail_make_request, bytes);
}
......@@ -608,15 +577,6 @@ static int __init fail_make_request_debugfs(void)
}
late_initcall(fail_make_request_debugfs);
#else /* CONFIG_FAIL_MAKE_REQUEST */
static inline bool should_fail_request(struct block_device *part,
unsigned int bytes)
{
return false;
}
#endif /* CONFIG_FAIL_MAKE_REQUEST */
static inline bool bio_check_ro(struct bio *bio)
......@@ -802,15 +762,6 @@ noinline_for_stack bool submit_bio_checks(struct bio *bio)
break;
}
/*
* Various block parts want %current->io_context, so allocate it up
* front rather than dealing with lots of pain to allocate it only
* where needed. This may fail and the block layer knows how to live
* with it.
*/
if (unlikely(!current->io_context))
create_task_io_context(current, GFP_ATOMIC, q->node);
if (blk_throtl_bio(bio))
return false;
......@@ -836,17 +787,21 @@ noinline_for_stack bool submit_bio_checks(struct bio *bio)
static void __submit_bio_fops(struct gendisk *disk, struct bio *bio)
{
if (unlikely(bio_queue_enter(bio) != 0))
return;
if (submit_bio_checks(bio) && blk_crypto_bio_prep(&bio))
disk->fops->submit_bio(bio);
blk_queue_exit(disk->queue);
if (blk_crypto_bio_prep(&bio)) {
if (likely(bio_queue_enter(bio) == 0)) {
disk->fops->submit_bio(bio);
blk_queue_exit(disk->queue);
}
}
}
static void __submit_bio(struct bio *bio)
{
struct gendisk *disk = bio->bi_bdev->bd_disk;
if (unlikely(!submit_bio_checks(bio)))
return;
if (!disk->fops->submit_bio)
blk_mq_submit_bio(bio);
else
......@@ -1090,135 +1045,7 @@ int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
}
EXPORT_SYMBOL_GPL(iocb_bio_iopoll);
/**
* blk_cloned_rq_check_limits - Helper function to check a cloned request
* for the new queue limits
* @q: the queue
* @rq: the request being checked
*
* Description:
* @rq may have been made based on weaker limitations of upper-level queues
* in request stacking drivers, and it may violate the limitation of @q.
* Since the block layer and the underlying device driver trust @rq
* after it is inserted to @q, it should be checked against @q before
* the insertion using this generic function.
*
* Request stacking drivers like request-based dm may change the queue
* limits when retrying requests on other queues. Those requests need
* to be checked against the new queue limits again during dispatch.
*/
static blk_status_t blk_cloned_rq_check_limits(struct request_queue *q,
struct request *rq)
{
unsigned int max_sectors = blk_queue_get_max_sectors(q, req_op(rq));
if (blk_rq_sectors(rq) > max_sectors) {
/*
* SCSI device does not have a good way to return if
* Write Same/Zero is actually supported. If a device rejects
* a non-read/write command (discard, write same,etc.) the
* low-level device driver will set the relevant queue limit to
* 0 to prevent blk-lib from issuing more of the offending
* operations. Commands queued prior to the queue limit being
* reset need to be completed with BLK_STS_NOTSUPP to avoid I/O
* errors being propagated to upper layers.
*/
if (max_sectors == 0)
return BLK_STS_NOTSUPP;
printk(KERN_ERR "%s: over max size limit. (%u > %u)\n",
__func__, blk_rq_sectors(rq), max_sectors);
return BLK_STS_IOERR;
}
/*
* The queue settings related to segment counting may differ from the
* original queue.
*/
rq->nr_phys_segments = blk_recalc_rq_segments(rq);
if (rq->nr_phys_segments > queue_max_segments(q)) {
printk(KERN_ERR "%s: over max segments limit. (%hu > %hu)\n",
__func__, rq->nr_phys_segments, queue_max_segments(q));
return BLK_STS_IOERR;
}
return BLK_STS_OK;
}
/**
* blk_insert_cloned_request - Helper for stacking drivers to submit a request
* @q: the queue to submit the request
* @rq: the request being queued
*/
blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *rq)
{
blk_status_t ret;
ret = blk_cloned_rq_check_limits(q, rq);
if (ret != BLK_STS_OK)
return ret;
if (rq->rq_disk &&
should_fail_request(rq->rq_disk->part0, blk_rq_bytes(rq)))
return BLK_STS_IOERR;
if (blk_crypto_insert_cloned_request(rq))
return BLK_STS_IOERR;
blk_account_io_start(rq);
/*
* Since we have a scheduler attached on the top device,
* bypass a potential scheduler on the bottom device for
* insert.
*/
return blk_mq_request_issue_directly(rq, true);
}
EXPORT_SYMBOL_GPL(blk_insert_cloned_request);
/**
* blk_rq_err_bytes - determine number of bytes till the next failure boundary
* @rq: request to examine
*
* Description:
* A request could be merge of IOs which require different failure
* handling. This function determines the number of bytes which
* can be failed from the beginning of the request without
* crossing into area which need to be retried further.
*
* Return:
* The number of bytes to fail.
*/
unsigned int blk_rq_err_bytes(const struct request *rq)
{
unsigned int ff = rq->cmd_flags & REQ_FAILFAST_MASK;
unsigned int bytes = 0;
struct bio *bio;
if (!(rq->rq_flags & RQF_MIXED_MERGE))
return blk_rq_bytes(rq);
/*
* Currently the only 'mixing' which can happen is between
* different fastfail types. We can safely fail portions
* which have all the failfast bits that the first one has -
* the ones which are at least as eager to fail as the first
* one.
*/
for (bio = rq->bio; bio; bio = bio->bi_next) {
if ((bio->bi_opf & ff) != ff)
break;
bytes += bio->bi_iter.bi_size;
}
/* this could lead to infinite loop */
BUG_ON(blk_rq_bytes(rq) && !bytes);
return bytes;
}
EXPORT_SYMBOL_GPL(blk_rq_err_bytes);
static void update_io_ticks(struct block_device *part, unsigned long now,
bool end)
void update_io_ticks(struct block_device *part, unsigned long now, bool end)
{
unsigned long stamp;
again:
......@@ -1233,30 +1060,6 @@ static void update_io_ticks(struct block_device *part, unsigned long now,
}
}
void __blk_account_io_done(struct request *req, u64 now)
{
const int sgrp = op_stat_group(req_op(req));
part_stat_lock();
update_io_ticks(req->part, jiffies, true);
part_stat_inc(req->part, ios[sgrp]);
part_stat_add(req->part, nsecs[sgrp], now - req->start_time_ns);
part_stat_unlock();
}
void __blk_account_io_start(struct request *rq)
{
/* passthrough requests can hold bios that do not have ->bi_bdev set */
if (rq->bio && rq->bio->bi_bdev)
rq->part = rq->bio->bi_bdev;
else
rq->part = rq->rq_disk->part0;
part_stat_lock();
update_io_ticks(rq->part, jiffies, false);
part_stat_unlock();
}
static unsigned long __part_start_io_acct(struct block_device *part,
unsigned int sectors, unsigned int op)
{
......@@ -1320,46 +1123,6 @@ void disk_end_io_acct(struct gendisk *disk, unsigned int op,
}
EXPORT_SYMBOL(disk_end_io_acct);
/*
* Steal bios from a request and add them to a bio list.
* The request must not have been partially completed before.
*/
void blk_steal_bios(struct bio_list *list, struct request *rq)
{
if (rq->bio) {
if (list->tail)
list->tail->bi_next = rq->bio;
else
list->head = rq->bio;
list->tail = rq->biotail;
rq->bio = NULL;
rq->biotail = NULL;
}
rq->__data_len = 0;
}
EXPORT_SYMBOL_GPL(blk_steal_bios);
#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
/**
* rq_flush_dcache_pages - Helper function to flush all pages in a request
* @rq: the request to be flushed
*
* Description:
* Flush all pages in @rq.
*/
void rq_flush_dcache_pages(struct request *rq)
{
struct req_iterator iter;
struct bio_vec bvec;
rq_for_each_segment(bvec, rq, iter)
flush_dcache_page(bvec.bv_page);
}
EXPORT_SYMBOL_GPL(rq_flush_dcache_pages);
#endif
/**
* blk_lld_busy - Check if underlying low-level drivers of a device are busy
* @q : the queue of the device being checked
......@@ -1388,93 +1151,6 @@ int blk_lld_busy(struct request_queue *q)
}
EXPORT_SYMBOL_GPL(blk_lld_busy);
/**
* blk_rq_unprep_clone - Helper function to free all bios in a cloned request
* @rq: the clone request to be cleaned up
*
* Description:
* Free all bios in @rq for a cloned request.
*/
void blk_rq_unprep_clone(struct request *rq)
{
struct bio *bio;
while ((bio = rq->bio) != NULL) {
rq->bio = bio->bi_next;
bio_put(bio);
}
}
EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
/**
* blk_rq_prep_clone - Helper function to setup clone request
* @rq: the request to be setup
* @rq_src: original request to be cloned
* @bs: bio_set that bios for clone are allocated from
* @gfp_mask: memory allocation mask for bio
* @bio_ctr: setup function to be called for each clone bio.
* Returns %0 for success, non %0 for failure.
* @data: private data to be passed to @bio_ctr
*
* Description:
* Clones bios in @rq_src to @rq, and copies attributes of @rq_src to @rq.
* Also, pages which the original bios are pointing to are not copied
* and the cloned bios just point same pages.
* So cloned bios must be completed before original bios, which means
* the caller must complete @rq before @rq_src.
*/
int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
struct bio_set *bs, gfp_t gfp_mask,
int (*bio_ctr)(struct bio *, struct bio *, void *),
void *data)
{
struct bio *bio, *bio_src;
if (!bs)
bs = &fs_bio_set;
__rq_for_each_bio(bio_src, rq_src) {
bio = bio_clone_fast(bio_src, gfp_mask, bs);
if (!bio)
goto free_and_out;
if (bio_ctr && bio_ctr(bio, bio_src, data))
goto free_and_out;
if (rq->bio) {
rq->biotail->bi_next = bio;
rq->biotail = bio;
} else {
rq->bio = rq->biotail = bio;
}
bio = NULL;
}
/* Copy attributes of the original request to the clone request. */
rq->__sector = blk_rq_pos(rq_src);
rq->__data_len = blk_rq_bytes(rq_src);
if (rq_src->rq_flags & RQF_SPECIAL_PAYLOAD) {
rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
rq->special_vec = rq_src->special_vec;
}
rq->nr_phys_segments = rq_src->nr_phys_segments;
rq->ioprio = rq_src->ioprio;
if (rq->bio && blk_crypto_rq_bio_prep(rq, rq->bio, gfp_mask) < 0)
goto free_and_out;
return 0;
free_and_out:
if (bio)
bio_put(bio);
blk_rq_unprep_clone(rq);
return -ENOMEM;
}
EXPORT_SYMBOL_GPL(blk_rq_prep_clone);
int kblockd_schedule_work(struct work_struct *work)
{
return queue_work(kblockd_workqueue, work);
......@@ -1639,6 +1315,9 @@ int __init blk_dev_init(void)
sizeof_field(struct request, cmd_flags));
BUILD_BUG_ON(REQ_OP_BITS + REQ_FLAG_BITS > 8 *
sizeof_field(struct bio, bi_opf));
BUILD_BUG_ON(ALIGN(offsetof(struct request_queue, srcu),
__alignof__(struct request_queue)) !=
sizeof(struct request_queue));
/* used for unplugging and affects IO latency/throughput - HIGHPRI */
kblockd_workqueue = alloc_workqueue("kblockd",
......@@ -1649,6 +1328,10 @@ int __init blk_dev_init(void)
blk_requestq_cachep = kmem_cache_create("request_queue",
sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
blk_requestq_srcu_cachep = kmem_cache_create("request_queue_srcu",
sizeof(struct request_queue) +
sizeof(struct srcu_struct), 0, SLAB_PANIC, NULL);
blk_debugfs_root = debugfs_create_dir("block", NULL);
return 0;
......
......@@ -463,11 +463,6 @@ bool blk_crypto_register(struct blk_crypto_profile *profile,
}
EXPORT_SYMBOL_GPL(blk_crypto_register);
void blk_crypto_unregister(struct request_queue *q)
{
q->crypto_profile = NULL;
}
/**
* blk_crypto_intersect_capabilities() - restrict supported crypto capabilities
* by child device
......
// SPDX-License-Identifier: GPL-2.0
/*
* Functions related to setting various queue properties from drivers
*/
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/bio.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/sched/sysctl.h>
#include "blk.h"
#include "blk-mq-sched.h"
/**
* blk_end_sync_rq - executes a completion event on a request
* @rq: request to complete
* @error: end I/O status of the request
*/
static void blk_end_sync_rq(struct request *rq, blk_status_t error)
{
struct completion *waiting = rq->end_io_data;
rq->end_io_data = (void *)(uintptr_t)error;
/*
* complete last, if this is a stack request the process (and thus
* the rq pointer) could be invalid right after this complete()
*/
complete(waiting);
}
/**
* blk_execute_rq_nowait - insert a request to I/O scheduler for execution
* @bd_disk: matching gendisk
* @rq: request to insert
* @at_head: insert request at head or tail of queue
* @done: I/O completion handler
*
* Description:
* Insert a fully prepared request at the back of the I/O scheduler queue
* for execution. Don't wait for completion.
*
* Note:
* This function will invoke @done directly if the queue is dead.
*/
void blk_execute_rq_nowait(struct gendisk *bd_disk, struct request *rq,
int at_head, rq_end_io_fn *done)
{
WARN_ON(irqs_disabled());
WARN_ON(!blk_rq_is_passthrough(rq));
rq->rq_disk = bd_disk;
rq->end_io = done;
blk_account_io_start(rq);
/*
* don't check dying flag for MQ because the request won't
* be reused after dying flag is set
*/
blk_mq_sched_insert_request(rq, at_head, true, false);
}
EXPORT_SYMBOL_GPL(blk_execute_rq_nowait);
static bool blk_rq_is_poll(struct request *rq)
{
if (!rq->mq_hctx)
return false;
if (rq->mq_hctx->type != HCTX_TYPE_POLL)
return false;
if (WARN_ON_ONCE(!rq->bio))
return false;
return true;
}
static void blk_rq_poll_completion(struct request *rq, struct completion *wait)
{
do {
bio_poll(rq->bio, NULL, 0);
cond_resched();
} while (!completion_done(wait));
}
/**
* blk_execute_rq - insert a request into queue for execution
* @bd_disk: matching gendisk
* @rq: request to insert
* @at_head: insert request at head or tail of queue
*
* Description:
* Insert a fully prepared request at the back of the I/O scheduler queue
* for execution and wait for completion.
* Return: The blk_status_t result provided to blk_mq_end_request().
*/
blk_status_t blk_execute_rq(struct gendisk *bd_disk, struct request *rq, int at_head)
{
DECLARE_COMPLETION_ONSTACK(wait);
unsigned long hang_check;
rq->end_io_data = &wait;
blk_execute_rq_nowait(bd_disk, rq, at_head, blk_end_sync_rq);
/* Prevent hang_check timer from firing at us during very long I/O */
hang_check = sysctl_hung_task_timeout_secs;
if (blk_rq_is_poll(rq))
blk_rq_poll_completion(rq, &wait);
else if (hang_check)
while (!wait_for_completion_io_timeout(&wait, hang_check * (HZ/2)));
else
wait_for_completion_io(&wait);
return (blk_status_t)(uintptr_t)rq->end_io_data;
}
EXPORT_SYMBOL(blk_execute_rq);
......@@ -69,6 +69,7 @@
#include <linux/blkdev.h>
#include <linux/gfp.h>
#include <linux/blk-mq.h>
#include <linux/part_stat.h>
#include "blk.h"
#include "blk-mq.h"
......@@ -95,6 +96,12 @@ enum {
static void blk_kick_flush(struct request_queue *q,
struct blk_flush_queue *fq, unsigned int flags);
static inline struct blk_flush_queue *
blk_get_flush_queue(struct request_queue *q, struct blk_mq_ctx *ctx)
{
return blk_mq_map_queue(q, REQ_OP_FLUSH, ctx)->fq;
}
static unsigned int blk_flush_policy(unsigned long fflags, struct request *rq)
{
unsigned int policy = 0;
......@@ -138,7 +145,7 @@ static void blk_flush_queue_rq(struct request *rq, bool add_front)
static void blk_account_io_flush(struct request *rq)
{
struct block_device *part = rq->rq_disk->part0;
struct block_device *part = rq->q->disk->part0;
part_stat_lock();
part_stat_inc(part, ios[STAT_FLUSH]);
......@@ -222,7 +229,7 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
/* release the tag's ownership to the req cloned from */
spin_lock_irqsave(&fq->mq_flush_lock, flags);
if (!refcount_dec_and_test(&flush_rq->ref)) {
if (!req_ref_put_and_test(flush_rq)) {
fq->rq_status = error;
spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
return;
......@@ -235,8 +242,10 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
* avoiding use-after-free.
*/
WRITE_ONCE(flush_rq->state, MQ_RQ_IDLE);
if (fq->rq_status != BLK_STS_OK)
if (fq->rq_status != BLK_STS_OK) {
error = fq->rq_status;
fq->rq_status = BLK_STS_OK;
}
if (!q->elevator) {
flush_rq->tag = BLK_MQ_NO_TAG;
......@@ -332,7 +341,6 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
flush_rq->cmd_flags |= (flags & REQ_DRV) | (flags & REQ_FAILFAST_MASK);
flush_rq->rq_flags |= RQF_FLUSH_SEQ;
flush_rq->rq_disk = first_rq->rq_disk;
flush_rq->end_io = flush_end_io;
/*
* Order WRITE ->end_io and WRITE rq->ref, and its pair is the one
......@@ -341,7 +349,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
* and READ flush_rq->end_io
*/
smp_wmb();
refcount_set(&flush_rq->ref, 1);
req_ref_set(flush_rq, 1);
blk_flush_queue_rq(flush_rq, false);
}
......
......@@ -411,7 +411,7 @@ void blk_integrity_register(struct gendisk *disk, struct blk_integrity *template
#ifdef CONFIG_BLK_INLINE_ENCRYPTION
if (disk->queue->crypto_profile) {
pr_warn("blk-integrity: Integrity and hardware inline encryption are not supported together. Disabling hardware inline encryption.\n");
blk_crypto_unregister(disk->queue);
disk->queue->crypto_profile = NULL;
}
#endif
}
......
......@@ -8,22 +8,25 @@
#include <linux/bio.h>
#include <linux/blkdev.h>
#include <linux/slab.h>
#include <linux/security.h>
#include <linux/sched/task.h>
#include "blk.h"
#include "blk-mq-sched.h"
/*
* For io context allocations
*/
static struct kmem_cache *iocontext_cachep;
#ifdef CONFIG_BLK_ICQ
/**
* get_io_context - increment reference count to io_context
* @ioc: io_context to get
*
* Increment reference count to @ioc.
*/
void get_io_context(struct io_context *ioc)
static void get_io_context(struct io_context *ioc)
{
BUG_ON(atomic_long_read(&ioc->refcount) <= 0);
atomic_long_inc(&ioc->refcount);
......@@ -53,6 +56,16 @@ static void ioc_exit_icq(struct io_cq *icq)
icq->flags |= ICQ_EXITED;
}
static void ioc_exit_icqs(struct io_context *ioc)
{
struct io_cq *icq;
spin_lock_irq(&ioc->lock);
hlist_for_each_entry(icq, &ioc->icq_list, ioc_node)
ioc_exit_icq(icq);
spin_unlock_irq(&ioc->lock);
}
/*
* Release an icq. Called with ioc locked for blk-mq, and with both ioc
* and queue locked for legacy.
......@@ -132,68 +145,74 @@ static void ioc_release_fn(struct work_struct *work)
kmem_cache_free(iocontext_cachep, ioc);
}
/**
* put_io_context - put a reference of io_context
* @ioc: io_context to put
*
* Decrement reference count of @ioc and release it if the count reaches
* zero.
/*
* Releasing icqs requires reverse order double locking and we may already be
* holding a queue_lock. Do it asynchronously from a workqueue.
*/
void put_io_context(struct io_context *ioc)
static bool ioc_delay_free(struct io_context *ioc)
{
unsigned long flags;
bool free_ioc = false;
if (ioc == NULL)
return;
BUG_ON(atomic_long_read(&ioc->refcount) <= 0);
/*
* Releasing ioc requires reverse order double locking and we may
* already be holding a queue_lock. Do it asynchronously from wq.
*/
if (atomic_long_dec_and_test(&ioc->refcount)) {
spin_lock_irqsave(&ioc->lock, flags);
if (!hlist_empty(&ioc->icq_list))
queue_work(system_power_efficient_wq,
&ioc->release_work);
else
free_ioc = true;
spin_lock_irqsave(&ioc->lock, flags);
if (!hlist_empty(&ioc->icq_list)) {
queue_work(system_power_efficient_wq, &ioc->release_work);
spin_unlock_irqrestore(&ioc->lock, flags);
return true;
}
if (free_ioc)
kmem_cache_free(iocontext_cachep, ioc);
spin_unlock_irqrestore(&ioc->lock, flags);
return false;
}
/**
* put_io_context_active - put active reference on ioc
* @ioc: ioc of interest
* ioc_clear_queue - break any ioc association with the specified queue
* @q: request_queue being cleared
*
* Undo get_io_context_active(). If active reference reaches zero after
* put, @ioc can never issue further IOs and ioscheds are notified.
* Walk @q->icq_list and exit all io_cq's.
*/
void put_io_context_active(struct io_context *ioc)
void ioc_clear_queue(struct request_queue *q)
{
struct io_cq *icq;
LIST_HEAD(icq_list);
if (!atomic_dec_and_test(&ioc->active_ref)) {
put_io_context(ioc);
return;
}
spin_lock_irq(&q->queue_lock);
list_splice_init(&q->icq_list, &icq_list);
spin_unlock_irq(&q->queue_lock);
spin_lock_irq(&ioc->lock);
hlist_for_each_entry(icq, &ioc->icq_list, ioc_node) {
if (icq->flags & ICQ_EXITED)
continue;
rcu_read_lock();
while (!list_empty(&icq_list)) {
struct io_cq *icq =
list_entry(icq_list.next, struct io_cq, q_node);
ioc_exit_icq(icq);
spin_lock_irq(&icq->ioc->lock);
if (!(icq->flags & ICQ_DESTROYED))
ioc_destroy_icq(icq);
spin_unlock_irq(&icq->ioc->lock);
}
spin_unlock_irq(&ioc->lock);
rcu_read_unlock();
}
#else /* CONFIG_BLK_ICQ */
static inline void ioc_exit_icqs(struct io_context *ioc)
{
}
static inline bool ioc_delay_free(struct io_context *ioc)
{
return false;
}
#endif /* CONFIG_BLK_ICQ */
put_io_context(ioc);
/**
* put_io_context - put a reference of io_context
* @ioc: io_context to put
*
* Decrement reference count of @ioc and release it if the count reaches
* zero.
*/
void put_io_context(struct io_context *ioc)
{
BUG_ON(atomic_long_read(&ioc->refcount) <= 0);
if (atomic_long_dec_and_test(&ioc->refcount) && !ioc_delay_free(ioc))
kmem_cache_free(iocontext_cachep, ioc);
}
EXPORT_SYMBOL_GPL(put_io_context);
/* Called by the exiting task */
void exit_io_context(struct task_struct *task)
......@@ -205,132 +224,109 @@ void exit_io_context(struct task_struct *task)
task->io_context = NULL;
task_unlock(task);
atomic_dec(&ioc->nr_tasks);
put_io_context_active(ioc);
}
static void __ioc_clear_queue(struct list_head *icq_list)
{
unsigned long flags;
rcu_read_lock();
while (!list_empty(icq_list)) {
struct io_cq *icq = list_entry(icq_list->next,
struct io_cq, q_node);
struct io_context *ioc = icq->ioc;
spin_lock_irqsave(&ioc->lock, flags);
if (icq->flags & ICQ_DESTROYED) {
spin_unlock_irqrestore(&ioc->lock, flags);
continue;
}
ioc_destroy_icq(icq);
spin_unlock_irqrestore(&ioc->lock, flags);
if (atomic_dec_and_test(&ioc->active_ref)) {
ioc_exit_icqs(ioc);
put_io_context(ioc);
}
rcu_read_unlock();
}
/**
* ioc_clear_queue - break any ioc association with the specified queue
* @q: request_queue being cleared
*
* Walk @q->icq_list and exit all io_cq's.
*/
void ioc_clear_queue(struct request_queue *q)
{
LIST_HEAD(icq_list);
spin_lock_irq(&q->queue_lock);
list_splice_init(&q->icq_list, &icq_list);
spin_unlock_irq(&q->queue_lock);
__ioc_clear_queue(&icq_list);
}
int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node)
static struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
{
struct io_context *ioc;
int ret;
ioc = kmem_cache_alloc_node(iocontext_cachep, gfp_flags | __GFP_ZERO,
node);
if (unlikely(!ioc))
return -ENOMEM;
return NULL;
/* initialize */
atomic_long_set(&ioc->refcount, 1);
atomic_set(&ioc->nr_tasks, 1);
atomic_set(&ioc->active_ref, 1);
#ifdef CONFIG_BLK_ICQ
spin_lock_init(&ioc->lock);
INIT_RADIX_TREE(&ioc->icq_tree, GFP_ATOMIC);
INIT_HLIST_HEAD(&ioc->icq_list);
INIT_WORK(&ioc->release_work, ioc_release_fn);
#endif
return ioc;
}
/*
* Try to install. ioc shouldn't be installed if someone else
* already did or @task, which isn't %current, is exiting. Note
* that we need to allow ioc creation on exiting %current as exit
* path may issue IOs from e.g. exit_files(). The exit path is
* responsible for not issuing IO after exit_io_context().
*/
task_lock(task);
if (!task->io_context &&
(task == current || !(task->flags & PF_EXITING)))
task->io_context = ioc;
else
kmem_cache_free(iocontext_cachep, ioc);
int set_task_ioprio(struct task_struct *task, int ioprio)
{
int err;
const struct cred *cred = current_cred(), *tcred;
ret = task->io_context ? 0 : -EBUSY;
rcu_read_lock();
tcred = __task_cred(task);
if (!uid_eq(tcred->uid, cred->euid) &&
!uid_eq(tcred->uid, cred->uid) && !capable(CAP_SYS_NICE)) {
rcu_read_unlock();
return -EPERM;
}
rcu_read_unlock();
task_unlock(task);
err = security_task_setioprio(task, ioprio);
if (err)
return err;
return ret;
}
task_lock(task);
if (unlikely(!task->io_context)) {
struct io_context *ioc;
/**
* get_task_io_context - get io_context of a task
* @task: task of interest
* @gfp_flags: allocation flags, used if allocation is necessary
* @node: allocation node, used if allocation is necessary
*
* Return io_context of @task. If it doesn't exist, it is created with
* @gfp_flags and @node. The returned io_context has its reference count
* incremented.
*
* This function always goes through task_lock() and it's better to use
* %current->io_context + get_io_context() for %current.
*/
struct io_context *get_task_io_context(struct task_struct *task,
gfp_t gfp_flags, int node)
{
struct io_context *ioc;
task_unlock(task);
might_sleep_if(gfpflags_allow_blocking(gfp_flags));
ioc = alloc_io_context(GFP_ATOMIC, NUMA_NO_NODE);
if (!ioc)
return -ENOMEM;
do {
task_lock(task);
ioc = task->io_context;
if (likely(ioc)) {
get_io_context(ioc);
task_unlock(task);
return ioc;
if (task->flags & PF_EXITING) {
err = -ESRCH;
kmem_cache_free(iocontext_cachep, ioc);
goto out;
}
task_unlock(task);
} while (!create_task_io_context(task, gfp_flags, node));
if (task->io_context)
kmem_cache_free(iocontext_cachep, ioc);
else
task->io_context = ioc;
}
task->io_context->ioprio = ioprio;
out:
task_unlock(task);
return err;
}
EXPORT_SYMBOL_GPL(set_task_ioprio);
return NULL;
int __copy_io(unsigned long clone_flags, struct task_struct *tsk)
{
struct io_context *ioc = current->io_context;
/*
* Share io context with parent, if CLONE_IO is set
*/
if (clone_flags & CLONE_IO) {
atomic_inc(&ioc->active_ref);
tsk->io_context = ioc;
} else if (ioprio_valid(ioc->ioprio)) {
tsk->io_context = alloc_io_context(GFP_KERNEL, NUMA_NO_NODE);
if (!tsk->io_context)
return -ENOMEM;
tsk->io_context->ioprio = ioc->ioprio;
}
return 0;
}
#ifdef CONFIG_BLK_ICQ
/**
* ioc_lookup_icq - lookup io_cq from ioc
* @ioc: the associated io_context
* @q: the associated request_queue
*
* Look up io_cq associated with @ioc - @q pair from @ioc. Must be called
* with @q->queue_lock held.
*/
struct io_cq *ioc_lookup_icq(struct io_context *ioc, struct request_queue *q)
struct io_cq *ioc_lookup_icq(struct request_queue *q)
{
struct io_context *ioc = current->io_context;
struct io_cq *icq;
lockdep_assert_held(&q->queue_lock);
......@@ -359,9 +355,7 @@ EXPORT_SYMBOL(ioc_lookup_icq);
/**
* ioc_create_icq - create and link io_cq
* @ioc: io_context of interest
* @q: request_queue of interest
* @gfp_mask: allocation mask
*
* Make sure io_cq linking @ioc and @q exists. If icq doesn't exist, they
* will be created using @gfp_mask.
......@@ -369,19 +363,19 @@ EXPORT_SYMBOL(ioc_lookup_icq);
* The caller is responsible for ensuring @ioc won't go away and @q is
* alive and will stay alive until this function returns.
*/
struct io_cq *ioc_create_icq(struct io_context *ioc, struct request_queue *q,
gfp_t gfp_mask)
static struct io_cq *ioc_create_icq(struct request_queue *q)
{
struct io_context *ioc = current->io_context;
struct elevator_type *et = q->elevator->type;
struct io_cq *icq;
/* allocate stuff */
icq = kmem_cache_alloc_node(et->icq_cache, gfp_mask | __GFP_ZERO,
icq = kmem_cache_alloc_node(et->icq_cache, GFP_ATOMIC | __GFP_ZERO,
q->node);
if (!icq)
return NULL;
if (radix_tree_maybe_preload(gfp_mask) < 0) {
if (radix_tree_maybe_preload(GFP_ATOMIC) < 0) {
kmem_cache_free(et->icq_cache, icq);
return NULL;
}
......@@ -402,7 +396,7 @@ struct io_cq *ioc_create_icq(struct io_context *ioc, struct request_queue *q,
et->ops.init_icq(icq);
} else {
kmem_cache_free(et->icq_cache, icq);
icq = ioc_lookup_icq(ioc, q);
icq = ioc_lookup_icq(q);
if (!icq)
printk(KERN_ERR "cfq: icq link failed!\n");
}
......@@ -413,6 +407,46 @@ struct io_cq *ioc_create_icq(struct io_context *ioc, struct request_queue *q,
return icq;
}
struct io_cq *ioc_find_get_icq(struct request_queue *q)
{
struct io_context *ioc = current->io_context;
struct io_cq *icq = NULL;
if (unlikely(!ioc)) {
ioc = alloc_io_context(GFP_ATOMIC, q->node);
if (!ioc)
return NULL;
task_lock(current);
if (current->io_context) {
kmem_cache_free(iocontext_cachep, ioc);
ioc = current->io_context;
} else {
current->io_context = ioc;
}
get_io_context(ioc);
task_unlock(current);
} else {
get_io_context(ioc);
spin_lock_irq(&q->queue_lock);
icq = ioc_lookup_icq(q);
spin_unlock_irq(&q->queue_lock);
}
if (!icq) {
icq = ioc_create_icq(q);
if (!icq) {
put_io_context(ioc);
return NULL;
}
}
return icq;
}
EXPORT_SYMBOL_GPL(ioc_find_get_icq);
#endif /* CONFIG_BLK_ICQ */
static int __init blk_ioc_init(void)
{
iocontext_cachep = kmem_cache_create("blkdev_ioc",
......
......@@ -62,6 +62,7 @@ struct ioprio_blkg {
struct ioprio_blkcg {
struct blkcg_policy_data cpd;
enum prio_policy prio_policy;
bool prio_set;
};
static inline struct ioprio_blkg *pd_to_ioprio(struct blkg_policy_data *pd)
......@@ -112,7 +113,7 @@ static ssize_t ioprio_set_prio_policy(struct kernfs_open_file *of, char *buf,
if (ret < 0)
return ret;
blkcg->prio_policy = ret;
blkcg->prio_set = true;
return nbytes;
}
......@@ -190,6 +191,10 @@ static void blkcg_ioprio_track(struct rq_qos *rqos, struct request *rq,
struct bio *bio)
{
struct ioprio_blkcg *blkcg = ioprio_blkcg_from_bio(bio);
u16 prio;
if (!blkcg->prio_set)
return;
/*
* Except for IOPRIO_CLASS_NONE, higher I/O priority numbers
......@@ -199,8 +204,10 @@ static void blkcg_ioprio_track(struct rq_qos *rqos, struct request *rq,
* bio I/O priority is not modified. If the bio I/O priority equals
* IOPRIO_CLASS_NONE, the cgroup I/O priority is assigned to the bio.
*/
bio->bi_ioprio = max_t(u16, bio->bi_ioprio,
IOPRIO_PRIO_VALUE(blkcg->prio_policy, 0));
prio = max_t(u16, bio->bi_ioprio,
IOPRIO_PRIO_VALUE(blkcg->prio_policy, 0));
if (prio > bio->bi_ioprio)
bio->bi_ioprio = prio;
}
static void blkcg_ioprio_exit(struct rq_qos *rqos)
......
......@@ -8,10 +8,12 @@
#include <linux/blkdev.h>
#include <linux/blk-integrity.h>
#include <linux/scatterlist.h>
#include <linux/part_stat.h>
#include <trace/events/block.h>
#include "blk.h"
#include "blk-mq-sched.h"
#include "blk-rq-qos.h"
#include "blk-throttle.h"
......@@ -775,8 +777,7 @@ static struct request *attempt_merge(struct request_queue *q,
if (req_op(req) != req_op(next))
return NULL;
if (rq_data_dir(req) != rq_data_dir(next)
|| req->rq_disk != next->rq_disk)
if (rq_data_dir(req) != rq_data_dir(next))
return NULL;
if (req_op(req) == REQ_OP_WRITE_SAME &&
......@@ -903,10 +904,6 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
if (bio_data_dir(bio) != rq_data_dir(rq))
return false;
/* must be same device */
if (rq->rq_disk != bio->bi_bdev->bd_disk)
return false;
/* only merge integrity protected bio into ditto rq */
if (blk_integrity_merge_bio(rq->q, rq, bio) == false)
return false;
......@@ -1067,7 +1064,6 @@ static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
* @q: request_queue new bio is being queued at
* @bio: new bio being queued
* @nr_segs: number of segments in @bio
* @same_queue_rq: output value, will be true if there's an existing request
* from the passed in @q already in the plug list
*
* Determine whether @bio being queued on @q can be merged with the previous
......@@ -1084,7 +1080,7 @@ static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
* Caller must ensure !blk_queue_nomerges(q) beforehand.
*/
bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
unsigned int nr_segs, bool *same_queue_rq)
unsigned int nr_segs)
{
struct blk_plug *plug;
struct request *rq;
......@@ -1096,12 +1092,6 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
/* check the previously added entry for a quick merge attempt */
rq = rq_list_peek(&plug->mq_list);
if (rq->q == q) {
/*
* Only blk-mq multiple hardware queues case checks the rq in
* the same queue, there should be only one such rq in a queue
*/
*same_queue_rq = true;
if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
BIO_MERGE_OK)
return true;
......
......@@ -11,6 +11,7 @@
#include "blk.h"
#include "blk-mq.h"
#include "blk-mq-debugfs.h"
#include "blk-mq-sched.h"
#include "blk-mq-tag.h"
#include "blk-rq-qos.h"
......@@ -29,6 +30,9 @@ static int queue_poll_stat_show(void *data, struct seq_file *m)
struct request_queue *q = data;
int bucket;
if (!q->poll_stat)
return 0;
for (bucket = 0; bucket < (BLK_MQ_POLL_STATS_BKTS / 2); bucket++) {
seq_printf(m, "read (%d Bytes): ", 1 << (9 + bucket));
print_stat(m, &q->poll_stat[2 * bucket]);
......@@ -122,7 +126,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(FUA),
QUEUE_FLAG_NAME(DAX),
QUEUE_FLAG_NAME(STATS),
QUEUE_FLAG_NAME(POLL_STATS),
QUEUE_FLAG_NAME(REGISTERED),
QUEUE_FLAG_NAME(QUIESCED),
QUEUE_FLAG_NAME(PCI_P2PDMA),
......
......@@ -18,32 +18,6 @@
#include "blk-mq-tag.h"
#include "blk-wbt.h"
void blk_mq_sched_assign_ioc(struct request *rq)
{
struct request_queue *q = rq->q;
struct io_context *ioc;
struct io_cq *icq;
/*
* May not have an IO context if it's a passthrough request
*/
ioc = current->io_context;
if (!ioc)
return;
spin_lock_irq(&q->queue_lock);
icq = ioc_lookup_icq(ioc, q);
spin_unlock_irq(&q->queue_lock);
if (!icq) {
icq = ioc_create_icq(ioc, q, GFP_ATOMIC);
if (!icq)
return;
}
get_io_context(icq->ioc);
rq->elv.icq = icq;
}
/*
* Mark a hardware queue as needing a restart. For shared queues, maintain
* a count of how many hardware queues are marked for restart.
......@@ -501,7 +475,8 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx,
* us one extra enqueue & dequeue to sw queue.
*/
if (!hctx->dispatch_busy && !run_queue_async) {
blk_mq_try_issue_list_directly(hctx, list);
blk_mq_run_dispatch_ops(hctx->queue,
blk_mq_try_issue_list_directly(hctx, list));
if (list_empty(list))
goto out;
}
......
......@@ -8,8 +8,6 @@
#define MAX_SCHED_RQ (16 * BLKDEV_DEFAULT_RQ)
void blk_mq_sched_assign_ioc(struct request *rq);
bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
unsigned int nr_segs, struct request **merged_request);
bool blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio,
......
......@@ -36,8 +36,6 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
kobj);
if (hctx->flags & BLK_MQ_F_BLOCKING)
cleanup_srcu_struct(hctx->srcu);
blk_free_flush_queue(hctx->fq);
sbitmap_free(&hctx->ctx_map);
free_cpumask_var(hctx->cpumask);
......
......@@ -215,7 +215,8 @@ void blk_mq_put_tags(struct blk_mq_tags *tags, int *tag_array, int nr_tags)
struct bt_iter_data {
struct blk_mq_hw_ctx *hctx;
busy_iter_fn *fn;
struct request_queue *q;
busy_tag_iter_fn *fn;
void *data;
bool reserved;
};
......@@ -228,7 +229,7 @@ static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
spin_lock_irqsave(&tags->lock, flags);
rq = tags->rqs[bitnr];
if (!rq || rq->tag != bitnr || !refcount_inc_not_zero(&rq->ref))
if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
rq = NULL;
spin_unlock_irqrestore(&tags->lock, flags);
return rq;
......@@ -238,11 +239,18 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
{
struct bt_iter_data *iter_data = data;
struct blk_mq_hw_ctx *hctx = iter_data->hctx;
struct blk_mq_tags *tags = hctx->tags;
struct request_queue *q = iter_data->q;
struct blk_mq_tag_set *set = q->tag_set;
bool reserved = iter_data->reserved;
struct blk_mq_tags *tags;
struct request *rq;
bool ret = true;
if (blk_mq_is_shared_tags(set->flags))
tags = set->shared_tags;
else
tags = hctx->tags;
if (!reserved)
bitnr += tags->nr_reserved_tags;
/*
......@@ -253,8 +261,8 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
if (!rq)
return true;
if (rq->q == hctx->queue && rq->mq_hctx == hctx)
ret = iter_data->fn(hctx, rq, iter_data->data, reserved);
if (rq->q == q && (!hctx || rq->mq_hctx == hctx))
ret = iter_data->fn(rq, iter_data->data, reserved);
blk_mq_put_rq_ref(rq);
return ret;
}
......@@ -262,6 +270,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
/**
* bt_for_each - iterate over the requests associated with a hardware queue
* @hctx: Hardware queue to examine.
* @q: Request queue to examine.
* @bt: sbitmap to examine. This is either the breserved_tags member
* or the bitmap_tags member of struct blk_mq_tags.
* @fn: Pointer to the function that will be called for each request
......@@ -273,14 +282,16 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
* @reserved: Indicates whether @bt is the breserved_tags member or the
* bitmap_tags member of struct blk_mq_tags.
*/
static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
busy_iter_fn *fn, void *data, bool reserved)
static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct request_queue *q,
struct sbitmap_queue *bt, busy_tag_iter_fn *fn,
void *data, bool reserved)
{
struct bt_iter_data iter_data = {
.hctx = hctx,
.fn = fn,
.data = data,
.reserved = reserved,
.q = q,
};
sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
......@@ -457,12 +468,9 @@ EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request);
* called for all requests on all queues that share that tag set and not only
* for requests associated with @q.
*/
void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
void *priv)
{
struct blk_mq_hw_ctx *hctx;
int i;
/*
* __blk_mq_update_nr_hw_queues() updates nr_hw_queues and queue_hw_ctx
* while the queue is frozen. So we can use q_usage_counter to avoid
......@@ -471,19 +479,34 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
if (!percpu_ref_tryget(&q->q_usage_counter))
return;
queue_for_each_hw_ctx(q, hctx, i) {
struct blk_mq_tags *tags = hctx->tags;
/*
* If no software queues are currently mapped to this
* hardware queue, there's nothing to check
*/
if (!blk_mq_hw_queue_mapped(hctx))
continue;
if (blk_mq_is_shared_tags(q->tag_set->flags)) {
struct blk_mq_tags *tags = q->tag_set->shared_tags;
struct sbitmap_queue *bresv = &tags->breserved_tags;
struct sbitmap_queue *btags = &tags->bitmap_tags;
if (tags->nr_reserved_tags)
bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
bt_for_each(NULL, q, bresv, fn, priv, true);
bt_for_each(NULL, q, btags, fn, priv, false);
} else {
struct blk_mq_hw_ctx *hctx;
int i;
queue_for_each_hw_ctx(q, hctx, i) {
struct blk_mq_tags *tags = hctx->tags;
struct sbitmap_queue *bresv = &tags->breserved_tags;
struct sbitmap_queue *btags = &tags->bitmap_tags;
/*
* If no software queues are currently mapped to this
* hardware queue, there's nothing to check
*/
if (!blk_mq_hw_queue_mapped(hctx))
continue;
if (tags->nr_reserved_tags)
bt_for_each(hctx, q, bresv, fn, priv, true);
bt_for_each(hctx, q, btags, fn, priv, false);
}
}
blk_queue_exit(q);
}
......
......@@ -28,7 +28,7 @@ extern void blk_mq_tag_resize_shared_tags(struct blk_mq_tag_set *set,
extern void blk_mq_tag_update_sched_shared_tags(struct request_queue *q);
extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
void *priv);
void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
void *priv);
......
此差异已折叠。
......@@ -65,9 +65,6 @@ void blk_mq_request_bypass_insert(struct request *rq, bool at_head,
bool run_queue);
void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
struct list_head *list);
/* Used by blk_insert_cloned_request() to issue request directly */
blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last);
void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
struct list_head *list);
......@@ -377,5 +374,24 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
return __blk_mq_active_requests(hctx) < depth;
}
/* run the code block in @dispatch_ops with rcu/srcu read lock held */
#define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
do { \
if (!blk_queue_has_srcu(q)) { \
rcu_read_lock(); \
(dispatch_ops); \
rcu_read_unlock(); \
} else { \
int srcu_idx; \
\
might_sleep_if(check_sleep); \
srcu_idx = srcu_read_lock((q)->srcu); \
(dispatch_ops); \
srcu_read_unlock((q)->srcu, srcu_idx); \
} \
} while (0)
#define blk_mq_run_dispatch_ops(q, dispatch_ops) \
__blk_mq_run_dispatch_ops(q, true, dispatch_ops) \
#endif
......@@ -15,7 +15,7 @@
struct blk_queue_stats {
struct list_head callbacks;
spinlock_t lock;
bool enable_accounting;
int accounting;
};
void blk_rq_stat_init(struct blk_rq_stat *stat)
......@@ -161,7 +161,7 @@ void blk_stat_remove_callback(struct request_queue *q,
spin_lock_irqsave(&q->stats->lock, flags);
list_del_rcu(&cb->list);
if (list_empty(&q->stats->callbacks) && !q->stats->enable_accounting)
if (list_empty(&q->stats->callbacks) && !q->stats->accounting)
blk_queue_flag_clear(QUEUE_FLAG_STATS, q);
spin_unlock_irqrestore(&q->stats->lock, flags);
......@@ -184,13 +184,24 @@ void blk_stat_free_callback(struct blk_stat_callback *cb)
call_rcu(&cb->rcu, blk_stat_free_callback_rcu);
}
void blk_stat_disable_accounting(struct request_queue *q)
{
unsigned long flags;
spin_lock_irqsave(&q->stats->lock, flags);
if (!--q->stats->accounting)
blk_queue_flag_clear(QUEUE_FLAG_STATS, q);
spin_unlock_irqrestore(&q->stats->lock, flags);
}
EXPORT_SYMBOL_GPL(blk_stat_disable_accounting);
void blk_stat_enable_accounting(struct request_queue *q)
{
unsigned long flags;
spin_lock_irqsave(&q->stats->lock, flags);
q->stats->enable_accounting = true;
blk_queue_flag_set(QUEUE_FLAG_STATS, q);
if (!q->stats->accounting++)
blk_queue_flag_set(QUEUE_FLAG_STATS, q);
spin_unlock_irqrestore(&q->stats->lock, flags);
}
EXPORT_SYMBOL_GPL(blk_stat_enable_accounting);
......@@ -205,7 +216,7 @@ struct blk_queue_stats *blk_alloc_queue_stats(void)
INIT_LIST_HEAD(&stats->callbacks);
spin_lock_init(&stats->lock);
stats->enable_accounting = false;
stats->accounting = 0;
return stats;
}
......@@ -219,3 +230,21 @@ void blk_free_queue_stats(struct blk_queue_stats *stats)
kfree(stats);
}
bool blk_stats_alloc_enable(struct request_queue *q)
{
struct blk_rq_stat *poll_stat;
poll_stat = kcalloc(BLK_MQ_POLL_STATS_BKTS, sizeof(*poll_stat),
GFP_ATOMIC);
if (!poll_stat)
return false;
if (cmpxchg(&q->poll_stat, NULL, poll_stat) != NULL) {
kfree(poll_stat);
return true;
}
blk_stat_add_callback(q, q->poll_cb);
return false;
}
......@@ -64,11 +64,13 @@ struct blk_stat_callback {
struct blk_queue_stats *blk_alloc_queue_stats(void);
void blk_free_queue_stats(struct blk_queue_stats *);
bool blk_stats_alloc_enable(struct request_queue *q);
void blk_stat_add(struct request *rq, u64 now);
/* record time/size info in request but not add a callback */
void blk_stat_enable_accounting(struct request_queue *q);
void blk_stat_disable_accounting(struct request_queue *q);
/**
* blk_stat_alloc_callback() - Allocate a block statistics callback.
......
......@@ -16,6 +16,7 @@
#include "blk.h"
#include "blk-mq.h"
#include "blk-mq-debugfs.h"
#include "blk-mq-sched.h"
#include "blk-wbt.h"
#include "blk-throttle.h"
......@@ -734,7 +735,8 @@ static void blk_free_queue_rcu(struct rcu_head *rcu_head)
{
struct request_queue *q = container_of(rcu_head, struct request_queue,
rcu_head);
kmem_cache_free(blk_requestq_cachep, q);
kmem_cache_free(blk_get_queue_kmem_cache(blk_queue_has_srcu(q)), q);
}
/* Unconfigure the I/O scheduler and dissociate from the cgroup controller. */
......@@ -747,7 +749,7 @@ static void blk_exit_queue(struct request_queue *q)
*/
if (q->elevator) {
ioc_clear_queue(q);
__elevator_exit(q, q->elevator);
elevator_exit(q);
}
/*
......@@ -785,14 +787,15 @@ static void blk_release_queue(struct kobject *kobj)
might_sleep();
if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags))
if (q->poll_stat)
blk_stat_remove_callback(q, q->poll_cb);
blk_stat_free_callback(q->poll_cb);
blk_free_queue_stats(q->stats);
blk_exit_queue(q);
blk_free_queue_stats(q->stats);
kfree(q->poll_stat);
blk_queue_free_zone_bitmaps(q);
if (queue_is_mq(q))
......
......@@ -13,6 +13,7 @@
#include <linux/blk-cgroup.h>
#include "blk.h"
#include "blk-cgroup-rwstat.h"
#include "blk-stat.h"
#include "blk-throttle.h"
/* Max dispatch from a group in 1 round */
......
......@@ -2,15 +2,10 @@
#ifndef BLK_INTERNAL_H
#define BLK_INTERNAL_H
#include <linux/idr.h>
#include <linux/blk-mq.h>
#include <linux/part_stat.h>
#include <linux/blk-crypto.h>
#include <linux/memblock.h> /* for max_pfn/max_low_pfn */
#include <xen/xen.h>
#include "blk-crypto-internal.h"
#include "blk-mq.h"
#include "blk-mq-sched.h"
struct elevator_type;
......@@ -32,15 +27,10 @@ struct blk_flush_queue {
};
extern struct kmem_cache *blk_requestq_cachep;
extern struct kmem_cache *blk_requestq_srcu_cachep;
extern struct kobj_type blk_queue_ktype;
extern struct ida blk_queue_ida;
static inline struct blk_flush_queue *
blk_get_flush_queue(struct request_queue *q, struct blk_mq_ctx *ctx)
{
return blk_mq_map_queue(q, REQ_OP_FLUSH, ctx)->fq;
}
static inline void __blk_get_queue(struct request_queue *q)
{
kobject_get(&q->kobj);
......@@ -250,16 +240,13 @@ static inline void blk_integrity_del(struct gendisk *disk)
unsigned long blk_rq_timeout(unsigned long timeout);
void blk_add_timer(struct request *req);
void blk_print_req_error(struct request *req, blk_status_t status);
const char *blk_status_to_str(blk_status_t status);
bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
unsigned int nr_segs, bool *same_queue_rq);
unsigned int nr_segs);
bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
struct bio *bio, unsigned int nr_segs);
void __blk_account_io_start(struct request *req);
void __blk_account_io_done(struct request *req, u64 now);
/*
* Plug flush limits
*/
......@@ -275,19 +262,10 @@ void blk_insert_flush(struct request *rq);
int elevator_switch_mq(struct request_queue *q,
struct elevator_type *new_e);
void __elevator_exit(struct request_queue *, struct elevator_queue *);
void elevator_exit(struct request_queue *q);
int elv_register_queue(struct request_queue *q, bool uevent);
void elv_unregister_queue(struct request_queue *q);
static inline void elevator_exit(struct request_queue *q,
struct elevator_queue *e)
{
lockdep_assert_held(&q->sysfs_lock);
blk_mq_sched_free_rqs(q);
__elevator_exit(q, e);
}
ssize_t part_size_show(struct device *dev, struct device_attribute *attr,
char *buf);
ssize_t part_stat_show(struct device *dev, struct device_attribute *attr,
......@@ -347,26 +325,10 @@ int blk_dev_init(void);
*/
static inline bool blk_do_io_stat(struct request *rq)
{
return (rq->rq_flags & RQF_IO_STAT) && rq->rq_disk;
}
static inline void blk_account_io_done(struct request *req, u64 now)
{
/*
* Account IO completion. flush_rq isn't accounted as a
* normal IO on queueing nor completion. Accounting the
* containing request is enough.
*/
if (blk_do_io_stat(req) && req->part &&
!(req->rq_flags & RQF_FLUSH_SEQ))
__blk_account_io_done(req, now);
return (rq->rq_flags & RQF_IO_STAT) && rq->q->disk;
}
static inline void blk_account_io_start(struct request *req)
{
if (blk_do_io_stat(req))
__blk_account_io_start(req);
}
void update_io_ticks(struct block_device *part, unsigned long now, bool end);
static inline void req_set_nomerge(struct request_queue *q, struct request *req)
{
......@@ -402,13 +364,15 @@ static inline unsigned int bio_aligned_discard_max_sectors(
/*
* Internal io_context interface
*/
void get_io_context(struct io_context *ioc);
struct io_cq *ioc_lookup_icq(struct io_context *ioc, struct request_queue *q);
struct io_cq *ioc_create_icq(struct io_context *ioc, struct request_queue *q,
gfp_t gfp_mask);
struct io_cq *ioc_find_get_icq(struct request_queue *q);
struct io_cq *ioc_lookup_icq(struct request_queue *q);
#ifdef CONFIG_BLK_ICQ
void ioc_clear_queue(struct request_queue *q);
int create_task_io_context(struct task_struct *task, gfp_t gfp_mask, int node);
#else
static inline void ioc_clear_queue(struct request_queue *q)
{
}
#endif /* CONFIG_BLK_ICQ */
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
......@@ -467,7 +431,15 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
struct page *page, unsigned int len, unsigned int offset,
unsigned int max_sectors, bool *same_page);
struct request_queue *blk_alloc_queue(int node_id);
static inline struct kmem_cache *blk_get_queue_kmem_cache(bool srcu)
{
if (srcu)
return blk_requestq_srcu_cachep;
return blk_requestq_cachep;
}
struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu);
int disk_scan_partitions(struct gendisk *disk, fmode_t mode);
int disk_alloc_events(struct gendisk *disk);
void disk_add_events(struct gendisk *disk);
......@@ -493,4 +465,45 @@ int disk_register_independent_access_ranges(struct gendisk *disk,
struct blk_independent_access_ranges *new_iars);
void disk_unregister_independent_access_ranges(struct gendisk *disk);
#ifdef CONFIG_FAIL_MAKE_REQUEST
bool should_fail_request(struct block_device *part, unsigned int bytes);
#else /* CONFIG_FAIL_MAKE_REQUEST */
static inline bool should_fail_request(struct block_device *part,
unsigned int bytes)
{
return false;
}
#endif /* CONFIG_FAIL_MAKE_REQUEST */
/*
* Optimized request reference counting. Ideally we'd make timeouts be more
* clever, as that's the only reason we need references at all... But until
* this happens, this is faster than using refcount_t. Also see:
*
* abc54d634334 ("io_uring: switch to atomic_t for io_kiocb reference count")
*/
#define req_ref_zero_or_close_to_overflow(req) \
((unsigned int) atomic_read(&(req->ref)) + 127u <= 127u)
static inline bool req_ref_inc_not_zero(struct request *req)
{
return atomic_inc_not_zero(&req->ref);
}
static inline bool req_ref_put_and_test(struct request *req)
{
WARN_ON_ONCE(req_ref_zero_or_close_to_overflow(req));
return atomic_dec_and_test(&req->ref);
}
static inline void req_ref_set(struct request *req, int value)
{
atomic_set(&req->ref, value);
}
static inline int req_ref_read(struct request *req)
{
return atomic_read(&req->ref);
}
#endif /* BLK_INTERNAL_H */
......@@ -92,7 +92,7 @@ static int bsg_transport_sg_io_fn(struct request_queue *q, struct sg_io_v4 *hdr,
goto out_unmap_bidi_rq;
bio = rq->bio;
blk_execute_rq(NULL, rq, !(hdr->flags & BSG_FLAG_Q_AT_TAIL));
blk_execute_rq(rq, !(hdr->flags & BSG_FLAG_Q_AT_TAIL));
/*
* The assignments below don't make much sense, but are kept for
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册