提交 36869cb9 编写于 作者: L Linus Torvalds

Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "This is the main block pull request this series. Contrary to previous
  release, I've kept the core and driver changes in the same branch. We
  always ended up having dependencies between the two for obvious
  reasons, so makes more sense to keep them together. That said, I'll
  probably try and keep more topical branches going forward, especially
  for cycles that end up being as busy as this one.

  The major parts of this pull request is:

   - Improved support for O_DIRECT on block devices, with a small
     private implementation instead of using the pig that is
     fs/direct-io.c. From Christoph.

   - Request completion tracking in a scalable fashion. This is utilized
     by two components in this pull, the new hybrid polling and the
     writeback queue throttling code.

   - Improved support for polling with O_DIRECT, adding a hybrid mode
     that combines pure polling with an initial sleep. From me.

   - Support for automatic throttling of writeback queues on the block
     side. This uses feedback from the device completion latencies to
     scale the queue on the block side up or down. From me.

   - Support from SMR drives in the block layer and for SD. From Hannes
     and Shaun.

   - Multi-connection support for nbd. From Josef.

   - Cleanup of request and bio flags, so we have a clear split between
     which are bio (or rq) private, and which ones are shared. From
     Christoph.

   - A set of patches from Bart, that improve how we handle queue
     stopping and starting in blk-mq.

   - Support for WRITE_ZEROES from Chaitanya.

   - Lightnvm updates from Javier/Matias.

   - Supoort for FC for the nvme-over-fabrics code. From James Smart.

   - A bunch of fixes from a whole slew of people, too many to name
     here"

* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
  blk-stat: fix a few cases of missing batch flushing
  blk-flush: run the queue when inserting blk-mq flush
  elevator: make the rqhash helpers exported
  blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  blk-mq: add blk_mq_start_stopped_hw_queue()
  block: improve handling of the magic discard payload
  blk-wbt: don't throttle discard or write zeroes
  nbd: use dev_err_ratelimited in io path
  nbd: reset the setup task for NBD_CLEAR_SOCK
  nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
  nvme-fabrics: Add target support for FC transport
  nvme-fabrics: Add host support for FC transport
  nvme-fabrics: Add FC transport LLDD api definitions
  nvme-fabrics: Add FC transport FC-NVME definitions
  nvme-fabrics: Add FC transport error codes to nvme.h
  Add type 0x28 NVME type code to scsi fc headers
  nvme-fabrics: patch target code in prep for FC transport support
  nvme-fabrics: set sqe.command_id in core not transports
  parser: add u64 number parser
  nvme-rdma: align to generic ib_event logging helper
  ...
......@@ -235,3 +235,45 @@ Description:
write_same_max_bytes is 0, write same is not supported
by the device.
What: /sys/block/<disk>/queue/write_zeroes_max_bytes
Date: November 2016
Contact: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Description:
Devices that support write zeroes operation in which a
single request can be issued to zero out the range of
contiguous blocks on storage without having any payload
in the request. This can be used to optimize writing zeroes
to the devices. write_zeroes_max_bytes indicates how many
bytes can be written in a single write zeroes command. If
write_zeroes_max_bytes is 0, write zeroes is not supported
by the device.
What: /sys/block/<disk>/queue/zoned
Date: September 2016
Contact: Damien Le Moal <damien.lemoal@hgst.com>
Description:
zoned indicates if the device is a zoned block device
and the zone model of the device if it is indeed zoned.
The possible values indicated by zoned are "none" for
regular block devices and "host-aware" or "host-managed"
for zoned block devices. The characteristics of
host-aware and host-managed zoned block devices are
described in the ZBC (Zoned Block Commands) and ZAC
(Zoned Device ATA Command Set) standards. These standards
also define the "drive-managed" zone model. However,
since drive-managed zoned block devices do not support
zone commands, they will be treated as regular block
devices and zoned will report "none".
What: /sys/block/<disk>/queue/chunk_sectors
Date: September 2016
Contact: Hannes Reinecke <hare@suse.com>
Description:
chunk_sectors has different meaning depending on the type
of the disk. For a RAID device (dm-raid), chunk_sectors
indicates the size in 512B sectors of the RAID volume
stripe segment. For a zoned block device, either
host-aware or host-managed, chunk_sectors indicates the
size of 512B sectors of the zones of the device, with
the eventual exception of the last zone of the device
which may be smaller.
......@@ -348,7 +348,7 @@ Drivers can now specify a request prepare function (q->prep_rq_fn) that the
block layer would invoke to pre-build device commands for a given request,
or perform other preparatory processing for the request. This is routine is
called by elv_next_request(), i.e. typically just before servicing a request.
(The prepare function would not be called for requests that have REQ_DONTPREP
(The prepare function would not be called for requests that have RQF_DONTPREP
enabled)
Aside:
......@@ -553,8 +553,8 @@ struct request {
struct request_list *rl;
}
See the rq_flag_bits definitions for an explanation of the various flags
available. Some bits are used by the block layer or i/o scheduler.
See the req_ops and req_flag_bits definitions for an explanation of the various
flags available. Some bits are used by the block layer or i/o scheduler.
The behaviour of the various sector counts are almost the same as before,
except that since we have multi-segment bios, current_nr_sectors refers
......
......@@ -240,11 +240,11 @@ All cfq queues doing synchronous sequential IO go on to sync-idle tree.
On this tree we idle on each queue individually.
All synchronous non-sequential queues go on sync-noidle tree. Also any
request which are marked with REQ_NOIDLE go on this service tree. On this
tree we do not idle on individual queues instead idle on the whole group
of queues or the tree. So if there are 4 queues waiting for IO to dispatch
we will idle only once last queue has dispatched the IO and there is
no more IO on this service tree.
synchronous write request which is not marked with REQ_IDLE goes on this
service tree. On this tree we do not idle on individual queues instead idle
on the whole group of queues or the tree. So if there are 4 queues waiting
for IO to dispatch we will idle only once last queue has dispatched the IO
and there is no more IO on this service tree.
All async writes go on async service tree. There is no idling on async
queues.
......@@ -257,17 +257,17 @@ tree idling provides isolation with buffered write queues on async tree.
FAQ
===
Q1. Why to idle at all on queues marked with REQ_NOIDLE.
Q1. Why to idle at all on queues not marked with REQ_IDLE.
A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
with REQ_NOIDLE. This helps in providing isolation with all the sync-idle
A1. We only do tree idle (all queues on sync-noidle tree) on queues not marked
with REQ_IDLE. This helps in providing isolation with all the sync-idle
queues. Otherwise in presence of many sequential readers, other
synchronous IO might not get fair share of disk.
For example, if there are 10 sequential readers doing IO and they get
100ms each. If a REQ_NOIDLE request comes in, it will be scheduled
roughly after 1 second. If after completion of REQ_NOIDLE request we
do not idle, and after a couple of milli seconds a another REQ_NOIDLE
100ms each. If a !REQ_IDLE request comes in, it will be scheduled
roughly after 1 second. If after completion of !REQ_IDLE request we
do not idle, and after a couple of milli seconds a another !REQ_IDLE
request comes in, again it will be scheduled after 1second. Repeat it
and notice how a workload can lose its disk share and suffer due to
multiple sequential readers.
......@@ -276,16 +276,16 @@ A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
context of fsync, and later some journaling data is written. Journaling
data comes in only after fsync has finished its IO (atleast for ext4
that seemed to be the case). Now if one decides not to idle on fsync
thread due to REQ_NOIDLE, then next journaling write will not get
thread due to !REQ_IDLE, then next journaling write will not get
scheduled for another second. A process doing small fsync, will suffer
badly in presence of multiple sequential readers.
Hence doing tree idling on threads using REQ_NOIDLE flag on requests
Hence doing tree idling on threads using !REQ_IDLE flag on requests
provides isolation from multiple sequential readers and at the same
time we do not idle on individual threads.
Q2. When to specify REQ_NOIDLE
A2. I would think whenever one is doing synchronous write and not expecting
Q2. When to specify REQ_IDLE
A2. I would think whenever one is doing synchronous write and expecting
more writes to be dispatched from same context soon, should be able
to specify REQ_NOIDLE on writes and that probably should work well for
to specify REQ_IDLE on writes and that probably should work well for
most of the cases.
......@@ -72,4 +72,4 @@ use_per_node_hctx=[0/1]: Default: 0
queue for each CPU node in the system.
use_lightnvm=[0/1]: Default: 0
Register device with LightNVM. Requires blk-mq to be used.
Register device with LightNVM. Requires blk-mq and CONFIG_NVM to be enabled.
......@@ -58,6 +58,20 @@ When read, this file shows the total number of block IO polls and how
many returned success. Writing '0' to this file will disable polling
for this device. Writing any non-zero value will enable this feature.
io_poll_delay (RW)
------------------
If polling is enabled, this controls what kind of polling will be
performed. It defaults to -1, which is classic polling. In this mode,
the CPU will repeatedly ask for completions without giving up any time.
If set to 0, a hybrid polling mode is used, where the kernel will attempt
to make an educated guess at when the IO will complete. Based on this
guess, the kernel will put the process issuing IO to sleep for an amount
of time, before entering a classic poll loop. This mode might be a
little slower than pure classic polling, but it will be more efficient.
If set to a value larger than 0, the kernel will put the process issuing
IO to sleep for this amont of microseconds before entering classic
polling.
iostats (RW)
-------------
This file is used to control (on/off) the iostats accounting of the
......@@ -169,5 +183,14 @@ This is the number of bytes the device can write in a single write-same
command. A value of '0' means write-same is not supported by this
device.
wb_lat_usec (RW)
----------------
If the device is registered for writeback throttling, then this file shows
the target minimum read latency. If this latency is exceeded in a given
window of time (see wb_window_usec), then the writeback throttling will start
scaling back writes. Writing a value of '0' to this file disables the
feature. Writing a value of '-1' to this file resets the value to the
default setting.
Jens Axboe <jens.axboe@oracle.com>, February 2009
......@@ -8766,6 +8766,16 @@ L: linux-nvme@lists.infradead.org
S: Supported
F: drivers/nvme/target/
NVM EXPRESS FC TRANSPORT DRIVERS
M: James Smart <james.smart@broadcom.com>
L: linux-nvme@lists.infradead.org
S: Supported
F: include/linux/nvme-fc.h
F: include/linux/nvme-fc-driver.h
F: drivers/nvme/host/fc.c
F: drivers/nvme/target/fc.c
F: drivers/nvme/target/fcloop.c
NVMEM FRAMEWORK
M: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
M: Maxime Ripard <maxime.ripard@free-electrons.com>
......@@ -9656,8 +9666,8 @@ F: arch/mips/boot/dts/pistachio/
F: arch/mips/configs/pistachio*_defconfig
PKTCDVD DRIVER
M: Jiri Kosina <jikos@kernel.org>
S: Maintained
S: Orphan
M: linux-block@vger.kernel.org
F: drivers/block/pktcdvd.c
F: include/linux/pktcdvd.h
F: include/uapi/linux/pktcdvd.h
......
......@@ -25,7 +25,6 @@
#include <linux/string.h>
#include <linux/types.h>
#include <linux/blk_types.h>
#include <asm/byteorder.h>
#include <asm/memory.h>
#include <asm-generic/pci_iomap.h>
......
......@@ -22,7 +22,6 @@
#ifdef __KERNEL__
#include <linux/types.h>
#include <linux/blk_types.h>
#include <asm/byteorder.h>
#include <asm/barrier.h>
......
......@@ -5,6 +5,7 @@ menuconfig BLOCK
bool "Enable the block layer" if EXPERT
default y
select SBITMAP
select SRCU
help
Provide block layer support for the kernel.
......@@ -89,6 +90,14 @@ config BLK_DEV_INTEGRITY
T10/SCSI Data Integrity Field or the T13/ATA External Path
Protection. If in doubt, say N.
config BLK_DEV_ZONED
bool "Zoned block device support"
---help---
Block layer zoned block device support. This option enables
support for ZAC/ZBC host-managed and host-aware zoned block devices.
Say yes here if you have a ZAC or ZBC storage device.
config BLK_DEV_THROTTLING
bool "Block layer bio throttling support"
depends on BLK_CGROUP=y
......@@ -112,6 +121,32 @@ config BLK_CMDLINE_PARSER
See Documentation/block/cmdline-partition.txt for more information.
config BLK_WBT
bool "Enable support for block device writeback throttling"
default n
---help---
Enabling this option enables the block layer to throttle buffered
background writeback from the VM, making it more smooth and having
less impact on foreground operations. The throttling is done
dynamically on an algorithm loosely based on CoDel, factoring in
the realtime performance of the disk.
config BLK_WBT_SQ
bool "Single queue writeback throttling"
default n
depends on BLK_WBT
---help---
Enable writeback throttling by default on legacy single queue devices
config BLK_WBT_MQ
bool "Multiqueue writeback throttling"
default y
depends on BLK_WBT
---help---
Enable writeback throttling by default on multiqueue devices.
Multiqueue currently doesn't have support for IO scheduling,
enabling this option is recommended.
menu "Partition Types"
source "block/partitions/Kconfig"
......
......@@ -5,7 +5,7 @@
obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-lib.o blk-mq.o blk-mq-tag.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
......@@ -23,3 +23,5 @@ obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
obj-$(CONFIG_BLK_MQ_PCI) += blk-mq-pci.o
obj-$(CONFIG_BLK_DEV_ZONED) += blk-zoned.o
obj-$(CONFIG_BLK_WBT) += blk-wbt.o
......@@ -172,7 +172,7 @@ bool bio_integrity_enabled(struct bio *bio)
{
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
if (!bio_is_rw(bio))
if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
return false;
/* Already protected? */
......
......@@ -270,11 +270,15 @@ static void bio_free(struct bio *bio)
}
}
void bio_init(struct bio *bio)
void bio_init(struct bio *bio, struct bio_vec *table,
unsigned short max_vecs)
{
memset(bio, 0, sizeof(*bio));
atomic_set(&bio->__bi_remaining, 1);
atomic_set(&bio->__bi_cnt, 1);
bio->bi_io_vec = table;
bio->bi_max_vecs = max_vecs;
}
EXPORT_SYMBOL(bio_init);
......@@ -480,7 +484,7 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
return NULL;
bio = p + front_pad;
bio_init(bio);
bio_init(bio, NULL, 0);
if (nr_iovecs > inline_vecs) {
unsigned long idx = 0;
......@@ -670,6 +674,7 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
switch (bio_op(bio)) {
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
case REQ_OP_WRITE_ZEROES:
break;
case REQ_OP_WRITE_SAME:
bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
......@@ -847,6 +852,55 @@ int bio_add_page(struct bio *bio, struct page *page,
}
EXPORT_SYMBOL(bio_add_page);
/**
* bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
* @bio: bio to add pages to
* @iter: iov iterator describing the region to be mapped
*
* Pins as many pages from *iter and appends them to @bio's bvec array. The
* pages will have to be released using put_page() when done.
*/
int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{
unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
struct page **pages = (struct page **)bv;
size_t offset, diff;
ssize_t size;
size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
if (unlikely(size <= 0))
return size ? size : -EFAULT;
nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
/*
* Deep magic below: We need to walk the pinned pages backwards
* because we are abusing the space allocated for the bio_vecs
* for the page array. Because the bio_vecs are larger than the
* page pointers by definition this will always work. But it also
* means we can't use bio_add_page, so any changes to it's semantics
* need to be reflected here as well.
*/
bio->bi_iter.bi_size += size;
bio->bi_vcnt += nr_pages;
diff = (nr_pages * PAGE_SIZE - offset) - size;
while (nr_pages--) {
bv[nr_pages].bv_page = pages[nr_pages];
bv[nr_pages].bv_len = PAGE_SIZE;
bv[nr_pages].bv_offset = 0;
}
bv[0].bv_offset += offset;
bv[0].bv_len -= offset;
if (diff)
bv[bio->bi_vcnt - 1].bv_len -= diff;
iov_iter_advance(iter, size);
return 0;
}
EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
struct submit_bio_ret {
struct completion event;
int error;
......@@ -1786,15 +1840,7 @@ struct bio *bio_split(struct bio *bio, int sectors,
BUG_ON(sectors <= 0);
BUG_ON(sectors >= bio_sectors(bio));
/*
* Discards need a mutable bio_vec to accommodate the payload
* required by the DSM TRIM and UNMAP commands.
*/
if (bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_SECURE_ERASE)
split = bio_clone_bioset(bio, gfp, bs);
else
split = bio_clone_fast(bio, gfp, bs);
split = bio_clone_fast(bio, gfp, bs);
if (!split)
return NULL;
......
......@@ -185,7 +185,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
}
wb_congested = wb_congested_get_create(&q->backing_dev_info,
blkcg->css.id, GFP_NOWAIT);
blkcg->css.id,
GFP_NOWAIT | __GFP_NOWARN);
if (!wb_congested) {
ret = -ENOMEM;
goto err_put_css;
......@@ -193,7 +194,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
/* allocate */
if (!new_blkg) {
new_blkg = blkg_alloc(blkcg, q, GFP_NOWAIT);
new_blkg = blkg_alloc(blkcg, q, GFP_NOWAIT | __GFP_NOWARN);
if (unlikely(!new_blkg)) {
ret = -ENOMEM;
goto err_put_congested;
......@@ -1022,7 +1023,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
}
spin_lock_init(&blkcg->lock);
INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_NOWAIT);
INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_NOWAIT | __GFP_NOWARN);
INIT_HLIST_HEAD(&blkcg->blkg_list);
#ifdef CONFIG_CGROUP_WRITEBACK
INIT_LIST_HEAD(&blkcg->cgwb_list);
......@@ -1240,7 +1241,7 @@ int blkcg_activate_policy(struct request_queue *q,
if (blkg->pd[pol->plid])
continue;
pd = pol->pd_alloc_fn(GFP_NOWAIT, q->node);
pd = pol->pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN, q->node);
if (!pd)
swap(pd, pd_prealloc);
if (!pd) {
......
此差异已折叠。
......@@ -72,7 +72,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
spin_lock_irq(q->queue_lock);
if (unlikely(blk_queue_dying(q))) {
rq->cmd_flags |= REQ_QUIET;
rq->rq_flags |= RQF_QUIET;
rq->errors = -ENXIO;
__blk_end_request_all(rq, rq->errors);
spin_unlock_irq(q->queue_lock);
......
......@@ -56,7 +56,7 @@
* Once while executing DATA and again after the whole sequence is
* complete. The first completion updates the contained bio but doesn't
* finish it so that the bio submitter is notified only after the whole
* sequence is complete. This is implemented by testing REQ_FLUSH_SEQ in
* sequence is complete. This is implemented by testing RQF_FLUSH_SEQ in
* req_bio_endio().
*
* The above peculiarity requires that each FLUSH/FUA request has only one
......@@ -127,17 +127,14 @@ static void blk_flush_restore_request(struct request *rq)
rq->bio = rq->biotail;
/* make @rq a normal request */
rq->cmd_flags &= ~REQ_FLUSH_SEQ;
rq->rq_flags &= ~RQF_FLUSH_SEQ;
rq->end_io = rq->flush.saved_end_io;
}
static bool blk_flush_queue_rq(struct request *rq, bool add_front)
{
if (rq->q->mq_ops) {
struct request_queue *q = rq->q;
blk_mq_add_to_requeue_list(rq, add_front);
blk_mq_kick_requeue_list(q);
blk_mq_add_to_requeue_list(rq, add_front, true);
return false;
} else {
if (add_front)
......@@ -330,7 +327,8 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq)
}
flush_rq->cmd_type = REQ_TYPE_FS;
req_set_op_attrs(flush_rq, REQ_OP_FLUSH, WRITE_FLUSH | REQ_FLUSH_SEQ);
flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
flush_rq->rq_flags |= RQF_FLUSH_SEQ;
flush_rq->rq_disk = first_rq->rq_disk;
flush_rq->end_io = flush_end_io;
......@@ -368,7 +366,7 @@ static void flush_data_end_io(struct request *rq, int error)
elv_completed_request(q, rq);
/* for avoiding double accounting */
rq->cmd_flags &= ~REQ_STARTED;
rq->rq_flags &= ~RQF_STARTED;
/*
* After populating an empty queue, kick it to avoid stall. Read
......@@ -425,6 +423,13 @@ void blk_insert_flush(struct request *rq)
if (!(fflags & (1UL << QUEUE_FLAG_FUA)))
rq->cmd_flags &= ~REQ_FUA;
/*
* REQ_PREFLUSH|REQ_FUA implies REQ_SYNC, so if we clear any
* of those flags, we have to set REQ_SYNC to avoid skewing
* the request accounting.
*/
rq->cmd_flags |= REQ_SYNC;
/*
* An empty flush handed down from a stacking driver may
* translate into nothing if the underlying device does not
......@@ -449,7 +454,7 @@ void blk_insert_flush(struct request *rq)
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
if (q->mq_ops) {
blk_mq_insert_request(rq, false, false, true);
blk_mq_insert_request(rq, false, true, false);
} else
list_add_tail(&rq->queuelist, &q->queue_head);
return;
......@@ -461,7 +466,7 @@ void blk_insert_flush(struct request *rq)
*/
memset(&rq->flush, 0, sizeof(rq->flush));
INIT_LIST_HEAD(&rq->flush.list);
rq->cmd_flags |= REQ_FLUSH_SEQ;
rq->rq_flags |= RQF_FLUSH_SEQ;
rq->flush.saved_end_io = rq->end_io; /* Usually NULL */
if (q->mq_ops) {
rq->end_io = mq_flush_data_end_io;
......@@ -513,7 +518,7 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
bio = bio_alloc(gfp_mask, 0);
bio->bi_bdev = bdev;
bio_set_op_attrs(bio, REQ_OP_WRITE, WRITE_FLUSH);
bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
ret = submit_bio_wait(bio);
......
......@@ -29,7 +29,7 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
struct request_queue *q = bdev_get_queue(bdev);
struct bio *bio = *biop;
unsigned int granularity;
enum req_op op;
unsigned int op;
int alignment;
sector_t bs_mask;
......@@ -80,7 +80,7 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
req_sects = end_sect - sector;
}
bio = next_bio(bio, 1, gfp_mask);
bio = next_bio(bio, 0, gfp_mask);
bio->bi_iter.bi_sector = sector;
bio->bi_bdev = bdev;
bio_set_op_attrs(bio, op, 0);
......@@ -137,24 +137,24 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
EXPORT_SYMBOL(blkdev_issue_discard);
/**
* blkdev_issue_write_same - queue a write same operation
* __blkdev_issue_write_same - generate number of bios with same page
* @bdev: target blockdev
* @sector: start sector
* @nr_sects: number of sectors to write
* @gfp_mask: memory allocation flags (for bio_alloc)
* @page: page containing data to write
* @biop: pointer to anchor bio
*
* Description:
* Issue a write same request for the sectors in question.
* Generate and issue number of bios(REQ_OP_WRITE_SAME) with same page.
*/
int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask,
struct page *page)
static int __blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct page *page,
struct bio **biop)
{
struct request_queue *q = bdev_get_queue(bdev);
unsigned int max_write_same_sectors;
struct bio *bio = NULL;
int ret = 0;
struct bio *bio = *biop;
sector_t bs_mask;
if (!q)
......@@ -164,6 +164,9 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
if ((sector | nr_sects) & bs_mask)
return -EINVAL;
if (!bdev_write_same(bdev))
return -EOPNOTSUPP;
/* Ensure that max_write_same_sectors doesn't overflow bi_size */
max_write_same_sectors = UINT_MAX >> 9;
......@@ -185,32 +188,112 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
bio->bi_iter.bi_size = nr_sects << 9;
nr_sects = 0;
}
cond_resched();
}
if (bio) {
*biop = bio;
return 0;
}
/**
* blkdev_issue_write_same - queue a write same operation
* @bdev: target blockdev
* @sector: start sector
* @nr_sects: number of sectors to write
* @gfp_mask: memory allocation flags (for bio_alloc)
* @page: page containing data
*
* Description:
* Issue a write same request for the sectors in question.
*/
int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask,
struct page *page)
{
struct bio *bio = NULL;
struct blk_plug plug;
int ret;
blk_start_plug(&plug);
ret = __blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask, page,
&bio);
if (ret == 0 && bio) {
ret = submit_bio_wait(bio);
bio_put(bio);
}
blk_finish_plug(&plug);
return ret;
}
EXPORT_SYMBOL(blkdev_issue_write_same);
/**
* blkdev_issue_zeroout - generate number of zero filed write bios
* __blkdev_issue_write_zeroes - generate number of bios with WRITE ZEROES
* @bdev: blockdev to issue
* @sector: start sector
* @nr_sects: number of sectors to write
* @gfp_mask: memory allocation flags (for bio_alloc)
* @biop: pointer to anchor bio
*
* Description:
* Generate and issue number of bios with zerofiled pages.
* Generate and issue number of bios(REQ_OP_WRITE_ZEROES) with zerofiled pages.
*/
static int __blkdev_issue_write_zeroes(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
struct bio **biop)
{
struct bio *bio = *biop;
unsigned int max_write_zeroes_sectors;
struct request_queue *q = bdev_get_queue(bdev);
if (!q)
return -ENXIO;
/* Ensure that max_write_zeroes_sectors doesn't overflow bi_size */
max_write_zeroes_sectors = bdev_write_zeroes_sectors(bdev);
if (max_write_zeroes_sectors == 0)
return -EOPNOTSUPP;
while (nr_sects) {
bio = next_bio(bio, 0, gfp_mask);
bio->bi_iter.bi_sector = sector;
bio->bi_bdev = bdev;
bio_set_op_attrs(bio, REQ_OP_WRITE_ZEROES, 0);
static int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask)
if (nr_sects > max_write_zeroes_sectors) {
bio->bi_iter.bi_size = max_write_zeroes_sectors << 9;
nr_sects -= max_write_zeroes_sectors;
sector += max_write_zeroes_sectors;
} else {
bio->bi_iter.bi_size = nr_sects << 9;
nr_sects = 0;
}
cond_resched();
}
*biop = bio;
return 0;
}
/**
* __blkdev_issue_zeroout - generate number of zero filed write bios
* @bdev: blockdev to issue
* @sector: start sector
* @nr_sects: number of sectors to write
* @gfp_mask: memory allocation flags (for bio_alloc)
* @biop: pointer to anchor bio
* @discard: discard flag
*
* Description:
* Generate and issue number of bios with zerofiled pages.
*/
int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
bool discard)
{
int ret;
struct bio *bio = NULL;
int bi_size = 0;
struct bio *bio = *biop;
unsigned int sz;
sector_t bs_mask;
......@@ -218,6 +301,24 @@ static int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
if ((sector | nr_sects) & bs_mask)
return -EINVAL;
if (discard) {
ret = __blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask,
BLKDEV_DISCARD_ZERO, biop);
if (ret == 0 || (ret && ret != -EOPNOTSUPP))
goto out;
}
ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
biop);
if (ret == 0 || (ret && ret != -EOPNOTSUPP))
goto out;
ret = __blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
ZERO_PAGE(0), biop);
if (ret == 0 || (ret && ret != -EOPNOTSUPP))
goto out;
ret = 0;
while (nr_sects != 0) {
bio = next_bio(bio, min(nr_sects, (sector_t)BIO_MAX_PAGES),
gfp_mask);
......@@ -227,21 +328,20 @@ static int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
while (nr_sects != 0) {
sz = min((sector_t) PAGE_SIZE >> 9 , nr_sects);
ret = bio_add_page(bio, ZERO_PAGE(0), sz << 9, 0);
nr_sects -= ret >> 9;
sector += ret >> 9;
if (ret < (sz << 9))
bi_size = bio_add_page(bio, ZERO_PAGE(0), sz << 9, 0);
nr_sects -= bi_size >> 9;
sector += bi_size >> 9;
if (bi_size < (sz << 9))
break;
}
cond_resched();
}
if (bio) {
ret = submit_bio_wait(bio);
bio_put(bio);
return ret;
}
return 0;
*biop = bio;
out:
return ret;
}
EXPORT_SYMBOL(__blkdev_issue_zeroout);
/**
* blkdev_issue_zeroout - zero-fill a block range
......@@ -258,26 +358,27 @@ static int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
* the discard request fail, if the discard flag is not set, or if
* discard_zeroes_data is not supported, this function will resort to
* zeroing the blocks manually, thus provisioning (allocating,
* anchoring) them. If the block device supports the WRITE SAME command
* blkdev_issue_zeroout() will use it to optimize the process of
* anchoring) them. If the block device supports WRITE ZEROES or WRITE SAME
* command(s), blkdev_issue_zeroout() will use it to optimize the process of
* clearing the block range. Otherwise the zeroing will be performed
* using regular WRITE calls.
*/
int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, bool discard)
{
if (discard) {
if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask,
BLKDEV_DISCARD_ZERO))
return 0;
}
int ret;
struct bio *bio = NULL;
struct blk_plug plug;
if (bdev_write_same(bdev) &&
blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
ZERO_PAGE(0)) == 0)
return 0;
blk_start_plug(&plug);
ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
&bio, discard);
if (ret == 0 && bio) {
ret = submit_bio_wait(bio);
bio_put(bio);
}
blk_finish_plug(&plug);
return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
return ret;
}
EXPORT_SYMBOL(blkdev_issue_zeroout);
......@@ -16,6 +16,8 @@
int blk_rq_append_bio(struct request *rq, struct bio *bio)
{
if (!rq->bio) {
rq->cmd_flags &= REQ_OP_MASK;
rq->cmd_flags |= (bio->bi_opf & REQ_OP_MASK);
blk_rq_bio_prep(rq->q, rq, bio);
} else {
if (!ll_back_merge_fn(rq->q, rq, bio))
......@@ -138,7 +140,7 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
} while (iov_iter_count(&i));
if (!bio_flagged(bio, BIO_USER_MAPPED))
rq->cmd_flags |= REQ_COPY_USER;
rq->rq_flags |= RQF_COPY_USER;
return 0;
unmap_rq:
......@@ -236,7 +238,7 @@ int blk_rq_map_kern(struct request_queue *q, struct request *rq, void *kbuf,
bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
if (do_copy)
rq->cmd_flags |= REQ_COPY_USER;
rq->rq_flags |= RQF_COPY_USER;
ret = blk_rq_append_bio(rq, bio);
if (unlikely(ret)) {
......
......@@ -199,6 +199,10 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
case REQ_OP_SECURE_ERASE:
split = blk_bio_discard_split(q, *bio, bs, &nsegs);
break;
case REQ_OP_WRITE_ZEROES:
split = NULL;
nsegs = (*bio)->bi_phys_segments;
break;
case REQ_OP_WRITE_SAME:
split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
break;
......@@ -237,15 +241,14 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
if (!bio)
return 0;
/*
* This should probably be returning 0, but blk_add_request_payload()
* (Christoph!!!!)
*/
if (bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_SECURE_ERASE)
return 1;
if (bio_op(bio) == REQ_OP_WRITE_SAME)
switch (bio_op(bio)) {
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
case REQ_OP_WRITE_ZEROES:
return 0;
case REQ_OP_WRITE_SAME:
return 1;
}
fbio = bio;
cluster = blk_queue_cluster(q);
......@@ -402,38 +405,21 @@ __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
*bvprv = *bvec;
}
static inline int __blk_bvec_map_sg(struct request_queue *q, struct bio_vec bv,
struct scatterlist *sglist, struct scatterlist **sg)
{
*sg = sglist;
sg_set_page(*sg, bv.bv_page, bv.bv_len, bv.bv_offset);
return 1;
}
static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
struct scatterlist *sglist,
struct scatterlist **sg)
{
struct bio_vec bvec, bvprv = { NULL };
struct bvec_iter iter;
int nsegs, cluster;
nsegs = 0;
cluster = blk_queue_cluster(q);
switch (bio_op(bio)) {
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
/*
* This is a hack - drivers should be neither modifying the
* biovec, nor relying on bi_vcnt - but because of
* blk_add_request_payload(), a discard bio may or may not have
* a payload we need to set up here (thank you Christoph) and
* bi_vcnt is really the only way of telling if we need to.
*/
if (!bio->bi_vcnt)
return 0;
/* Fall through */
case REQ_OP_WRITE_SAME:
*sg = sglist;
bvec = bio_iovec(bio);
sg_set_page(*sg, bvec.bv_page, bvec.bv_len, bvec.bv_offset);
return 1;
default:
break;
}
int cluster = blk_queue_cluster(q), nsegs = 0;
for_each_bio(bio)
bio_for_each_segment(bvec, bio, iter)
......@@ -453,10 +439,14 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
struct scatterlist *sg = NULL;
int nsegs = 0;
if (rq->bio)
if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
nsegs = __blk_bvec_map_sg(q, rq->special_vec, sglist, &sg);
else if (rq->bio && bio_op(rq->bio) == REQ_OP_WRITE_SAME)
nsegs = __blk_bvec_map_sg(q, bio_iovec(rq->bio), sglist, &sg);
else if (rq->bio)
nsegs = __blk_bios_map_sg(q, rq->bio, sglist, &sg);
if (unlikely(rq->cmd_flags & REQ_COPY_USER) &&
if (unlikely(rq->rq_flags & RQF_COPY_USER) &&
(blk_rq_bytes(rq) & q->dma_pad_mask)) {
unsigned int pad_len =
(q->dma_pad_mask & ~blk_rq_bytes(rq)) + 1;
......@@ -486,12 +476,19 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
* Something must have been wrong if the figured number of
* segment is bigger than number of req's physical segments
*/
WARN_ON(nsegs > rq->nr_phys_segments);
WARN_ON(nsegs > blk_rq_nr_phys_segments(rq));
return nsegs;
}
EXPORT_SYMBOL(blk_rq_map_sg);
static void req_set_nomerge(struct request_queue *q, struct request *req)
{
req->cmd_flags |= REQ_NOMERGE;
if (req == q->last_merge)
q->last_merge = NULL;
}
static inline int ll_new_hw_segment(struct request_queue *q,
struct request *req,
struct bio *bio)
......@@ -512,9 +509,7 @@ static inline int ll_new_hw_segment(struct request_queue *q,
return 1;
no_merge:
req->cmd_flags |= REQ_NOMERGE;
if (req == q->last_merge)
q->last_merge = NULL;
req_set_nomerge(q, req);
return 0;
}
......@@ -528,9 +523,7 @@ int ll_back_merge_fn(struct request_queue *q, struct request *req,
return 0;
if (blk_rq_sectors(req) + bio_sectors(bio) >
blk_rq_get_max_sectors(req, blk_rq_pos(req))) {
req->cmd_flags |= REQ_NOMERGE;
if (req == q->last_merge)
q->last_merge = NULL;
req_set_nomerge(q, req);
return 0;
}
if (!bio_flagged(req->biotail, BIO_SEG_VALID))
......@@ -552,9 +545,7 @@ int ll_front_merge_fn(struct request_queue *q, struct request *req,
return 0;
if (blk_rq_sectors(req) + bio_sectors(bio) >
blk_rq_get_max_sectors(req, bio->bi_iter.bi_sector)) {
req->cmd_flags |= REQ_NOMERGE;
if (req == q->last_merge)
q->last_merge = NULL;
req_set_nomerge(q, req);
return 0;
}
if (!bio_flagged(bio, BIO_SEG_VALID))
......@@ -634,7 +625,7 @@ void blk_rq_set_mixed_merge(struct request *rq)
unsigned int ff = rq->cmd_flags & REQ_FAILFAST_MASK;
struct bio *bio;
if (rq->cmd_flags & REQ_MIXED_MERGE)
if (rq->rq_flags & RQF_MIXED_MERGE)
return;
/*
......@@ -647,7 +638,7 @@ void blk_rq_set_mixed_merge(struct request *rq)
(bio->bi_opf & REQ_FAILFAST_MASK) != ff);
bio->bi_opf |= ff;
}
rq->cmd_flags |= REQ_MIXED_MERGE;
rq->rq_flags |= RQF_MIXED_MERGE;
}
static void blk_account_io_merge(struct request *req)
......@@ -709,7 +700,7 @@ static int attempt_merge(struct request_queue *q, struct request *req,
* makes sure that all involved bios have mixable attributes
* set properly.
*/
if ((req->cmd_flags | next->cmd_flags) & REQ_MIXED_MERGE ||
if (((req->rq_flags | next->rq_flags) & RQF_MIXED_MERGE) ||
(req->cmd_flags & REQ_FAILFAST_MASK) !=
(next->cmd_flags & REQ_FAILFAST_MASK)) {
blk_rq_set_mixed_merge(req);
......
......@@ -259,6 +259,47 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
return ret;
}
static void blk_mq_stat_clear(struct blk_mq_hw_ctx *hctx)
{
struct blk_mq_ctx *ctx;
unsigned int i;
hctx_for_each_ctx(hctx, ctx, i) {
blk_stat_init(&ctx->stat[BLK_STAT_READ]);
blk_stat_init(&ctx->stat[BLK_STAT_WRITE]);
}
}
static ssize_t blk_mq_hw_sysfs_stat_store(struct blk_mq_hw_ctx *hctx,
const char *page, size_t count)
{
blk_mq_stat_clear(hctx);
return count;
}
static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
{
return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
pre, (long long) stat->nr_samples,
(long long) stat->mean, (long long) stat->min,
(long long) stat->max);
}
static ssize_t blk_mq_hw_sysfs_stat_show(struct blk_mq_hw_ctx *hctx, char *page)
{
struct blk_rq_stat stat[2];
ssize_t ret;
blk_stat_init(&stat[BLK_STAT_READ]);
blk_stat_init(&stat[BLK_STAT_WRITE]);
blk_hctx_stat_get(hctx, stat);
ret = print_stat(page, &stat[BLK_STAT_READ], "read :");
ret += print_stat(page + ret, &stat[BLK_STAT_WRITE], "write:");
return ret;
}
static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
.attr = {.name = "dispatched", .mode = S_IRUGO },
.show = blk_mq_sysfs_dispatched_show,
......@@ -317,6 +358,11 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_poll = {
.show = blk_mq_hw_sysfs_poll_show,
.store = blk_mq_hw_sysfs_poll_store,
};
static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_stat = {
.attr = {.name = "stats", .mode = S_IRUGO | S_IWUSR },
.show = blk_mq_hw_sysfs_stat_show,
.store = blk_mq_hw_sysfs_stat_store,
};
static struct attribute *default_hw_ctx_attrs[] = {
&blk_mq_hw_sysfs_queued.attr,
......@@ -327,6 +373,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
&blk_mq_hw_sysfs_cpus.attr,
&blk_mq_hw_sysfs_active.attr,
&blk_mq_hw_sysfs_poll.attr,
&blk_mq_hw_sysfs_stat.attr,
NULL,
};
......
此差异已折叠。
#ifndef INT_BLK_MQ_H
#define INT_BLK_MQ_H
#include "blk-stat.h"
struct blk_mq_tag_set;
struct blk_mq_ctx {
......@@ -18,6 +20,7 @@ struct blk_mq_ctx {
/* incremented at completion time */
unsigned long ____cacheline_aligned_in_smp rq_completed[2];
struct blk_rq_stat stat[2];
struct request_queue *queue;
struct kobject kobj;
......@@ -28,6 +31,7 @@ void blk_mq_freeze_queue(struct request_queue *q);
void blk_mq_free_queue(struct request_queue *q);
int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
void blk_mq_wake_waiters(struct request_queue *q);
bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
/*
* CPU hotplug helpers
......@@ -100,6 +104,11 @@ static inline void blk_mq_set_alloc_data(struct blk_mq_alloc_data *data,
data->hctx = hctx;
}
static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
{
return test_bit(BLK_MQ_S_STOPPED, &hctx->state);
}
static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
{
return hctx->nr_ctx && hctx->tags;
......
......@@ -13,6 +13,7 @@
#include <linux/gfp.h>
#include "blk.h"
#include "blk-wbt.h"
unsigned long blk_max_low_pfn;
EXPORT_SYMBOL(blk_max_low_pfn);
......@@ -95,6 +96,7 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->max_dev_sectors = 0;
lim->chunk_sectors = 0;
lim->max_write_same_sectors = 0;
lim->max_write_zeroes_sectors = 0;
lim->max_discard_sectors = 0;
lim->max_hw_discard_sectors = 0;
lim->discard_granularity = 0;
......@@ -107,6 +109,7 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->io_opt = 0;
lim->misaligned = 0;
lim->cluster = 1;
lim->zoned = BLK_ZONED_NONE;
}
EXPORT_SYMBOL(blk_set_default_limits);
......@@ -130,6 +133,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_sectors = UINT_MAX;
lim->max_dev_sectors = UINT_MAX;
lim->max_write_same_sectors = UINT_MAX;
lim->max_write_zeroes_sectors = UINT_MAX;
}
EXPORT_SYMBOL(blk_set_stacking_limits);
......@@ -298,6 +302,19 @@ void blk_queue_max_write_same_sectors(struct request_queue *q,
}
EXPORT_SYMBOL(blk_queue_max_write_same_sectors);
/**
* blk_queue_max_write_zeroes_sectors - set max sectors for a single
* write zeroes
* @q: the request queue for the device
* @max_write_zeroes_sectors: maximum number of sectors to write per command
**/
void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
unsigned int max_write_zeroes_sectors)
{
q->limits.max_write_zeroes_sectors = max_write_zeroes_sectors;
}
EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
/**
* blk_queue_max_segments - set max hw segments for a request for this queue
* @q: the request queue for the device
......@@ -526,6 +543,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
t->max_write_same_sectors = min(t->max_write_same_sectors,
b->max_write_same_sectors);
t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
b->max_write_zeroes_sectors);
t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
......@@ -631,6 +650,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
t->discard_granularity;
}
if (b->chunk_sectors)
t->chunk_sectors = min_not_zero(t->chunk_sectors,
b->chunk_sectors);
return ret;
}
EXPORT_SYMBOL(blk_stack_limits);
......@@ -832,6 +855,19 @@ void blk_queue_flush_queueable(struct request_queue *q, bool queueable)
}
EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
/**
* blk_set_queue_depth - tell the block layer about the device queue depth
* @q: the request queue for the device
* @depth: queue depth
*
*/
void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
{
q->queue_depth = depth;
wbt_set_queue_depth(q->rq_wb, depth);
}
EXPORT_SYMBOL(blk_set_queue_depth);
/**
* blk_queue_write_cache - configure queue's write cache
* @q: the request queue for the device
......@@ -852,6 +888,8 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
else
queue_flag_clear(QUEUE_FLAG_FUA, q);
spin_unlock_irq(q->queue_lock);
wbt_set_write_cache(q->rq_wb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
}
EXPORT_SYMBOL_GPL(blk_queue_write_cache);
......
/*
* Block stat tracking code
*
* Copyright (C) 2016 Jens Axboe
*/
#include <linux/kernel.h>
#include <linux/blk-mq.h>
#include "blk-stat.h"
#include "blk-mq.h"
static void blk_stat_flush_batch(struct blk_rq_stat *stat)
{
const s32 nr_batch = READ_ONCE(stat->nr_batch);
const s32 nr_samples = READ_ONCE(stat->nr_samples);
if (!nr_batch)
return;
if (!nr_samples)
stat->mean = div64_s64(stat->batch, nr_batch);
else {
stat->mean = div64_s64((stat->mean * nr_samples) +
stat->batch,
nr_batch + nr_samples);
}
stat->nr_samples += nr_batch;
stat->nr_batch = stat->batch = 0;
}
static void blk_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
{
if (!src->nr_samples)
return;
blk_stat_flush_batch(src);
dst->min = min(dst->min, src->min);
dst->max = max(dst->max, src->max);
if (!dst->nr_samples)
dst->mean = src->mean;
else {
dst->mean = div64_s64((src->mean * src->nr_samples) +
(dst->mean * dst->nr_samples),
dst->nr_samples + src->nr_samples);
}
dst->nr_samples += src->nr_samples;
}
static void blk_mq_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
{
struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
uint64_t latest = 0;
int i, j, nr;
blk_stat_init(&dst[BLK_STAT_READ]);
blk_stat_init(&dst[BLK_STAT_WRITE]);
nr = 0;
do {
uint64_t newest = 0;
queue_for_each_hw_ctx(q, hctx, i) {
hctx_for_each_ctx(hctx, ctx, j) {
blk_stat_flush_batch(&ctx->stat[BLK_STAT_READ]);
blk_stat_flush_batch(&ctx->stat[BLK_STAT_WRITE]);
if (!ctx->stat[BLK_STAT_READ].nr_samples &&
!ctx->stat[BLK_STAT_WRITE].nr_samples)
continue;
if (ctx->stat[BLK_STAT_READ].time > newest)
newest = ctx->stat[BLK_STAT_READ].time;
if (ctx->stat[BLK_STAT_WRITE].time > newest)
newest = ctx->stat[BLK_STAT_WRITE].time;
}
}
/*
* No samples
*/
if (!newest)
break;
if (newest > latest)
latest = newest;
queue_for_each_hw_ctx(q, hctx, i) {
hctx_for_each_ctx(hctx, ctx, j) {
if (ctx->stat[BLK_STAT_READ].time == newest) {
blk_stat_sum(&dst[BLK_STAT_READ],
&ctx->stat[BLK_STAT_READ]);
nr++;
}
if (ctx->stat[BLK_STAT_WRITE].time == newest) {
blk_stat_sum(&dst[BLK_STAT_WRITE],
&ctx->stat[BLK_STAT_WRITE]);
nr++;
}
}
}
/*
* If we race on finding an entry, just loop back again.
* Should be very rare.
*/
} while (!nr);
dst[BLK_STAT_READ].time = dst[BLK_STAT_WRITE].time = latest;
}
void blk_queue_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
{
if (q->mq_ops)
blk_mq_stat_get(q, dst);
else {
blk_stat_flush_batch(&q->rq_stats[BLK_STAT_READ]);
blk_stat_flush_batch(&q->rq_stats[BLK_STAT_WRITE]);
memcpy(&dst[BLK_STAT_READ], &q->rq_stats[BLK_STAT_READ],
sizeof(struct blk_rq_stat));
memcpy(&dst[BLK_STAT_WRITE], &q->rq_stats[BLK_STAT_WRITE],
sizeof(struct blk_rq_stat));
}
}
void blk_hctx_stat_get(struct blk_mq_hw_ctx *hctx, struct blk_rq_stat *dst)
{
struct blk_mq_ctx *ctx;
unsigned int i, nr;
nr = 0;
do {
uint64_t newest = 0;
hctx_for_each_ctx(hctx, ctx, i) {
blk_stat_flush_batch(&ctx->stat[BLK_STAT_READ]);
blk_stat_flush_batch(&ctx->stat[BLK_STAT_WRITE]);
if (!ctx->stat[BLK_STAT_READ].nr_samples &&
!ctx->stat[BLK_STAT_WRITE].nr_samples)
continue;
if (ctx->stat[BLK_STAT_READ].time > newest)
newest = ctx->stat[BLK_STAT_READ].time;
if (ctx->stat[BLK_STAT_WRITE].time > newest)
newest = ctx->stat[BLK_STAT_WRITE].time;
}
if (!newest)
break;
hctx_for_each_ctx(hctx, ctx, i) {
if (ctx->stat[BLK_STAT_READ].time == newest) {
blk_stat_sum(&dst[BLK_STAT_READ],
&ctx->stat[BLK_STAT_READ]);
nr++;
}
if (ctx->stat[BLK_STAT_WRITE].time == newest) {
blk_stat_sum(&dst[BLK_STAT_WRITE],
&ctx->stat[BLK_STAT_WRITE]);
nr++;
}
}
/*
* If we race on finding an entry, just loop back again.
* Should be very rare, as the window is only updated
* occasionally
*/
} while (!nr);
}
static void __blk_stat_init(struct blk_rq_stat *stat, s64 time_now)
{
stat->min = -1ULL;
stat->max = stat->nr_samples = stat->mean = 0;
stat->batch = stat->nr_batch = 0;
stat->time = time_now & BLK_STAT_NSEC_MASK;
}
void blk_stat_init(struct blk_rq_stat *stat)
{
__blk_stat_init(stat, ktime_to_ns(ktime_get()));
}
static bool __blk_stat_is_current(struct blk_rq_stat *stat, s64 now)
{
return (now & BLK_STAT_NSEC_MASK) == (stat->time & BLK_STAT_NSEC_MASK);
}
bool blk_stat_is_current(struct blk_rq_stat *stat)
{
return __blk_stat_is_current(stat, ktime_to_ns(ktime_get()));
}
void blk_stat_add(struct blk_rq_stat *stat, struct request *rq)
{
s64 now, value;
now = __blk_stat_time(ktime_to_ns(ktime_get()));
if (now < blk_stat_time(&rq->issue_stat))
return;
if (!__blk_stat_is_current(stat, now))
__blk_stat_init(stat, now);
value = now - blk_stat_time(&rq->issue_stat);
if (value > stat->max)
stat->max = value;
if (value < stat->min)
stat->min = value;
if (stat->batch + value < stat->batch ||
stat->nr_batch + 1 == BLK_RQ_STAT_BATCH)
blk_stat_flush_batch(stat);
stat->batch += value;
stat->nr_batch++;
}
void blk_stat_clear(struct request_queue *q)
{
if (q->mq_ops) {
struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
int i, j;
queue_for_each_hw_ctx(q, hctx, i) {
hctx_for_each_ctx(hctx, ctx, j) {
blk_stat_init(&ctx->stat[BLK_STAT_READ]);
blk_stat_init(&ctx->stat[BLK_STAT_WRITE]);
}
}
} else {
blk_stat_init(&q->rq_stats[BLK_STAT_READ]);
blk_stat_init(&q->rq_stats[BLK_STAT_WRITE]);
}
}
void blk_stat_set_issue_time(struct blk_issue_stat *stat)
{
stat->time = (stat->time & BLK_STAT_MASK) |
(ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK);
}
/*
* Enable stat tracking, return whether it was enabled
*/
bool blk_stat_enable(struct request_queue *q)
{
if (!test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
set_bit(QUEUE_FLAG_STATS, &q->queue_flags);
return false;
}
return true;
}
#ifndef BLK_STAT_H
#define BLK_STAT_H
/*
* ~0.13s window as a power-of-2 (2^27 nsecs)
*/
#define BLK_STAT_NSEC 134217728ULL
#define BLK_STAT_NSEC_MASK ~(BLK_STAT_NSEC - 1)
/*
* Upper 3 bits can be used elsewhere
*/
#define BLK_STAT_RES_BITS 3
#define BLK_STAT_SHIFT (64 - BLK_STAT_RES_BITS)
#define BLK_STAT_TIME_MASK ((1ULL << BLK_STAT_SHIFT) - 1)
#define BLK_STAT_MASK ~BLK_STAT_TIME_MASK
enum {
BLK_STAT_READ = 0,
BLK_STAT_WRITE,
};
void blk_stat_add(struct blk_rq_stat *, struct request *);
void blk_hctx_stat_get(struct blk_mq_hw_ctx *, struct blk_rq_stat *);
void blk_queue_stat_get(struct request_queue *, struct blk_rq_stat *);
void blk_stat_clear(struct request_queue *);
void blk_stat_init(struct blk_rq_stat *);
bool blk_stat_is_current(struct blk_rq_stat *);
void blk_stat_set_issue_time(struct blk_issue_stat *);
bool blk_stat_enable(struct request_queue *);
static inline u64 __blk_stat_time(u64 time)
{
return time & BLK_STAT_TIME_MASK;
}
static inline u64 blk_stat_time(struct blk_issue_stat *stat)
{
return __blk_stat_time(stat->time);
}
#endif
......@@ -13,6 +13,7 @@
#include "blk.h"
#include "blk-mq.h"
#include "blk-wbt.h"
struct queue_sysfs_entry {
struct attribute attr;
......@@ -41,6 +42,19 @@ queue_var_store(unsigned long *var, const char *page, size_t count)
return count;
}
static ssize_t queue_var_store64(s64 *var, const char *page)
{
int err;
s64 v;
err = kstrtos64(page, 10, &v);
if (err < 0)
return err;
*var = v;
return 0;
}
static ssize_t queue_requests_show(struct request_queue *q, char *page)
{
return queue_var_show(q->nr_requests, (page));
......@@ -130,6 +144,11 @@ static ssize_t queue_physical_block_size_show(struct request_queue *q, char *pag
return queue_var_show(queue_physical_block_size(q), page);
}
static ssize_t queue_chunk_sectors_show(struct request_queue *q, char *page)
{
return queue_var_show(q->limits.chunk_sectors, page);
}
static ssize_t queue_io_min_show(struct request_queue *q, char *page)
{
return queue_var_show(queue_io_min(q), page);
......@@ -192,6 +211,11 @@ static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
(unsigned long long)q->limits.max_write_same_sectors << 9);
}
static ssize_t queue_write_zeroes_max_show(struct request_queue *q, char *page)
{
return sprintf(page, "%llu\n",
(unsigned long long)q->limits.max_write_zeroes_sectors << 9);
}
static ssize_t
queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)
......@@ -258,6 +282,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
#undef QUEUE_SYSFS_BIT_FNS
static ssize_t queue_zoned_show(struct request_queue *q, char *page)
{
switch (blk_queue_zoned_model(q)) {
case BLK_ZONED_HA:
return sprintf(page, "host-aware\n");
case BLK_ZONED_HM:
return sprintf(page, "host-managed\n");
default:
return sprintf(page, "none\n");
}
}
static ssize_t queue_nomerges_show(struct request_queue *q, char *page)
{
return queue_var_show((blk_queue_nomerges(q) << 1) |
......@@ -320,6 +356,38 @@ queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count)
return ret;
}
static ssize_t queue_poll_delay_show(struct request_queue *q, char *page)
{
int val;
if (q->poll_nsec == -1)
val = -1;
else
val = q->poll_nsec / 1000;
return sprintf(page, "%d\n", val);
}
static ssize_t queue_poll_delay_store(struct request_queue *q, const char *page,
size_t count)
{
int err, val;
if (!q->mq_ops || !q->mq_ops->poll)
return -EINVAL;
err = kstrtoint(page, 10, &val);
if (err < 0)
return err;
if (val == -1)
q->poll_nsec = -1;
else
q->poll_nsec = val * 1000;
return count;
}
static ssize_t queue_poll_show(struct request_queue *q, char *page)
{
return queue_var_show(test_bit(QUEUE_FLAG_POLL, &q->queue_flags), page);
......@@ -348,6 +416,50 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
return ret;
}
static ssize_t queue_wb_lat_show(struct request_queue *q, char *page)
{
if (!q->rq_wb)
return -EINVAL;
return sprintf(page, "%llu\n", div_u64(q->rq_wb->min_lat_nsec, 1000));
}
static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page,
size_t count)
{
struct rq_wb *rwb;
ssize_t ret;
s64 val;
ret = queue_var_store64(&val, page);
if (ret < 0)
return ret;
if (val < -1)
return -EINVAL;
rwb = q->rq_wb;
if (!rwb) {
ret = wbt_init(q);
if (ret)
return ret;
rwb = q->rq_wb;
if (!rwb)
return -EINVAL;
}
if (val == -1)
rwb->min_lat_nsec = wbt_default_latency_nsec(q);
else if (val >= 0)
rwb->min_lat_nsec = val * 1000ULL;
if (rwb->enable_state == WBT_STATE_ON_DEFAULT)
rwb->enable_state = WBT_STATE_ON_MANUAL;
wbt_update_limits(rwb);
return count;
}
static ssize_t queue_wc_show(struct request_queue *q, char *page)
{
if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
......@@ -385,6 +497,26 @@ static ssize_t queue_dax_show(struct request_queue *q, char *page)
return queue_var_show(blk_queue_dax(q), page);
}
static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
{
return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
pre, (long long) stat->nr_samples,
(long long) stat->mean, (long long) stat->min,
(long long) stat->max);
}
static ssize_t queue_stats_show(struct request_queue *q, char *page)
{
struct blk_rq_stat stat[2];
ssize_t ret;
blk_queue_stat_get(q, stat);
ret = print_stat(page, &stat[BLK_STAT_READ], "read :");
ret += print_stat(page + ret, &stat[BLK_STAT_WRITE], "write:");
return ret;
}
static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
.show = queue_requests_show,
......@@ -444,6 +576,11 @@ static struct queue_sysfs_entry queue_physical_block_size_entry = {
.show = queue_physical_block_size_show,
};
static struct queue_sysfs_entry queue_chunk_sectors_entry = {
.attr = {.name = "chunk_sectors", .mode = S_IRUGO },
.show = queue_chunk_sectors_show,
};
static struct queue_sysfs_entry queue_io_min_entry = {
.attr = {.name = "minimum_io_size", .mode = S_IRUGO },
.show = queue_io_min_show,
......@@ -480,12 +617,22 @@ static struct queue_sysfs_entry queue_write_same_max_entry = {
.show = queue_write_same_max_show,
};
static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
.attr = {.name = "write_zeroes_max_bytes", .mode = S_IRUGO },
.show = queue_write_zeroes_max_show,
};
static struct queue_sysfs_entry queue_nonrot_entry = {
.attr = {.name = "rotational", .mode = S_IRUGO | S_IWUSR },
.show = queue_show_nonrot,
.store = queue_store_nonrot,
};
static struct queue_sysfs_entry queue_zoned_entry = {
.attr = {.name = "zoned", .mode = S_IRUGO },
.show = queue_zoned_show,
};
static struct queue_sysfs_entry queue_nomerges_entry = {
.attr = {.name = "nomerges", .mode = S_IRUGO | S_IWUSR },
.show = queue_nomerges_show,
......@@ -516,6 +663,12 @@ static struct queue_sysfs_entry queue_poll_entry = {
.store = queue_poll_store,
};
static struct queue_sysfs_entry queue_poll_delay_entry = {
.attr = {.name = "io_poll_delay", .mode = S_IRUGO | S_IWUSR },
.show = queue_poll_delay_show,
.store = queue_poll_delay_store,
};
static struct queue_sysfs_entry queue_wc_entry = {
.attr = {.name = "write_cache", .mode = S_IRUGO | S_IWUSR },
.show = queue_wc_show,
......@@ -527,6 +680,17 @@ static struct queue_sysfs_entry queue_dax_entry = {
.show = queue_dax_show,
};
static struct queue_sysfs_entry queue_stats_entry = {
.attr = {.name = "stats", .mode = S_IRUGO },
.show = queue_stats_show,
};
static struct queue_sysfs_entry queue_wb_lat_entry = {
.attr = {.name = "wbt_lat_usec", .mode = S_IRUGO | S_IWUSR },
.show = queue_wb_lat_show,
.store = queue_wb_lat_store,
};
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
......@@ -539,6 +703,7 @@ static struct attribute *default_attrs[] = {
&queue_hw_sector_size_entry.attr,
&queue_logical_block_size_entry.attr,
&queue_physical_block_size_entry.attr,
&queue_chunk_sectors_entry.attr,
&queue_io_min_entry.attr,
&queue_io_opt_entry.attr,
&queue_discard_granularity_entry.attr,
......@@ -546,7 +711,9 @@ static struct attribute *default_attrs[] = {
&queue_discard_max_hw_entry.attr,
&queue_discard_zeroes_data_entry.attr,
&queue_write_same_max_entry.attr,
&queue_write_zeroes_max_entry.attr,
&queue_nonrot_entry.attr,
&queue_zoned_entry.attr,
&queue_nomerges_entry.attr,
&queue_rq_affinity_entry.attr,
&queue_iostats_entry.attr,
......@@ -554,6 +721,9 @@ static struct attribute *default_attrs[] = {
&queue_poll_entry.attr,
&queue_wc_entry.attr,
&queue_dax_entry.attr,
&queue_stats_entry.attr,
&queue_wb_lat_entry.attr,
&queue_poll_delay_entry.attr,
NULL,
};
......@@ -628,6 +798,7 @@ static void blk_release_queue(struct kobject *kobj)
struct request_queue *q =
container_of(kobj, struct request_queue, kobj);
wbt_exit(q);
bdi_exit(&q->backing_dev_info);
blkcg_exit_queue(q);
......@@ -668,6 +839,23 @@ struct kobj_type blk_queue_ktype = {
.release = blk_release_queue,
};
static void blk_wb_init(struct request_queue *q)
{
#ifndef CONFIG_BLK_WBT_MQ
if (q->mq_ops)
return;
#endif
#ifndef CONFIG_BLK_WBT_SQ
if (q->request_fn)
return;
#endif
/*
* If this fails, we don't get throttling
*/
wbt_init(q);
}
int blk_register_queue(struct gendisk *disk)
{
int ret;
......@@ -707,6 +895,8 @@ int blk_register_queue(struct gendisk *disk)
if (q->mq_ops)
blk_mq_register_dev(dev, q);
blk_wb_init(q);
if (!q->request_fn)
return 0;
......
......@@ -270,7 +270,7 @@ void blk_queue_end_tag(struct request_queue *q, struct request *rq)
BUG_ON(tag >= bqt->real_max_depth);
list_del_init(&rq->queuelist);
rq->cmd_flags &= ~REQ_QUEUED;
rq->rq_flags &= ~RQF_QUEUED;
rq->tag = -1;
if (unlikely(bqt->tag_index[tag] == NULL))
......@@ -316,7 +316,7 @@ int blk_queue_start_tag(struct request_queue *q, struct request *rq)
unsigned max_depth;
int tag;
if (unlikely((rq->cmd_flags & REQ_QUEUED))) {
if (unlikely((rq->rq_flags & RQF_QUEUED))) {
printk(KERN_ERR
"%s: request %p for device [%s] already tagged %d",
__func__, rq,
......@@ -371,7 +371,7 @@ int blk_queue_start_tag(struct request_queue *q, struct request *rq)
*/
bqt->next_tag = (tag + 1) % bqt->max_depth;
rq->cmd_flags |= REQ_QUEUED;
rq->rq_flags |= RQF_QUEUED;
rq->tag = tag;
bqt->tag_index[tag] = rq;
blk_start_request(rq);
......
......@@ -818,13 +818,13 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
tg->io_disp[rw]++;
/*
* REQ_THROTTLED is used to prevent the same bio to be throttled
* BIO_THROTTLED is used to prevent the same bio to be throttled
* more than once as a throttled bio will go through blk-throtl the
* second time when it eventually gets issued. Set it when a bio
* is being charged to a tg.
*/
if (!(bio->bi_opf & REQ_THROTTLED))
bio->bi_opf |= REQ_THROTTLED;
if (!bio_flagged(bio, BIO_THROTTLED))
bio_set_flag(bio, BIO_THROTTLED);
}
/**
......@@ -1401,7 +1401,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
WARN_ON_ONCE(!rcu_read_lock_held());
/* see throtl_charge_bio() */
if ((bio->bi_opf & REQ_THROTTLED) || !tg->has_rules[rw])
if (bio_flagged(bio, BIO_THROTTLED) || !tg->has_rules[rw])
goto out;
spin_lock_irq(q->queue_lock);
......@@ -1480,7 +1480,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
* being issued.
*/
if (!throttled)
bio->bi_opf &= ~REQ_THROTTLED;
bio_clear_flag(bio, BIO_THROTTLED);
return throttled;
}
......
/*
* buffered writeback throttling. loosely based on CoDel. We can't drop
* packets for IO scheduling, so the logic is something like this:
*
* - Monitor latencies in a defined window of time.
* - If the minimum latency in the above window exceeds some target, increment
* scaling step and scale down queue depth by a factor of 2x. The monitoring
* window is then shrunk to 100 / sqrt(scaling step + 1).
* - For any window where we don't have solid data on what the latencies
* look like, retain status quo.
* - If latencies look good, decrement scaling step.
* - If we're only doing writes, allow the scaling step to go negative. This
* will temporarily boost write performance, snapping back to a stable
* scaling step of 0 if reads show up or the heavy writers finish. Unlike
* positive scaling steps where we shrink the monitoring window, a negative
* scaling step retains the default step==0 window size.
*
* Copyright (C) 2016 Jens Axboe
*
*/
#include <linux/kernel.h>
#include <linux/blk_types.h>
#include <linux/slab.h>
#include <linux/backing-dev.h>
#include <linux/swap.h>
#include "blk-wbt.h"
#define CREATE_TRACE_POINTS
#include <trace/events/wbt.h>
enum {
/*
* Default setting, we'll scale up (to 75% of QD max) or down (min 1)
* from here depending on device stats
*/
RWB_DEF_DEPTH = 16,
/*
* 100msec window
*/
RWB_WINDOW_NSEC = 100 * 1000 * 1000ULL,
/*
* Disregard stats, if we don't meet this minimum
*/
RWB_MIN_WRITE_SAMPLES = 3,
/*
* If we have this number of consecutive windows with not enough
* information to scale up or down, scale up.
*/
RWB_UNKNOWN_BUMP = 5,
};
static inline bool rwb_enabled(struct rq_wb *rwb)
{
return rwb && rwb->wb_normal != 0;
}
/*
* Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
* false if 'v' + 1 would be bigger than 'below'.
*/
static bool atomic_inc_below(atomic_t *v, int below)
{
int cur = atomic_read(v);
for (;;) {
int old;
if (cur >= below)
return false;
old = atomic_cmpxchg(v, cur, cur + 1);
if (old == cur)
break;
cur = old;
}
return true;
}
static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
{
if (rwb_enabled(rwb)) {
const unsigned long cur = jiffies;
if (cur != *var)
*var = cur;
}
}
/*
* If a task was rate throttled in balance_dirty_pages() within the last
* second or so, use that to indicate a higher cleaning rate.
*/
static bool wb_recent_wait(struct rq_wb *rwb)
{
struct bdi_writeback *wb = &rwb->queue->backing_dev_info.wb;
return time_before(jiffies, wb->dirty_sleep + HZ);
}
static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb, bool is_kswapd)
{
return &rwb->rq_wait[is_kswapd];
}
static void rwb_wake_all(struct rq_wb *rwb)
{
int i;
for (i = 0; i < WBT_NUM_RWQ; i++) {
struct rq_wait *rqw = &rwb->rq_wait[i];
if (waitqueue_active(&rqw->wait))
wake_up_all(&rqw->wait);
}
}
void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
{
struct rq_wait *rqw;
int inflight, limit;
if (!(wb_acct & WBT_TRACKED))
return;
rqw = get_rq_wait(rwb, wb_acct & WBT_KSWAPD);
inflight = atomic_dec_return(&rqw->inflight);
/*
* wbt got disabled with IO in flight. Wake up any potential
* waiters, we don't have to do more than that.
*/
if (unlikely(!rwb_enabled(rwb))) {
rwb_wake_all(rwb);
return;
}
/*
* If the device does write back caching, drop further down
* before we wake people up.
*/
if (rwb->wc && !wb_recent_wait(rwb))
limit = 0;
else
limit = rwb->wb_normal;
/*
* Don't wake anyone up if we are above the normal limit.
*/
if (inflight && inflight >= limit)
return;
if (waitqueue_active(&rqw->wait)) {
int diff = limit - inflight;
if (!inflight || diff >= rwb->wb_background / 2)
wake_up_all(&rqw->wait);
}
}
/*
* Called on completion of a request. Note that it's also called when
* a request is merged, when the request gets freed.
*/
void wbt_done(struct rq_wb *rwb, struct blk_issue_stat *stat)
{
if (!rwb)
return;
if (!wbt_is_tracked(stat)) {
if (rwb->sync_cookie == stat) {
rwb->sync_issue = 0;
rwb->sync_cookie = NULL;
}
if (wbt_is_read(stat))
wb_timestamp(rwb, &rwb->last_comp);
wbt_clear_state(stat);
} else {
WARN_ON_ONCE(stat == rwb->sync_cookie);
__wbt_done(rwb, wbt_stat_to_mask(stat));
wbt_clear_state(stat);
}
}
/*
* Return true, if we can't increase the depth further by scaling
*/
static bool calc_wb_limits(struct rq_wb *rwb)
{
unsigned int depth;
bool ret = false;
if (!rwb->min_lat_nsec) {
rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
return false;
}
/*
* For QD=1 devices, this is a special case. It's important for those
* to have one request ready when one completes, so force a depth of
* 2 for those devices. On the backend, it'll be a depth of 1 anyway,
* since the device can't have more than that in flight. If we're
* scaling down, then keep a setting of 1/1/1.
*/
if (rwb->queue_depth == 1) {
if (rwb->scale_step > 0)
rwb->wb_max = rwb->wb_normal = 1;
else {
rwb->wb_max = rwb->wb_normal = 2;
ret = true;
}
rwb->wb_background = 1;
} else {
/*
* scale_step == 0 is our default state. If we have suffered
* latency spikes, step will be > 0, and we shrink the
* allowed write depths. If step is < 0, we're only doing
* writes, and we allow a temporarily higher depth to
* increase performance.
*/
depth = min_t(unsigned int, RWB_DEF_DEPTH, rwb->queue_depth);
if (rwb->scale_step > 0)
depth = 1 + ((depth - 1) >> min(31, rwb->scale_step));
else if (rwb->scale_step < 0) {
unsigned int maxd = 3 * rwb->queue_depth / 4;
depth = 1 + ((depth - 1) << -rwb->scale_step);
if (depth > maxd) {
depth = maxd;
ret = true;
}
}
/*
* Set our max/normal/bg queue depths based on how far
* we have scaled down (->scale_step).
*/
rwb->wb_max = depth;
rwb->wb_normal = (rwb->wb_max + 1) / 2;
rwb->wb_background = (rwb->wb_max + 3) / 4;
}
return ret;
}
static inline bool stat_sample_valid(struct blk_rq_stat *stat)
{
/*
* We need at least one read sample, and a minimum of
* RWB_MIN_WRITE_SAMPLES. We require some write samples to know
* that it's writes impacting us, and not just some sole read on
* a device that is in a lower power state.
*/
return stat[BLK_STAT_READ].nr_samples >= 1 &&
stat[BLK_STAT_WRITE].nr_samples >= RWB_MIN_WRITE_SAMPLES;
}
static u64 rwb_sync_issue_lat(struct rq_wb *rwb)
{
u64 now, issue = ACCESS_ONCE(rwb->sync_issue);
if (!issue || !rwb->sync_cookie)
return 0;
now = ktime_to_ns(ktime_get());
return now - issue;
}
enum {
LAT_OK = 1,
LAT_UNKNOWN,
LAT_UNKNOWN_WRITES,
LAT_EXCEEDED,
};
static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
{
struct backing_dev_info *bdi = &rwb->queue->backing_dev_info;
u64 thislat;
/*
* If our stored sync issue exceeds the window size, or it
* exceeds our min target AND we haven't logged any entries,
* flag the latency as exceeded. wbt works off completion latencies,
* but for a flooded device, a single sync IO can take a long time
* to complete after being issued. If this time exceeds our
* monitoring window AND we didn't see any other completions in that
* window, then count that sync IO as a violation of the latency.
*/
thislat = rwb_sync_issue_lat(rwb);
if (thislat > rwb->cur_win_nsec ||
(thislat > rwb->min_lat_nsec && !stat[BLK_STAT_READ].nr_samples)) {
trace_wbt_lat(bdi, thislat);
return LAT_EXCEEDED;
}
/*
* No read/write mix, if stat isn't valid
*/
if (!stat_sample_valid(stat)) {
/*
* If we had writes in this stat window and the window is
* current, we're only doing writes. If a task recently
* waited or still has writes in flights, consider us doing
* just writes as well.
*/
if ((stat[BLK_STAT_WRITE].nr_samples && blk_stat_is_current(stat)) ||
wb_recent_wait(rwb) || wbt_inflight(rwb))
return LAT_UNKNOWN_WRITES;
return LAT_UNKNOWN;
}
/*
* If the 'min' latency exceeds our target, step down.
*/
if (stat[BLK_STAT_READ].min > rwb->min_lat_nsec) {
trace_wbt_lat(bdi, stat[BLK_STAT_READ].min);
trace_wbt_stat(bdi, stat);
return LAT_EXCEEDED;
}
if (rwb->scale_step)
trace_wbt_stat(bdi, stat);
return LAT_OK;
}
static int latency_exceeded(struct rq_wb *rwb)
{
struct blk_rq_stat stat[2];
blk_queue_stat_get(rwb->queue, stat);
return __latency_exceeded(rwb, stat);
}
static void rwb_trace_step(struct rq_wb *rwb, const char *msg)
{
struct backing_dev_info *bdi = &rwb->queue->backing_dev_info;
trace_wbt_step(bdi, msg, rwb->scale_step, rwb->cur_win_nsec,
rwb->wb_background, rwb->wb_normal, rwb->wb_max);
}
static void scale_up(struct rq_wb *rwb)
{
/*
* Hit max in previous round, stop here
*/
if (rwb->scaled_max)
return;
rwb->scale_step--;
rwb->unknown_cnt = 0;
blk_stat_clear(rwb->queue);
rwb->scaled_max = calc_wb_limits(rwb);
rwb_wake_all(rwb);
rwb_trace_step(rwb, "step up");
}
/*
* Scale rwb down. If 'hard_throttle' is set, do it quicker, since we
* had a latency violation.
*/
static void scale_down(struct rq_wb *rwb, bool hard_throttle)
{
/*
* Stop scaling down when we've hit the limit. This also prevents
* ->scale_step from going to crazy values, if the device can't
* keep up.
*/
if (rwb->wb_max == 1)
return;
if (rwb->scale_step < 0 && hard_throttle)
rwb->scale_step = 0;
else
rwb->scale_step++;
rwb->scaled_max = false;
rwb->unknown_cnt = 0;
blk_stat_clear(rwb->queue);
calc_wb_limits(rwb);
rwb_trace_step(rwb, "step down");
}
static void rwb_arm_timer(struct rq_wb *rwb)
{
unsigned long expires;
if (rwb->scale_step > 0) {
/*
* We should speed this up, using some variant of a fast
* integer inverse square root calculation. Since we only do
* this for every window expiration, it's not a huge deal,
* though.
*/
rwb->cur_win_nsec = div_u64(rwb->win_nsec << 4,
int_sqrt((rwb->scale_step + 1) << 8));
} else {
/*
* For step < 0, we don't want to increase/decrease the
* window size.
*/
rwb->cur_win_nsec = rwb->win_nsec;
}
expires = jiffies + nsecs_to_jiffies(rwb->cur_win_nsec);
mod_timer(&rwb->window_timer, expires);
}
static void wb_timer_fn(unsigned long data)
{
struct rq_wb *rwb = (struct rq_wb *) data;
unsigned int inflight = wbt_inflight(rwb);
int status;
status = latency_exceeded(rwb);
trace_wbt_timer(&rwb->queue->backing_dev_info, status, rwb->scale_step,
inflight);
/*
* If we exceeded the latency target, step down. If we did not,
* step one level up. If we don't know enough to say either exceeded
* or ok, then don't do anything.
*/
switch (status) {
case LAT_EXCEEDED:
scale_down(rwb, true);
break;
case LAT_OK:
scale_up(rwb);
break;
case LAT_UNKNOWN_WRITES:
/*
* We started a the center step, but don't have a valid
* read/write sample, but we do have writes going on.
* Allow step to go negative, to increase write perf.
*/
scale_up(rwb);
break;
case LAT_UNKNOWN:
if (++rwb->unknown_cnt < RWB_UNKNOWN_BUMP)
break;
/*
* We get here when previously scaled reduced depth, and we
* currently don't have a valid read/write sample. For that
* case, slowly return to center state (step == 0).
*/
if (rwb->scale_step > 0)
scale_up(rwb);
else if (rwb->scale_step < 0)
scale_down(rwb, false);
break;
default:
break;
}
/*
* Re-arm timer, if we have IO in flight
*/
if (rwb->scale_step || inflight)
rwb_arm_timer(rwb);
}
void wbt_update_limits(struct rq_wb *rwb)
{
rwb->scale_step = 0;
rwb->scaled_max = false;
calc_wb_limits(rwb);
rwb_wake_all(rwb);
}
static bool close_io(struct rq_wb *rwb)
{
const unsigned long now = jiffies;
return time_before(now, rwb->last_issue + HZ / 10) ||
time_before(now, rwb->last_comp + HZ / 10);
}
#define REQ_HIPRIO (REQ_SYNC | REQ_META | REQ_PRIO)
static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
{
unsigned int limit;
/*
* At this point we know it's a buffered write. If this is
* kswapd trying to free memory, or REQ_SYNC is set, set, then
* it's WB_SYNC_ALL writeback, and we'll use the max limit for
* that. If the write is marked as a background write, then use
* the idle limit, or go to normal if we haven't had competing
* IO for a bit.
*/
if ((rw & REQ_HIPRIO) || wb_recent_wait(rwb) || current_is_kswapd())
limit = rwb->wb_max;
else if ((rw & REQ_BACKGROUND) || close_io(rwb)) {
/*
* If less than 100ms since we completed unrelated IO,
* limit us to half the depth for background writeback.
*/
limit = rwb->wb_background;
} else
limit = rwb->wb_normal;
return limit;
}
static inline bool may_queue(struct rq_wb *rwb, struct rq_wait *rqw,
wait_queue_t *wait, unsigned long rw)
{
/*
* inc it here even if disabled, since we'll dec it at completion.
* this only happens if the task was sleeping in __wbt_wait(),
* and someone turned it off at the same time.
*/
if (!rwb_enabled(rwb)) {
atomic_inc(&rqw->inflight);
return true;
}
/*
* If the waitqueue is already active and we are not the next
* in line to be woken up, wait for our turn.
*/
if (waitqueue_active(&rqw->wait) &&
rqw->wait.task_list.next != &wait->task_list)
return false;
return atomic_inc_below(&rqw->inflight, get_limit(rwb, rw));
}
/*
* Block if we will exceed our limit, or if we are currently waiting for
* the timer to kick off queuing again.
*/
static void __wbt_wait(struct rq_wb *rwb, unsigned long rw, spinlock_t *lock)
{
struct rq_wait *rqw = get_rq_wait(rwb, current_is_kswapd());
DEFINE_WAIT(wait);
if (may_queue(rwb, rqw, &wait, rw))
return;
do {
prepare_to_wait_exclusive(&rqw->wait, &wait,
TASK_UNINTERRUPTIBLE);
if (may_queue(rwb, rqw, &wait, rw))
break;
if (lock)
spin_unlock_irq(lock);
io_schedule();
if (lock)
spin_lock_irq(lock);
} while (1);
finish_wait(&rqw->wait, &wait);
}
static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
{
const int op = bio_op(bio);
/*
* If not a WRITE, do nothing
*/
if (op != REQ_OP_WRITE)
return false;
/*
* Don't throttle WRITE_ODIRECT
*/
if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE))
return false;
return true;
}
/*
* Returns true if the IO request should be accounted, false if not.
* May sleep, if we have exceeded the writeback limits. Caller can pass
* in an irq held spinlock, if it holds one when calling this function.
* If we do sleep, we'll release and re-grab it.
*/
unsigned int wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
{
unsigned int ret = 0;
if (!rwb_enabled(rwb))
return 0;
if (bio_op(bio) == REQ_OP_READ)
ret = WBT_READ;
if (!wbt_should_throttle(rwb, bio)) {
if (ret & WBT_READ)
wb_timestamp(rwb, &rwb->last_issue);
return ret;
}
__wbt_wait(rwb, bio->bi_opf, lock);
if (!timer_pending(&rwb->window_timer))
rwb_arm_timer(rwb);
if (current_is_kswapd())
ret |= WBT_KSWAPD;
return ret | WBT_TRACKED;
}
void wbt_issue(struct rq_wb *rwb, struct blk_issue_stat *stat)
{
if (!rwb_enabled(rwb))
return;
/*
* Track sync issue, in case it takes a long time to complete. Allows
* us to react quicker, if a sync IO takes a long time to complete.
* Note that this is just a hint. 'stat' can go away when the
* request completes, so it's important we never dereference it. We
* only use the address to compare with, which is why we store the
* sync_issue time locally.
*/
if (wbt_is_read(stat) && !rwb->sync_issue) {
rwb->sync_cookie = stat;
rwb->sync_issue = blk_stat_time(stat);
}
}
void wbt_requeue(struct rq_wb *rwb, struct blk_issue_stat *stat)
{
if (!rwb_enabled(rwb))
return;
if (stat == rwb->sync_cookie) {
rwb->sync_issue = 0;
rwb->sync_cookie = NULL;
}
}
void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth)
{
if (rwb) {
rwb->queue_depth = depth;
wbt_update_limits(rwb);
}
}
void wbt_set_write_cache(struct rq_wb *rwb, bool write_cache_on)
{
if (rwb)
rwb->wc = write_cache_on;
}
/*
* Disable wbt, if enabled by default. Only called from CFQ, if we have
* cgroups enabled
*/
void wbt_disable_default(struct request_queue *q)
{
struct rq_wb *rwb = q->rq_wb;
if (rwb && rwb->enable_state == WBT_STATE_ON_DEFAULT) {
del_timer_sync(&rwb->window_timer);
rwb->win_nsec = rwb->min_lat_nsec = 0;
wbt_update_limits(rwb);
}
}
EXPORT_SYMBOL_GPL(wbt_disable_default);
u64 wbt_default_latency_nsec(struct request_queue *q)
{
/*
* We default to 2msec for non-rotational storage, and 75msec
* for rotational storage.
*/
if (blk_queue_nonrot(q))
return 2000000ULL;
else
return 75000000ULL;
}
int wbt_init(struct request_queue *q)
{
struct rq_wb *rwb;
int i;
/*
* For now, we depend on the stats window being larger than
* our monitoring window. Ensure that this isn't inadvertently
* violated.
*/
BUILD_BUG_ON(RWB_WINDOW_NSEC > BLK_STAT_NSEC);
BUILD_BUG_ON(WBT_NR_BITS > BLK_STAT_RES_BITS);
rwb = kzalloc(sizeof(*rwb), GFP_KERNEL);
if (!rwb)
return -ENOMEM;
for (i = 0; i < WBT_NUM_RWQ; i++) {
atomic_set(&rwb->rq_wait[i].inflight, 0);
init_waitqueue_head(&rwb->rq_wait[i].wait);
}
setup_timer(&rwb->window_timer, wb_timer_fn, (unsigned long) rwb);
rwb->wc = 1;
rwb->queue_depth = RWB_DEF_DEPTH;
rwb->last_comp = rwb->last_issue = jiffies;
rwb->queue = q;
rwb->win_nsec = RWB_WINDOW_NSEC;
rwb->enable_state = WBT_STATE_ON_DEFAULT;
wbt_update_limits(rwb);
/*
* Assign rwb, and turn on stats tracking for this queue
*/
q->rq_wb = rwb;
blk_stat_enable(q);
rwb->min_lat_nsec = wbt_default_latency_nsec(q);
wbt_set_queue_depth(rwb, blk_queue_depth(q));
wbt_set_write_cache(rwb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
return 0;
}
void wbt_exit(struct request_queue *q)
{
struct rq_wb *rwb = q->rq_wb;
if (rwb) {
del_timer_sync(&rwb->window_timer);
q->rq_wb = NULL;
kfree(rwb);
}
}
#ifndef WB_THROTTLE_H
#define WB_THROTTLE_H
#include <linux/kernel.h>
#include <linux/atomic.h>
#include <linux/wait.h>
#include <linux/timer.h>
#include <linux/ktime.h>
#include "blk-stat.h"
enum wbt_flags {
WBT_TRACKED = 1, /* write, tracked for throttling */
WBT_READ = 2, /* read */
WBT_KSWAPD = 4, /* write, from kswapd */
WBT_NR_BITS = 3, /* number of bits */
};
enum {
WBT_NUM_RWQ = 2,
};
/*
* Enable states. Either off, or on by default (done at init time),
* or on through manual setup in sysfs.
*/
enum {
WBT_STATE_ON_DEFAULT = 1,
WBT_STATE_ON_MANUAL = 2,
};
static inline void wbt_clear_state(struct blk_issue_stat *stat)
{
stat->time &= BLK_STAT_TIME_MASK;
}
static inline enum wbt_flags wbt_stat_to_mask(struct blk_issue_stat *stat)
{
return (stat->time & BLK_STAT_MASK) >> BLK_STAT_SHIFT;
}
static inline void wbt_track(struct blk_issue_stat *stat, enum wbt_flags wb_acct)
{
stat->time |= ((u64) wb_acct) << BLK_STAT_SHIFT;
}
static inline bool wbt_is_tracked(struct blk_issue_stat *stat)
{
return (stat->time >> BLK_STAT_SHIFT) & WBT_TRACKED;
}
static inline bool wbt_is_read(struct blk_issue_stat *stat)
{
return (stat->time >> BLK_STAT_SHIFT) & WBT_READ;
}
struct rq_wait {
wait_queue_head_t wait;
atomic_t inflight;
};
struct rq_wb {
/*
* Settings that govern how we throttle
*/
unsigned int wb_background; /* background writeback */
unsigned int wb_normal; /* normal writeback */
unsigned int wb_max; /* max throughput writeback */
int scale_step;
bool scaled_max;
short enable_state; /* WBT_STATE_* */
/*
* Number of consecutive periods where we don't have enough
* information to make a firm scale up/down decision.
*/
unsigned int unknown_cnt;
u64 win_nsec; /* default window size */
u64 cur_win_nsec; /* current window size */
struct timer_list window_timer;
s64 sync_issue;
void *sync_cookie;
unsigned int wc;
unsigned int queue_depth;
unsigned long last_issue; /* last non-throttled issue */
unsigned long last_comp; /* last non-throttled comp */
unsigned long min_lat_nsec;
struct request_queue *queue;
struct rq_wait rq_wait[WBT_NUM_RWQ];
};
static inline unsigned int wbt_inflight(struct rq_wb *rwb)
{
unsigned int i, ret = 0;
for (i = 0; i < WBT_NUM_RWQ; i++)
ret += atomic_read(&rwb->rq_wait[i].inflight);
return ret;
}
#ifdef CONFIG_BLK_WBT
void __wbt_done(struct rq_wb *, enum wbt_flags);
void wbt_done(struct rq_wb *, struct blk_issue_stat *);
enum wbt_flags wbt_wait(struct rq_wb *, struct bio *, spinlock_t *);
int wbt_init(struct request_queue *);
void wbt_exit(struct request_queue *);
void wbt_update_limits(struct rq_wb *);
void wbt_requeue(struct rq_wb *, struct blk_issue_stat *);
void wbt_issue(struct rq_wb *, struct blk_issue_stat *);
void wbt_disable_default(struct request_queue *);
void wbt_set_queue_depth(struct rq_wb *, unsigned int);
void wbt_set_write_cache(struct rq_wb *, bool);
u64 wbt_default_latency_nsec(struct request_queue *);
#else
static inline void __wbt_done(struct rq_wb *rwb, enum wbt_flags flags)
{
}
static inline void wbt_done(struct rq_wb *rwb, struct blk_issue_stat *stat)
{
}
static inline enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio,
spinlock_t *lock)
{
return 0;
}
static inline int wbt_init(struct request_queue *q)
{
return -EINVAL;
}
static inline void wbt_exit(struct request_queue *q)
{
}
static inline void wbt_update_limits(struct rq_wb *rwb)
{
}
static inline void wbt_requeue(struct rq_wb *rwb, struct blk_issue_stat *stat)
{
}
static inline void wbt_issue(struct rq_wb *rwb, struct blk_issue_stat *stat)
{
}
static inline void wbt_disable_default(struct request_queue *q)
{
}
static inline void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth)
{
}
static inline void wbt_set_write_cache(struct rq_wb *rwb, bool wc)
{
}
static inline u64 wbt_default_latency_nsec(struct request_queue *q)
{
return 0;
}
#endif /* CONFIG_BLK_WBT */
#endif
/*
* Zoned block device handling
*
* Copyright (c) 2015, Hannes Reinecke
* Copyright (c) 2015, SUSE Linux GmbH
*
* Copyright (c) 2016, Damien Le Moal
* Copyright (c) 2016, Western Digital
*/
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/rbtree.h>
#include <linux/blkdev.h>
static inline sector_t blk_zone_start(struct request_queue *q,
sector_t sector)
{
sector_t zone_mask = blk_queue_zone_size(q) - 1;
return sector & ~zone_mask;
}
/*
* Check that a zone report belongs to the partition.
* If yes, fix its start sector and write pointer, copy it in the
* zone information array and return true. Return false otherwise.
*/
static bool blkdev_report_zone(struct block_device *bdev,
struct blk_zone *rep,
struct blk_zone *zone)
{
sector_t offset = get_start_sect(bdev);
if (rep->start < offset)
return false;
rep->start -= offset;
if (rep->start + rep->len > bdev->bd_part->nr_sects)
return false;
if (rep->type == BLK_ZONE_TYPE_CONVENTIONAL)
rep->wp = rep->start + rep->len;
else
rep->wp -= offset;
memcpy(zone, rep, sizeof(struct blk_zone));
return true;
}
/**
* blkdev_report_zones - Get zones information
* @bdev: Target block device
* @sector: Sector from which to report zones
* @zones: Array of zone structures where to return the zones information
* @nr_zones: Number of zone structures in the zone array
* @gfp_mask: Memory allocation flags (for bio_alloc)
*
* Description:
* Get zone information starting from the zone containing @sector.
* The number of zone information reported may be less than the number
* requested by @nr_zones. The number of zones actually reported is
* returned in @nr_zones.
*/
int blkdev_report_zones(struct block_device *bdev,
sector_t sector,
struct blk_zone *zones,
unsigned int *nr_zones,
gfp_t gfp_mask)
{
struct request_queue *q = bdev_get_queue(bdev);
struct blk_zone_report_hdr *hdr;
unsigned int nrz = *nr_zones;
struct page *page;
unsigned int nr_rep;
size_t rep_bytes;
unsigned int nr_pages;
struct bio *bio;
struct bio_vec *bv;
unsigned int i, n, nz;
unsigned int ofst;
void *addr;
int ret;
if (!q)
return -ENXIO;
if (!blk_queue_is_zoned(q))
return -EOPNOTSUPP;
if (!nrz)
return 0;
if (sector > bdev->bd_part->nr_sects) {
*nr_zones = 0;
return 0;
}
/*
* The zone report has a header. So make room for it in the
* payload. Also make sure that the report fits in a single BIO
* that will not be split down the stack.
*/
rep_bytes = sizeof(struct blk_zone_report_hdr) +
sizeof(struct blk_zone) * nrz;
rep_bytes = (rep_bytes + PAGE_SIZE - 1) & PAGE_MASK;
if (rep_bytes > (queue_max_sectors(q) << 9))
rep_bytes = queue_max_sectors(q) << 9;
nr_pages = min_t(unsigned int, BIO_MAX_PAGES,
rep_bytes >> PAGE_SHIFT);
nr_pages = min_t(unsigned int, nr_pages,
queue_max_segments(q));
bio = bio_alloc(gfp_mask, nr_pages);
if (!bio)
return -ENOMEM;
bio->bi_bdev = bdev;
bio->bi_iter.bi_sector = blk_zone_start(q, sector);
bio_set_op_attrs(bio, REQ_OP_ZONE_REPORT, 0);
for (i = 0; i < nr_pages; i++) {
page = alloc_page(gfp_mask);
if (!page) {
ret = -ENOMEM;
goto out;
}
if (!bio_add_page(bio, page, PAGE_SIZE, 0)) {
__free_page(page);
break;
}
}
if (i == 0)
ret = -ENOMEM;
else
ret = submit_bio_wait(bio);
if (ret)
goto out;
/*
* Process the report result: skip the header and go through the
* reported zones to fixup and fixup the zone information for
* partitions. At the same time, return the zone information into
* the zone array.
*/
n = 0;
nz = 0;
nr_rep = 0;
bio_for_each_segment_all(bv, bio, i) {
if (!bv->bv_page)
break;
addr = kmap_atomic(bv->bv_page);
/* Get header in the first page */
ofst = 0;
if (!nr_rep) {
hdr = (struct blk_zone_report_hdr *) addr;
nr_rep = hdr->nr_zones;
ofst = sizeof(struct blk_zone_report_hdr);
}
/* Fixup and report zones */
while (ofst < bv->bv_len &&
n < nr_rep && nz < nrz) {
if (blkdev_report_zone(bdev, addr + ofst, &zones[nz]))
nz++;
ofst += sizeof(struct blk_zone);
n++;
}
kunmap_atomic(addr);
if (n >= nr_rep || nz >= nrz)
break;
}
*nr_zones = nz;
out:
bio_for_each_segment_all(bv, bio, i)
__free_page(bv->bv_page);
bio_put(bio);
return ret;
}
EXPORT_SYMBOL_GPL(blkdev_report_zones);
/**
* blkdev_reset_zones - Reset zones write pointer
* @bdev: Target block device
* @sector: Start sector of the first zone to reset
* @nr_sectors: Number of sectors, at least the length of one zone
* @gfp_mask: Memory allocation flags (for bio_alloc)
*
* Description:
* Reset the write pointer of the zones contained in the range
* @sector..@sector+@nr_sectors. Specifying the entire disk sector range
* is valid, but the specified range should not contain conventional zones.
*/
int blkdev_reset_zones(struct block_device *bdev,
sector_t sector, sector_t nr_sectors,
gfp_t gfp_mask)
{
struct request_queue *q = bdev_get_queue(bdev);
sector_t zone_sectors;
sector_t end_sector = sector + nr_sectors;
struct bio *bio;
int ret;
if (!q)
return -ENXIO;
if (!blk_queue_is_zoned(q))
return -EOPNOTSUPP;
if (end_sector > bdev->bd_part->nr_sects)
/* Out of range */
return -EINVAL;
/* Check alignment (handle eventual smaller last zone) */
zone_sectors = blk_queue_zone_size(q);
if (sector & (zone_sectors - 1))
return -EINVAL;
if ((nr_sectors & (zone_sectors - 1)) &&
end_sector != bdev->bd_part->nr_sects)
return -EINVAL;
while (sector < end_sector) {
bio = bio_alloc(gfp_mask, 0);
bio->bi_iter.bi_sector = sector;
bio->bi_bdev = bdev;
bio_set_op_attrs(bio, REQ_OP_ZONE_RESET, 0);
ret = submit_bio_wait(bio);
bio_put(bio);
if (ret)
return ret;
sector += zone_sectors;
/* This may take a while, so be nice to others */
cond_resched();
}
return 0;
}
EXPORT_SYMBOL_GPL(blkdev_reset_zones);
/**
* BLKREPORTZONE ioctl processing.
* Called from blkdev_ioctl.
*/
int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
unsigned int cmd, unsigned long arg)
{
void __user *argp = (void __user *)arg;
struct request_queue *q;
struct blk_zone_report rep;
struct blk_zone *zones;
int ret;
if (!argp)
return -EINVAL;
q = bdev_get_queue(bdev);
if (!q)
return -ENXIO;
if (!blk_queue_is_zoned(q))
return -ENOTTY;
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
if (copy_from_user(&rep, argp, sizeof(struct blk_zone_report)))
return -EFAULT;
if (!rep.nr_zones)
return -EINVAL;
zones = kcalloc(rep.nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
if (!zones)
return -ENOMEM;
ret = blkdev_report_zones(bdev, rep.sector,
zones, &rep.nr_zones,
GFP_KERNEL);
if (ret)
goto out;
if (copy_to_user(argp, &rep, sizeof(struct blk_zone_report))) {
ret = -EFAULT;
goto out;
}
if (rep.nr_zones) {
if (copy_to_user(argp + sizeof(struct blk_zone_report), zones,
sizeof(struct blk_zone) * rep.nr_zones))
ret = -EFAULT;
}
out:
kfree(zones);
return ret;
}
/**
* BLKRESETZONE ioctl processing.
* Called from blkdev_ioctl.
*/
int blkdev_reset_zones_ioctl(struct block_device *bdev, fmode_t mode,
unsigned int cmd, unsigned long arg)
{
void __user *argp = (void __user *)arg;
struct request_queue *q;
struct blk_zone_range zrange;
if (!argp)
return -EINVAL;
q = bdev_get_queue(bdev);
if (!q)
return -ENXIO;
if (!blk_queue_is_zoned(q))
return -ENOTTY;
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
if (!(mode & FMODE_WRITE))
return -EBADF;
if (copy_from_user(&zrange, argp, sizeof(struct blk_zone_range)))
return -EFAULT;
return blkdev_reset_zones(bdev, zrange.sector, zrange.nr_sectors,
GFP_KERNEL);
}
......@@ -111,6 +111,7 @@ void blk_account_io_done(struct request *req);
enum rq_atomic_flags {
REQ_ATOM_COMPLETE = 0,
REQ_ATOM_STARTED,
REQ_ATOM_POLL_SLEPT,
};
/*
......@@ -130,7 +131,7 @@ static inline void blk_clear_rq_complete(struct request *rq)
/*
* Internal elevator interface
*/
#define ELV_ON_HASH(rq) ((rq)->cmd_flags & REQ_HASHED)
#define ELV_ON_HASH(rq) ((rq)->rq_flags & RQF_HASHED)
void blk_insert_flush(struct request *rq);
......@@ -247,7 +248,7 @@ extern int blk_update_nr_requests(struct request_queue *, unsigned int);
static inline int blk_do_io_stat(struct request *rq)
{
return rq->rq_disk &&
(rq->cmd_flags & REQ_IO_STAT) &&
(rq->rq_flags & RQF_IO_STAT) &&
(rq->cmd_type == REQ_TYPE_FS);
}
......
......@@ -161,6 +161,8 @@ static int bsg_create_job(struct device *dev, struct request *req)
* Drivers/subsys should pass this to the queue init function.
*/
void bsg_request_fn(struct request_queue *q)
__releases(q->queue_lock)
__acquires(q->queue_lock)
{
struct device *dev = q->queuedata;
struct request *req;
......
......@@ -176,7 +176,7 @@ static int blk_fill_sgv4_hdr_rq(struct request_queue *q, struct request *rq,
* Check if sg_io_v4 from user is allowed and valid
*/
static int
bsg_validate_sgv4_hdr(struct request_queue *q, struct sg_io_v4 *hdr, int *rw)
bsg_validate_sgv4_hdr(struct sg_io_v4 *hdr, int *rw)
{
int ret = 0;
......@@ -226,7 +226,7 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr, fmode_t has_write_perm,
hdr->dout_xfer_len, (unsigned long long) hdr->din_xferp,
hdr->din_xfer_len);
ret = bsg_validate_sgv4_hdr(q, hdr, &rw);
ret = bsg_validate_sgv4_hdr(hdr, &rw);
if (ret)
return ERR_PTR(ret);
......
......@@ -16,6 +16,7 @@
#include <linux/blktrace_api.h>
#include <linux/blk-cgroup.h>
#include "blk.h"
#include "blk-wbt.h"
/*
* tunables
......@@ -667,10 +668,10 @@ static inline void cfqg_put(struct cfq_group *cfqg)
} while (0)
static inline void cfqg_stats_update_io_add(struct cfq_group *cfqg,
struct cfq_group *curr_cfqg, int op,
int op_flags)
struct cfq_group *curr_cfqg,
unsigned int op)
{
blkg_rwstat_add(&cfqg->stats.queued, op, op_flags, 1);
blkg_rwstat_add(&cfqg->stats.queued, op, 1);
cfqg_stats_end_empty_time(&cfqg->stats);
cfqg_stats_set_start_group_wait_time(cfqg, curr_cfqg);
}
......@@ -684,30 +685,29 @@ static inline void cfqg_stats_update_timeslice_used(struct cfq_group *cfqg,
#endif
}
static inline void cfqg_stats_update_io_remove(struct cfq_group *cfqg, int op,
int op_flags)
static inline void cfqg_stats_update_io_remove(struct cfq_group *cfqg,
unsigned int op)
{
blkg_rwstat_add(&cfqg->stats.queued, op, op_flags, -1);
blkg_rwstat_add(&cfqg->stats.queued, op, -1);
}
static inline void cfqg_stats_update_io_merged(struct cfq_group *cfqg, int op,
int op_flags)
static inline void cfqg_stats_update_io_merged(struct cfq_group *cfqg,
unsigned int op)
{
blkg_rwstat_add(&cfqg->stats.merged, op, op_flags, 1);
blkg_rwstat_add(&cfqg->stats.merged, op, 1);
}
static inline void cfqg_stats_update_completion(struct cfq_group *cfqg,
uint64_t start_time, uint64_t io_start_time, int op,
int op_flags)
uint64_t start_time, uint64_t io_start_time,
unsigned int op)
{
struct cfqg_stats *stats = &cfqg->stats;
unsigned long long now = sched_clock();
if (time_after64(now, io_start_time))
blkg_rwstat_add(&stats->service_time, op, op_flags,
now - io_start_time);
blkg_rwstat_add(&stats->service_time, op, now - io_start_time);
if (time_after64(io_start_time, start_time))
blkg_rwstat_add(&stats->wait_time, op, op_flags,
blkg_rwstat_add(&stats->wait_time, op,
io_start_time - start_time);
}
......@@ -786,16 +786,16 @@ static inline void cfqg_put(struct cfq_group *cfqg) { }
#define cfq_log_cfqg(cfqd, cfqg, fmt, args...) do {} while (0)
static inline void cfqg_stats_update_io_add(struct cfq_group *cfqg,
struct cfq_group *curr_cfqg, int op, int op_flags) { }
struct cfq_group *curr_cfqg, unsigned int op) { }
static inline void cfqg_stats_update_timeslice_used(struct cfq_group *cfqg,
uint64_t time, unsigned long unaccounted_time) { }
static inline void cfqg_stats_update_io_remove(struct cfq_group *cfqg, int op,
int op_flags) { }
static inline void cfqg_stats_update_io_merged(struct cfq_group *cfqg, int op,
int op_flags) { }
static inline void cfqg_stats_update_io_remove(struct cfq_group *cfqg,
unsigned int op) { }
static inline void cfqg_stats_update_io_merged(struct cfq_group *cfqg,
unsigned int op) { }
static inline void cfqg_stats_update_completion(struct cfq_group *cfqg,
uint64_t start_time, uint64_t io_start_time, int op,
int op_flags) { }
uint64_t start_time, uint64_t io_start_time,
unsigned int op) { }
#endif /* CONFIG_CFQ_GROUP_IOSCHED */
......@@ -912,15 +912,6 @@ static inline struct cfq_data *cic_to_cfqd(struct cfq_io_cq *cic)
return cic->icq.q->elevator->elevator_data;
}
/*
* We regard a request as SYNC, if it's either a read or has the SYNC bit
* set (in which case it could also be direct WRITE).
*/
static inline bool cfq_bio_sync(struct bio *bio)
{
return bio_data_dir(bio) == READ || (bio->bi_opf & REQ_SYNC);
}
/*
* scheduler run of queue, if there are requests pending and no one in the
* driver that will restart queueing
......@@ -1596,7 +1587,7 @@ static struct blkcg_policy_data *cfq_cpd_alloc(gfp_t gfp)
{
struct cfq_group_data *cgd;
cgd = kzalloc(sizeof(*cgd), GFP_KERNEL);
cgd = kzalloc(sizeof(*cgd), gfp);
if (!cgd)
return NULL;
return &cgd->cpd;
......@@ -2474,10 +2465,10 @@ static void cfq_reposition_rq_rb(struct cfq_queue *cfqq, struct request *rq)
{
elv_rb_del(&cfqq->sort_list, rq);
cfqq->queued[rq_is_sync(rq)]--;
cfqg_stats_update_io_remove(RQ_CFQG(rq), req_op(rq), rq->cmd_flags);
cfqg_stats_update_io_remove(RQ_CFQG(rq), rq->cmd_flags);
cfq_add_rq_rb(rq);
cfqg_stats_update_io_add(RQ_CFQG(rq), cfqq->cfqd->serving_group,
req_op(rq), rq->cmd_flags);
rq->cmd_flags);
}
static struct request *
......@@ -2491,7 +2482,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
if (!cic)
return NULL;
cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
cfqq = cic_to_cfqq(cic, op_is_sync(bio->bi_opf));
if (cfqq)
return elv_rb_find(&cfqq->sort_list, bio_end_sector(bio));
......@@ -2530,7 +2521,7 @@ static void cfq_remove_request(struct request *rq)
cfq_del_rq_rb(rq);
cfqq->cfqd->rq_queued--;
cfqg_stats_update_io_remove(RQ_CFQG(rq), req_op(rq), rq->cmd_flags);
cfqg_stats_update_io_remove(RQ_CFQG(rq), rq->cmd_flags);
if (rq->cmd_flags & REQ_PRIO) {
WARN_ON(!cfqq->prio_pending);
cfqq->prio_pending--;
......@@ -2565,7 +2556,7 @@ static void cfq_merged_request(struct request_queue *q, struct request *req,
static void cfq_bio_merged(struct request_queue *q, struct request *req,
struct bio *bio)
{
cfqg_stats_update_io_merged(RQ_CFQG(req), bio_op(bio), bio->bi_opf);
cfqg_stats_update_io_merged(RQ_CFQG(req), bio->bi_opf);
}
static void
......@@ -2588,7 +2579,7 @@ cfq_merged_requests(struct request_queue *q, struct request *rq,
if (cfqq->next_rq == next)
cfqq->next_rq = rq;
cfq_remove_request(next);
cfqg_stats_update_io_merged(RQ_CFQG(rq), req_op(next), next->cmd_flags);
cfqg_stats_update_io_merged(RQ_CFQG(rq), next->cmd_flags);
cfqq = RQ_CFQQ(next);
/*
......@@ -2605,13 +2596,14 @@ static int cfq_allow_bio_merge(struct request_queue *q, struct request *rq,
struct bio *bio)
{
struct cfq_data *cfqd = q->elevator->elevator_data;
bool is_sync = op_is_sync(bio->bi_opf);
struct cfq_io_cq *cic;
struct cfq_queue *cfqq;
/*
* Disallow merge of a sync bio into an async request.
*/
if (cfq_bio_sync(bio) && !rq_is_sync(rq))
if (is_sync && !rq_is_sync(rq))
return false;
/*
......@@ -2622,7 +2614,7 @@ static int cfq_allow_bio_merge(struct request_queue *q, struct request *rq,
if (!cic)
return false;
cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
cfqq = cic_to_cfqq(cic, is_sync);
return cfqq == RQ_CFQQ(rq);
}
......@@ -3771,9 +3763,11 @@ static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
struct cfq_data *cfqd = cic_to_cfqd(cic);
struct cfq_queue *cfqq;
uint64_t serial_nr;
bool nonroot_cg;
rcu_read_lock();
serial_nr = bio_blkcg(bio)->css.serial_nr;
nonroot_cg = bio_blkcg(bio) != &blkcg_root;
rcu_read_unlock();
/*
......@@ -3783,6 +3777,14 @@ static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
if (unlikely(!cfqd) || likely(cic->blkcg_serial_nr == serial_nr))
return;
/*
* If we have a non-root cgroup, we can depend on that to
* do proper throttling of writes. Turn off wbt for that
* case, if it was enabled by default.
*/
if (nonroot_cg)
wbt_disable_default(cfqd->queue);
/*
* Drop reference to queues. New queues will be assigned in new
* group upon arrival of fresh requests.
......@@ -3854,7 +3856,8 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
goto out;
}
cfqq = kmem_cache_alloc_node(cfq_pool, GFP_NOWAIT | __GFP_ZERO,
cfqq = kmem_cache_alloc_node(cfq_pool,
GFP_NOWAIT | __GFP_ZERO | __GFP_NOWARN,
cfqd->queue->node);
if (!cfqq) {
cfqq = &cfqd->oom_cfqq;
......@@ -3923,6 +3926,12 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
}
static inline bool req_noidle(struct request *req)
{
return req_op(req) == REQ_OP_WRITE &&
(req->cmd_flags & (REQ_SYNC | REQ_IDLE)) == REQ_SYNC;
}
/*
* Disable idle window if the process thinks too long or seeks so much that
* it doesn't matter
......@@ -3944,7 +3953,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
if (cfqq->queued[0] + cfqq->queued[1] >= 4)
cfq_mark_cfqq_deep(cfqq);
if (cfqq->next_rq && (cfqq->next_rq->cmd_flags & REQ_NOIDLE))
if (cfqq->next_rq && req_noidle(cfqq->next_rq))
enable_idle = 0;
else if (!atomic_read(&cic->icq.ioc->active_ref) ||
!cfqd->cfq_slice_idle ||
......@@ -4142,7 +4151,7 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
rq->fifo_time = ktime_get_ns() + cfqd->cfq_fifo_expire[rq_is_sync(rq)];
list_add_tail(&rq->queuelist, &cfqq->fifo);
cfq_add_rq_rb(rq);
cfqg_stats_update_io_add(RQ_CFQG(rq), cfqd->serving_group, req_op(rq),
cfqg_stats_update_io_add(RQ_CFQG(rq), cfqd->serving_group,
rq->cmd_flags);
cfq_rq_enqueued(cfqd, cfqq, rq);
}
......@@ -4229,8 +4238,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
const int sync = rq_is_sync(rq);
u64 now = ktime_get_ns();
cfq_log_cfqq(cfqd, cfqq, "complete rqnoidle %d",
!!(rq->cmd_flags & REQ_NOIDLE));
cfq_log_cfqq(cfqd, cfqq, "complete rqnoidle %d", req_noidle(rq));
cfq_update_hw_tag(cfqd);
......@@ -4240,8 +4248,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
cfqq->dispatched--;
(RQ_CFQG(rq))->dispatched--;
cfqg_stats_update_completion(cfqq->cfqg, rq_start_time_ns(rq),
rq_io_start_time_ns(rq), req_op(rq),
rq->cmd_flags);
rq_io_start_time_ns(rq), rq->cmd_flags);
cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--;
......@@ -4319,14 +4326,14 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
cfq_schedule_dispatch(cfqd);
}
static void cfqq_boost_on_prio(struct cfq_queue *cfqq, int op_flags)
static void cfqq_boost_on_prio(struct cfq_queue *cfqq, unsigned int op)
{
/*
* If REQ_PRIO is set, boost class and prio level, if it's below
* BE/NORM. If prio is not set, restore the potentially boosted
* class/prio level.
*/
if (!(op_flags & REQ_PRIO)) {
if (!(op & REQ_PRIO)) {
cfqq->ioprio_class = cfqq->org_ioprio_class;
cfqq->ioprio = cfqq->org_ioprio;
} else {
......@@ -4347,7 +4354,7 @@ static inline int __cfq_may_queue(struct cfq_queue *cfqq)
return ELV_MQUEUE_MAY;
}
static int cfq_may_queue(struct request_queue *q, int op, int op_flags)
static int cfq_may_queue(struct request_queue *q, unsigned int op)
{
struct cfq_data *cfqd = q->elevator->elevator_data;
struct task_struct *tsk = current;
......@@ -4364,10 +4371,10 @@ static int cfq_may_queue(struct request_queue *q, int op, int op_flags)
if (!cic)
return ELV_MQUEUE_MAY;
cfqq = cic_to_cfqq(cic, rw_is_sync(op, op_flags));
cfqq = cic_to_cfqq(cic, op_is_sync(op));
if (cfqq) {
cfq_init_prio_data(cfqq, cic);
cfqq_boost_on_prio(cfqq, op_flags);
cfqq_boost_on_prio(cfqq, op);
return __cfq_may_queue(cfqq);
}
......
此差异已折叠。
......@@ -519,6 +519,10 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
BLKDEV_DISCARD_SECURE);
case BLKZEROOUT:
return blk_ioctl_zeroout(bdev, mode, arg);
case BLKREPORTZONE:
return blkdev_report_zones_ioctl(bdev, mode, cmd, arg);
case BLKRESETZONE:
return blkdev_reset_zones_ioctl(bdev, mode, cmd, arg);
case HDIO_GETGEO:
return blkdev_getgeo(bdev, argp);
case BLKRAGET:
......
此差异已折叠。
......@@ -384,9 +384,12 @@ config BLK_DEV_RAM_DAX
allocated from highmem (only a problem for highmem systems).
config CDROM_PKTCDVD
tristate "Packet writing on CD/DVD media"
tristate "Packet writing on CD/DVD media (DEPRECATED)"
depends on !UML
help
Note: This driver is deprecated and will be removed from the
kernel in the near future!
If you have a CDROM/DVD drive that supports packet writing, say
Y to include support. It should work with any MMC/Mt Fuji
compliant ATAPI or SCSI drive, which is just about any newer
......
此差异已折叠。
......@@ -148,7 +148,7 @@ static int _drbd_md_sync_page_io(struct drbd_device *device,
if ((op == REQ_OP_WRITE) && !test_bit(MD_NO_FUA, &device->flags))
op_flags |= REQ_FUA | REQ_PREFLUSH;
op_flags |= REQ_SYNC | REQ_NOIDLE;
op_flags |= REQ_SYNC;
bio = bio_alloc_drbd(GFP_NOIO);
bio->bi_bdev = bdev->md_bdev;
......
此差异已折叠。
......@@ -3806,14 +3806,10 @@ static int __floppy_read_block_0(struct block_device *bdev, int drive)
cbdata.drive = drive;
bio_init(&bio);
bio.bi_io_vec = &bio_vec;
bio_vec.bv_page = page;
bio_vec.bv_len = size;
bio_vec.bv_offset = 0;
bio.bi_vcnt = 1;
bio.bi_iter.bi_size = size;
bio_init(&bio, &bio_vec, 1);
bio.bi_bdev = bdev;
bio_add_page(&bio, page, size, 0);
bio.bi_iter.bi_sector = 0;
bio.bi_flags |= (1 << BIO_QUIET);
bio.bi_private = &cbdata;
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册