提交 · 86ff7c2a80cd357f6156a53b354f6a0b357dc0c9 · openanolis / cloud-kernel

31 1月, 2018 1 次提交

blk-mq: introduce BLK_STS_DEV_RESOURCE · 86ff7c2a

由 Ming Lei 提交于 1月 30, 2018

This status is returned from driver to block layer if device related
resource is unavailable, but driver can guarantee that IO dispatch
will be triggered in future when the resource is available.

Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
3 ms because both scsi-mq and nvmefc are using that magic value.

If a driver can make sure there is in-flight IO, it is safe to return
BLK_STS_DEV_RESOURCE because:

1) If all in-flight IOs complete before examining SCHED_RESTART in
blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
is run immediately in this case by blk_mq_dispatch_rq_list();

2) if there is any in-flight IO after/when examining SCHED_RESTART
in blk_mq_dispatch_rq_list():
- if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
- otherwise, this request will be dispatched after any in-flight IO is
  completed via blk_mq_sched_restart()

3) if SCHED_RESTART is set concurently in context because of
BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
cases and make sure IO hang can be avoided.

One invariant is that queue will be rerun if SCHED_RESTART is set.
Suggested-by: NJens Axboe <axboe@kernel.dk>
Tested-by: NLaurence Oberman <loberman@redhat.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

86ff7c2a

25 1月, 2018 2 次提交

bsg: use pr_debug instead of hand crafted macros · 3124b65d

由 Johannes Thumshirn 提交于 1月 24, 2018

Use pr_debug instead of hand crafted macros. This way it is not needed to
re-compile the kernel to enable bsg debug outputs and it's possible to
selectively enable specific prints.

Cc: Joe Perches <joe@perches.com>
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3124b65d

blk-mq-debugfs: don't allow write on attributes with seq_operations set · 6b136a24

由 Eryu Guan 提交于 1月 24, 2018

Attributes that only implement .seq_ops are read-only, any write to
them should be rejected. But currently kernel would crash when
writing to such debugfs entries, e.g.

chmod +w /sys/kernel/debug/block/<dev>/requeue_list
echo 0 > /sys/kernel/debug/block/<dev>/requeue_list
chmod -w /sys/kernel/debug/block/<dev>/requeue_list

Fix it by returning -EPERM in blk_mq_debugfs_write() when writing to
such attributes.

Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NEryu Guan <eguan@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6b136a24

24 1月, 2018 1 次提交

block: Set BIO_TRACE_COMPLETION on new bio during split · 20d59023

由 Goldwyn Rodrigues 提交于 1月 23, 2018

We inadvertently set it again on the source bio, but we need
to set it on the new split bio instead.

Fixes: fbbaf700 ("block: trace completion of all bios.")
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

20d59023

20 1月, 2018 4 次提交

blk-throttle: use queue_is_rq_based · 475a055e

由 weiping zhang 提交于 1月 20, 2018

use queue_is_rq_based instead of open code.
Signed-off-by: Nweiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

475a055e

block: Remove kblockd_schedule_delayed_work{,_on}() · f5ced52a

由 Bart Van Assche 提交于 1月 19, 2018

The previous patch removed all users of these two functions. Hence
also remove the functions themselves.
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f5ced52a

blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays · ae943d20

由 Bart Van Assche 提交于 1月 19, 2018

Make sure that calling blk_mq_run_hw_queue() or
blk_mq_kick_requeue_list() triggers a queue run without delay even
if blk_mq_delay_run_hw_queue() has been called recently and if its
delay has not yet expired.
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ae943d20

blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly() · c77ff7fd

由 Bart Van Assche 提交于 1月 19, 2018

Most blk-mq functions have a name that follows the pattern blk_mq_${action}.
However, the function name blk_mq_request_direct_issue is an exception.
Hence rename this function. This patch does not change any functionality.
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c77ff7fd

19 1月, 2018 7 次提交

blk-throttle: track read and write request individually · b889bf66

由 Joseph Qi 提交于 11月 21, 2017

In mixed read/write workload on SSD, write latency is much lower than
read. But now we only track and record read latency and then use it as
threshold base for both read and write io latency accounting. As a
result, write io latency will always be considered as good and
bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
tg to be checked will be treated as idle most of the time and still let
others dispatch more ios, even it is truly running under low limit and
wants its low limit to be guaranteed, which is not we expected in fact.
So track read and write request individually, which can bring more
precise latency control for low limit idle detection.
Signed-off-by: NJoseph Qi <qijiang.qj@alibaba-inc.com>
Reviewed-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b889bf66

block: add bdev_read_only() checks to common helpers · a13553c7

由 Ilya Dryomov 提交于 1月 11, 2018

Similar to blkdev_write_iter(), return -EPERM if the partition is
read-only.  This covers ioctl(), fallocate() and most in-kernel users
but isn't meant to be exhaustive -- everything else will be caught in
generic_make_request_checks(), fail with -EIO and can be fixed later.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a13553c7

block: fail op_is_write() requests to read-only partitions · 721c7fc7

由 Ilya Dryomov 提交于 1月 11, 2018

Regular block device writes go through blkdev_write_iter(), which does
bdev_read_only(), while zeroout/discard/etc requests are never checked,
both userspace- and kernel-triggered.  Add a generic catch-all check to
generic_make_request_checks() to actually enforce ioctl(BLKROSET) and
set_disk_ro(), which is used by quite a few drivers for things like
snapshots, read-only backing files/images, etc.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

721c7fc7

blk-throttle: export io_serviced_recursive, io_service_bytes_recursive · 17534c6f

由 weiping zhang 提交于 12月 11, 2017

export these two interface for cgroup-v1.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: Nweiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

17534c6f

block: Protect less code with sysfs_lock in blk_{un,}register_queue() · 2c2086af

由 Bart Van Assche 提交于 1月 17, 2018

The __blk_mq_register_dev(), blk_mq_unregister_dev(),
elv_register_queue() and elv_unregister_queue() calls need to be
protected with sysfs_lock but other code in these functions not.
Hence protect only this code with sysfs_lock. This patch fixes a
locking inversion issue in blk_unregister_queue() and also in an
error path of blk_register_queue(): it is not allowed to hold
sysfs_lock around the kobject_del(&q->kobj) call.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2c2086af

block: Document scheduler modification locking requirements · 14a23498

由 Bart Van Assche 提交于 1月 17, 2018

This patch does not change any functionality.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

14a23498

block: Unexport elv_register_queue() and elv_unregister_queue() · 83d016ac

由 Bart Van Assche 提交于 1月 17, 2018

These two functions are only called from inside the block layer so
unexport them.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

83d016ac

18 1月, 2018 9 次提交

block, bfq: limit sectors served with interactive weight raising · 8a8747dc

由 Paolo Valente 提交于 1月 13, 2018

To maximise responsiveness, BFQ raises the weight, and performs device
idling, for bfq_queues associated with processes deemed as
interactive. In particular, weight raising has a maximum duration,
equal to the time needed to start a large application. If a
weight-raised process goes on doing I/O beyond this maximum duration,
it loses weight-raising.

This mechanism is evidently vulnerable to the following false
positives: I/O-bound applications that will go on doing I/O for much
longer than the duration of weight-raising. These applications have
basically no benefit from being weight-raised at the beginning of
their I/O. On the opposite end, while being weight-raised, these
applications
a) unjustly steal throughput to applications that may truly need
low latency;
b) make BFQ uselessly perform device idling; device idling results
in loss of device throughput with most flash-based storage, and may
increase latencies when used purposelessly.

This commit adds a countermeasure to reduce both the above
problems. To introduce this countermeasure, we provide the following
extra piece of information (full details in the comments added by this
commit). During the start-up of the large application used as a
reference to set the duration of weight-raising, involved processes
transfer at most ~110K sectors each. Accordingly, a process initially
deemed as interactive has no right to be weight-raised any longer,
once transferred 110K sectors or more.

Basing on this consideration, this commit early-ends weight-raising
for a bfq_queue if the latter happens to have received an amount of
service at least equal to 110K sectors (actually, a little bit more,
to keep a safety margin). I/O-bound applications that reach a high
throughput, such as file copy, get to this threshold much before the
allowed weight-raising period finishes. Thus this early ending of
weight-raising reduces the amount of time during which these
applications cause the problems described above.
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8a8747dc

block, bfq: limit tags for writes and async I/O · a52a69ea

由 Paolo Valente 提交于 1月 13, 2018

Asynchronous I/O can easily starve synchronous I/O (both sync reads
and sync writes), by consuming all request tags. Similarly, storms of
synchronous writes, such as those that sync(2) may trigger, can starve
synchronous reads. In their turn, these two problems may also cause
BFQ to loose control on latency for interactive and soft real-time
applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice
Writer takes 0.6 seconds to start if the device is idle, but it takes
more than 45 seconds (!) if there are sequential writes in the
background.

This commit addresses this issue by limiting the maximum percentage of
tags that asynchronous I/O requests and synchronous write requests can
consume. In particular, this commit grants a higher threshold to
synchronous writes, to prevent the latter from being starved by
asynchronous I/O.

According to the above test, LibreOffice Writer now starts in about
1.2 seconds on average, regardless of the background workload, and
apart from some rare outlier. To check this improvement, run, e.g.,
sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init"
for the comm_startup_lat benchmark in the S suite [1].

[1] https://github.com/Algodev-github/STested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a52a69ea

blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy · 23d4ee19

由 Ming Lei 提交于 1月 18, 2018

If we run into blk_mq_request_direct_issue(), when queue is busy, we
don't want to dispatch this request into hctx->dispatch_list, and
what we need to do is to return the queue busy info to caller, so
that caller can deal with it well.

Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Reported-by: NLaurence Oberman <loberman@redhat.com>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

23d4ee19

block: Fix __bio_integrity_endio() documentation · de99a346

由 Bart Van Assche 提交于 1月 16, 2018

Fixes: 4246a0b6 ("block: add a bi_error field to struct bio")
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

de99a346

blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request · 9e97d295

由 Mike Snitzer 提交于 1月 17, 2018

After commit:

923218f6 ("blk-mq: don't allocate driver tag upfront for flush rq")

we no longer use the 'can_block' argument in
blk_mq_sched_insert_request(). Kill it.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

Added actual commit message as to why it's being removed.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9e97d295

blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback · 396eaf21

由 Ming Lei 提交于 1月 17, 2018

blk_insert_cloned_request() is called in the fast path of a dm-rq driver
(e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
blk_mq_request_bypass_insert() to directly append the request to the
blk-mq hctx->dispatch_list of the underlying queue.

1) This way isn't efficient enough because the hctx spinlock is always
used.

2) With blk_insert_cloned_request(), we completely bypass underlying
queue's elevator and depend on the upper-level dm-rq driver's elevator
to schedule IO.  But dm-rq currently can't get the underlying queue's
dispatch feedback at all.  Without knowing whether a request was issued
or not (e.g. due to underlying queue being busy) the dm-rq elevator will
not be able to provide effective IO merging (as a side-effect of dm-rq
currently blindly destaging a request from its elevator only to requeue
it after a delay, which kills any opportunity for merging).  This
obviously causes very bad sequential IO performance.

Fix this by updating blk_insert_cloned_request() to use
blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
request to be issued directly to the underlying queue and returns the
dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
to _not_ destage the request.  Whereby preserving the opportunity to
merge IO.

With this, request-based DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing).
Signed-off-by: NMing Lei <ming.lei@redhat.com>
[blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
they were refactored to make them less fragile and easier to read/review]
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

396eaf21

blk-mq: factor out a few helpers from __blk_mq_try_issue_directly · 0f95549c

由 Mike Snitzer 提交于 1月 17, 2018

No functional change.  Just makes code flow more logically.

In following commit, __blk_mq_try_issue_directly() will be used to
return the dispatch result (blk_status_t) to DM.  DM needs this
information to improve IO merging.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0f95549c

blk-mq: turn WARN_ON in __blk_mq_run_hw_queue into printk · 7df938fb

由 Ming Lei 提交于 1月 18, 2018

We know this WARN_ON is harmless and in reality it may be trigged,
so convert it to printk() and dump_stack() to avoid to confusing
people.

Also add comment about two releated races here.

Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7df938fb

blk-mq: make sure hctx->next_cpu is set correctly · 7bed4595

由 Ming Lei 提交于 1月 18, 2018

When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.

The race can be triggered in the following two sitations:

1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.

2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.

This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.

Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: N"jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: N"jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d813 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7bed4595

15 1月, 2018 5 次提交

blk_rq_map_user_iov: fix error override · 69e0927b

由 Douglas Gilbert 提交于 1月 14, 2018

During stress tests by syzkaller on the sg driver the block layer
infrequently returns EINVAL. Closer inspection shows the block
layer was trying to return ENOMEM (which is much more
understandable) but for some reason overroad that useful error.

Patch below does not show this (unchanged) line:
   ret =__blk_rq_map_user_iov(rq, map_data, &i, gfp_mask, copy);
That 'ret' was being overridden when that function failed.
Signed-off-by: NDouglas Gilbert <dgilbert@interlog.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

69e0927b

block: allow gendisk's request_queue registration to be deferred · fa70d2e2

由 Mike Snitzer 提交于 1月 08, 2018

Since I can remember DM has forced the block layer to allow the
allocation and initialization of the request_queue to be distinct
operations.  Reason for this is block/genhd.c:add_disk() has requires
that the request_queue (and associated bdi) be tied to the gendisk
before add_disk() is called -- because add_disk() also deals with
exposing the request_queue via blk_register_queue().

DM's dynamic creation of arbitrary device types (and associated
request_queue types) requires the DM device's gendisk be available so
that DM table loads can establish a master/slave relationship with
subordinate devices that are referenced by loaded DM tables -- using
bd_link_disk_holder().  But until these DM tables, and their associated
subordinate devices, are known DM cannot know what type of request_queue
it needs -- nor what its queue_limits should be.

This chicken and egg scenario has created all manner of problems for DM
and, at times, the block layer.

Summary of changes:

- Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
  that drivers may use to add a disk without also calling
  blk_register_queue().  Driver must call blk_register_queue() once its
  request_queue is fully initialized.

- Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
  is not set.  It won't be set if driver used add_disk_no_queue_reg()
  but driver encounters an error and must del_gendisk() before calling
  blk_register_queue().

- Export blk_register_queue().

These changes allow DM to use add_disk_no_queue_reg() to anchor its
gendisk as the "master" for master/slave relationships DM must establish
with subordinate devices referenced in DM tables that get loaded.  Once
all "slave" devices for a DM device are known its request_queue can be
properly initialized and then advertised via sysfs -- important
improvement being that no request_queue resource initialization
performed by blk_register_queue() is missed for DM devices anymore.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fa70d2e2

block: properly protect the 'queue' kobj in blk_unregister_queue · 667257e8

由 Mike Snitzer 提交于 1月 11, 2018

The original commit e9a823fb (block: fix warning when I/O elevator
is changed as request_queue is being removed) is pretty conflated.
"conflated" because the resource being protected by q->sysfs_lock isn't
the queue_flags (it is the 'queue' kobj).

q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
from racing with blk_unregister_queue():
1) By holding q->sysfs_lock first, __elevator_change() can complete
before a racing blk_unregister_queue().
2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
in case elv_iosched_store() loses the race with blk_unregister_queue(),
it needs a way to know the 'queue' kobj isn't there.

Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
held until after the 'queue' kobj is removed.

To do so blk_mq_unregister_dev() must not also take q->sysfs_lock.  So
rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().

Also, blk_unregister_queue() should use q->queue_lock to protect against
any concurrent writes to q->queue_flags -- even though chances are the
queue is being cleaned up so no concurrent writes are likely.

Fixes: e9a823fb ("block: fix warning when I/O elevator is changed as request_queue is being removed")
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

667257e8

block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN · bc8d062c

由 Mike Snitzer 提交于 1月 09, 2018

device_add_disk() will only call bdi_register_owner() if
!GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call
bdi_unregister() if !GENHD_FL_HIDDEN.

Found with code inspection.  bdi_unregister() won't do any harm if
bdi_register_owner() wasn't used but best to avoid the unnecessary
call to bdi_unregister().

Fixes: 8ddcd653 ("block: introduce GENHD_FL_HIDDEN")
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bc8d062c

blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init() · bf9ae8c5

由 Jens Axboe 提交于 1月 14, 2018

A previous commit moved the clearing of rq->rq_flags later,
but we may have already set RQF_MQ_INFLIGHT when that happens.
Ensure that we correctly initialize rq->rq_flags to the
right value.

This is based on an original fix by Ming, just rewritten to not
require a conditional.

Fixes: 7c3fb70f ("block: rearrange a few request fields for better cache layout")
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bf9ae8c5

13 1月, 2018 2 次提交

blk-mq: add missing RQF_STARTED to debugfs · 85ba3eff

由 Jens Axboe 提交于 1月 12, 2018

Looking at debug output, we see:

./000000009ddfa913/requeue_list:000000009646711c {.op=READ, .state=idle, gen=0x1
18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, complete
=0, .tag=-1, .internal_tag=217}

Note the '1' between SORTED and SOFTBARRIER - that's because no name
as defined for RQF_STARTED. Fixed that.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

85ba3eff

blk-mq: simplify queue mapping & schedule with each possisble CPU · 20e4d813

由 Christoph Hellwig 提交于 1月 12, 2018

The previous patch assigns interrupt vectors to all possible CPUs, so
now hctx can be mapped to possible CPUs, this patch applies this fact
to simplify queue mapping & schedule so that we don't need to handle
CPU hotplug for dealing with physical CPU plug & unplug. With this
simplication, we can work well on physical CPU plug & unplug, which
is a normal use case for VM at least.

Make sure we allocate blk_mq_ctx structures for all possible CPUs, and
set hctx->numa_node for possible CPUs which are mapped to this hctx. And
only choose the online CPUs for schedule.
Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Tested-by: NStefan Haberland <sth@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Fixes: 4b855ad3 ("blk-mq: Create hctx for each present CPU")
(merged the three into one because any single one may not work, and fix
selecting online CPUs for scheduler)
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

20e4d813

12 1月, 2018 1 次提交

blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait() · c27d53fb

由 Bart Van Assche 提交于 1月 10, 2018

This patch does not change any functionality but makes the
blk_mq_mark_tag_wait() code slightly easier to read.
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c27d53fb

11 1月, 2018 8 次提交

blk-mq: Add locking annotations to hctx_lock() and hctx_unlock() · b7435db8

由 Bart Van Assche 提交于 1月 10, 2018

This patch avoids that sparse reports the following:

block/blk-mq.c:637:33: warning: context imbalance in 'hctx_unlock' - unexpected unlock
block/blk-mq.c:642:9: warning: context imbalance in 'hctx_lock' - wrong count at exit
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b7435db8

block: silently forbid sending any ioctl to a partition · 0478fe68

由 Paolo Bonzini 提交于 1月 10, 2018

After the first few months, the message has not led to many bug reports.
It's been almost five years now, and in practice the main source of
it seems to be MTIOCGET that someone is using to detect tape devices.
While we could whitelist it just like CDROM_GET_CAPABILITY, this patch
just removes the message altogether.

The patch also removes the "safe but not very useful" ioctl whitelist,
as suggested by Christoph.  I doubt anything is using most of those
ioctls _in general_, let alone on a partition.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0478fe68

block: rearrange a few request fields for better cache layout · 7c3fb70f

由 Jens Axboe 提交于 1月 10, 2018

Move completion related items (like the call single data) near the
end of the struct, instead of mixing them in with the initial
queueing related fields.

Move queuelist below the bio structures. Then we have all
queueing related bits in the first cache line.

This yields a 1.5-2% increase in IOPS for a null_blk test, both for
sync and for high thread count access. Sync test goes form 975K to
992K, 32-thread case from 20.8M to 21.2M IOPS.
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7c3fb70f

block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bit · e14575b3

由 Jens Axboe 提交于 1月 10, 2018

We only have one atomic flag left. Instead of using an entire
unsigned long for that, steal the bottom bit of the deadline
field that we already reserved.

Remove ->atomic_flags, since it's now unused.
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e14575b3

block: add accessors for setting/querying request deadline · 0a72e7f4

由 Jens Axboe 提交于 1月 09, 2018

We reduce the resolution of request expiry, but since we're already
using jiffies for this where resolution depends on the kernel
configuration and since the timeout resolution is coarse anyway,
that should be fine.
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0a72e7f4

block: remove REQ_ATOM_POLL_SLEPT · 76a86f9d

由 Jens Axboe 提交于 1月 10, 2018

We don't need this to be an atomic flag, it can be a regular
flag. We either end up on the same CPU for the polling, in which
case the state is sane, or we did the sleep which would imply
the needed barrier to ensure we see the right state.
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

76a86f9d

blk-mq: add a few missing debugfs RQF_ flags · 5d75d3f2

由 Jens Axboe 提交于 1月 10, 2018

We are missing ZONE_WRITE_LOCKED and MQ_TIMEOUT_EXPIRED, add them
so the debugfs bits can decode them.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5d75d3f2

blk-mq: Explain when 'active_queues' is decremented · fcd36c36

由 Bart Van Assche 提交于 1月 10, 2018

It is nontrivial to derive from the blk-mq source code when
blk_mq_tags.active_queues is decremented. Hence add a comment that
explains this.
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fcd36c36

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功