提交 · 066a4a73cee9a44a906b98825e70c47de5bd8b5c · openanolis / cloud-kernel

12 11月, 2016 1 次提交

blk-mq: blk_mq_try_issue_directly() should lookup hardware queue · 066a4a73

由 Jens Axboe 提交于 11月 11, 2016

A previous commit changed this to pass in the hardware queue, but
it was using the wrong hardware queue. Hence a request that was
allocated on one hardware queue ended up being issued on another
one, and that caused IO timeouts and oopses on some drivers. Since
the request holds hardware queue private resources, like a tag,
we can't just issue it on a different hardware queue.

Fixes: 2253efc8 ("blk-mq: Move more code into blk_mq_direct_issue_request()")
Signed-off-by: NJens Axboe <axboe@fb.com>

066a4a73

11 11月, 2016 2 次提交

block: hook up writeback throttling · 87760e5e

由 Jens Axboe 提交于 11月 09, 2016

Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.

Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.

The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.

We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: NJens Axboe <axboe@fb.com>

87760e5e

block: add scalable completion tracking of requests · cf43e6be

由 Jens Axboe 提交于 11月 07, 2016

For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

The feature is off by default, to avoid any extra overhead. In-kernel
users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
flags. We currently don't turn it on if someone just reads any of
the stats files, that is something we could add as well.
Signed-off-by: NJens Axboe <axboe@fb.com>

cf43e6be

07 11月, 2016 1 次提交

blk-mq: Always schedule hctx->next_cpu · c02ebfdd

由 Gabriel Krisman Bertazi 提交于 9月 28, 2016

Commit 0e87e58b ("blk-mq: improve warning for running a queue on the
wrong CPU") attempts to avoid triggering the WARN_ON in
__blk_mq_run_hw_queue when the expected CPU is dead.  Problem is, in the
last batch execution before round robin, blk_mq_hctx_next_cpu can
schedule a dead CPU and also update next_cpu to the next alive CPU in
the mask, which will trigger the WARN_ON despite the previous
workaround.

The following patch fixes this scenario by always scheduling the value
in hctx->next_cpu.  This changes the moment when we round-robin the CPU
running the hctx, but it really doesn't matter, since it still executes
BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.

Fixes: 0e87e58b ("blk-mq: improve warning for running a queue on the wrong CPU")
Signed-off-by: NGabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c02ebfdd

04 11月, 2016 1 次提交

blk-mq: immediately dispatch big size request · 600271d9

由 Shaohua Li 提交于 11月 03, 2016

This is corresponding part for blk-mq. Disk with multiple hardware
queues doesn't need this as we only hold 1 request at most.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

600271d9

03 11月, 2016 8 次提交

blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request() · 2b053aca

由 Bart Van Assche 提交于 10月 28, 2016

Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
are followed by kicking the requeue list. Hence add an argument to
these two functions that allows to kick the requeue list. This was
proposed by Christoph Hellwig.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

2b053aca

blk-mq: Introduce blk_mq_quiesce_queue() · 6a83e74d

由 Bart Van Assche 提交于 11月 02, 2016

blk_mq_quiesce_queue() waits until ongoing .queue_rq() invocations
have finished. This function does *not* wait until all outstanding
requests have finished (this means invocation of request.end_io()).
The algorithm used by blk_mq_quiesce_queue() is as follows:
* Hold either an RCU read lock or an SRCU read lock around
  .queue_rq() calls. The former is used if .queue_rq() does not
  block and the latter if .queue_rq() may block.
* blk_mq_quiesce_queue() first calls blk_mq_stop_hw_queues()
  followed by synchronize_srcu() or synchronize_rcu(). The latter
  call waits for .queue_rq() invocations that started before
  blk_mq_quiesce_queue() was called.
* The blk_mq_hctx_stopped() calls that control whether or not
  .queue_rq() will be called are called with the (S)RCU read lock
  held. This is necessary to avoid race conditions against
  blk_mq_quiesce_queue().
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMing Lei <tom.leiming@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

6a83e74d

blk-mq: Remove blk_mq_cancel_requeue_work() · 9b7dd572

由 Bart Van Assche 提交于 10月 28, 2016

Since blk_mq_requeue_work() no longer restarts stopped queues
canceling requeue work is no longer needed to prevent that a
stopped queue would be restarted. Hence remove this function.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

9b7dd572

blk-mq: Avoid that requeueing starts stopped queues · 52d7f1b5

由 Bart Van Assche 提交于 10月 28, 2016

Since blk_mq_requeue_work() starts stopped queues and since
execution of this function can be scheduled after a queue has
been stopped it is not possible to stop queues without using
an additional state variable to track whether or not the queue
has been stopped. Hence modify blk_mq_requeue_work() such that it
does not start stopped queues. My conclusion after a review of
the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list()
callers is as follows:
* In the dm driver starting and stopping queues should only happen
  if __dm_suspend() or __dm_resume() is called and not if the
  requeue list is processed.
* In the SCSI core queue stopping and starting should only be
  performed by the scsi_internal_device_block() and
  scsi_internal_device_unblock() functions but not by any other
  function. Although the blk_mq_stop_hw_queue() call in
  scsi_queue_rq() may help to reduce CPU load if a LLD queue is
  full, figuring out whether or not a queue should be restarted
  when requeueing a command would require to introduce additional
  locking in scsi_mq_requeue_cmd() to avoid a race with
  scsi_internal_device_block(). Avoid this complexity by removing
  the blk_mq_stop_hw_queue() call from scsi_queue_rq().
* In the NVMe core only the functions that call
  blk_mq_start_stopped_hw_queues() explicitly should start stopped
  queues.
* A blk_mq_start_stopped_hwqueues() call must be added in the
  xen-blkfront driver in its blkif_recover() function.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: James Bottomley <jejb@linux.vnet.ibm.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

52d7f1b5

blk-mq: Move more code into blk_mq_direct_issue_request() · 2253efc8

由 Bart Van Assche 提交于 10月 28, 2016

Move the "hctx stopped" test and the insert request calls into
blk_mq_direct_issue_request(). Rename that function into
blk_mq_try_issue_directly() to reflect its new semantics. Pass
the hctx pointer to that function instead of looking it up a
second time. These changes avoid that code has to be duplicated
in the next patch.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

2253efc8

blk-mq: Introduce blk_mq_queue_stopped() · fd001443

由 Bart Van Assche 提交于 10月 28, 2016

The function blk_queue_stopped() allows to test whether or not a
traditional request queue has been stopped. Introduce a helper
function that allows block drivers to query easily whether or not
one or more hardware contexts of a blk-mq queue have been stopped.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

fd001443

blk-mq: Introduce blk_mq_hctx_stopped() · 5d1b25c1

由 Bart Van Assche 提交于 10月 28, 2016

Multiple functions test the BLK_MQ_S_STOPPED bit so introduce
a helper function that performs this test.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NMing Lei <tom.leiming@gmail.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

5d1b25c1

blk-mq: Do not invoke .queue_rq() for a stopped queue · bc27c01b

由 Bart Van Assche 提交于 10月 28, 2016

The meaning of the BLK_MQ_S_STOPPED flag is "do not call
.queue_rq()". Hence modify blk_mq_make_request() such that requests
are queued instead of issued if a queue has been stopped.
Reported-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <tom.leiming@gmail.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Cc: <stable@vger.kernel.org>
Signed-off-by: NJens Axboe <axboe@fb.com>

bc27c01b

28 10月, 2016 4 次提交

block: better op and flags encoding · ef295ecf

由 Christoph Hellwig 提交于 10月 28, 2016

Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields.  This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits.  Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

ef295ecf

block: split out request-only flags into a new namespace · e8064021

由 Christoph Hellwig 提交于 10月 20, 2016

A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.

This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests.  It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NShaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

e8064021

blk-mq: get rid of confusing blk_map_ctx structure · 2552e3f8

由 Jens Axboe 提交于 10月 27, 2016

We can just use struct blk_mq_alloc_data - it has a few more
members, but we allocate it further down the stack anyway. So
this cleans up the code, and reduces the stack overhead a bit.
Signed-off-by: NJens Axboe <axboe@fb.com>

2552e3f8

blk-mq: update hardware and software queues for sleeping alloc · 7dd2fb68

由 Jens Axboe 提交于 10月 27, 2016

If we end up sleeping due to running out of requests, we should
update the hardware and software queues in the map ctx structure.
Otherwise we could end up having rq->mq_ctx point to the pre-sleep
context, and risk corrupting ctx->rq_list since we'll be
grabbing the wrong lock when inserting the request.
Reported-by: NDave Jones <davej@codemonkey.org.uk>
Reported-by: NChris Mason <clm@fb.com>
Tested-by: NChris Mason <clm@fb.com>
Fixes: 63581af3 ("blk-mq: remove non-blocking pass in blk_mq_map_request")
Signed-off-by: NJens Axboe <axboe@fb.com>

7dd2fb68

24 9月, 2016 1 次提交

blk-mq: skip unmapped queues in blk_mq_alloc_request_hctx · c8712c6a

由 Christoph Hellwig 提交于 9月 23, 2016

This provides the caller a feedback that a given hctx is not mapped and thus
no command can be sent on it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c8712c6a

23 9月, 2016 3 次提交

blk-mq: fixup "Convert to new hotplug state machine" · 97a32864

由 Sebastian Andrzej Siewior 提交于 9月 23, 2016

The "blk_mq_queue_reinit_dead()" just cleared the cpumask instead doing
a copy. Since we might never had an online callback we could end up with
a ZERO mask which in turn leads to crash as test robot demonstarted.

Fixes: 65d5291e ("blk-mq: Convert to new hotplug state machine")
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

97a32864

blk-mq: add flag for drivers wanting blocking ->queue_rq() · 1b792f2f

由 Jens Axboe 提交于 9月 21, 2016

If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its
->queue_rq() handler. For that case, blk-mq ensures that we always
calls it from a safe context.
Signed-off-by: NJens Axboe <axboe@fb.com>
Tested-by: NJosef Bacik <jbacik@fb.com>

1b792f2f

blk-mq: remove non-blocking pass in blk_mq_map_request · 63581af3

由 Christoph Hellwig 提交于 9月 22, 2016

bt_get already does a non-blocking pass as well as running the queue
when scheduling internally, no need to duplicate it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

63581af3

22 9月, 2016 3 次提交

blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue() · 841bac2c

由 Jens Axboe 提交于 9月 21, 2016

Two cases:

1) blk_mq_alloc_request() needlessly re-runs the queue, after
   calling into the tag allocation without NOWAIT set. We don't
   need to do that.

2) blk_mq_map_request() should just use blk_mq_run_hw_queue() with
   the async flag set to false.
Signed-off-by: NJens Axboe <axboe@fb.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

841bac2c

blk-mq: Convert to new hotplug state machine · 65d5291e

由 Sebastian Andrzej Siewior 提交于 9月 22, 2016

Install the callbacks via the state machine so we can phase out the cpu
hotplug notifiers mess.
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-block@vger.kernel.org
Cc: rt@linutronix.de
Cc: Christoph Hellwing <hch@lst.de>
Link: http://lkml.kernel.org/r/20160919212601.180033814@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

65d5291e

blk-mq/cpu-notif: Convert to new hotplug state machine · 9467f859

由 Thomas Gleixner 提交于 9月 22, 2016

Replace the block-mq notifier list management with the multi instance
facility in the cpu hotplug state machine.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-block@vger.kernel.org
Cc: rt@linutronix.de
Cc: Christoph Hellwing <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

9467f859

17 9月, 2016 3 次提交

sbitmap: push per-cpu last_tag into sbitmap_queue · 40aabb67

由 Omar Sandoval 提交于 9月 17, 2016

Allocating your own per-cpu allocation hint separately makes for an
awkward API. Instead, allocate the per-cpu hint as part of the struct
sbitmap_queue. There's no point for a struct sbitmap_queue without the
cache, but you can still use a bare struct sbitmap.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

40aabb67

blk-mq: abstract tag allocation out into sbitmap library · 88459642

由 Omar Sandoval 提交于 9月 17, 2016

This is a generally useful data structure, so make it available to
anyone else who might want to use it. It's also a nice cleanup
separating the allocation logic from the rest of the tag handling logic.

The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
selected by CONFIG_BLOCK for now.

This should be a complete noop functionality-wise.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

88459642

blk-mq: account higher order dispatch · 703fd1c0

由 Jens Axboe 提交于 9月 16, 2016

We currently account a '0' dispatch, and anything above that still falls
below the range set by BLK_MQ_MAX_DISPATCH_ORDER. If we dispatch more,
we don't account it.

Change the last bucket to be inclusive of anything above the range we
track, and have the sysfs file reflect that by including a '+' in the
output:

$ cat /sys/block/nvme0n1/mq/0/dispatched
        0	1006
        1	20229
        2	1
        4	0
        8	0
       16	0
       32+	0
Signed-off-by: NJens Axboe <axboe@fb.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>

703fd1c0

15 9月, 2016 7 次提交

blk-mq: kill unused blk_mq_create_mq_map() · 9151bcb4

由 Jens Axboe 提交于 9月 15, 2016

Fixes 1b157939 ("blk-mq: get rid of the cpumask in struct blk_mq_tags")
Signed-off-by: NJens Axboe <axboe@fb.com>

9151bcb4

blk-mq: get rid of the cpumask in struct blk_mq_tags · 1b157939

由 Christoph Hellwig 提交于 9月 14, 2016

Unused now that NVMe sets up irq affinity before calling into blk-mq.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

1b157939

blk-mq: allow the driver to pass in a queue mapping · da695ba2

由 Christoph Hellwig 提交于 9月 14, 2016

This allows drivers specify their own queue mapping by overriding the
setup-time function that builds the mq_map.  This can be used for
example to build the map based on the MSI-X vector mapping provided
by the core interrupt layer for PCI devices.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

da695ba2

blk-mq: remove ->map_queue · 7d7e0f90

由 Christoph Hellwig 提交于 9月 14, 2016

All drivers use the default, so provide an inline version of it.  If we
ever need other queue mapping we can add an optional method back,
although supporting will also require major changes to the queue setup
code.

This provides better code generation, and better debugability as well.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

7d7e0f90

blk-mq: only allocate a single mq_map per tag_set · bdd17e75

由 Christoph Hellwig 提交于 9月 14, 2016

The mapping is identical for all queues in a tag_set, so stop wasting
memory for building multiple.  Note that for now I've kept the mq_map
pointer in the request_queue, but we'll need to investigate if we can
remove it without suffering too much from the additional pointer chasing.
The same would apply to the mq_ops pointer as well.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

bdd17e75

blk-mq: don't redistribute hardware queues on a CPU hotplug event · 4e68a011

由 Christoph Hellwig 提交于 9月 14, 2016

Currently blk-mq will totally remap hardware context when a CPU hotplug
even happened, which causes major havoc for drivers, as they are never
told about this remapping.  E.g. any carefully sorted out CPU affinity
will just be completely messed up.

The rebuild also doesn't really help for the common case of cpu
hotplug, which is soft onlining / offlining of cpus - in this case we
should just leave the queue and irq mapping as is.  If it actually
worked it would have helped in the case of physical cpu hotplug,
although for that we'd need a way to actually notify the driver.
Note that drivers may already be able to accommodate such a topology
change on their own, e.g. using the reset_controller sysfs file in NVMe
will cause the driver to get things right for this case.

With the rebuild removed we will simplify retain the queue mapping for
a soft offlined CPU that will work when it comes back online, and will
map any newly onlined CPU to queue 0 until the driver initiates
a rebuild of the queue map.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

4e68a011

blk-mq: introduce blk_mq_delay_kick_requeue_list() · 2849450a

由 Mike Snitzer 提交于 9月 14, 2016

blk_mq_delay_kick_requeue_list() provides the ability to kick the
q->requeue_list after a specified time.  To do this the request_queue's
'requeue_work' member was changed to a delayed_work.

blk_mq_delay_kick_requeue_list() allows DM to defer processing requeued
requests while it doesn't make sense to immediately requeue them
(e.g. when all paths in a DM multipath have failed).
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

2849450a

29 8月, 2016 2 次提交

blk-mq: prefetch request in blk_mq_tag_to_rq() · 88c7b2b7

由 Jens Axboe 提交于 8月 25, 2016

When drivers or the core calls this function, they usually
dereference the request shortly there after. Prefetch the first
cache line.

Profiling IO workloads shows that this is the most common cache
miss on the block side of things.
Signed-off-by: NJens Axboe <axboe@fb.com>

88c7b2b7

blk-mq: turn hctx->run_work into a regular work struct · 27489a3c

由 Jens Axboe 提交于 8月 24, 2016

We don't need the larger delayed work struct, since we always run it
immediately.
Signed-off-by: NJens Axboe <axboe@fb.com>

27489a3c

25 8月, 2016 2 次提交

blk-mq: improve warning for running a queue on the wrong CPU · 0e87e58b

由 Jens Axboe 提交于 8月 24, 2016

__blk_mq_run_hw_queue() currently warns if we are running the queue on a
CPU that isn't set in its mask. However, this can happen if a CPU is
being offlined, and the workqueue handling will place the work on CPU0
instead. Improve the warning so that it only triggers if the batch cpu
in the hardware queue is currently online.  If it triggers for that
case, then it's indicative of a flow problem in blk-mq, so we want to
retain it for that case.
Signed-off-by: NJens Axboe <axboe@fb.com>

0e87e58b

blk-mq: don't overwrite rq->mq_ctx · e57690fe

由 Jens Axboe 提交于 8月 24, 2016

We do this in a few places, if the CPU is offline. This isn't allowed,
though, since on multi queue hardware, we can't just move a request
from one software queue to another, if they map to different hardware
queues. The request and tag isn't valid on another hardware queue.

This can happen if plugging races with CPU offlining. But it does
no harm, since it can only happen in the window where we are
currently busy freezing the queue and flushing IO, in preparation
for redoing the software <-> hardware queue mappings.
Signed-off-by: NJens Axboe <axboe@fb.com>

e57690fe

08 8月, 2016 1 次提交

block: rename bio bi_rw to bi_opf · 1eff9d32

由 Jens Axboe 提交于 8月 05, 2016

Since commit 63a4cc24, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.
Signed-off-by: NJens Axboe <axboe@fb.com>

1eff9d32

05 8月, 2016 1 次提交

blk-mq: Allow timeouts to run while queue is freezing · 71f79fb3

由 Gabriel Krisman Bertazi 提交于 8月 01, 2016

In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread.  This puts
the request_queue in a hung state, forever waiting for the freeze
completion.

This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.

[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4

The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion.  This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive.  This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.

Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout.  In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.

I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process.  I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis.  I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.
Signed-off-by: NGabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

71f79fb3

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功