提交 · a0958ba7fcdc316e3900f8d2afda519850d60985 · openeuler / Kernel

28 8月, 2019 1 次提交

block: Remove blk_mq_register_dev() · 9685b227

由 Bart Van Assche 提交于 8月 27, 2019

This function has no callers. Hence remove it.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9685b227

05 8月, 2019 4 次提交

blk-mq: add callback of .cleanup_rq · 226b4fc7

由 Ming Lei 提交于 7月 25, 2019

SCSI maintains its own driver private data hooked off of each SCSI
request, and the pridate data won't be freed after scsi_queue_rq()
returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver
(e.g. dm-rq) may need to retry these SCSI requests, before SCSI has
fully dispatched them, due to a lower level SCSI driver's resource
limitation identified in scsi_queue_rq(). Currently SCSI's per-request
private data is leaked when the upper layer driver (dm-rq) frees and
then retries these requests in response to BLK_STS_RESOURCE or
BLK_STS_DEV_RESOURCE returns from scsi_queue_rq().

This usecase is so specialized that it doesn't warrant training an
existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to
account for freeing its driver private data -- doing so would add an
extra branch for handling a special case that all other consumers of
SCSI (and blk-mq) won't ever need to worry about.

So the most pragmatic way forward is to delegate freeing SCSI driver
private data to the upper layer driver (dm-rq).  Do so by adding
new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method
from dm-rq.  A following commit will implement the .cleanup_rq() hook
in scsi_mq_ops.

Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: <stable@vger.kernel.org>
Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

226b4fc7

blk-mq: remove blk_mq_complete_request_sync · a87ccce0

由 Ming Lei 提交于 7月 24, 2019

blk_mq_tagset_wait_completed_request() has been applied for waiting
for completed request's fn, so not necessary to use
blk_mq_complete_request_sync() any more.

Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a87ccce0

blk-mq: introduce blk_mq_tagset_wait_completed_request() · f9934a80

由 Ming Lei 提交于 7月 24, 2019

blk-mq may schedule to call queue's complete function on remote CPU via
IPI, but doesn't provide any way to synchronize the request's complete
fn. The current queue freeze interface can't provide the synchonization
because aborted requests stay at blk-mq queues during EH.

In some driver's EH(such as NVMe), hardware queue's resource may be freed &
re-allocated. If the completed request's complete fn is run finally after the
hardware queue's resource is released, kernel crash will be triggered.

Prepare for fixing this kind of issue by introducing
blk_mq_tagset_wait_completed_request().

Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f9934a80

blk-mq: introduce blk_mq_request_completed() · aa306ab7

由 Ming Lei 提交于 7月 24, 2019

NVMe needs this function to decide if one request to be aborted has
been completed in normal IO path already.

So introduce it.

Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

aa306ab7

21 6月, 2019 1 次提交

block: remove the bi_phys_segments field in struct bio · 14ccb66b

由 Christoph Hellwig 提交于 6月 06, 2019

We only need the number of segments in the blk-mq submission path.
Remove the field from struct bio, and return it from a variant of
blk_queue_split instead of that it can passed as an argument to
those functions that need the value.

This also means we stop recounting segments except for cloning
and partial segments.

To keep the number of arguments in this how path down remove
pointless struct request_queue arguments from any of the functions
that had it and grew a nr_segs argument.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

14ccb66b

04 5月, 2019 1 次提交

blk-mq: always free hctx after request queue is freed · 2f8f1336

由 Ming Lei 提交于 4月 30, 2019

In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().

However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.

Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: NHannes Reinecke <hare@suse.com>
Tested-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2f8f1336

10 4月, 2019 1 次提交

blk-mq: introduce blk_mq_complete_request_sync() · 1b8f21b7

由 Ming Lei 提交于 4月 09, 2019

In NVMe's error handler, follows the typical steps of tearing down
hardware for recovering controller:

1) stop blk_mq hw queues
2) stop the real hw queues
3) cancel in-flight requests via
	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
cancel_request():
	mark the request as abort
	blk_mq_complete_request(req);
4) destroy real hw queues

However, there may be race between #3 and #4, because blk_mq_complete_request()
may run q->mq_ops->complete(rq) remotelly and asynchronously, and
->complete(rq) may be run after #4.

This patch introduces blk_mq_complete_request_sync() for fixing the
above race.

Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: James Smart <james.smart@broadcom.com>
Cc: linux-nvme@lists.infradead.org
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1b8f21b7

21 3月, 2019 1 次提交

block: Unexport blk_mq_add_to_requeue_list() · e6c98712

由 Bart Van Assche 提交于 3月 20, 2019

This function is not used outside the block layer core. Hence unexport it.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e6c98712

19 3月, 2019 1 次提交

blk-mq: remove unused 'nr_expired' from blk_mq_hw_ctx · 9496c015

由 Dongli Zhang 提交于 3月 19, 2019

There is no usage of 'nr_expired'.

The 'nr_expired' was introduced by commit 1d9bd516 ("blk-mq: replace
timeout synchronization with a RCU and generation based scheme"). Its usage
was removed since commit 12f5b931 ("blk-mq: Remove generation
seqeunce").
Signed-off-by: NDongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9496c015

15 2月, 2019 1 次提交

block: kill BLK_MQ_F_SG_MERGE · 56d18f62

由 Ming Lei 提交于 2月 15, 2019

QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

56d18f62

19 12月, 2018 1 次提交

block: make request_to_qc_t public · 7b7ab780

由 Sagi Grimberg 提交于 12月 14, 2018

block consumers will need it for polling requests that
are sent with blk_execute_rq_nowait. Also, get rid of
blk_tag_to_qc_t and open-code it instead.
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

7b7ab780

18 12月, 2018 1 次提交

blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight() · 3c94d83c

由 Jens Axboe 提交于 12月 17, 2018

There's a single user of this function, dm, and dm just wants
to check if IO is inflight, not that it's just allocated.

This fixes a hang with srp/002 in blktests with dm, where it tries
to suspend but waits for inflight IO to finish first. As it checks
for just allocated requests, this fails.
Tested-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3c94d83c

05 12月, 2018 1 次提交

block: move queues types to the block layer · e20ba6e1

由 Christoph Hellwig 提交于 12月 02, 2018

Having another indirect all in the fast path doesn't really help
in our post-spectre world.  Also having too many queue type is just
going to create confusion, so I'd rather manage them centrally.

Note that the queue type naming and ordering changes a bit - the
first index now is the default queue for everything not explicitly
marked, the optional ones are read and poll queues.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e20ba6e1

30 11月, 2018 1 次提交

blk-mq: add mq_ops->commit_rqs() · d666ba98

由 Jens Axboe 提交于 11月 27, 2018

blk-mq passes information to the hardware about any given request being
the last that we will issue in this sequence. The point is that hardware
can defer costly doorbell type writes to the last request. But if we run
into errors issuing a sequence of requests, we may never send the request
with bd->last == true set. For that case, we need a hook that tells the
hardware that nothing else is coming right now.

For failures returned by the drivers ->queue_rq() hook, the driver is
responsible for flushing pending requests, if it uses bd->last to
optimize that part. This works like before, no changes there.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d666ba98

27 11月, 2018 2 次提交

blk-mq: Simplify request completion state · af78ff7c

由 Keith Busch 提交于 11月 26, 2018

There are no more users relying on blk-mq request states to prevent
double completions, so replace the relatively expensive cmpxchg operation
with WRITE_ONCE.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

af78ff7c

blk-mq: Return true if request was completed · 16c15eb1

由 Keith Busch 提交于 11月 26, 2018

A driver may have internal state to cleanup if we're pretending a request
didn't complete. Return 'false' if the command wasn't actually completed
due to the timeout error injection, and true otherwise.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

16c15eb1

26 11月, 2018 1 次提交

blk-mq: remove 'tag' parameter from mq_ops->poll() · 9743139c

由 Jens Axboe 提交于 11月 16, 2018

We always pass in -1 now and none of the callers use the tag value,
remove the parameter.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9743139c

09 11月, 2018 2 次提交

blk-mq: provide a helper to check if a queue is busy · ae879912

由 Jens Axboe 提交于 11月 08, 2018

Returns true if the queue currently has requests pending,
false if not.

DM can use this to replace the atomic_inc/dec they do per device
to see if a device is busy.
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ae879912

blk-mq-tag: change busy_iter_fn to return whether to continue or not · 7baa8572

由 Jens Axboe 提交于 11月 08, 2018

We have this functionality in sbitmap, but we don't export it in
blk-mq for users of the tags busy iteration. This can be useful
for stopping the iteration, if the caller doesn't need to find
more requests.
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7baa8572

08 11月, 2018 8 次提交

nvme: add separate poll queue map · 4b04cc6a

由 Jens Axboe 提交于 11月 05, 2018

Adds support for defining a variable number of poll queues, currently
configurable with the 'poll_queues' module parameter. Defaults to
a single poll queue.

And now we finally have poll support without triggering interrupts!
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4b04cc6a

blk-mq: initial support for multiple queue maps · 843477d4

由 Jens Axboe 提交于 10月 24, 2018

Add a queue offset to the tag map. This enables users to map
iteratively, for each queue map type they support.

Bump maximum number of supported maps to 2, we're now fully
able to support more than 1 map.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

843477d4

blk-mq: support multiple hctx maps · b3c661b1

由 Jens Axboe 提交于 10月 30, 2018

Add support for the tag set carrying multiple queue maps, and
for the driver to inform blk-mq how many it wishes to support
through setting set->nr_maps.

This adds an mq_ops helper for drivers that support more than 1
map, mq_ops->rq_flags_to_type(). The function takes request/bio
flags and CPU, and returns a queue map index for that. We then
use the type information in blk_mq_map_queue() to index the map
set.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b3c661b1

blk-mq: allow software queue to map to multiple hardware queues · f31967f0

由 Jens Axboe 提交于 10月 29, 2018

The mapping used to be dependent on just the CPU location, but
now it's a tuple of (type, cpu) instead. This is a prep patch
for allowing a single software queue to map to multiple hardware
queues. No functional changes in this patch.

This changes the software queue count to an unsigned short
to save a bit of space. We can still support 64K-1 CPUs,
which should be enough. Add a check to catch a wrap.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f31967f0

blk-mq: abstract out queue map · ed76e329

由 Jens Axboe 提交于 10月 29, 2018

This is in preparation for allowing multiple sets of maps per
queue, if so desired.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ed76e329

block: kill request ->cpu member · 9cf2bab6

由 Jens Axboe 提交于 10月 31, 2018

This was used for completion placement for the legacy path,
but for mq we have rq->mq_ctx->cpu for that. Add a helper
to get the request CPU assignment, as the mq_ctx type is
private to blk-mq.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9cf2bab6

block: get rid of q->softirq_done_fn() · c7bb9ad1

由 Jens Axboe 提交于 10月 31, 2018

With the legacy path gone, all we do is funnel it through the
mq_ops->complete() operation.
Tested-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c7bb9ad1

blk-mq: provide mq_ops->busy() hook · 9ba20527

由 Jens Axboe 提交于 10月 29, 2018

We'll hook into this from blk_lld_busy(), allowing blk-mq to also
return whether or not a given queue currently has requests in
progress.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Tested-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9ba20527

16 10月, 2018 1 次提交

blk-mq: provide helper for setting up an SQ queue and tag set · 9316a9ed

由 Jens Axboe 提交于 10月 15, 2018

This pattern is repeated throughout all the blk-mq conversions.
Provide a basic helper to get it done.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9316a9ed

25 7月, 2018 1 次提交

blk-mq: export setting request completion state · 0fc09f92

由 Keith Busch 提交于 7月 23, 2018

This is preparing for drivers that want to directly alter the state of
their requests. No functional change here.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0fc09f92

09 7月, 2018 2 次提交

blk-mq: dequeue request one by one from sw queue if hctx is busy · 6e768717

由 Ming Lei 提交于 7月 03, 2018

It won't be efficient to dequeue request one by one from sw queue,
but we have to do that when queue is busy for better merge performance.

This patch takes the Exponential Weighted Moving Average(EWMA) to figure
out if queue is busy, then only dequeue request one by one from sw queue
when queue is busy.

Fixes: b347689f ("blk-mq-sched: improve dispatching from sw queue")
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: NKashyap Desai <kashyap.desai@broadcom.com>
Tested-by: NKashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6e768717

blk-mq: introduce new lock for protecting hctx->dispatch_wait · 5815839b

由 Ming Lei 提交于 6月 25, 2018

Now hctx->lock is only acquired when adding hctx->dispatch_wait to
one wait queue, but not held when removing it from the wait queue.

IO hang can be observed easily if SCHED RESTART is disabled, that means
now RESTART exits just for fixing the issue in blk_mq_mark_tag_wait().

This patch fixes the issue by introducing hctx->dispatch_wait_lock and
holding it for removing hctx->dispatch_wait in blk_mq_dispatch_wake(),
since we need to avoid acquiring hctx->lock in irq context.

Fixes: eb619fdb ("blk-mq: fix issue with shared tag queue re-running")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: NAndrew Jones <drjones@redhat.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5815839b

14 6月, 2018 1 次提交

blk-mq: remove blk_mq_tagset_iter · e6c3456a

由 Christoph Hellwig 提交于 6月 14, 2018

Unused now that nvme stopped using it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJens Axboe <axboe@kernel.dk>

e6c3456a

31 5月, 2018 1 次提交

blk-mq: abstract out blk-mq-sched rq list iteration bio merge helper · 9c558734

由 Jens Axboe 提交于 5月 30, 2018

No functional changes in this patch, just a prep patch for utilizing
this in an IO scheduler.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NOmar Sandoval <osandov@fb.com>

9c558734

25 4月, 2018 1 次提交

block: mq: Add some minor doc for core structs · fe644072

由 Linus Walleij 提交于 4月 20, 2018

As it came up in discussion on the mailing list that the semantic
meaning of 'blk_mq_ctx' and 'blk_mq_hw_ctx' isn't completely
obvious to everyone, let's add some minimal kerneldoc for a
starter.
Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fe644072

10 4月, 2018 1 次提交

blk-mq: remove blk_mq_delay_queue() · 15fe8a90

由 Ming Lei 提交于 4月 08, 2018

No driver uses this interface any more, so remove it.

Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

15fe8a90

10 1月, 2018 2 次提交

blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcu · 05707b64

由 Tejun Heo 提交于 1月 09, 2018

The RCU protection has been expanded to cover both queueing and
completion paths making ->queue_rq_srcu a misnomer.  Rename it to
->srcu as suggested by Bart.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

05707b64

blk-mq: replace timeout synchronization with a RCU and generation based scheme · 1d9bd516

由 Tejun Heo 提交于 1月 09, 2018

Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules.  Unfortunately, it contains quite a few holes.

There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.

In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request.  If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly.  Nothing actually timed out.  It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.

1. Each request has a u64 generation + state value, which can be
   updated only by the request owner.  Whenever a request becomes
   in-flight, the generation number gets bumped up too.  This provides
   the basis for the timeout path to distinguish different recycle
   instances of the request.

   Also, marking a request in-flight and setting its deadline are
   protected with a seqcount so that the timeout path can fetch both
   values coherently.

2. The timeout path fetches the generation, state and deadline.  If
   the verdict is timeout, it records the generation into a dedicated
   request abortion field and does RCU wait.

3. The completion path is also protected by RCU (from the previous
   patch) and checks whether the current generation number and state
   match the abortion field.  If so, it skips completion.

4. The timeout path, after RCU wait, scans requests again and
   terminates the ones whose generation and state still match the ones
   requested for abortion.

   By now, the timeout path knows that either the generation number
   and state changed if it lost the race or the completion will yield
   to it and can safely timeout the request.

While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.

While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places.  Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.

Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path.  The race has always been there and this
patch doesn't change it.  It's just documenting the existing race.

v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
    - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
    - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.

v3: - Fixed possible extended seqcount / u64_stats_sync read looping
      spotted by Peter.
    - MQ_RQ_IDLE was incorrectly being set in complete_request instead
      of free_request.  Fixed.

v4: - Rebased on top of hctx_lock() refactoring patch.
    - Added comment explaining the use of hctx_lock() in completion path.

v5: - Added comments requested by Bart.
    - Note the addition of BLK_EH_RESET_TIMER race condition in the
      commit message.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1d9bd516

11 11月, 2017 2 次提交

blk-mq: only run the hardware queue if IO is pending · 79f720a7

由 Jens Axboe 提交于 11月 10, 2017

Currently we are inconsistent in when we decide to run the queue. Using
blk_mq_run_hw_queues() we check if the hctx has pending IO before
running it, but we don't do that from the individual queue run function,
blk_mq_run_hw_queue(). This results in a lot of extra and pointless
queue runs, potentially, on flush requests and (much worse) on tag
starvation situations. This is observable just looking at top output,
with lots of kworkers active. For the !async runs, it just adds to the
CPU overhead of blk-mq.

Move the has-pending check into the run function instead of having
callers do it.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

79f720a7

block, nvme: Introduce blk_mq_req_flags_t · 9a95e4ef

由 Bart Van Assche 提交于 11月 09, 2017

Several block layer and NVMe core functions accept a combination
of BLK_MQ_REQ_* flags through the 'flags' argument but there is
no verification at compile time whether the right type of block
layer flags is passed. Make it possible for sparse to verify this.
This patch does not change any functionality.
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Cc: linux-nvme@lists.infradead.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9a95e4ef

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功