提交 · d78f5292ff88609cdf82e147b33c772f1ebf96e2 · openanolis / cloud-kernel

02 9月, 2020 2 次提交

blk-mq: allow software queue to map to multiple hardware queues · d78f5292

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit f31967f0e455d08d3ea1d2f849bf62dafc92dbf4 upstream

The mapping used to be dependent on just the CPU location, but
now it's a tuple of (type, cpu) instead. This is a prep patch
for allowing a single software queue to map to multiple hardware
queues. No functional changes in this patch.

This changes the software queue count to an unsigned short
to save a bit of space. We can still support 64K-1 CPUs,
which should be enough. Add a check to catch a wrap.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d78f5292

blk-mq: pass in request/bio flags to queue mapping · b8617081

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit f9afca4d367b8c915f28d29fcaba7460640403ff upstream

Prep patch for being able to place request based not just on
CPU location, but also on the type of request.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b8617081

18 3月, 2020 1 次提交

blk-mq: grab .q_usage_counter when queuing request from plug code path · 4a55f77f

由 Ming Lei 提交于 4月 30, 2019

commit e87eb301bee183d82bb3d04bd71b6660889a2588 upstream.

Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:

blk_mq_flush_plug_list():
        blk_mq_sched_insert_requests()
                insert requests to sw queue or scheduler queue
                blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Tested-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[Joseph: use the passing 'q' directly]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4a55f77f

13 1月, 2019 1 次提交

block: mq-deadline: Fix write completion handling · 6353c0a0

由 Damien Le Moal 提交于 12月 17, 2018

commit 7211aef86f79583e59b88a0aba0bc830566f7e8e upstream.

For a zoned block device using mq-deadline, if a write request for a
zone is received while another write was already dispatched for the same
zone, dd_dispatch_request() will return NULL and the newly inserted
write request is kept in the scheduler queue waiting for the ongoing
zone write to complete. With this behavior, when no other request has
been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
__blk_mq_free_request() call of blk_mq_sched_restart() to not run the
queue when the already dispatched write request completes. The newly
dispatched request stays stuck in the scheduler queue until eventually
another request is submitted.

This problem does not affect SCSI disk as the SCSI stack handles queue
restart on request completion. However, this problem is can be triggered
the nullblk driver with zoned mode enabled.

Fix this by always requesting a queue restart in dd_dispatch_request()
if no request was dispatched while WRITE requests are queued.

Fixes: 5700f691 ("mq-deadline: Introduce zone locking support")
Cc: <stable@vger.kernel.org>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

Add missing export of blk_mq_sched_restart()
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6353c0a0

21 8月, 2018 1 次提交

blk-mq: init hctx sched after update ctx and hctx mapping · d48ece20

由 Jianchao Wang 提交于 8月 21, 2018

Currently, when update nr_hw_queues, IO scheduler's init_hctx will
be invoked before the mapping between ctx and hctx is adapted
correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
may depend on this mapping and get wrong result and panic finally.
A simply way to fix this is that switch the IO scheduler to 'none'
before update the nr_hw_queues, and then switch it back after
update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
to nobody use them any more.
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d48ece20

18 7月, 2018 1 次提交

blk-mq: issue directly if hw queue isn't busy in case of 'none' · 6ce3dd6e

由 Ming Lei 提交于 7月 10, 2018

In case of 'none' io scheduler, when hw queue isn't busy, it isn't
necessary to enqueue request to sw queue and dequeue it from
sw queue because request may be submitted to hw queue asap without
extra cost, meantime there shouldn't be much request in sw queue,
and we don't need to worry about effect on IO merge.

There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
which may connect high performance devices, so 'none' is often required
for obtaining good performance.

This patch improves IOPS and decreases CPU unilization on megaraid_sas,
per Kashyap's test.

Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: NKashyap Desai <kashyap.desai@broadcom.com>
Tested-by: NKashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6ce3dd6e

09 7月, 2018 3 次提交

blk-mq: dequeue request one by one from sw queue if hctx is busy · 6e768717

由 Ming Lei 提交于 7月 03, 2018

It won't be efficient to dequeue request one by one from sw queue,
but we have to do that when queue is busy for better merge performance.

This patch takes the Exponential Weighted Moving Average(EWMA) to figure
out if queue is busy, then only dequeue request one by one from sw queue
when queue is busy.

Fixes: b347689f ("blk-mq-sched: improve dispatching from sw queue")
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: NKashyap Desai <kashyap.desai@broadcom.com>
Tested-by: NKashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6e768717

blk-mq: only attempt to merge bio if there is rq in sw queue · b04f50ab

由 Ming Lei 提交于 7月 02, 2018

Only attempt to merge bio iff the ctx->rq_list isn't empty, because:

1) for high-performance SSD, most of times dispatch may succeed, then
there may be nothing left in ctx->rq_list, so don't try to merge over
sw queue if it is empty, then we can save one acquiring of ctx->lock

2) we can't expect good merge performance on per-cpu sw queue, and missing
one merge on sw queue won't be a big deal since tasks can be scheduled from
one CPU to another.

Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: NKashyap Desai <kashyap.desai@broadcom.com>
Reported-by: NKashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b04f50ab

blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set() · 97889f9a

由 Ming Lei 提交于 6月 25, 2018

We have to remove synchronize_rcu() from blk_queue_cleanup(),
otherwise long delay can be caused during lun probe. For removing
it, we have to avoid to iterate the set->tag_list in IO path, eg,
blk_mq_sched_restart().

This patch reverts 5b79413946d (Revert "blk-mq: don't handle
TAG_SHARED in restart"). Given we have fixed enough IO hang issue,
and there isn't any reason to restart all queues in one tags any more,
see the following reasons:

1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(),
which can wake up queues waiting for driver tag.

2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue,
target or host is ready, but SCSI built-in restart can cover all these well,
see scsi_end_request(), queue will be rerun after any request initiated from
this host/target is completed.

In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30%
when running I/O on these 8 luns concurrently.

Fixes: 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Reported-by: NAndrew Jones <drjones@redhat.com>
Tested-by: NAndrew Jones <drjones@redhat.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

97889f9a

03 6月, 2018 1 次提交

blk-mq: update nr_requests when switching to 'none' scheduler · 32a50fab

由 Ming Lei 提交于 6月 02, 2018

Now we setup q->nr_requests when switching to one new scheduler,
but not do it for 'none', then q->nr_requests may not be correct
for 'none'.

This patch fixes this issue by always updating 'nr_requests' when
switching to 'none'.

Cc: Marco Patalano <mpatalan@redhat.com>
Cc: "Ewan D. Milne" <emilne@redhat.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

32a50fab

01 6月, 2018 2 次提交

block: move sysfs_lock into elevator_init · acddf3b3

由 Christoph Hellwig 提交于 5月 31, 2018

Both callers take just around so function call, so move it in.
Also remove the now pointless blk_mq_sched_init wrapper.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
Tested-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

acddf3b3

block: remove the always unused name argument to elevator_init · ddb72532

由 Christoph Hellwig 提交于 5月 31, 2018

Reported-by: NDamien Le Moal <Damien.LeMoal@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
Tested-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ddb72532

31 5月, 2018 1 次提交

blk-mq: abstract out blk-mq-sched rq list iteration bio merge helper · 9c558734

由 Jens Axboe 提交于 5月 30, 2018

No functional changes in this patch, just a prep patch for utilizing
this in an IO scheduler.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NOmar Sandoval <osandov@fb.com>

9c558734

02 2月, 2018 1 次提交
- K
  blk-mq-sched: Enable merging discard bio into request · bea99a50
  由 Keith Busch 提交于 2月 01, 2018
```
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
```
  bea99a50
18 1月, 2018 1 次提交

blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request · 9e97d295

由 Mike Snitzer 提交于 1月 17, 2018

After commit:

923218f6 ("blk-mq: don't allocate driver tag upfront for flush rq")

we no longer use the 'can_block' argument in
blk_mq_sched_insert_request(). Kill it.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

Added actual commit message as to why it's being removed.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9e97d295

05 1月, 2018 1 次提交

blk-mq: remove confusing comment of blk_mq_sched_dispatch_requests · 913a9500

由 Liu Bo 提交于 1月 05, 2018

Commit de148297
("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops")
changes the function to return bool type, and then commit 1f460b63
("blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE")
changes it back to void, but the comment remains.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

913a9500

11 11月, 2017 2 次提交

blk-mq: only run the hardware queue if IO is pending · 79f720a7

由 Jens Axboe 提交于 11月 10, 2017

Currently we are inconsistent in when we decide to run the queue. Using
blk_mq_run_hw_queues() we check if the hctx has pending IO before
running it, but we don't do that from the individual queue run function,
blk_mq_run_hw_queue(). This results in a lot of extra and pointless
queue runs, potentially, on flush requests and (much worse) on tag
starvation situations. This is observable just looking at top output,
with lots of kworkers active. For the !async runs, it just adds to the
CPU overhead of blk-mq.

Move the has-pending check into the run function instead of having
callers do it.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

79f720a7

Revert "blk-mq: don't handle TAG_SHARED in restart" · 05b79413

由 Jens Axboe 提交于 11月 08, 2017

This reverts commit 358a3a6b.

We have cases that aren't covered 100% in the drivers, so for now
we have to retain the shared tag restart loops.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

05b79413

05 11月, 2017 3 次提交

blk-mq: don't allocate driver tag upfront for flush rq · 923218f6

由 Ming Lei 提交于 11月 02, 2017

The idea behind it is simple:

1) for none scheduler, driver tag has to be borrowed for flush rq,
   otherwise we may run out of tag, and that causes an IO hang. And
   get/put driver tag is actually noop for none, so reordering tags
   isn't necessary at all.

2) for a real I/O scheduler, we need not allocate a driver tag upfront
   for flush rq. It works just fine to follow the same approach as
   normal requests: allocate driver tag for each rq just before calling
   ->queue_rq().

One driver visible change is that the driver tag isn't shared in the
flush request sequence. That won't be a problem, since we always do that
in legacy path.

Then flush rq need not be treated specially wrt. get/put driver tag.
This cleans up the code - for instance, reorder_tags_to_front() can be
removed, and we needn't worry about request ordering in dispatch list
for avoiding I/O deadlock.

Also we have to put the driver tag before requeueing.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

923218f6

blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ · a6a252e6

由 Ming Lei 提交于 11月 02, 2017

In case of IO scheduler we always pre-allocate one driver tag before
calling blk_insert_flush(), and flush request will be marked as
RQF_FLUSH_SEQ once it is in flush machinery.

So if RQF_FLUSH_SEQ isn't set, we call blk_insert_flush() to handle
the request, otherwise the flush request is dispatched to ->dispatch
list directly.

This is a preparation patch for not preallocating a driver tag for flush
requests, and for not treating flush requests as a special case. This is
similar to what the legacy path does.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a6a252e6

blk-mq: don't handle failure in .get_budget · 88022d72

由 Ming Lei 提交于 11月 05, 2017

It is enough to just check if we can get the budget via .get_budget().
And we don't need to deal with device state change in .get_budget().

For SCSI, one issue to be fixed is that we have to call
scsi_mq_uninit_cmd() to free allocated ressources if SCSI device fails
to handle the request. And it isn't enough to simply call
blk_mq_end_request() to do that if this request is marked as
RQF_DONTPREP.

Fixes: 0df21c86(scsi: implement .get_budget and .put_budget for blk-mq)
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

88022d72

01 11月, 2017 6 次提交

blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE · 1f460b63

由 Ming Lei 提交于 10月 27, 2017

SCSI restarts its queue in scsi_end_request() automatically, so we don't
need to handle this case in blk-mq.

Especailly any request won't be dequeued in this case, we needn't to
worry about IO hang caused by restart vs. dispatch.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1f460b63

blk-mq: don't handle TAG_SHARED in restart · 358a3a6b

由 Ming Lei 提交于 10月 27, 2017

Now restart is used in the following cases, and TAG_SHARED is for
SCSI only.

1) .get_budget() returns BLK_STS_RESOURCE
- if resource in target/host level isn't satisfied, this SCSI device
will be added in shost->starved_list, and the whole queue will be rerun
(via SCSI's built-in RESTART) in scsi_end_request() after any request
initiated from this host/targe is completed. Forget to mention, host level
resource can't be an issue for blk-mq at all.

- the same is true if resource in the queue level isn't satisfied.

- if there isn't outstanding request on this queue, then SCSI's RESTART
can't work(blk-mq's can't work too), and the queue will be run after
SCSI_QUEUE_DELAY, and finally all starved sdevs will be handled by SCSI's
RESTART when this request is finished

2) scsi_dispatch_cmd() returns BLK_STS_RESOURCE
- if there isn't onprogressing request on this queue, the queue
will be run after SCSI_QUEUE_DELAY

- otherwise, SCSI's RESTART covers the rerun.

3) blk_mq_get_driver_tag() failed
- BLK_MQ_S_TAG_WAITING covers the cross-queue RESTART for driver
allocation.

In one word, SCSI's built-in RESTART is enough to cover the queue
rerun, and we don't need to pay special attention to TAG_SHARED wrt. restart.

In my test on scsi_debug(8 luns), this patch improves IOPS by 20% ~ 30% when
running I/O on these 8 luns concurrently.

Aslo Roman Pen reported the current RESTART is very expensive especialy
when there are lots of LUNs attached in one host, such as in his
test, RESTART causes half of IOPS be cut.

Fixes: https://marc.info/?l=linux-kernel&m=150832216727524&w=2
Fixes: 6d8c6c0f ("blk-mq: Restart a single queue if tag sets are shared")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

358a3a6b

blk-mq-sched: improve dispatching from sw queue · b347689f

由 Ming Lei 提交于 10月 14, 2017

SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.

So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun.  Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.

This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.

This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b347689f

blk-mq: introduce .get_budget and .put_budget in blk_mq_ops · de148297

由 Ming Lei 提交于 10月 14, 2017

For SCSI devices, there is often a per-request-queue depth, which needs
to be respected before queuing one request.

Currently blk-mq always dequeues the request first, then calls
.queue_rq() to dispatch the request to lld. One obvious issue with this
approach is that I/O merging may not be successful, because when the
per-request-queue depth can't be respected, .queue_rq() has to return
BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch
list. This means it never gets a chance to be merged with other IO.

This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
then we can try to get reserved budget first before dequeuing request.
If the budget for queueing I/O can't be satisfied, we don't need to
dequeue request at all. Hence the request can be left in the IO
scheduler queue, for more merging opportunities.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

de148297

blk-mq-sched: move actual dispatching into one helper · caf8eb0d

由 Ming Lei 提交于 10月 14, 2017

So that it becomes easy to support to dispatch from sw queue in the
following patch.

No functional change.
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Suggested-by: Christoph Hellwig <hch@lst.de> # for simplifying dispatch logic
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

caf8eb0d

blk-mq-sched: dispatch from scheduler IFF progress is made in ->dispatch · 5e3d02bb

由 Ming Lei 提交于 10月 14, 2017

When the hw queue is busy, we shouldn't take requests from the scheduler
queue any more, otherwise it is difficult to do IO merge.

This patch fixes the awful IO performance on some SCSI devices(lpfc,
qla2xxx, ...) when mq-deadline/kyber is used by not taking requests if
hw queue is busy.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5e3d02bb

04 7月, 2017 1 次提交

blk-mq-sched: fix performance regression of mq-deadline · 32825c45

由 Ming Lei 提交于 7月 03, 2017

When mq-deadline is taken, IOPS of sequential read and
seqential write is observed more than 20% drop on sata(scsi-mq)
devices, compared with using 'none' scheduler.

The reason is that the default nr_requests for scheduler is
too big for small queuedepth devices, and latency is increased
much.

Since the principle of taking 256 requests for mq scheduler
is based on 128 queue depth, this patch changes into
double size of min(hw queue_depth, 128).
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

32825c45

22 6月, 2017 1 次提交

blk-mq: fix performance regression with shared tags · 8e8320c9

由 Jens Axboe 提交于 6月 20, 2017

If we have shared tags enabled, then every IO completion will trigger
a full loop of every queue belonging to a tag set, and every hardware
queue for each of those queues, even if nothing needs to be done.
This causes a massive performance regression if you have a lot of
shared devices.

Instead of doing this huge full scan on every IO, add an atomic
counter to the main queue that tracks how many hardware queues have
been marked as needing a restart. With that, we can avoid looking for
restartable queues, if we don't have to.

Max reports that this restores performance. Before this patch, 4K
IOPS was limited to 22-23K IOPS. With the patch, we are running at
950-970K IOPS.

Fixes: 6d8c6c0f ("blk-mq: Restart a single queue if tag sets are shared")
Reported-by: NMax Gurtovoy <maxg@mellanox.com>
Tested-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
Tested-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8e8320c9

21 6月, 2017 1 次提交

blk-mq: Document locking assumptions · 7b607814

由 Bart Van Assche 提交于 6月 20, 2017

Document the locking assumptions in functions that modify
blk_mq_ctx.rq_list to make it easier for humans to verify
this code.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7b607814

19 6月, 2017 4 次提交

blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue · f4560ffe

由 Ming Lei 提交于 6月 18, 2017

It is required that no dispatch can happen any more once
blk_mq_quiesce_queue() returns, and we don't have such requirement
on APIs of stopping queue.

But blk_mq_quiesce_queue() still may not block/drain dispatch in the
the case of BLK_MQ_S_START_ON_RUN, so use the new introduced flag of
QUEUE_FLAG_QUIESCED and evaluate it inside RCU read-side critical
sections for fixing this issue.

Also blk_mq_quiesce_queue() is implemented via stopping queue, which
limits its uses, and easy to cause race, because any queue restart in
other paths may break blk_mq_quiesce_queue(). With the introduced
flag of QUEUE_FLAG_QUIESCED, we don't need to depend on stopping queue
for quiescing any more.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f4560ffe

blk-mq: refactor blk_mq_sched_assign_ioc · 44e8c2bf

由 Christoph Hellwig 提交于 6月 16, 2017

blk_mq_sched_assign_ioc now only handles the assigned of the ioc if
the schedule needs it (bfq only at the moment).  The caller to the
per-request initializer is moved out so that it can be merged with
a similar call for the kyber I/O scheduler.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

44e8c2bf

blk-mq: remove blk_mq_sched_{get,put}_rq_priv · ea511e3c

由 Christoph Hellwig 提交于 6月 16, 2017

Having these as separate helpers in a header really does not help
readability, or my chances to refactor this code sanely.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ea511e3c

blk-mq: move blk_mq_sched_{get,put}_request to blk-mq.c · d2c0d383

由 Christoph Hellwig 提交于 6月 16, 2017

Having them out of line in blk-mq-sched.c just makes the code flow
unnecessarily complicated.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d2c0d383

27 5月, 2017 1 次提交

blk-mq: make per-sw-queue bio merge as default .bio_merge · 9bddeb2a

由 Ming Lei 提交于 5月 26, 2017

Because what the per-sw-queue bio merge does is basically same with
scheduler's .bio_merge(), this patch makes per-sw-queue bio merge
as the default .bio_merge if no scheduler is used or io scheduler
doesn't provide .bio_merge().
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9bddeb2a

04 5月, 2017 1 次提交

blk-mq-debugfs: allow schedulers to register debugfs attributes · d332ce09

由 Omar Sandoval 提交于 5月 04, 2017

This provides the infrastructure for schedulers to expose their internal
state through debugfs. We add a list of queue attributes and a list of
hctx attributes to struct elevator_type and wire them up when switching
schedulers.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>

Add missing seq_file.h header in blk-mq-debugfs.h
Signed-off-by: NJens Axboe <axboe@fb.com>

d332ce09

02 5月, 2017 1 次提交

blk-mq-sched: remove hack that bypasses scheduler for reserved requests · 9f2779bf

由 Jens Axboe 提交于 4月 28, 2017

We have update the troublesome driver (mtip32xx) to deal with this
appropriately. So kill the hack that bypassed scheduler allocation
and insertion for reserved requests.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9f2779bf

27 4月, 2017 1 次提交

blk-mq-sched: alloate reserved tags out of normal pool · 33931808

由 Jens Axboe 提交于 4月 27, 2017

At least one driver, mtip32xx, has a hard coded dependency on
the value of the reserved tag used for internal commands. While
that should really be fixed up, for now let's ensure that we just
bypass the scheduler tags an allocation marked as reserved. They
are used for house keeping or error handling, so we can safely
ignore them in the scheduler.
Tested-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

33931808

21 4月, 2017 1 次提交

blk-mq: Remove blk_mq_sched_move_to_dispatch() · 246665db

由 Bart Van Assche 提交于 4月 20, 2017

commit c13660a0 ("blk-mq-sched: change ->dispatch_requests()
to ->dispatch_request()") removed the last user of this function.
Hence also remove the function itself.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

246665db

08 4月, 2017 1 次提交

blk-mq-sched: provide hooks for initializing hardware queue data · ee056f98

由 Omar Sandoval 提交于 4月 05, 2017

Schedulers need to be informed when a hardware queue is added or removed
at runtime so they can allocate/free per-hardware queue data. So,
replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
at init time, with .init_hctx() and .exit_hctx() hooks.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

ee056f98

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功