提交 · 6d6f167ce74158903e7fc20dfbecf89c71aa1c00 · openanolis / cloud-kernel

05 11月, 2017 3 次提交

blk-mq: put the driver tag of nxt rq before first one is requeued · 6d6f167c

由 Jianchao Wang 提交于 11月 02, 2017

When freeing the driver tag of the next rq with an I/O scheduler
configured, we get the first entry of the list. However, this can
race with requeue of a request, and we end up getting the wrong request
from the head of the list. Free the driver tag of next rq before the
failed one is requeued in the failure branch of queue_rq callback.
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6d6f167c

blkcg: add sanity check for blkcg policy operations · e8401073

由 weiping zhang 提交于 10月 17, 2017

blkcg policy should keep cpd/pd's alloc_fn and free_fn in pairs,
otherwise policy would register fail.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nweiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e8401073

blk-mq: don't handle failure in .get_budget · 88022d72

由 Ming Lei 提交于 11月 05, 2017

It is enough to just check if we can get the budget via .get_budget().
And we don't need to deal with device state change in .get_budget().

For SCSI, one issue to be fixed is that we have to call
scsi_mq_uninit_cmd() to free allocated ressources if SCSI device fails
to handle the request. And it isn't enough to simply call
blk_mq_end_request() to do that if this request is marked as
RQF_DONTPREP.

Fixes: 0df21c86(scsi: implement .get_budget and .put_budget for blk-mq)
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

88022d72

04 11月, 2017 8 次提交

block: fix peeking requests during PM · e4f36b24

由 Christoph Hellwig 提交于 10月 20, 2017

We need to look for an active PM request until the next softbarrier
instead of looking for the first non-PM request.  Otherwise any cause
of request reordering might starve the PM request(s).
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e4f36b24

blk-mq: Make blk_mq_get_request() error path less confusing · 21e768b4

由 Bart Van Assche 提交于 10月 16, 2017

blk_mq_get_tag() can modify data->ctx. This means that in the
error path of blk_mq_get_request() data->ctx should be passed to
blk_mq_put_ctx() instead of local_ctx. Note: since blk_mq_put_ctx()
ignores its argument, this patch does not change any functionality.

References: commit 1ad43c00 ("blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed")
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

21e768b4

blk-mq: fix nr_requests wrong value when modify it from sysfs · c2e82a23

由 weiping zhang 提交于 9月 22, 2017

if blk-mq use "none" io scheduler, nr_request get a wrong value when
input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
the smaller one min(nr, set->queue_depth), and then q->nr_request get a
wrong value.

Reproduce:

echo none > /sys/block/nvme0n1/queue/scheduler
echo 1000000 > /sys/block/nvme0n1/queue/nr_requests
cat /sys/block/nvme0n1/queue/nr_requests
1000000
Signed-off-by: Nweiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c2e82a23

block: add a poll_fn callback to struct request_queue · ea435e1b

由 Christoph Hellwig 提交于 11月 02, 2017

That we we can also poll non blk-mq queues.  Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ea435e1b

block: introduce GENHD_FL_HIDDEN · 8ddcd653

由 Christoph Hellwig 提交于 11月 02, 2017

With this flag a driver can create a gendisk that can be used for I/O
submission inside the kernel, but which is not registered as user
facing block device.  This will be useful for the NVMe multipath
implementation.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8ddcd653

block: don't look at the struct device dev_t in disk_devt · 517bf3c3

由 Christoph Hellwig 提交于 11月 02, 2017

The hidden gendisks introduced in the next patch need to keep the dev
field in their struct device empty so that udev won't try to create
block device nodes for them.  To support that rewrite disk_devt to
look at the major and first_minor fields in the gendisk itself instead
of looking into the struct device.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

517bf3c3

block: add a blk_steal_bios helper · ef71de8b

由 Christoph Hellwig 提交于 11月 02, 2017

This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ef71de8b

block: provide a direct_make_request helper · f421e1d9

由 Christoph Hellwig 提交于 11月 02, 2017

This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJavier González <javier@cnexlabs.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f421e1d9

01 11月, 2017 7 次提交

blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE · 1f460b63

由 Ming Lei 提交于 10月 27, 2017

SCSI restarts its queue in scsi_end_request() automatically, so we don't
need to handle this case in blk-mq.

Especailly any request won't be dequeued in this case, we needn't to
worry about IO hang caused by restart vs. dispatch.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1f460b63

blk-mq: don't handle TAG_SHARED in restart · 358a3a6b

由 Ming Lei 提交于 10月 27, 2017

Now restart is used in the following cases, and TAG_SHARED is for
SCSI only.

1) .get_budget() returns BLK_STS_RESOURCE
- if resource in target/host level isn't satisfied, this SCSI device
will be added in shost->starved_list, and the whole queue will be rerun
(via SCSI's built-in RESTART) in scsi_end_request() after any request
initiated from this host/targe is completed. Forget to mention, host level
resource can't be an issue for blk-mq at all.

- the same is true if resource in the queue level isn't satisfied.

- if there isn't outstanding request on this queue, then SCSI's RESTART
can't work(blk-mq's can't work too), and the queue will be run after
SCSI_QUEUE_DELAY, and finally all starved sdevs will be handled by SCSI's
RESTART when this request is finished

2) scsi_dispatch_cmd() returns BLK_STS_RESOURCE
- if there isn't onprogressing request on this queue, the queue
will be run after SCSI_QUEUE_DELAY

- otherwise, SCSI's RESTART covers the rerun.

3) blk_mq_get_driver_tag() failed
- BLK_MQ_S_TAG_WAITING covers the cross-queue RESTART for driver
allocation.

In one word, SCSI's built-in RESTART is enough to cover the queue
rerun, and we don't need to pay special attention to TAG_SHARED wrt. restart.

In my test on scsi_debug(8 luns), this patch improves IOPS by 20% ~ 30% when
running I/O on these 8 luns concurrently.

Aslo Roman Pen reported the current RESTART is very expensive especialy
when there are lots of LUNs attached in one host, such as in his
test, RESTART causes half of IOPS be cut.

Fixes: https://marc.info/?l=linux-kernel&m=150832216727524&w=2
Fixes: 6d8c6c0f ("blk-mq: Restart a single queue if tag sets are shared")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

358a3a6b

blk-mq-sched: improve dispatching from sw queue · b347689f

由 Ming Lei 提交于 10月 14, 2017

SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.

So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun.  Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.

This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.

This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b347689f

blk-mq: introduce .get_budget and .put_budget in blk_mq_ops · de148297

由 Ming Lei 提交于 10月 14, 2017

For SCSI devices, there is often a per-request-queue depth, which needs
to be respected before queuing one request.

Currently blk-mq always dequeues the request first, then calls
.queue_rq() to dispatch the request to lld. One obvious issue with this
approach is that I/O merging may not be successful, because when the
per-request-queue depth can't be respected, .queue_rq() has to return
BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch
list. This means it never gets a chance to be merged with other IO.

This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
then we can try to get reserved budget first before dequeuing request.
If the budget for queueing I/O can't be satisfied, we don't need to
dequeue request at all. Hence the request can be left in the IO
scheduler queue, for more merging opportunities.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

de148297

block: kyber: check if there are requests in ctx in kyber_has_work() · 63ba8e31

由 Ming Lei 提交于 10月 14, 2017

There may be request in sw queue, and not fetched to domain queue
yet, so check it in kyber_has_work().
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

63ba8e31

blk-mq-sched: move actual dispatching into one helper · caf8eb0d

由 Ming Lei 提交于 10月 14, 2017

So that it becomes easy to support to dispatch from sw queue in the
following patch.

No functional change.
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Suggested-by: Christoph Hellwig <hch@lst.de> # for simplifying dispatch logic
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

caf8eb0d

blk-mq-sched: dispatch from scheduler IFF progress is made in ->dispatch · 5e3d02bb

由 Ming Lei 提交于 10月 14, 2017

When the hw queue is busy, we shouldn't take requests from the scheduler
queue any more, otherwise it is difficult to do IO merge.

This patch fixes the awful IO performance on some SCSI devices(lpfc,
qla2xxx, ...) when mq-deadline/kyber is used by not taking requests if
hw queue is busy.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5e3d02bb

31 10月, 2017 1 次提交

block: Fix a race between blk_cleanup_queue() and timeout handling · 4e9b6f20

由 Bart Van Assche 提交于 10月 19, 2017

Make sure that if the timeout timer fires after a queue has been
marked "dying" that the affected requests are finished.
Reported-by: Nchenxiang (M) <chenxiang66@hisilicon.com>
Fixes: commit 287922eb ("block: defer timeouts to a workqueue")
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Tested-by: Nchenxiang (M) <chenxiang66@hisilicon.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4e9b6f20

26 10月, 2017 6 次提交

mq-deadline: add 'deadline' as a name alias · 4d740bc9

由 Jens Axboe 提交于 10月 25, 2017

The scheduler framework now supports looking up the appropriate
scheduler with the {name,mq} tupple. We can register mq-deadline
with the alias of 'deadline', so that switching to 'deadline'
will do the right thing based on the type of driver attached to
it.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4d740bc9

elevator: allow name aliases · 8ac0d9a8

由 Jens Axboe 提交于 10月 25, 2017

Since we now lookup elevator types with the appropriate multiqueue
capability, allow schedulers to register with an alias alongside
the real name. This is in preparation for allowing 'mq-deadline'
to register an alias of 'deadline' as well.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8ac0d9a8

elevator: lookup mq vs non-mq elevators · 2527d997

由 Jens Axboe 提交于 10月 25, 2017

If an IO scheduler is selected via elevator= and it doesn't match
the driver in question wrt blk-mq support, then we fail to boot.

The elevator= parameter is deprecated and only supported for
non-mq devices. Augment the elevator lookup API so that we
pass in if we're looking for an mq capable scheduler or not,
so that we only ever return a valid type for the queue in
question.

Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=196695Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2527d997

block: cope with WRITE ZEROES failing in blkdev_issue_zeroout() · d5ce4c31

由 Ilya Dryomov 提交于 10月 16, 2017

sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to
permit trying WRITE SAME on older SCSI devices, unless ->no_write_same
is set.  Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE
SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO:

  $ fallocate -zn -l 1k /dev/sdg
  fallocate: fallocate failed: Remote I/O error
  $ fallocate -zn -l 1k /dev/sdg  # OK
  $ fallocate -zn -l 1k /dev/sdg  # OK

The following calls succeed because sd_done() sets ->no_write_same in
response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing
__blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios.

This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing
and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is
specified.  For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if
sd_done() has just set ->no_write_same thus indicating lack of offload
support.

Fixes: c20cfc27 ("block: stop using blkdev_issue_write_same for zeroing")
Cc: Hannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d5ce4c31

block: factor out __blkdev_issue_zero_pages() · 425a4dba

由 Ilya Dryomov 提交于 10月 16, 2017

blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

425a4dba

block: move CAP_SYS_ADMIN check in blkdev_roset() · bb749b31

由 Ilya Dryomov 提交于 10月 18, 2017

Check for CAP_SYS_ADMIN before calling into the driver, similar to
blkdev_flushbuf().  This is safer and can spare a check in the driver.

(Currently BLKROSET is overridden by md and rbd, rbd is missing the
check.  md has the check, but it covers a lot more than BLKROSET.)
Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bb749b31

25 10月, 2017 1 次提交

block: Invalidate cache on discard v2 · 351499a1

由 Dmitry Monakhov 提交于 10月 24, 2017

It is reasonable drop page cache on discard, otherwise that pages may
be written by writeback second later, so thin provision devices will
not be happy. This seems to be a  security leak in case of secure discard case.

Also add check for queue_discard flag on early stage.
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

351499a1

19 10月, 2017 2 次提交

block: remove blk_mq_reinit_tagset · dab7487b

由 Sagi Grimberg 提交于 10月 11, 2017

No callers left.
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

dab7487b

block: introduce blk_mq_tagset_iter · 149e10f8

由 Sagi Grimberg 提交于 10月 11, 2017

Iterator helper to apply a function on all the
tags in a given tagset. export it as it will be used
outside the block layer later on.
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

149e10f8

18 10月, 2017 1 次提交

kyber: fix hang on domain token wait queue · 8cf46660

由 Omar Sandoval 提交于 10月 11, 2017

When we're getting a domain token, if we fail to get a token on our
first attempt, we put the current hardware queue on a wait queue and
then try again just in case a token was freed after our initial attempt
but before we got on the wait queue. If this second attempt succeeds, we
currently leave the hardware queue on the wait queue. Usually this is
okay; we'll just run the hardware queue one extra time when another
token is freed. However, if the hardware queue doesn't have any other
requests waiting, then when it it gets the extra wakeup, it won't have
anything to free and therefore won't wake up any other hardware queues.
If tokens are limited, then we won't make forward progress and the
device will hang.
Reported-by: NBin Zha <zhabin.zb@alibaba-inc.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8cf46660

17 10月, 2017 1 次提交

block: fix Sphinx kernel-doc warning · 519c8e9f

由 Randy Dunlap 提交于 10月 16, 2017

Sphinx treats symbols that end with '_' as a kind of special
documentation indicator, so fix that by adding an ending '*'
to it.

../block/bio.c:404: ERROR: Unknown target name: "gfp".
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

519c8e9f

11 10月, 2017 3 次提交

block: set request_list for request · 85acb3ba

由 Shaohua Li 提交于 10月 06, 2017

Legacy queue sets request's request_list, mq doesn't. This makes mq does
the same thing, so we can find cgroup of a request. Note, we really
only use blkg field of request_list, it's pointless to allocate mempool
for request_list in mq case.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

85acb3ba

blk-stat: delete useless code · eca8b53a

由 Shaohua Li 提交于 10月 06, 2017

Fix two issues:
- the per-cpu stat flush is unnecessary, nobody uses per-cpu stat except
  sum it to global stat. We can do the calculation there. The flush just
  wastes cpu time.
- some fields are signed int/s64. I don't see the point.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eca8b53a

blk-throttle: fix null pointer dereference while throttling writeback IOs · 53cfdc10

由 Jiufei Xue 提交于 10月 10, 2017

A null pointer dereference can occur when blkcg is removed manually
with writeback IOs inflight. This is caused by the following case:

Writeback kworker submit the bio and set bio->bi_cg_private to tg
in blk_throtl_assoc_bio.
Then we remove the block cgroup manually, the blkg and tg would be
freed if there is no request inflight.
When the submitted bio come back, blk_throtl_bio_endio() fetch the tg
which was already freed.

Fix this by increasing the refcount of blkg in funcion
blk_throtl_assoc_bio() so that the blkg will not be freed until the
bio_endio called.
Reviewed-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJiufei Xue <jiufei.xjf@alibaba-inc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

53cfdc10

10 10月, 2017 1 次提交

blkcg: check pol->cpd_free_fn before free cpd · 58a9edce

由 weiping zhang 提交于 10月 10, 2017

check pol->cpd_free_fn() instead of pol->cpd_alloc_fn() when free cpd.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nweiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

58a9edce

09 10月, 2017 2 次提交

block, bfq: fix unbalanced decrements of burst size · 99fead8d

由 Paolo Valente 提交于 10月 09, 2017

The commit "block, bfq: decrease burst size when queues in burst
exit" introduced the decrement of burst_size on the removal of a
bfq_queue from the burst list. Unfortunately, this decrement can
happen to be performed even when burst size is already equal to 0,
because of unbalanced decrements. A description follows of the cause
of these unbalanced decrements, namely a wrong assumption, and of the
way how this wrong assumption leads to unbalanced decrements.

The wrong assumption is that a bfq_queue can exit only if the process
associated with the bfq_queue has exited. This is false, because a
bfq_queue, say Q, may exit also as a consequence of a merge with
another bfq_queue. In this case, Q exits because the I/O of its
associated process has been redirected to another bfq_queue.

The decrement unbalance occurs because Q may then be re-created after
a split, and added back to the current burst list, *without*
incrementing burst_size. burst_size is not incremented because Q is
not a new bfq_queue added to the burst list, but a bfq_queue only
temporarily removed from the list, and, before the commit "bfq-sq,
bfq-mq: decrease burst size when queues in burst exit", burst_size was
not decremented when Q was removed.

This commit addresses this issue by just checking whether the exiting
bfq_queue is a merged bfq_queue, and, in that case, not decrementing
burst_size. Unfortunately, this still leaves room for unbalanced
decrements, in the following rarer case: on a split, the bfq_queue
happens to be inserted into a different burst list than that it was
removed from when merged. If this happens, the number of elements in
the new burst list becomes higher than burst_size (by one). When the
bfq_queue then exits, it is of course not in a merged state any
longer, thus burst_size is decremented, which results in an unbalanced
decrement. To handle this sporadic, unlucky case in a simple way,
this commit also checks that burst_size is larger than 0 before
decrementing it.

Finally, this commit removes an useless, extra check: the check that
the bfq_queue is sync, performed before checking whether the bfq_queue
is in the burst list. This extra check is redundant, because only sync
bfq_queues can be inserted into the burst list.

Fixes: 7cb04004 ("block, bfq: decrease burst size when queues in burst exit")
Reported-by: NPhilip Müller <philm@manjaro.org>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
Tested-by: NPhilip Müller <philm@manjaro.org>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

99fead8d

block,bfq: Disable writeback throttling · b5dc5d4d

由 Luca Miccio 提交于 10月 09, 2017

Similarly to CFQ, BFQ has its write-throttling heuristics, and it
is better not to combine them with further write-throttling
heuristics of a different nature.
So this commit disables write-back throttling for a device if BFQ
is used as I/O scheduler for that device.
Signed-off-by: NLuca Miccio <lucmiccio@gmail.com>
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b5dc5d4d

07 10月, 2017 2 次提交

block/bio: Remove null checks before mempool_destroy in bioset_free · 4078def8

由 Tim Hansen 提交于 10月 06, 2017

This patch removes redundant checks for null values on bio_pool and
bvec_pool.

Found using make coccicheck M=block/ on linux-net tree on the
next-20170929 tag.
Signed-off-by: NTim Hansen <devtimhansen@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4078def8

block: remove unnecessary NULL checks in bioset_integrity_free() · 4b14a5c5

由 Tim Hansen 提交于 10月 05, 2017

mempool_destroy() already checks for a NULL value being passed in, this
eliminates duplicate checks.

This was caught by running make coccicheck M=block/ on linus' tree on
commit 77ede3a0 (current head as of this
patch).
Reviewed-by: NKyle Fortin <kyle.fortin@oracle.com>
Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NTim Hansen <devtimhansen@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4b14a5c5

06 10月, 2017 1 次提交

block: remove QUEUE_FLAG_STACKABLE · 5fdee212

由 Christoph Hellwig 提交于 10月 05, 2017

We already have a queue_is_rq_based helper to check if a request_queue
is request based, so we can remove the flag for it.
Acked-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5fdee212

05 10月, 2017 1 次提交

blk-mq: document the need to have STARTED and COMPLETED share a byte · fc13457f

由 Jens Axboe 提交于 10月 04, 2017

For memory ordering guarantees on stores, we need to ensure that
these two bits share the same byte of storage in the unsigned
long. Add a comment as to why, and a BUILD_BUG_ON() to ensure that
we don't violate this requirement.
Suggested-by: NBoqun Feng <boqun.feng@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fc13457f

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功