提交 · e5edd5f298fafda28284bafb8371e6f0b7681035 · openeuler / Kernel

18 12月, 2018 4 次提交

blk-mq: skip zero-queue maps in blk_mq_map_swqueue · e5edd5f2

由 Ming Lei 提交于 12月 18, 2018

From 7e849dd9 ("nvme-pci: don't share queue maps"), the mapping
table won't be initialized actually if map->nr_queues is zero, so
we can't use blk_mq_map_queue_type() to retrieve hctx any more.

This way still may cause broken mapping, fix it by skipping zero-queues
maps in blk_mq_map_swqueue().

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e5edd5f2

block: fix blk-iolatency accounting underflow · 13369816

由 Dennis Zhou 提交于 12月 17, 2018

The blk-iolatency controller measures the time from rq_qos_throttle() to
rq_qos_done_bio() and attributes this time to the first bio that needs
to create the request. This means if a bio is plug-mergeable or
bio-mergeable, it gets to bypass the blk-iolatency controller.

The recent series [1], to tag all bios w/ blkgs undermined how iolatency
was determining which bios it was charging and should process in
rq_qos_done_bio(). Because all bios are being tagged, this caused the
atomic_t for the struct rq_wait inflight count to underflow and result
in a stall.

This patch adds a new flag BIO_TRACKED to let controllers know that a
bio is going through the rq_qos path. blk-iolatency now checks if this
flag is set to see if it should process the bio in rq_qos_done_bio().

Overloading BLK_QUEUE_ENTERED works, but makes the flag rules confusing.
BIO_THROTTLED was another candidate, but the flag is set for all bios
that have gone through blk-throttle code. Overloading a flag comes with
the burden of making sure that when either implementation changes, a
change in setting rules for one doesn't cause a bug in the other. So
here, we unfortunately opt for adding a new flag.

[1] https://lore.kernel.org/lkml/20181205171039.73066-1-dennis@kernel.org/

Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device")
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

13369816

blk-mq: fix dispatch from sw queue · c16d6b5a

由 Ming Lei 提交于 12月 17, 2018

When a request is added to rq list of sw queue(ctx), the rq may be from
a different type of hctx, especially after multi queue mapping is
introduced.

So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or
blk_mq_dequeue_from_ctx(), one request belonging to other queue type of
hctx can be dispatched to current hctx in case that read queue or poll
queue is enabled.

This patch fixes this issue by introducing per-queue-type list.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>

Changed by me to not use separately cacheline aligned lists, just
place them all in the same cacheline where we had just the one list
and lock before.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c16d6b5a

block: mq-deadline: Fix write completion handling · 7211aef8

由 Damien Le Moal 提交于 12月 17, 2018

For a zoned block device using mq-deadline, if a write request for a
zone is received while another write was already dispatched for the same
zone, dd_dispatch_request() will return NULL and the newly inserted
write request is kept in the scheduler queue waiting for the ongoing
zone write to complete. With this behavior, when no other request has
been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
__blk_mq_free_request() call of blk_mq_sched_restart() to not run the
queue when the already dispatched write request completes. The newly
dispatched request stays stuck in the scheduler queue until eventually
another request is submitted.

This problem does not affect SCSI disk as the SCSI stack handles queue
restart on request completion. However, this problem is can be triggered
the nullblk driver with zoned mode enabled.

Fix this by always requesting a queue restart in dd_dispatch_request()
if no request was dispatched while WRITE requests are queued.

Fixes: 5700f691 ("mq-deadline: Introduce zone locking support")
Cc: <stable@vger.kernel.org>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>

Add missing export of blk_mq_sched_restart()
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7211aef8

17 12月, 2018 10 次提交

nvme-pci: don't share queue maps · 7e849dd9

由 Christoph Hellwig 提交于 12月 17, 2018

Now that the block layer checks if a queue map has any queues inside
it there is no more reason to duplicate the maps for the non-default
types.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7e849dd9

blk-mq: only dispatch to non-defauly queue maps if they have queues · 5aceaeb2

由 Christoph Hellwig 提交于 12月 17, 2018

We should check if a given queue map actually has queues enabled before
dispatching to it.  This allows drivers to not initialize optional but
not used map types, which subsequently will allow fixing problems with
queue map rebuilds for that case.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5aceaeb2

blk-mq: export hctx->type in debugfs instead of sysfs · 346fc108

由 Ming Lei 提交于 12月 17, 2018

Now we only export hctx->type via sysfs, and there isn't such info
in hctx entry under debugfs. We often use debugfs only to diagnose
queue mapping issue, so add the support in debugfs.

Queue mapping becomes a bit more complicated after multiple queue
mapping is supported, we may write blktest to verify if queue mapping
is valid based on blk-mq-debugfs.

Given not necessary to export hctx->type twice, so remove the export
from sysfs.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

346fc108

blk-mq: fix allocation for queue mapping table · 07b35eb5

由 Ming Lei 提交于 12月 17, 2018

Type of each element in queue mapping table is 'unsigned int,
intead of 'struct blk_mq_queue_map)', so fix it.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

07b35eb5

blk-wbt: export internal state via debugfs · d19afebc

由 Ming Lei 提交于 12月 17, 2018

This information is helpful to either investigate issues, or understand
wbt's internal behaviour.

Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d19afebc

blk-mq-debugfs: support rq_qos · cc56694f

由 Ming Lei 提交于 12月 17, 2018

blk-mq-debugfs has been proved as very helpful for debug some
tough issues, such as IO hang.

We have seen blk-wbt related IO hang several times, even inside
Red Hat BZ, there is such report not sovled yet, so this patch
adds support debugfs on rq_qos.

Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cc56694f

block: update sysfs documentation · f9824952

由 Damien Le Moal 提交于 11月 30, 2018

Add the description of the zoned, nr_zones and chunk_sectors sysfs queue
attributes to Documentation/block/queue-sysfs.txt. The description of
the zoned and chunk_sector attributes are mostly copied from
ABI/testing/sysfs-block (added a typo fix). While at it, also fix a
typo in the description of the io_poll_delay attribute.

nr_zones description is also added to ABI/testing/sysfs-block and
contact email address updated for the zoned attribute.
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f9824952

block: loop: check error using IS_ERR instead of IS_ERR_OR_NULL in loop_add() · 38a3499f

由 Chengguang Xu 提交于 12月 16, 2018

blk_mq_init_queue() will not return NULL pointer to its caller,
so it's better to replace IS_ERR_OR_NULL using IS_ERR in loop_add().

If in the future things change to check NULL pointer inside loop_add(),
we should return -ENOMEM as return code instead of PTR_ERR(NULL).
Signed-off-by: NChengguang Xu <cgxu519@gmx.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

38a3499f

aoe: add __exit annotation · e7cc005f

由 Chengguang Xu 提交于 12月 16, 2018

Add __exit annotation to cleanup helper which
is only called once in the module.
Signed-off-by: NChengguang Xu <cgxu519@gmx.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e7cc005f

block: clear REQ_HIPRI if polling is not supported · d04c406f

由 Christoph Hellwig 提交于 12月 14, 2018

This prevents a HIPRI bio from being submitted through a stacking
driver that does not support polling and thus won't poll for I/O
completion.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d04c406f

16 12月, 2018 5 次提交

blk-mq: replace and kill blk_mq_request_issue_directly · d6a51a97

由 Jianchao Wang 提交于 12月 14, 2018

Replace blk_mq_request_issue_directly with blk_mq_try_issue_directly
in blk_insert_cloned_request and kill it as nobody uses it any more.
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d6a51a97

blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests · 5b7a6f12

由 Jianchao Wang 提交于 12月 14, 2018

It is not necessary to issue request directly with bypass 'true'
in blk_mq_sched_insert_requests and handle the non-issued requests
itself. Just set bypass to 'false' and let blk_mq_try_issue_directly
handle them totally. Remove the blk_rq_can_direct_dispatch check,
because blk_mq_try_issue_directly can handle it well.If request is
direct-issued unsuccessfully, insert the reset.
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5b7a6f12

blk-mq: refactor the code of issue request directly · 7f556a44

由 Jianchao Wang 提交于 12月 14, 2018

Merge blk_mq_try_issue_directly and __blk_mq_try_issue_directly
into one interface to unify the interfaces to issue requests
directly. The merged interface takes over the requests totally,
it could insert, end or do nothing based on the return value of
.queue_rq and 'bypass' parameter. Then caller needn't any other
handling any more and then code could be cleaned up.

And also the commit c616cbee ( blk-mq: punt failed direct issue
to dispatch list ) always inserts requests to hctx dispatch list
whenever get a BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, this is
overkill and will harm the merging. We just need to do that for
the requests that has been through .queue_rq. This patch also
could fix this.
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7f556a44

C
block: remove the bio_integrity_advance export · 4c9770c9
由 Christoph Hellwig 提交于 12月 13, 2018
```
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
```
4c9770c9
C
block: remove the bioset_integrity_free export · 74030653
由 Christoph Hellwig 提交于 12月 13, 2018
```
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
```
74030653

14 12月, 2018 6 次提交

C
block: remove the unused bio_set_pages_dirty and bio_check_pages_dirty exports · a45eb575
由 Christoph Hellwig 提交于 12月 13, 2018
```
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
```
a45eb575
C
block: remove the unused bio_iov_iter_get_pages export · 0374e113
由 Christoph Hellwig 提交于 12月 13, 2018
```
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
```
0374e113
C
block: remove the blk_recount_segments export · 637b60ad
由 Christoph Hellwig 提交于 12月 13, 2018
```
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
```
637b60ad
C
block: remove the bio_phys_segments export · 6c210aa5
由 Christoph Hellwig 提交于 12月 13, 2018
```
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
```
6c210aa5

nvme: fix kernel paging oops · 092ff052

由 Sagi Grimberg 提交于 12月 13, 2018

free the controller discard_page correctly.

Fixes: cb5b7262 ("nvme: provide fallback for discard alloc failure")
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

092ff052

Merge branch 'nvme-4.21' of git://git.infradead.org/nvme into for-4.21/block · 2d9a058e

由 Jens Axboe 提交于 12月 13, 2018

Pull NVMe updates from Christoph:

"Here is the second large chunk of nvme updates for 4.21:

 - host and target support for NVMe over TCP (Sagi Grimberg,
	Roy Shterman, Solganik Alexander)
 - error log page support in target (Chaitanya Kulkarni)

plus small fixes and improvements from Jens Axboe and Chengguang Xu."

* 'nvme-4.21' of git://git.infradead.org/nvme: (33 commits)
  nvme-rdma: support separate queue maps for read and write
  nvme-tcp: support separate queue maps for read and write
  nvme-fabrics: allow user to set nr_write_queues for separate queue maps
  nvme-fabrics: add missing nvmf_ctrl_options documentation
  blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues
  nvmet: update smart log with num err log entries
  nvmet: add error log page cmd handler
  nvmet: add error log support for file backend
  nvmet: add error log support for bdev backend
  nvmet: add error log support for admin-cmd
  nvmet: add error log support for rdma backend
  nvmet: add error log support for fabrics-cmd
  nvmet: add error log support in the core
  nvmet: add interface to update error-log page
  nvmet: add error-log definitions
  nvme: add error log page slot definition
  nvme: remove nvme_common command cdw10 array
  nvmet: remove unused variable
  nvme: provide fallback for discard alloc failure
  nvme: add __exit annotation
  ...

2d9a058e

13 12月, 2018 15 次提交

bcache: print number of keys in trace_bcache_journal_write · e78bd0d2

由 Guoju Fang 提交于 12月 13, 2018

Sometimes flush journal may be very frequent, so it's useful to dump
number of keys every time write journal.
Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e78bd0d2

bcache: set writeback_percent in a flexible range · cc38ca7e

由 Coly Li 提交于 12月 13, 2018

Because CUTOFF_WRITEBACK is defined as 40, so before the changes of
dynamic cutoff writeback values, writeback_percent is limited to [0,
CUTOFF_WRITEBACK]. Any value larger than CUTOFF_WRITEBACK will be fixed
up to 40.

Now cutof writeback limit is a dynamic value bch_cutoff_writeback, so
the range of writeback_percent can be a more flexible range as [0,
bch_cutoff_writeback]. The flexibility is, it can be expended to a
larger or smaller range than [0, 40], depends on how value
bch_cutoff_writeback is specified.

The default value is still strongly recommended to most of users for
most of workloads. But for people who want to do research on bcache
writeback perforamnce tuning, they may have chance to specify more
flexible writeback_percent in range [0, 70].
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cc38ca7e

bcache: make cutoff_writeback and cutoff_writeback_sync tunable · 9aaf5165

由 Coly Li 提交于 12月 13, 2018

Currently the cutoff writeback and cutoff writeback sync thresholds are
defined by CUTOFF_WRITEBACK (40) and CUTOFF_WRITEBACK_SYNC (70) as
static values. Most of time these they work fine, but when people want
to do research on bcache writeback mode performance tuning, there is no
chance to modify the soft and hard cutoff writeback values.

This patch introduces two module parameters bch_cutoff_writeback_sync
and bch_cutoff_writeback which permit people to tune the values when
loading bcache.ko. If they are not specified by module loading, current
values CUTOFF_WRITEBACK_SYNC and CUTOFF_WRITEBACK will be used as
default and nothing changes.

When people want to tune this two values,
- cutoff_writeback can be set in range [1, 70]
- cutoff_writeback_sync can be set in range [1, 90]
- cutoff_writeback always <= cutoff_writeback_sync

The default values are strongly recommended to most of users for most of
workloads. Anyway, if people wants to take their own risk to do research
on new writeback cutoff tuning for their own workload, now they can make
it.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9aaf5165

bcache: add MODULE_DESCRIPTION information · 009673d0

由 Coly Li 提交于 12月 13, 2018

This patch moves MODULE_AUTHOR and MODULE_LICENSE to end of super.c, and
add MODULE_DESCRIPTION("Bcache: a Linux block layer cache").

This is preparation for adding module parameters.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

009673d0

bcache: option to automatically run gc thread after writeback · 7a671d8e

由 Coly Li 提交于 12月 13, 2018

The option gc_after_writeback is disabled by default, because garbage
collection will discard SSD data which drops cached data.

Echo 1 into /sys/fs/bcache/<UUID>/internal/gc_after_writeback will
enable this option, which wakes up gc thread when writeback accomplished
and all cached data is clean.

This option is helpful for people who cares writing performance more. In
heavy writing workload, all cached data can be clean only happens when
writeback thread cleans all cached data in I/O idle time. In such
situation a following gc running may help to shrink bcache B+ tree and
discard more clean data, which may be helpful for future writing
requests.

If you are not sure whether this is helpful for your own workload,
please leave it as disabled by default.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7a671d8e

bcache: introduce force_wake_up_gc() · cb07ad63

由 Coly Li 提交于 12月 13, 2018

Garbage collection thread starts to work when c->sectors_to_gc is
negative value, otherwise nothing will happen even the gc thread is
woken up by wake_up_gc().

force_wake_up_gc() sets c->sectors_to_gc to -1 before calling
wake_up_gc(), then gc thread may have chance to run if no one else sets
c->sectors_to_gc to a positive value before gc_should_run().

This routine can be called where the gc thread is woken up and required
to run in force.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cb07ad63

bcache: cannot set writeback_running via sysfs if no writeback kthread created · f383ae30