提交 · 0be600a5add76e8e8b9e1119f2a7426ff849aca8 · openeuler / Kernel

30 1月, 2018 6 次提交

dm mpath selector: more evenly distribute ties · f2042605

由 Khazhismel Kumykov 提交于 1月 19, 2018

Move the last used path to the end of the list (least preferred) so that
ties are more evenly distributed.

For example, in case with three paths with one that is slower than
others, the remaining two would be unevenly used if they tie. This is
due to the rotation not being a truely fair distribution.

Illustrated: paths a, b, c, 'c' has 1 outstanding IO, a and b are 'tied'
Three possible rotations:
(a, b, c) -> best path 'a'
(b, c, a) -> best path 'b'
(c, a, b) -> best path 'a'
(a, b, c) -> best path 'a'
(b, c, a) -> best path 'b'
(c, a, b) -> best path 'a'
...

So 'a' is used 2x more than 'b', although they should be used evenly.

With this change, the most recently used path is always the least
preferred, removing this bias resulting in even distribution.
(a, b, c) -> best path 'a'
(b, c, a) -> best path 'b'
(c, a, b) -> best path 'a'
(c, b, a) -> best path 'b'
...
Signed-off-by: NKhazhismel Kumykov <khazhy@google.com>
Reviewed-by: NMartin Wilck <mwilck@suse.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

f2042605

dm unstripe: fix target length versus number of stripes size check · cc656619

由 Scott Bauer 提交于 1月 23, 2018

Since the unstripe target takes a target length which is the
size of *one* striped member we're trying to expose, not the
total size of *all* the striped members, the check does not
make sense and fails for some striped setups.

For example, say we have a 4TB striped device:
or 3907018496 sectors per underlying device:

if (sector_div(width, uc->stripes)) :
   3907018496 / 2(num stripes)  == 1953509248

tmp_len = width;
if (sector_div(tmp_len, uc->chunk_size)) :
   1953509248 / 256(chunk size) == 7630895.5
   (fails)

Fix this by removing the first check which isn't valid for unstriping.
Signed-off-by: NScott Bauer <scott.bauer@intel.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

cc656619

dm thin: fix trailing semicolon in __remap_and_issue_shared_cell · bd6d1e0a

由 Luis de Bethencourt 提交于 1月 17, 2018

The trailing semicolon is an empty statement that does no operation.
Removing it since it doesn't do anything.
Signed-off-by: NLuis de Bethencourt <luisbg@kernel.org>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

bd6d1e0a

dm table: fix NVMe bio-based dm_table_determine_type() validation · eaa160ed

由 Mike Snitzer 提交于 1月 13, 2018

The 'verify_rq_based:' code in dm_table_determine_type() was checking
all devices in the DM table rather than only checking the data devices.
Fix this by using the immutable target's iterate_devices method.

Also, tweak the block of dm_table_determine_type() code that decides
whether to upgrade from DM_TYPE_BIO_BASED to DM_TYPE_NVME_BIO_BASED so
that it makes sure the immutable_target doesn't support require
splitting IOs.

These changes have been verified to allow a "thin-pool" target whose
data device is an NVMe device to be upgraded to DM_TYPE_NVME_BIO_BASED.
Using the thin-pool in NVMe bio-based mode was verified to pass all the
device-mapper-test-suite's "thin-provisioning" tests.

Also verified that request-based DM multipath (with queue_mode "rq" and
"mq") works as expected using the 'mptest' harness.

Fixes: 22c11858 ("dm: introduce DM_TYPE_NVME_BIO_BASED")
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

eaa160ed

M
dm: various cleanups to md->queue initialization code · c12c9a3c
由 Mike Snitzer 提交于 1月 12, 2018
```
Also, add dm_sysfs_init() error handling to dm_create().
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
c12c9a3c

dm mpath: delay the retry of a request if the target responded as busy · ac514ffc

由 Mike Snitzer 提交于 1月 12, 2018

Add DM_ENDIO_DELAY_REQUEUE to allow request-based multipath's
multipath_end_io() to instruct dm-rq.c:dm_done() to delay a requeue.
This is beneficial to do if BLK_STS_RESOURCE is returned from the target
(because target is busy).

Relative to blk-mq: kick the hw queues via blk_mq_requeue_work(),
indirectly from dm-rq.c:__dm_mq_kick_requeue_list(), after a delay.

For old .request_fn: use blk_delay_queue().

bio-based multipath doesn't have feature parity with request-based for
retryable error requeues; that is something that'll need fixing in the
future.
Suggested-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NBart Van Assche <bart.vanassche@wdc.com>
[as interpreted from Bart's "... patch looks fine to me."]

ac514ffc

18 1月, 2018 1 次提交

blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback · 396eaf21

由 Ming Lei 提交于 1月 17, 2018

blk_insert_cloned_request() is called in the fast path of a dm-rq driver
(e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
blk_mq_request_bypass_insert() to directly append the request to the
blk-mq hctx->dispatch_list of the underlying queue.

1) This way isn't efficient enough because the hctx spinlock is always
used.

2) With blk_insert_cloned_request(), we completely bypass underlying
queue's elevator and depend on the upper-level dm-rq driver's elevator
to schedule IO.  But dm-rq currently can't get the underlying queue's
dispatch feedback at all.  Without knowing whether a request was issued
or not (e.g. due to underlying queue being busy) the dm-rq elevator will
not be able to provide effective IO merging (as a side-effect of dm-rq
currently blindly destaging a request from its elevator only to requeue
it after a delay, which kills any opportunity for merging).  This
obviously causes very bad sequential IO performance.

Fix this by updating blk_insert_cloned_request() to use
blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
request to be issued directly to the underlying queue and returns the
dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
to _not_ destage the request.  Whereby preserving the opportunity to
merge IO.

With this, request-based DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing).
Signed-off-by: NMing Lei <ming.lei@redhat.com>
[blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
they were refactored to make them less fragile and easier to read/review]
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

396eaf21

17 1月, 2018 19 次提交

dm mpath: return DM_MAPIO_DELAY_REQUEUE if QUEUE_IO or PG_INIT_REQUIRED · 459b5401

由 Ming Lei 提交于 1月 11, 2018

Avoid using DM_MAPIO_REQUEUE unless absolutely necessary because it
results in dm-rq.c:dm_mq_queue_rq() returning BLK_STS_RESOURCE to
blk-mq -- doing so should only ever be done if the underlying queue is
out of resources.  So switch to returning DM_MAPIO_DELAY_REQUEUE from
multipath_clone_and_map() if either MPATHF_QUEUE_IO or
MPATHF_PG_INIT_REQUIRED are set.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

459b5401

dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failure · 050af08f

由 Ming Lei 提交于 1月 11, 2018

blk-mq will rerun queue via RESTART or dispatch wake after one request
is completed, so not necessary to wait random time for requeuing, we
should trust blk-mq to do it.

More importantly, we need to return BLK_STS_RESOURCE to blk-mq so that
dequeuing from the I/O scheduler can be stopped, this results in
improved I/O merging.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

050af08f

dm log writes: fix max length used for kstrndup · 4b259fc4

由 Ma Shimiao 提交于 12月 12, 2017

If source string is longer than max, kstrndup will allocate max+1
space.  So make sure the result will not exceed max.
Signed-off-by: NMa Shimiao <mashimiao.fnst@cn.fujitsu.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

4b259fc4

M
dm: backfill missing calls to mutex_destroy() · d5ffebdd
由 Mike Snitzer 提交于 1月 05, 2018
```
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
d5ffebdd

dm snapshot: use mutex instead of rw_semaphore · ae1093be

由 Mikulas Patocka 提交于 11月 23, 2017

The rw_semaphore is acquired for read only in two places, neither is
performance-critical.  So replace it with a mutex -- which is more
efficient.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

ae1093be

dm flakey: check for null arg_name in parse_features() · 7690e253

由 Goldwyn Rodrigues 提交于 12月 03, 2017

One can crash dm-flakey by specifying more feature arguments than the
number of features supplied.  Checking for null in arg_name avoids
this.

dmsetup create flakey-test --table "0 66076080 flakey /dev/sdb9 0 0 180 2 drop_writes"
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7690e253

dm: move dm_table_destroy() to same header as dm_table_create() · f6e7baad

由 Brian Norris 提交于 3月 28, 2017

If anyone is going to use dm_table_create(), they probably should be
able to use dm_table_destroy() too. Move the dm_table_destroy()
definition outside the private header, near dm_table_create()
Signed-off-by: NBrian Norris <briannorris@chromium.org>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

f6e7baad

dm raid: make raid_sets symbol static · 67ac901c

由 Wei Yongjun 提交于 1月 02, 2018

Fixes the following sparse warning:

drivers/md/dm-raid.c:33:1: warning:
 symbol 'raid_sets' was not declared. Should it be static?
Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

67ac901c

M
dm bufio: eliminate unnecessary labels in dm_bufio_client_create() · 0e696d38
由 Mike Snitzer 提交于 1月 04, 2018
```
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
0e696d38

dm bufio: check result of register_shrinker() · 46898e9a

由 Aliaksei Karaliou 提交于 12月 23, 2017

dm_bufio_client_create() does not check result of register_shrinker()
which was tagged as __must_check recently, reported by sparse.
Signed-off-by: NAliaksei Karaliou <akaraliou.dev@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

46898e9a

dm bufio: add missed destroys of client mutex · bde14184

由 Aliaksei Karaliou 提交于 12月 23, 2017

The client's mutex needs to be destroyed in dm_bufio_client_destroy() as
well as the dm_bufio_client_create() error path.
Signed-off-by: NAliaksei Karaliou <akaraliou.dev@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

bde14184

dm bufio: use REQ_OP_READ and REQ_OP_WRITE · 905be0a1

由 Mikulas Patocka 提交于 12月 02, 2017

Use REQ_OP_READ and REQ_OP_WRITE macros instead of READ and WRITE.  They
have the same value, but the block layer uses REQ_OP so bufio should
too.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

905be0a1

dm: add unstriped target · 18a5bf27

由 Scott Bauer 提交于 12月 18, 2017

This device mapper "unstriped" target remaps and unstripes I/O so it
is issued solely on a single drive in a HW RAID0 or dm-striped target.

In a 4 drive HW RAID0 the striped target exposes 1/4th of the LBA range
as a virtual drive.  Each I/O to that virtual drive will only be issued
to the 1 drive that was selected of the 4 drives in the HW RAID0.

This unstriped target is most useful for Intel NVMe drives that have
multiple cores but that do not have firmware control to pin separate LBA
ranges to each discrete cpu core.
Signed-off-by: NScott Bauer <scott.bauer@intel.com>
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Acked-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

18a5bf27

dm crypt: fix error return code in crypt_ctr() · 3cc2e57c

由 Wei Yongjun 提交于 1月 17, 2018

Fix to return error code -ENOMEM from the mempool_create_kmalloc_pool()
error handling case instead of 0, as done elsewhere in this function.

Fixes: ef43aa38 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

3cc2e57c

dm crypt: wipe kernel key copy after IV initialization · dc94902b

由 Ondrej Kozina 提交于 1月 12, 2018

Loading key via kernel keyring service erases the internal
key copy immediately after we pass it in crypto layer. This is
wrong because IV is initialized later and we use wrong key
for the initialization (instead of real key there's just zeroed
block).

The bug may cause data corruption if key is loaded via kernel keyring
service first and later same crypt device is reactivated using exactly
same key in hexbyte representation, or vice versa. The bug (and fix)
affects only ciphers using following IVs: essiv, lmk and tcw.

Fixes: c538f6ec ("dm crypt: add ability to use keys from the kernel key retention service")
Cc: stable@vger.kernel.org # 4.10+
Signed-off-by: NOndrej Kozina <okozina@redhat.com>
Reviewed-by: NMilan Broz <gmazyland@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

dc94902b

dm integrity: don't store cipher request on the stack · 717f4b1c

由 Mikulas Patocka 提交于 1月 10, 2018

Some asynchronous cipher implementations may use DMA.  The stack may
be mapped in the vmalloc area that doesn't support DMA.  Therefore,
the cipher request and initialization vector shouldn't be on the
stack.

Fix this by allocating the request and iv with kmalloc.

Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

717f4b1c

dm crypt: fix crash by adding missing check for auth key size · 27c70036

由 Milan Broz 提交于 1月 03, 2018

If dm-crypt uses authenticated mode with separate MAC, there are two
concatenated part of the key structure - key(s) for encryption and
authentication key.

Add a missing check for authenticated key length.  If this key length is
smaller than actually provided key, dm-crypt now properly fails instead
of crashing.

Fixes: ef43aa38 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
Cc: stable@vger.kernel.org # 4.12+
Reported-by: NSalah Coronya <salahx@yahoo.com>
Signed-off-by: NMilan Broz <gmazyland@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

27c70036

dm btree: fix serious bug in btree_split_beneath() · bc68d0a4

由 Joe Thornber 提交于 12月 20, 2017

When inserting a new key/value pair into a btree we walk down the spine of
btree nodes performing the following 2 operations:

  i) space for a new entry
  ii) adjusting the first key entry if the new key is lower than any in the node.

If the _root_ node is full, the function btree_split_beneath() allocates 2 new
nodes, and redistibutes the root nodes entries between them.  The root node is
left with 2 entries corresponding to the 2 new nodes.

btree_split_beneath() then adjusts the spine to point to one of the two new
children.  This means the first key is never adjusted if the new key was lower,
ie. operation (ii) gets missed out.  This can result in the new key being
'lost' for a period; until another low valued key is inserted that will uncover
it.

This is a serious bug, and quite hard to make trigger in normal use.  A
reproducing test case ("thin create devices-in-reverse-order") is
available as part of the thin-provision-tools project:
  https://github.com/jthornber/thin-provisioning-tools/blob/master/functional-tests/device-mapper/dm-tests.scm#L593

Fix the issue by changing btree_split_beneath() so it no longer adjusts
the spine.  Instead it unlocks both the new nodes, and lets the main
loop in btree_insert_raw() relock the appropriate one and make any
neccessary adjustments.

Cc: stable@vger.kernel.org
Reported-by: NMonty Pavel <monty_pavel@sina.com>
Signed-off-by: NJoe Thornber <thornber@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

bc68d0a4

dm thin metadata: THIN_MAX_CONCURRENT_LOCKS should be 6 · 490ae017

由 Dennis Yang 提交于 12月 12, 2017

For btree removal, there is a corner case that a single thread
could takes 6 locks which is more than THIN_MAX_CONCURRENT_LOCKS(5)
and leads to deadlock.

A btree removal might eventually call
rebalance_children()->rebalance3() to rebalance entries of three
neighbor child nodes when shadow_spine has already acquired two
write locks. In rebalance3(), it tries to shadow and acquire the
write locks of all three child nodes. However, shadowing a child
node requires acquiring a read lock of the original child node and
a write lock of the new block. Although the read lock will be
released after block shadowing, shadowing the third child node
in rebalance3() could still take the sixth lock.
(2 write locks for shadow_spine +
 2 write locks for the first two child nodes's shadow +
 1 write lock for the last child node's shadow +
 1 read lock for the last child node)

Cc: stable@vger.kernel.org
Signed-off-by: NDennis Yang <dennisyang@qnap.com>
Acked-by: NJoe Thornber <thornber@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

490ae017

16 1月, 2018 1 次提交

raid5-ppl: PPL support for disks with write-back cache enabled · 1532d9e8

由 Tomasz Majchrzak 提交于 12月 27, 2017

In order to provide data consistency with PPL for disks with write-back
cache enabled all data has to be flushed to disks before next PPL
entry. The disks to be flushed are marked in the bitmap. It's modified
under a mutex and it's only read after PPL io unit is submitted.

A limitation of 64 disks in the array has been introduced to keep data
structures and implementation simple. RAID5 arrays with so many disks are
not likely due to high risk of multiple disks failure. Such restriction
should not be a real life limitation.

With write-back cache disabled next PPL entry is submitted when data write
for current one completes. Data flush defers next log submission so trigger
it when there are no stripes for handling found.

As PPL assures all data is flushed to disk at request completion, just
acknowledge flush request when PPL is enabled.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>

1532d9e8

15 1月, 2018 1 次提交

dm: fix incomplete request_queue initialization · c100ec49

由 Mike Snitzer 提交于 1月 08, 2018

DM is no longer prone to having its request_queue be improperly
initialized.

Summary of changes:

- defer DM's blk_register_queue() from add_disk()-time until
  dm_setup_md_queue() by using add_disk_no_queue_reg() in alloc_dev().

- dm_setup_md_queue() is updated to fully initialize DM's request_queue
  (_after_ all table loads have occurred and the request_queue's type,
  features and limits are known).

A very welcome side-effect of these changes is DM no longer needs to:
1) backfill the "mq" sysfs entry (because historically DM didn't
initialize the request_queue to use blk-mq until _after_
blk_register_queue() was called via add_disk()).
2) call elv_register_queue() to get .request_fn request-based DM
device's "iosched" exposed in syfs.

In addition, blk-mq debugfs support is now made available because
request-based DM's blk-mq request_queue is now properly initialized
before dm_setup_md_queue() calls blk_register_queue().

These changes also stave off the need to introduce new DM-specific
workarounds in block core, e.g. this proposal:
https://patchwork.kernel.org/patch/10067961/

In the end DM devices should be less unicorn in nature (relative to
initialization and availability of block core infrastructure provided by
the request_queue).
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Tested-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c100ec49

11 1月, 2018 1 次提交

dm mpath: Use blk_path_error · a1275677

由 Keith Busch 提交于 1月 09, 2018

Uses common code for determining if an error should be retried on
alternate path.
Acked-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a1275677

10 1月, 2018 1 次提交

bcache: closures: move control bits one bit right · 3609c471

由 Michael Lyle 提交于 1月 09, 2018

Otherwise, architectures that do negated adds of atomics (e.g. s390)
to do atomic_sub fail in closure_set_stopped.
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3609c471

09 1月, 2018 10 次提交

bcache: fix writeback target calc on large devices · 616486ab

由 Michael Lyle 提交于 1月 08, 2018

Bcache needs to scale the dirty data in the cache over the multiple
backing disks in order to calculate writeback rates for each.
The previous code did this by multiplying the target number of dirty
sectors by the backing device size, and expected it to fit into a
uint64_t; this blows up on relatively small backing devices.

The new approach figures out the bdev's share in 16384ths of the overall
cached data.  This is chosen to cope well when bdevs drastically vary in
size and to ensure that bcache can cross the petabyte boundary for each
backing device.

This has been improved based on Tang Junhui's feedback to ensure that
every device gets a share of dirty data, no matter how small it is
compared to the total backing pool.

The existing mechanism is very limited; this is purely a bug fix to
remove limits on volume size.  However, there still needs to be change
to make this "fair" over many volumes where some are idle.
Reported-by: NJack Douglas <jack@douglastechnology.co.uk>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

616486ab

bcache: fix misleading error message in bch_count_io_errors() · 5138ac67

由 Coly Li 提交于 1月 08, 2018

Bcache only does recoverable I/O for read operations by calling
cached_dev_read_error(). For write opertions there is no I/O recovery for
failed requests.

But in bch_count_io_errors() no matter read or write I/Os, before errors
counter reaches io error limit, pr_err() always prints "IO error on %,
recoverying". For write requests this information is misleading, because
there is no I/O recovery at all.

This patch adds a parameter 'is_read' to bch_count_io_errors(), and only
prints "recovering" by pr_err() when the bio direction is READ.
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5138ac67

bcache: reduce cache_set devices iteration by devices_max_used · 2831231d

由 Coly Li 提交于 1月 08, 2018

Member devices of struct cache_set is used to reference all attached
bcache devices to this cache set. If it is treated as array of pointers,
size of devices[] is indicated by member nr_uuids of struct cache_set.

nr_uuids is calculated in drivers/md/super.c:bch_cache_set_alloc(),
	bucket_bytes(c) / sizeof(struct uuid_entry)
Bucket size is determined by user space tool "make-bcache", by default it
is 1024 sectors (defined in bcache-tools/make-bcache.c:main()). So default
nr_uuids value is 4096 from the above calculation.

Every time when bcache code iterates bcache devices of a cache set, all
the 4096 pointers are checked even only 1 bcache device is attached to the
cache set, that's a wast of time and unncessary.

This patch adds a member devices_max_used to struct cache_set. Its value
is 1 + the maximum used index of devices[] in a cache set. When iterating
all valid bcache devices of a cache set, use c->devices_max_used in
for-loop may reduce a lot of useless checking.

Personally, my motivation of this patch is not for performance, I use it
in bcache debugging, which helps me to narrow down the scape to check
valid bcached devices of a cache set.
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2831231d

bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct() · b40503ea

由 Zhai Zhaoxuan 提交于 1月 08, 2018

The function cached_dev_make_request() and flash_dev_make_request() call
generic_start_io_acct() with (struct bcache_device)->disk when they start a
closure. Then the function bio_complete() calls generic_end_io_acct() with
(struct search)->orig_bio->bi_disk when the closure has done.
Since the `bi_disk` is not the bcache device, the generic_end_io_acct() is
called with a wrong device queue.

It causes the "inflight" (in struct hd_struct) counter keep increasing
without decreasing.

This patch fix the problem by calling generic_end_io_acct() with
(struct bcache_device)->disk.
Signed-off-by: NZhai Zhaoxuan <kxuanobj@gmail.com>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NColy Li <colyli@suse.de>
Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b40503ea

bcache: mark closure_sync() __sched · ce439bf7

由 Kent Overstreet 提交于 1月 08, 2018

[edit by mlyle: include sched/debug.h to get __sched]
Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ce439bf7

bcache: Fix, improve efficiency of closure_sync() · e4bf7919

由 Kent Overstreet 提交于 1月 08, 2018

Eliminates cases where sync can race and fail to complete / get stuck.
Removes many status flags and simplifies entering-and-exiting closure
sleeping behaviors.

[mlyle: fixed conflicts due to changed return behavior in mainline.
extended commit comment, and squashed down two commits that were mostly
contradictory to get to this state.  Changed __set_current_state to
set_current_state per Jens review comment]
Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e4bf7919

bcache: allow quick writeback when backing idle · b1092c9a

由 Michael Lyle 提交于 1月 08, 2018

If the control system would wait for at least half a second, and there's
been no reqs hitting the backing disk for awhile: use an alternate mode
where we have at most one contiguous set of writebacks in flight at a
time. (But don't otherwise delay).  If front-end IO appears, it will
still be quick, as it will only have to contend with one real operation
in flight.  But otherwise, we'll be sending data to the backing disk as
quickly as it can accept it (with one op at a time).
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
Acked-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b1092c9a

bcache: writeback: properly order backing device IO · 6e6ccc67

由 Michael Lyle 提交于 1月 08, 2018

Writeback keys are presently iterated and dispatched for writeback in
order of the logical block address on the backing device.  Multiple may
be, in parallel, read from the cache device and then written back
(especially when there are contiguous I/O).

However-- there was no guarantee with the existing code that the writes
would be issued in LBA order, as the reads from the cache device are
often re-ordered.  In turn, when writing back quickly, the backing disk
often has to seek backwards-- this slows writeback and increases
utilization.

This patch introduces an ordering mechanism that guarantees that the
original order of issue is maintained for the write portion of the I/O.
Performance for writeback is significantly improved when there are
multiple contiguous keys or high writeback rates.
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
Tested-by: NTang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6e6ccc67

bcache: fix wrong return value in bch_debug_init() · 539d39eb

由 Tang Junhui 提交于 1月 08, 2018

in bch_debug_init(), ret is always 0, and the return value is useless,
change it to return 0 if be success after calling debugfs_create_dir(),
else return a non-zero value.
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

539d39eb

bcache: segregate flash only volume write streams · 4eca1cb2

由 Tang Junhui 提交于 1月 08, 2018

In such scenario that there are some flash only volumes
, and some cached devices, when many tasks request these devices in
writeback mode, the write IOs may fall to the same bucket as bellow:
| cached data | flash data | cached data | cached data| flash data|
then after writeback of these cached devices, the bucket would
be like bellow bucket:
| free | flash data | free | free | flash data |

So, there are many free space in this bucket, but since data of flash
only volumes still exists, so this bucket cannot be reclaimable,
which would cause waste of bucket space.

In this patch, we segregate flash only volume write streams from
cached devices, so data from flash only volumes and cached devices
can store in different buckets.

Compare to v1 patch, this patch do not add a additionally open bucket
list, and it is try best to segregate flash only volume write streams
from cached devices, sectors of flash only volumes may still be mixed
with dirty sectors of cached device, but the number is very small.

[mlyle: fixed commit log formatting, permissions, line endings]
Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4eca1cb2

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功