提交 · d1d9220cbaeecce910f3ecfeb71cc897a678eb68 · openeuler / raspberrypi-kernel

13 11月, 2014 1 次提交

dm cache: emit a warning message if there are a lot of cache blocks · d1d9220c

由 Joe Thornber 提交于 11月 11, 2014

Loading and saving millions of block mappings takes time.  We may as
well explain what's going on, and encourage people to use a larger
cache block size.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

d1d9220c

11 11月, 2014 28 次提交

dm cache: improve discard support · 7ae34e77

由 Joe Thornber 提交于 11月 06, 2014

Safely allow the discard blocksize to be larger than the cache blocksize
by using the bio prison's range locking support.  This also improves
discard performance considerly because larger discards are issued to the
dm-cache device.  The discard blocksize was always intended to be
greater than the cache blocksize.  But until now it wasn't implemented
safely.

Also, by safely restoring the ability to have discard blocksize larger
than cache blocksize we're able to significantly reduce the memory used
for the cache's discard bitset.  Before, with a small discard blocksize,
the discard bitset could get quite large because its size is a function
of the discard blocksize and the origin device's size.  For example,
previously, using a 32KB cache blocksize with a 40TB origin resulted in
1280MB of incore memory use for the discard bitset!  Now, the discard
blocksize is scaled up accordingly to ensure the discard bitset is
capped at 2**14 bits, or 16KB.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7ae34e77

dm cache: revert "prevent corruption caused by discard_block_size > cache_block_size" · 08b18451

由 Joe Thornber 提交于 11月 06, 2014

This reverts commit d132cc6d because we
actually do want to allow the discard blocksize to be larger than the
cache blocksize.  Further dm-cache discard changes will make this
possible.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

08b18451

dm cache: revert "remove remainder of distinct discard block size" · 1bad9bc4

由 Joe Thornber 提交于 11月 07, 2014

This reverts commit 64ab346a because we
actually do want to allow the discard blocksize to be larger than the
cache blocksize.  Further dm-cache discard changes will make this
possible.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

1bad9bc4

dm bio prison: introduce support for locking ranges of blocks · 5f274d88

由 Joe Thornber 提交于 9月 17, 2014

Ranges will be placed in the same cell if they overlap.

Range locking is a prerequisite for more efficient multi-block discard
support in both the cache and thin-provisioning targets.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

5f274d88

dm cache policy mq: simplify ability to promote sequential IO to the cache · f1afb36a

由 Mike Snitzer 提交于 10月 30, 2014

Before, if the user wanted sequential IO to be promoted to the cache
they'd have to set sequential_threshold to some nebulous large value.

Now, the user may easily disable sequential IO detection (and sequential
IO's implicit bypass of the cache) by setting sequential_threshold to 0.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

f1afb36a

dm cache policy mq: tweak algorithm that decides when to promote a block · b155aa0e

由 Joe Thornber 提交于 10月 22, 2014

Rather than maintaining a separate promote_threshold variable that we
periodically update we now use the hit count of the oldest clean
block.  Also add a fudge factor to discourage demoting dirty blocks.

With some tests this has a sizeable difference, because the old code
was too eager to demote blocks.  For example, device-mapper-test-suite's
git_extract_cache_quick test goes from taking 190 seconds, to 142
(linear on spindle takes 250).
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

b155aa0e

dm: do not call dm_sync_table() when creating new devices · 41abc4e1

由 Hannes Reinecke 提交于 11月 05, 2014

When creating new devices dm_sync_table() calls
synchronize_rcu_expedited(), causing _all_ pending RCU pointers to be
flushed. This causes a latency overhead that is especially noticeable
when creating lots of devices.

And all of this is pointless as there are no old maps to be
disconnected, and hence no stale pointers which would need to be
cleared up.
Signed-off-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

41abc4e1

dm: sparse: Annotate field with __rcu for checking · 6fa99520

由 Pranith Kumar 提交于 10月 28, 2014

Annotate the map field with __rcu since this is a rcu pointer which is checked
by sparse.
Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

6fa99520

dm: Use rcu_dereference() for accessing rcu pointer · 33423974

由 Pranith Kumar 提交于 10月 28, 2014

The map field in 'struct mapped_device' is an rcu pointer. Use rcu_dereference()
while accessing it.
Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

33423974

M
dm thin: refactor requeue_io to eliminate spinlock bouncing · 42d6a8ce
由 Mike Snitzer 提交于 10月 19, 2014
```
Also refactor some other bio_list erroring helpers.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
42d6a8ce

dm thin: optimize retry_bios_on_resume · 9d094eeb

由 Mike Snitzer 提交于 10月 19, 2014

Eliminate redundant should_error_unserviceable_bio check and error
loop.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

9d094eeb

dm thin: sort the deferred cells · ac4c3f34

由 Joe Thornber 提交于 10月 10, 2014

Sort the cells in logical block order before processing each cell in
process_thin_deferred_cells().  This significantly improves the ondisk
layout on rotational storage, whereby improving read performance.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

ac4c3f34

dm thin: direct dispatch when breaking sharing · 23ca2bb6

由 Joe Thornber 提交于 10月 15, 2014

This use of direct submission in process_shared_bio() reduces latency
for submitting bios in the shared cell by avoiding adding those bios to
the deferred list and waiting for the next iteration of the worker.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

23ca2bb6

dm thin: remap the bios in a cell immediately · 2d759a46

由 Joe Thornber 提交于 10月 10, 2014

This use of direct submission in process_prepared_mapping() reduces
latency for submitting bios in a cell by avoiding adding those bios to
the deferred list and waiting for the next iteration of the worker.

But this direct submission exposes the potential for a race between
releasing a cell and incrementing deferred set.  Fix this by introducing
dm_cell_visit_release() and refactoring inc_remap_and_issue_cell()
accordingly.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

2d759a46

dm thin: defer whole cells rather than individual bios · a374bb21

由 Joe Thornber 提交于 10月 10, 2014

This avoids dropping the cell, so increases the probability that other
bios will collect within the cell, rather than being passed individually
to the worker.

Also add required process_cell and process_discard_cell error handling
wrappers and set associated pool-mode function pointers accordingly.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

a374bb21

dm thin: factor out remap_and_issue_overwrite · 452d7a62

由 Mike Snitzer 提交于 10月 09, 2014

Purely cleanup of duplicated code, no functional change.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

452d7a62

dm thin: performance improvement to discard processing · 7a7e97ca

由 Joe Thornber 提交于 9月 12, 2014

When processing a discard bio, if the block is already quiesced do the
discard immediately rather than adding the mapping to a list for the
next iteration of the worker thread.

Discarding a fully provisioned 100G thin volume with 64k block size goes
from 860s to 95s with this change.

Clearly there's something wrong with the worker architecture, more
investigation needed.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7a7e97ca

dm thin: implement thin_merge · 36f12aeb

由 Mike Snitzer 提交于 10月 09, 2014

Introduce thin_merge so that any additional constraints from the data
volume may be taken into account when determing the maximum number of
sectors that can be issued relative to the specified logical offset.

This is particularly important if/when the data volume is layered ontop
of a more sophisticated device (e.g. dm-raid or some other DM target).
Reviewed-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

36f12aeb

dm: improve documentation and code clarity in dm_merge_bvec · 148e51ba

由 Mike Snitzer 提交于 10月 09, 2014

These code changes do not introduce a functional change.

But bio_add_page() will never attempt to build up a bio larger than
queue_max_sectors(). Similarly, bio_get_nr_vecs() is also bound by
queue_max_sectors(). Therefore, there is no point in allowing
dm_merge_bvec() to answer "how many sectors can a bio have at this
offset?" with anything larger than queue_max_sectors(). Using
queue_max_sectors() rather than BIO_MAX_SECTORS serves to more
accurately convey the limits that are being imposed.

Also, use unlikely() to clarify the fact that the defensive code in
dm_merge_bvec() relative to max_size going negative shouldn't ever
happen -- if it does happen there is a bug in the block layer for
requesting larger than dm_merge_bvec()'s initial response for a given
offset. Also, update a comment in dm_merge_bvec() relative to
max_hw_sectors_kb. And fix empty newline whitespace.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

148e51ba

dm thin: adjust max_sectors_kb based on thinp blocksize · 604ea906

由 Mike Snitzer 提交于 10月 09, 2014

Allows for filesystems to submit bios that are a factor of the thinp
blocksize, improving dm-thinp efficiency (particularly when the data
volume is RAID).

Also set io_min to max_sectors_kb if it is a factor of the thinp
blocksize.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

604ea906

dm thin: throttle incoming IO · 7d327fe0

由 Joe Thornber 提交于 10月 06, 2014

Throttle IO based on the time it's taking the worker to do one loop.
There were reports of hung task timeouts occuring and it was observed
that the excessively long avgqu-sz (as reported by iostat) was
contributing to these hung tasks.

Throttling definitely helps dm-thinp perform better under heavy IO load
(without being detremental by being overzealous).  It reduces avgqu-sz
drastically, e.g.: from 60K to ~6K, and even as low as 150 once metadata
is cached by bufio, when dirty_ratio=5, dirty_background_ratio=2.  And
avgqu-sz stays at or below 30K even with dirty_ratio=20,
dirty_background_ratio=10.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7d327fe0

dm thin: prefetch missing metadata pages · 8a01a6af

由 Joe Thornber 提交于 10月 06, 2014

Prefetch metadata at the start of the worker thread and then again every
128th bio processed from the deferred list.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

8a01a6af

dm transaction manager: add support for prefetching blocks of metadata · 4646015d

由 Joe Thornber 提交于 10月 06, 2014

Introduce the dm_tm_issue_prefetches interface.  If you're using a
non-blocking clone the tm will build up a list of requested blocks that
weren't in core.  dm_tm_issue_prefetches will request those blocks to be
prefetched.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

4646015d

dm thin metadata: change dm_thin_find_block to allow blocking, but not issuing, IO · e5cfc69a

由 Joe Thornber 提交于 10月 06, 2014

This change is a prerequisite for allowing metadata to be prefetched.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

e5cfc69a

dm bio prison: switch to using a red black tree · a195db2d

由 Joe Thornber 提交于 10月 06, 2014

Previously it was using a fixed sized hash table.  There are times
when very many concurrent cells are held (such as when processing a very
large discard).  When this happens the hash table performance becomes
very poor.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

a195db2d

dm bufio: evict buffers that are past the max age but retain some buffers · 33096a78

由 Joe Thornber 提交于 10月 09, 2014

These changes help keep metadata backed by dm-bufio in-core longer which
fixes reports of metadata churn in the face of heavy random IO workloads.

Before, bufio evicted all buffers older than DM_BUFIO_DEFAULT_AGE_SECS.
Having a device (e.g. dm-thinp or dm-cache) lose all metadata just
because associated buffers had been idle for some time is unfriendly.

Now, the user may now configure the number of bytes that bufio retains
using the 'retain_bytes' module parameter. The default is 256K.

Also, the DM_BUFIO_WORK_TIMER_SECS and DM_BUFIO_DEFAULT_AGE_SECS
defaults were quite low so increase them (to 30 and 300 respectively).
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

33096a78

dm bufio: switch from a huge hash table to an rbtree · 4e420c45

由 Joe Thornber 提交于 10月 06, 2014

Converting over to using an rbtree eliminates a fixed 8MB allocation
from vmalloc space for the hash table.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

4e420c45

dm btree: fix a recursion depth bug in btree walking code · 9b460d36

由 Joe Thornber 提交于 11月 10, 2014

The walk code was using a 'ro_spine' to hold it's locked btree nodes.
But this data structure is designed for the rolling lock scheme, and
as such automatically unlocks blocks that are two steps up the call
chain.  This is not suitable for the simple recursive walk algorithm,
which retraces its steps.

This code is only used by the persistent array code, which in turn is
only used by dm-cache.  In order to trigger it you need to have a
mapping tree that is more than 2 levels deep; which equates to 8-16
million cache blocks.  For instance a 4T ssd with a very small block
size of 32k only just triggers this bug.

The fix just places the locked blocks on the stack, and stops using
the ro_spine altogether.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

9b460d36

05 11月, 2014 1 次提交

dm thin: grab a virtual cell before looking up the mapping · c822ed96

由 Joe Thornber 提交于 10月 10, 2014

Avoids normal IO racing with discard.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

c822ed96

30 10月, 2014 1 次提交

dm raid: fix inaccessible superblocks causing oops in configure_discard_support · d20c4b08

由 Heinz Mauelshagen 提交于 10月 29, 2014

Commit 48cf06bc ("dm raid: add discard support for RAID levels 4, 5
and 6") did not properly handle missing metadata device(s). A failing
read of the superblock causes the metadata and data devices to be
removed from the dev array in struct raid_set, setting references to
both devices to NULL. configure_discard_support() nonetheless tries to
access the data dev unconditionally causing an oops.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

d20c4b08

21 10月, 2014 1 次提交

dm raid: ensure superblock's size matches device's logical block size · 40d43c4b

由 Heinz Mauelshagen 提交于 10月 17, 2014

The dm-raid superblock (struct dm_raid_superblock) is padded to 512
bytes and that size is being used to read it in from the metadata
device into one preallocated page.

Reading or writing this on a 512-byte sector device works fine but on
a 4096-byte sector device this fails.

Set the dm-raid superblock's size to the logical block size of the
metadata device, because IO at that size is guaranteed too work.  Also
add a size check to avoid silent partial metadata loss in case the
superblock should ever grow past the logical block size or PAGE_SIZE.

[includes pointer math fix from Dan Carpenter]
Reported-by: N"Liuhua Wang" <lwang@suse.com>
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

40d43c4b

17 10月, 2014 1 次提交

dm bufio: change __GFP_IO to __GFP_FS in shrinker callbacks · 9d28eb12

由 Mikulas Patocka 提交于 10月 16, 2014

The shrinker uses gfp flags to indicate what kind of operation can the
driver wait for. If __GFP_IO flag is present, the driver can wait for
block I/O operations, if __GFP_FS flag is present, the driver can wait on
operations involving the filesystem.

dm-bufio tested for __GFP_IO. However, dm-bufio can run on a loop block
device that makes calls into the filesystem. If __GFP_IO is present and
__GFP_FS isn't, dm-bufio could still block on filesystem operations if it
runs on a loop block device.

The change from __GFP_IO to __GFP_FS supposedly fixes one observed (though
unreproducible) deadlock involving dm-bufio and loop device.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

9d28eb12

11 10月, 2014 1 次提交

dm stripe: fix potential for leak in stripe_ctr error path · a3f2af25

由 Pavitra Kumar 提交于 10月 10, 2014

Fix a potential struct stripe_c leak that would occur if the
chunk_size exceeded the maximum allowed by dm_set_target_max_io_len
(UINT_MAX).  However, in practice there is no possibility of this
occuring given that chunk_size is of type uint32_t.  But it is good to
fix this to future-proof in case dm_set_target_max_io_len's
implementation were to change.
Signed-off-by: NPavitra Kumar <pavitrak@nvidia.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

a3f2af25

06 10月, 2014 6 次提交

dm log userspace: fix memory leak in dm_ulog_tfr_init failure path · 56ec16cb

由 Alexey Khoroshilov 提交于 10月 01, 2014

If cn_add_callback() fails in dm_ulog_tfr_init(), it does not
deallocate prealloced memory but calls cn_del_callback().

Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: NAlexey Khoroshilov <khoroshilov@ispras.ru>
Reviewed-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

56ec16cb

dm bufio: when done scanning return from __scan immediately · 0e825862

由 Mikulas Patocka 提交于 10月 01, 2014

When __scan frees the required number of buffer entries that the
shrinker requested (nr_to_scan becomes zero) it must return.  Before
this fix the __scan code exited only the inner loop and continued in the
outer loop -- which could result in reduced performance due to extra
buffers being freed (e.g. unnecessarily evicted thinp metadata needing
to be synchronously re-read into bufio's cache).

Also, move dm_bufio_cond_resched to __scan's inner loop, so that
iterating the bufio client's lru lists doesn't result in scheduling
latency.
Reported-by: NJoe Thornber <thornber@redhat.com>
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.2+

0e825862

dm bufio: update last_accessed when relinking a buffer · eb76faf5

由 Joe Thornber 提交于 9月 30, 2014

The 'last_accessed' member of the dm_buffer structure was only set when
the the buffer was created.  This led to each buffer being discarded
after dm_bufio_max_age time even if it was used recently.  In practice
this resulted in all thinp metadata being evicted soon after being read
-- this is particularly problematic for metadata intensive workloads
like multithreaded small random IO.

'last_accessed' is now updated each time the buffer is moved to the head
of the LRU list, so the buffer is now properly discarded if it was not
used in dm_bufio_max_age time.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.2+

eb76faf5

dm raid: add discard support for RAID levels 4, 5 and 6 · 48cf06bc

由 Heinz Mauelshagen 提交于 9月 24, 2014

In case of RAID levels 4, 5 and 6 we have to verify each RAID members'
ability to zero data on discards to avoid stripe data corruption -- if
discard_zeroes_data is not set for each RAID member discard support must
be disabled. But given the uncertainty of whether or not a RAID member
properly supports zeroing data on discard we require the user to
explicitly allow discard support on RAID levels 4, 5, and 6 by setting
a dm-raid module paramter, e.g.: dm-raid.devices_handle_discard_safely=Y
Otherwise, discards could cause data corruption on RAID4/5/6.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

48cf06bc

dm raid: add discard support for RAID levels 1 and 10 · 75b8e04b

由 Heinz Mauelshagen 提交于 9月 24, 2014

Discard support is not enabled for RAID levels 4, 5, and 6 at this time
due to concerns about unreliable discard_zeroes_data support on some
hardware.  Otherwise, discards could cause stripe data corruption
(classic example of bad apples spoiling the bunch).
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

75b8e04b

dm: allow active and inactive tables to share dm_devs · 86f1152b

由 Benjamin Marzinski 提交于 8月 13, 2014

Until this change, when loading a new DM table, DM core would re-open
all of the devices in the DM table. Now, DM core will avoid redundant
device opens (and closes when destroying the old table) if the old
table already has a device open using the same mode. This is achieved
by managing reference counts on the table_devices that DM core now
stores in the mapped_device structure (rather than in the dm_table
structure). So a mapped_device's active and inactive dm_tables' dm_dev
lists now just point to the dm_devs stored in the mapped_device's
table_devices list.

This improvement in DM core's device reference counting has the
side-effect of fixing a long-standing limitation of the multipath
target: a DM multipath table couldn't include any paths that were unusable
(failed). For example: if all paths have failed and you add a new,
working, path to the table; you can't use it since the table load would
fail due to it still containing failed paths. Now a re-load of a
multipath table can include failed devices and when those devices become
active again they can be used instantly.

The device list code in dm.c isn't a straight copy/paste from the code in
dm-table.c, but it's very close (aside from some variable renames). One
subtle difference is that find_table_device for the tables_devices list
will only match devices with the same name and mode. This is because we
don't want to upgrade a device's mode in the active table when an
inactive table is loaded.

Access to the mapped_device structure's tables_devices list requires a
mutex (tables_devices_lock), so that tables cannot be created and
destroyed concurrently.
Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

86f1152b