提交 · 738211f70a1d0c7ff7dc965395f7b8a436365ebd · openeuler / raspberrypi-kernel

06 3月, 2014 4 次提交

dm thin: fix noflush suspend IO queueing · 738211f7

由 Joe Thornber 提交于 3月 03, 2014

i) by the time DM core calls the postsuspend hook the dm_noflush flag
has been cleared.  So the old thin_postsuspend did nothing.  We need to
use the presuspend hook instead.

ii) There was a race between bios leaving DM core and arriving in the
deferred queue.

thin_presuspend now sets a 'requeue' flag causing all bios destined for
that thin to be requeued back to DM core.  Then it requeues all held IO,
and all IO on the deferred queue (destined for that thin).  Finally
postsuspend clears the 'requeue' flag.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

738211f7

dm thin: fix deadlock in __requeue_bio_list · 18adc577

由 Joe Thornber 提交于 3月 03, 2014

The spin lock in requeue_io() was held for too long, allowing deadlock.
Don't worry, due to other issues addressed in the following "dm thin:
fix noflush suspend IO queueing" commit, this code was never called.

Fix this by taking the spin lock for a much shorter period of time.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

18adc577

dm thin: fix out of data space handling · 3e1a0699

由 Joe Thornber 提交于 3月 03, 2014

Ideally a thin pool would never run out of data space; the low water
mark would trigger userland to extend the pool before we completely run
out of space. However, many small random IOs to unprovisioned space can
consume data space at an alarming rate. Adjust your low water mark if
you're frequently seeing "out-of-data-space" mode.

Before this fix, if data space ran out the pool would be put in
PM_READ_ONLY mode which also aborted the pool's current metadata
transaction (data loss for any changes in the transaction). This had a
side-effect of needlessly compromising data consistency. And retry of
queued unserviceable bios, once the data pool was resized, could
initiate changes to potentially inconsistent pool metadata.

Now when the pool's data space is exhausted transition to a new pool
mode (PM_OUT_OF_DATA_SPACE) that allows metadata to be changed but data
may not be allocated. This allows users to remove thin volumes or
discard data to recover data space.

The pool is no longer put in PM_READ_ONLY mode in response to the pool
running out of data space. And PM_READ_ONLY mode no longer aborts the
pool's current metadata transaction. Also, set_pool_mode() will now
notify userspace when the pool mode is changed.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

3e1a0699

dm thin: ensure user takes action to validate data and metadata consistency · 07f2b6e0

由 Mike Snitzer 提交于 2月 14, 2014

If a thin metadata operation fails the current transaction will abort,
whereby causing potential for IO layers up the stack (e.g. filesystems)
to have data loss.  As such, set THIN_METADATA_NEEDS_CHECK_FLAG in the
thin metadata's superblock which:
1) requires the user verify the thin metadata is consistent (e.g. use
   thin_check, etc)
2) suggests the user verify the thin data is consistent (e.g. use fsck)

The only way to clear the superblock's THIN_METADATA_NEEDS_CHECK_FLAG is
to run thin_repair.

On metadata operation failure: abort current metadata transaction, set
pool in read-only mode, and now set the needs_check flag.

As part of this change, constraints are introduced or relaxed:
* don't allow a pool to transition to write mode if needs_check is set
* don't allow data or metadata space to be resized if needs_check is set
* if a thin pool's metadata space is exhausted: the kernel will now
  force the user to take the pool offline for repair before the kernel
  will allow the metadata space to be extended.

Also, update Documentation to include information about when the thin
provisioning target commits metadata, how it handles metadata failures
and running out of space.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJoe Thornber <ejt@redhat.com>

07f2b6e0

05 3月, 2014 1 次提交

dm thin: synchronize the pool mode during suspend · cdc2b415

由 Mike Snitzer 提交于 2月 14, 2014

Commit b5330655 ("dm thin: handle metadata failures more consistently")
increased potential for the pool's mode to be changed in response to
metadata operation failures.

When the pool mode is changed it isn't synchronized with the mode in
pool_features stored in the target's context (ti->private) that is used
as the basis for (re)establishing the pool mode during resume via
bind_control_target.

It is important that we synchronize the pool mode when it is changed
otherwise the pool may experience and unexpected mode transition on the
next resume (especially if there was no new table load).
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

cdc2b415

04 3月, 2014 2 次提交

dm snapshot: fix metadata corruption · 2c945820

由 Mikulas Patocka 提交于 3月 03, 2014

Commit 55494bf2 ("dm snapshot: use dm-bufio") broke snapshots.
Before that 3.14-rc1 commit, loading a snapshot's list of exceptions
involved reading exception areas one by one into ps->area and inserting
those exceptions into the hash table.  Commit 55494bf2 changed
it so that dm-bufio with prefetch is used to load exceptions in batchs.
Exceptions are loaded correctly, but ps->area is left uninitialized.
When a new exception is allocated, it is stored in this uninitialized
ps->area which will be written to the disk.  This causes metadata
corruption.

Fix this corruption by copying the last area that was read via dm-bufio
into ps->area.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

2c945820

dm: fix Kconfig indentation · c64d240d

由 Mike Snitzer 提交于 3月 03, 2014

Since DM_DEBUG_BLOCK_STACK_TRACING is a DM_PERSISTENT_DATA config option
move it from drivers/md/Kconfig to drivers/md/persistent-data/Kconfig.

Doing so fixes indentation for other DM config options.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

c64d240d

01 3月, 2014 1 次提交

dm cache mq: fix memory allocation failure for large cache devices · 14f398ca

由 Heinz Mauelshagen 提交于 2月 28, 2014

The memory allocated for the multiqueue policy's hash table doesn't need
to be physically contiguous.  Use vzalloc() instead of kzalloc().
Fedora has been carrying this fix since 10/10/2013.

Failure seen during creation of a 10TB cached device with a 2048 sector
block size and 411GB cache size:

 dmsetup: page allocation failure: order:9, mode:0x10c0d0
 CPU: 11 PID: 29235 Comm: dmsetup Not tainted 3.10.4 #3
 Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1a       12/30/2011
  000000000010c0d0 ffff880090941898 ffffffff81387ab4 ffff880090941928
  ffffffff810bb26f 0000000000000009 000000000010c0d0 ffff880090941928
  ffffffff81385dbc ffffffff815f3840 ffffffff00000000 000002000010c0d0
 Call Trace:
  [<ffffffff81387ab4>] dump_stack+0x19/0x1b
  [<ffffffff810bb26f>] warn_alloc_failed+0x110/0x124
  [<ffffffff81385dbc>] ? __alloc_pages_direct_compact+0x17c/0x18e
  [<ffffffff810bda2e>] __alloc_pages_nodemask+0x6c7/0x75e
  [<ffffffff810bdad7>] __get_free_pages+0x12/0x3f
  [<ffffffff810ea148>] kmalloc_order_trace+0x29/0x88
  [<ffffffff810ec1fd>] __kmalloc+0x36/0x11b
  [<ffffffffa031eeed>] ? mq_create+0x1dc/0x2cf [dm_cache_mq]
  [<ffffffffa031efc0>] mq_create+0x2af/0x2cf [dm_cache_mq]
  [<ffffffffa0314605>] dm_cache_policy_create+0xa7/0xd2 [dm_cache]
  [<ffffffffa0312530>] ? cache_ctr+0x245/0xa13 [dm_cache]
  [<ffffffffa031263e>] cache_ctr+0x353/0xa13 [dm_cache]
  [<ffffffffa012b916>] dm_table_add_target+0x227/0x2ce [dm_mod]
  [<ffffffffa012e8e4>] table_load+0x286/0x2ac [dm_mod]
  [<ffffffffa012e65e>] ? dev_wait+0x8a/0x8a [dm_mod]
  [<ffffffffa012e324>] ctl_ioctl+0x39a/0x3c2 [dm_mod]
  [<ffffffffa012e35a>] dm_ctl_ioctl+0xe/0x12 [dm_mod]
  [<ffffffff81101181>] vfs_ioctl+0x21/0x34
  [<ffffffff811019d3>] do_vfs_ioctl+0x3b1/0x3f4
  [<ffffffff810f4d2e>] ? ____fput+0x9/0xb
  [<ffffffff81050b6c>] ? task_work_run+0x7e/0x92
  [<ffffffff81101a68>] SyS_ioctl+0x52/0x82
  [<ffffffff81391d92>] system_call_fastpath+0x16/0x1b
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

14f398ca

28 2月, 2014 2 次提交

dm cache: fix truncation bug when mapping I/O to >2TB fast device · e0d849fa

由 Heinz Mauelshagen 提交于 2月 27, 2014

When remapping a block to the cache's fast device that is larger than
2TB we must not truncate the destination sector to 32bits.  The 32bit
temporary result of from_cblock() was being overflowed in
remap_to_cache() due to the logical left shift.

Use an intermediate 64bit type to store the 32bit from_cblock() result
to fix the overflow.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

e0d849fa

dm thin: allow metadata space larger than supported to go unused · 7d48935e

由 Mike Snitzer 提交于 2月 12, 2014

It was always intended that a user could provide a thin metadata device
that is larger than the max supported by the on-disk format. The extra
space would just go unused.

Unfortunately that never worked. If the user attempted to use a larger
metadata device on creation they would get an error like the following:

device-mapper: space map common: space map too large
device-mapper: transaction manager: couldn't create metadata space map
device-mapper: thin metadata: tm_create_with_sm failed
device-mapper: table: 252:17: thin-pool: Error creating metadata object
device-mapper: ioctl: error adding target to table

Fix this by allowing the initial metadata space map creation to cap its
size at the max number of blocks supported (DM_SM_METADATA_MAX_BLOCKS).
get_metadata_dev_size() must also impose DM_SM_METADATA_MAX_BLOCKS (via
THIN_METADATA_MAX_SECTORS), otherwise extending metadata would cap at
THIN_METADATA_MAX_SECTORS_WARNING (which is larger than supported).

Also, the calculation for THIN_METADATA_MAX_SECTORS didn't account for
the sizeof the disk_bitmap_header. So the supported maximum metadata
size is a bit smaller (reduced from 33423360 to 33292800 sectors).

Lastly, remove the "excess space will not be used" warning message from
get_metadata_dev_size(); it resulted in printing the warning multiple
times. Factor out warn_if_metadata_device_too_big(), call it from
pool_ctr() and maybe_resize_metadata_dev().
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

7d48935e

26 2月, 2014 1 次提交

dm mpath: fix stalls when handling invalid ioctls · a1989b33

由 Hannes Reinecke 提交于 2月 26, 2014

An invalid ioctl will never be valid, irrespective of whether multipath
has active paths or not.  So for invalid ioctls we do not have to wait
for multipath to activate any paths, but can rather return an error
code immediately.  This fix resolves numerous instances of:

 udevd[]: worker [] unexpectedly returned with status 0x0100

that have been seen during testing.
Signed-off-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

a1989b33

25 2月, 2014 1 次提交

dm thin: fix the error path for the thin device constructor · 1acacc07

由 Mike Snitzer 提交于 2月 19, 2014

dm_pool_close_thin_device() must be called if dm_set_target_max_io_len()
fails in thin_ctr().  Otherwise __pool_destroy() will fail because the
pool will still have an open thin device:

 device-mapper: thin metadata: attempt to close pmd when 1 device(s) are still open
 device-mapper: thin: __pool_destroy: dm_pool_metadata_close() failed.

Also, must establish error code if failing thin_ctr() because the pool
is in fail_io mode.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org

1acacc07

18 2月, 2014 5 次提交

dm raid1: fix immutable biovec related BUG when retrying read bio · f3a44fe0

由 Mikulas Patocka 提交于 2月 18, 2014

When restoring bi_end_io, increase bi_remaining before retrying the bio
to avoid BUG_ON(atomic_read(&bio->bi_remaining) <= 0) in bio_endio().
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

f3a44fe0

dm io: fix I/O to multiple destinations · d73f9907

由 Mikulas Patocka 提交于 2月 12, 2014

Commit 003b5c57 ("block: Convert drivers
to immutable biovecs") broke dm-mirror due to dm-io breakage.

dm-io had three possible iterators (DM_IO_PAGE_LIST, DM_IO_BVEC,
DM_IO_VMA) that iterate over pages where the I/O should be performed.

The switch to immutable biovecs changed the DM_IO_BVEC iterator to
DM_IO_BIO. Before this change the iterator stored the pointer to a bio
vector in the dpages structure. The iterator incremented the pointer in
the dpages structure as it advanced over the pages. After the immutable
biovecs change, the DM_IO_BIO iterator stores a pointer to the bio in
the dpages structure and uses bio_advance to change the bio as it
advances.

The problem is that the function dispatch_io stores the content of the
dpages structure into the variable old_pages and restores it before
issuing I/O to each of the devices. Before the change, the statement
"*dp = old_pages;" restored the iterator to its starting position.
After the change, struct dpages holds a pointer to the bio, thus the
statement "*dp = old_pages;" doesn't restore the iterator.

Consequently, in the context of dm-mirror: only the first mirror leg is
written correctly, the kernel locks up when trying to write the other
mirror legs because the number of sectors to write in the where->count
variable doesn't match the number of sectors returned by the iterator.

This patch fixes the bug by partially reverting the original patch - it
changes the code so that struct dpages holds a pointer to the bio vector,
so that the statement "*dp = old_pages;" restores the iterator correctly.

The field "context_u" holds the offset from the beginning of the current
bio vector entry, just like the "bio->bi_iter.bi_bvec_done" field.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

d73f9907

dm thin: avoid metadata commit if a pool's thin devices haven't changed · 4d1662a3

由 Mike Snitzer 提交于 2月 06, 2014

Commit 905e51b3 ("dm thin: commit outstanding data every second")
introduced a periodic commit.  This commit occurs regardless of whether
any thin devices have made changes.

Fix the periodic commit to check if any of a pool's thin devices have
changed using dm_pool_changed_this_transaction().
Reported-by: NAlexander Larsson <alexl@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org

4d1662a3

dm cache: do not add migration to completed list before unhooking bio · 80ae49aa

由 Mike Snitzer 提交于 1月 31, 2014

When completing an overwrite bio, in overwrite_endio(), the associated
migration should not be added to the 'completed_migrations' until the
bio's fields are restored with dm_unhook_bio().

Otherwise, do_worker() can race to process 'completed_migrations' before
dm_unhook_bio() -- so the bio's bi_end_io is incorrect.  This is
unlikely to cause any problems given the current code but should be
fixed on the basis of correctness.

Also, the cache's spinlock only needs to be held when manipulating the
'completed_migrations' list -- other changes don't need protection.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

80ae49aa

dm cache: move hook_info into common portion of per_bio_data structure · c6eda5e8

由 Mike Snitzer 提交于 1月 31, 2014

Commit c9d28d5d ("dm cache: promotion optimisation for writes")
incorrectly placed the 'hook_info' member in the writethrough-only
portion of the per_bio_data structure.

Given that the overwrite optimization may be used for writeback the
'hook_info' member must be placed above the 'cache' member of the
per_bio_data structure.  Any members above 'cache' are available from
both writeback and writethrough modes' per_bio_data structure.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org # 3.13+

c6eda5e8

13 2月, 2014 1 次提交

md/raid5: Fix CPU hotplug callback registration · 789b5e03

由 Oleg Nesterov 提交于 2月 06, 2014

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Interestingly, the raid5 code can actually prevent double initialization and
hence can use the following simplified form of callback registration:

	register_cpu_notifier(&foobar_cpu_notifier);

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	put_online_cpus();

A hotplug operation that occurs between registering the notifier and calling
get_online_cpus(), won't disrupt anything, because the code takes care to
perform the memory allocations only once.

So reorganize the code in raid5 this way to fix the deadlock with callback
registration.

Cc: linux-raid@vger.kernel.org
Cc: stable@vger.kernel.org (v2.6.32+)
Fixes: 36d1c647Signed-off-by: NOleg Nesterov <oleg@redhat.com>
[Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
free_scratch_buffer() helper to condense code further and wrote the changelog.]
Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

789b5e03

11 2月, 2014 1 次提交

drivers/md/bcache/extents.c: use %zi to format size_t · bd180b4e

由 Geert Uytterhoeven 提交于 2月 10, 2014

drivers/md/bcache/extents.c: In function `btree_ptr_bad_expensive':
drivers/md/bcache/extents.c:196: warning: format `%li' expects type `long int', but argument 4 has type `size_t'
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bd180b4e

05 2月, 2014 1 次提交

md/raid1: restore ability for check and repair to fix read errors. · 1877db75

由 NeilBrown 提交于 2月 05, 2014

commit 30bc9b53
    md/raid1: fix bio handling problems in process_checks()

Move the bio_reset() to a point before where BIO_UPTODATE is checked,
so that check now always report that the bio is uptodate, even if it is not.

This causes process_check() to sometimes treat read-errors as
successful matches so the good data isn't written out.

This patch preserves the flag until it is needed.

Bug was introduced in 3.11, but backported to 3.10-stable (as it fixed
an even worse bug).  So suitable for any -stable since 3.10.
Reported-and-tested-by: NMichael Tokarev <mjt@tls.msk.ru>
Cc: stable@vger.kernel.org (3.10+)
Fixed: 30bc9b53Signed-off-by: NNeilBrown <neilb@suse.de>

1877db75

30 1月, 2014 3 次提交

N
bcache: bugfix - gc thread now gets woken when cache is full · e3b4825b
由 Nicholas Swenson 提交于 12月 12, 2013
```
Signed-off-by: NNicholas Swenson <nks@daterainc.com>
```
e3b4825b
K
bcache: Minor fixes from kbuild robot · 3572324a
由 Kent Overstreet 提交于 1月 10, 2014
```
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
```
3572324a

bcache: fix BUG_ON due to integer overflow with GC_SECTORS_USED · 94717447

由 Darrick J. Wong 提交于 1月 28, 2014

The BUG_ON at the end of __bch_btree_mark_key can be triggered due to
an integer overflow error:

BITMASK(GC_SECTORS_USED, struct bucket, gc_mark, 2, 13);
...
SET_GC_SECTORS_USED(g, min_t(unsigned,
	     GC_SECTORS_USED(g) + KEY_SIZE(k),
	     (1 << 14) - 1));
BUG_ON(!GC_SECTORS_USED(g));

In bcache.h, the SECTORS_USED bitfield is defined to be 13 bits wide.
While the SET_ code tries to ensure that the field doesn't overflow by
clamping it to (1<<14)-1 == 16383, this is incorrect because 16383
requires 14 bits.  Therefore, if GC_SECTORS_USED() + KEY_SIZE() =
8192, the SET_ statement tries to store 8192 into a 13-bit field.  In
a 13-bit field, 8192 becomes zero, thus triggering the BUG_ON.

Therefore, create a field width constant and a max value constant, and
use those to create the bitfield and check the inputs to
SET_GC_SECTORS_USED.  Arguably the BITMASK() template ought to have
BUG_ON checks for too-large values, but that's a separate patch.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

94717447

22 1月, 2014 3 次提交

dm log userspace: allow mark requests to piggyback on flush requests · 5066a4df

由 Dongmao Zhang 提交于 1月 15, 2014

In the cluster evironment, cluster write has poor performance because
userspace_flush() has to contact a userspace program (cmirrord) for
clear/mark/flush requests.  But both mark and flush requests require
cmirrord to communicate the message to all the cluster nodes for each
flush call.  This behaviour is really slow.

To address this we now merge mark and flush requests together to reduce
the kernel-userspace-kernel time.  We allow a new directive,
"integrated_flush" that can be used to instruct the kernel log code to
combine flush and mark requests when directed by userspace.  If not
directed by userspace (due to an older version of the userspace code
perhaps), the kernel will function as it did previously - preserving
backwards compatibility.  Additionally, flush requests are performed
lazily when only clear requests exist.
Signed-off-by: NDongmao Zhang <dmzhang@suse.com>
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

5066a4df

md/raid5: close recently introduced race in stripe_head management. · 7da9d450

由 NeilBrown 提交于 1月 22, 2014

As release_stripe and __release_stripe decrement ->count and then
manipulate ->lru both under ->device_lock, it is important that
get_active_stripe() increments ->count and clears ->lru also under
->device_lock.

However we currently list_del_init ->lru under the lock, but increment
the ->count outside the lock.  This can lead to races and list
corruption.

So move the atomic_inc(&sh->count) up inside the ->device_lock
protected region.

Note that we still increment ->count without device lock in the case
where get_free_stripe() was called, and in fact don't take
->device_lock at all in that path.
This is safe because if the stripe_head can be found by
get_free_stripe, then the hash lock assures us the no-one else could
possibly be calling release_stripe() at the same time.

Fixes: 566c09c5
Cc: stable@vger.kernel.org (3.13)
Reported-and-tested-by: NIan Kumlien <ian.kumlien@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

7da9d450

dm space map metadata: fix bug in resizing of thin metadata · fca02843

由 Joe Thornber 提交于 1月 21, 2014

This bug was introduced in commit 7e664b3d ("dm space map metadata:
fix extending the space map").

When extending a dm-thin metadata volume we:

- Switch the space map into a simple bootstrap mode, which allocates
  all space linearly from the newly added space.
- Add new bitmap entries for the new space
- Increment the reference counts for those newly allocated bitmap
  entries
- Commit changes to disk
- Switch back out of bootstrap mode.

But, the disk commit may allocate space itself, if so this fact will be
lost when switching out of bootstrap mode.

The bug exhibited itself as an error when the bitmap_root, with an
erroneous ref count of 0, was subsequently decremented as part of a
later disk commit.  This would cause the disk commit to fail, and thinp
to enter read_only mode.  The metadata was not damaged (thin_check
passed).

The fix is to put the increments + commit into a loop, running until
the commit has not allocated extra space.  In practise this loop only
runs twice.

With this fix the following device mapper testsuite test passes:
 dmtest run --suite thin-provisioning -n thin_remove_works_after_resize
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # depends on commit 7e664b3d

fca02843

17 1月, 2014 1 次提交

dm cache: add policy name to status output · 2e68c4e6

由 Mike Snitzer 提交于 1月 15, 2014

The cache's policy may have been established using the "default" alias,
which is currently the "mq" policy but the default policy may change in
the future. It is useful to know exactly which policy is being used.

Add a 'real' member to the dm_cache_policy_type structure and have the
"default" dm_cache_policy_type point to the real "mq"
dm_cache_policy_type. Update dm_cache_policy_get_name() to check if
real is set, if so report the name of the real policy (not the alias).
Requested-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

2e68c4e6

16 1月, 2014 3 次提交

dm thin: fix pool feature parsing · 74aa45c3

由 Mike Snitzer 提交于 1月 15, 2014

Commit 787a996c ("dm thin: add error_if_no_space feature")
mistakenly forgot to increase the number of feature args supported.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

74aa45c3

md/raid5: fix long-standing problem with bitmap handling on write failure. · 9f97e4b1

由 NeilBrown 提交于 1月 16, 2014

Before a write starts we set a bit in the write-intent bitmap.
When the write completes we clear that bit if the write was successful
to all devices.  However if the write wasn't fully successful we
should not clear the bit.  If the faulty drive is subsequently
re-added, the fact that the bit is still set ensure that we will
re-write the data that is missing.

This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
bitmap bit when this flag is not set.
Currently we correctly set the flag if a write starts when some
devices are failed or missing.  But we do *not* set the flag if some
device failed during the write attempt.
This is wrong and can result in clearing the bit inappropriately.

So: set the flag when a write fails.

This bug has been present since bitmaps were introduces, so the fix is
suitable for any -stable kernel.
Reported-by: NEthan Wilson <ethan.wilson@shiftmail.org>
Cc: stable@vger.kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

9f97e4b1

md: check command validity early in md_ioctl(). · cb335f88

由 Nicolas Schichan 提交于 1月 15, 2014

Verify that the cmd parameter passed to md_ioctl() is valid before
doing anything.

This fixes mddev->hold_active being set to 0 when an invalid ioctl
command is passed to md_ioctl() before the array has been configured.

Clearing mddev->hold_active in that case can lead to a livelock
situation when an invalid ioctl number is given to md_ioctl() by a
process when the mddev is currently being opened by another process:

Process 1				Process 2
---------				---------

md_alloc()
  mddev_find()
  -> returns a new mddev with
     hold_active == UNTIL_IOCTL
  add_disk()
  -> sends KOBJ_ADD uevent

					(sees KOBJ_ADD uevent for device)
                    			md_open()
                    			md_ioctl(INVALID_IOCTL)
                    			-> returns ENODEV and clears
                       			   mddev->hold_active
                    			md_release()
                      			md_put()
                      			-> deletes the mddev as
                         		   hold_active is 0

md_open()
  mddev_find()
  -> returns a newly
    allocated mddev with
    mddev->gendisk == NULL
-> returns with ERESTARTSYS
   (kernel restarts the open syscall)
Signed-off-by: NNicolas Schichan <nschichan@freebox.fr>
Signed-off-by: NNeilBrown <neilb@suse.de>

cb335f88

15 1月, 2014 5 次提交

dm sysfs: fix a module unload race · 2995fa78

由 Mikulas Patocka 提交于 1月 13, 2014

This reverts commit be35f486 ("dm: wait until embedded kobject is
released before destroying a device") and provides an improved fix.

The kobject release code that calls the completion must be placed in a
non-module file, otherwise there is a module unload race (if the process
calling dm_kobject_release is preempted and the DM module unloaded after
the completion is triggered, but before dm_kobject_release returns).

To fix this race, this patch moves the completion code to dm-builtin.c
which is always compiled directly into the kernel if BLK_DEV_DM is
selected.

The patch introduces a new dm_kobject_holder structure, its purpose is
to keep the completion and kobject in one place, so that it can be
accessed from non-module code without the need to export the layout of
struct mapped_device to that code.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

2995fa78

dm snapshot: use dm-bufio prefetch · 55b082e6

由 Mikulas Patocka 提交于 1月 13, 2014

This patch modifies dm-snapshot so that it prefetches the buffers when
loading the exceptions.

The number of buffers read ahead is specified in the DM_PREFETCH_CHUNKS
macro.  The current value for DM_PREFETCH_CHUNKS (12) was found to
provide the best performance on a single 15k SCSI spindle.  In the
future we may modify this default or make it configurable.

Also, introduce the function dm_bufio_set_minimum_buffers to setup
bufio's number of internal buffers before freeing happens.  dm-bufio may
hold more buffers if enough memory is available.  There is no guarantee
that the specified number of buffers will be available - if you need a
guarantee, use the argument reserved_buffers for
dm_bufio_client_create.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

55b082e6

dm snapshot: use dm-bufio · 55494bf2

由 Mikulas Patocka 提交于 1月 13, 2014

Use dm-bufio for initial loading of the exceptions.
Introduce a new function dm_bufio_forget that frees the given buffer.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

55494bf2

dm snapshot: prepare for switch to using dm-bufio · 2cadabd5

由 Mikulas Patocka 提交于 1月 13, 2014

Change the functions get_exception, read_exception and insert_exceptions
so that ps->area is passed as an argument.

This patch doesn't change any functionality, but it refactors the code
to allow for a cleaner switch over to using dm-bufio.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

2cadabd5

dm snapshot: use GFP_KERNEL when initializing exceptions · 119bc547

由 Mikulas Patocka 提交于 1月 13, 2014

The list of initial exceptions is loaded in the target constructor.  We
are allowed to allocate memory with GFP_KERNEL at this point.  So,
change alloc_completed_exception to use GFP_KERNEL when being called
from the constructor.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

119bc547

14 1月, 2014 5 次提交

md: ensure metadata is writen after raid level change. · 830778a1

由 NeilBrown 提交于 1月 14, 2014

level_store() currently does not make sure the metadata is
updates to reflect the new raid level.  It simply sets MD_CHANGE_DEVS.

Any level with a ->thread will quickly notice this and update the
metadata.  However RAID0 and Linear do not have a thread so no
metadata update happens until the array is stopped.  At that point the
metadata is written.

This is later that we would like.  While the delay doesn't risk any
data it can cause confusion.  So if there is no md thread, immediately
update the metadata after a level change.
Reported-by: NRichard Michael <rmichael@edgeofthenet.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

830778a1

md/raid10: avoid fullsync when not necessary. · 0b59bb64

由 NeilBrown 提交于 1月 14, 2014

This is the raid10 equivalent of

commit 4f0a5e01
    MD RAID1: Further conditionalize 'fullsync'

If a device in a newly assembled array is not fully recovered we
currently do a fully resync by don't need to.
Signed-off-by: NNeilBrown <neilb@suse.de>

0b59bb64

md: allow a partially recovered device to be hot-added to an array. · 7eb41885

由 NeilBrown 提交于 1月 14, 2014

When adding a new device into an array it is normally important to
clear any stale data from ->recovery_offset else the new device may
not be recovered properly.

However when re-adding a device which is known to be nearly in-sync,
this is not needed and can be detrimental.  The (bitmap-based)
resync will still happen, and further recovery is only needed from
where-ever it was already up to.

So if save_raid_disk is set, signifying a re-add, don't clear
->recovery_offset.
Signed-off-by: NNeilBrown <neilb@suse.de>

7eb41885

md: Change handling of save_raid_disk and metadata update during recovery. · f466722c

由 NeilBrown 提交于 12月 09, 2013

Since commit d70ed2e4
   MD: Allow restarting an interrupted incremental recovery.

we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences.  This patch
changes things to make them work better.

At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".

Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>".  So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.

After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count".  That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point.  So this
was an improvement.

However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering).  So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.

It really is wrong to not update all the metadata together.  Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata.  i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening.  We
now add another one to say that it is a bitmap-based recovery.

With this we can remove the code that disables the write-out of
metadata on some devices.

So this patch:
 - moves the setting of 'saved_raid_disk' from add_new_disk to
   the validate_super methods.  This makes sure it is always set
   properly, both when adding a new device to an array, and when
   assembling an array from a collection of devices.
 - Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
   used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
   bitmap-based recovery is allowed.
   This is only present in v1.x metadata. v0.90 doesn't support
   devices which are in the middle of recovery at all.
 - Only skips writing metadata to Faulty devices.

 - Also allows rdev state to be set to "-insync" via sysfs.
   This can be used for external-metadata arrays.  When the
   'role' is set the device is assumed to be in-sync.  If, after
   setting the role, we set the state to "-insync", the role is
   moved to saved_raid_disk which effectively says the device is
   partly in-sync with that slot and needs a bitmap recovery.

Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

f466722c

md: fix problem when adding device to read-only array with bitmap. · 8313b8e5

由 NeilBrown 提交于 12月 12, 2013

If an array is started degraded, and then the missing device
is found it can be re-added and a minimal bitmap-based recovery
will bring it fully up-to-date.

If the array is read-only a recovery would not be allowed.
But also if the array is read-only and the missing device was
present very recently, then there could be no need for any
recovery at all, so we simply include the device in the read-only
array without any recovery.

However... if the missing device was removed a little longer ago
it could be missing some updates, but if a bitmap is present it will
be conditionally accepted pending a bitmap-based update.  We don't
currently detect this case properly and will include that old
device into the read-only array with no recovery even though it really
needs a recovery.

This patch keeps track of whether a bitmap-based-recovery is really
needed or not in the new Bitmap_sync rdev flag.  If that is set,
then the device will not be added to a read-only array.

Cc: Andrei Warkentin <andreiw@vmware.com>
Fixes: d70ed2e4
Cc: stable@vger.kernel.org (3.2+)
Signed-off-by: NNeilBrown <neilb@suse.de>

8313b8e5