提交 · 64590f45ddc7147fa1968147a1f5b5c436b728fe · openeuler / Kernel

04 2月, 2015 8 次提交

md: make merge_bvec_fn more robust in face of personality changes. · 64590f45

由 NeilBrown 提交于 12月 15, 2014

There is no locking around calls to merge_bvec_fn(), so
it is possible that calls which coincide with a level (or personality)
change could go wrong.

So create a central dispatch point for these functions and use
rcu_read_lock().
If the array is suspended, reject any merge that can be rejected.
If not, we know it is safe to call the function.
Signed-off-by: NNeilBrown <neilb@suse.de>

64590f45

md: make ->congested robust against personality changes. · 5c675f83

由 NeilBrown 提交于 12月 15, 2014

There is currently no locking around calls to the 'congested'
bdi function.  If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.

So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.

When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock.  Fortunately this is the case.
Signed-off-by: NNeilBrown <neilb@suse.de>

5c675f83

md: rename mddev->write_lock to mddev->lock · 85572d7c

由 NeilBrown 提交于 12月 15, 2014

This lock is used for (slightly) more than helping with writing
superblocks, and it will soon be extended further.  So the
name is inappropriate.

Also, the _irq variant hasn't been needed since 2.6.37 as it is
never taking from interrupt or bh context.

So:
  -rename write_lock to lock
  -document what it protects
  -remove _irq ... except in md_flush_request() as there
     is no wait_event_lock() (with no _irq).  This can be
     cleaned up after appropriate changes to wait.h.
Signed-off-by: NNeilBrown <neilb@suse.de>

85572d7c

md/raid5: need_this_block: tidy/fix last condition. · ea664c82

由 NeilBrown 提交于 2月 02, 2015

That last condition is unclear and over cautious.

There are two related issues here.

If a partial write is destined for a missing device, then
either RMW or RCW can work.  We must read all the available
block.  Only then can the missing blocks be calculated, and
then the parity update performed.

If RMW is not an option, then there is a complication even
without partial writes.  If we would need to read a missing
device to perform the reconstruction, then we must first read every
block so the missing device data can be computed.
This is the case for RAID6 (Which currently does not support
RMW) and for times when we don't trust the parity (after a crash)
and so are in the process of resyncing it.

So make these two cases more clear and separate, and perform
the relevant tests more  thoroughly.
Signed-off-by: NNeilBrown <neilb@suse.de>

ea664c82

md/raid5: need_this_block: start simplifying the last two conditions. · a9d56950

由 NeilBrown 提交于 2月 02, 2015

Both the last two cases are only relevant if something has failed and
something needs to be written (but not over-written), and if it is OK
to pre-read blocks at this point.  So factor out those tests and
explain them.
Signed-off-by: NNeilBrown <neilb@suse.de>

a9d56950

md/raid5: separate out the easy conditions in need_this_block. · a79cfe12

由 NeilBrown 提交于 2月 02, 2015

Some of the conditions in need_this_block have very straight
forward motivation.  Separate those out and document them.
Signed-off-by: NNeilBrown <neilb@suse.de>

a79cfe12

md/raid5: separate large if clause out of fetch_block(). · 2c58f06e

由 NeilBrown 提交于 2月 02, 2015

fetch_block() has a very large and hard to read 'if' condition.

Separate it into its own function so that it can be
made more readable.
Signed-off-by: NNeilBrown <neilb@suse.de>

2c58f06e

md: do_release_stripe(): No need to call md_wakeup_thread() twice · ad3ab8b6

由 Jes Sorensen 提交于 1月 29, 2015

67f45548 introduced a call to
md_wakeup_thread() when adding to the delayed_list. However the md
thread is woken up unconditionally just below.

Remove the unnecessary wakeup call.
Signed-off-by: NJes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

ad3ab8b6

02 2月, 2015 2 次提交

md/bitmap: fix a might_sleep() warning. · d9590143

由 NeilBrown 提交于 2月 02, 2015

commit 8eb23b9f
    sched: Debug nested sleeps

causes false-positive warnings in RAID5 code.

This annotation removes them and adds a comment
explaining why there is no real problem.
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

d9590143

md/raid5: fix another livelock caused by non-aligned writes. · b1b02fe9

由 NeilBrown 提交于 2月 02, 2015

If a non-page-aligned write is destined for a device which
is missing/faulty, we can deadlock.

As the target device is missing, a read-modify-write cycle
is not possible.
As the write is not for a full-page, a recontruct-write cycle
is not possible.

This should be handled by logic in fetch_block() which notices
there is a non-R5_OVERWRITE write to a missing device, and so
loads all blocks.

However since commit 67f45548, that code requires
STRIPE_PREREAD_ACTIVE before it will active, and those circumstances
never set STRIPE_PREREAD_ACTIVE.

So: in handle_stripe_dirtying, if neither rmw or rcw was possible,
set STRIPE_DELAYED, which will cause STRIPE_PREREAD_ACTIVE be set
after a suitable delay.

Fixes: 67f45548
Cc: stable@vger.kernel.org (v3.16+)
Reported-by: NMikulas Patocka <mpatocka@redhat.com>
Tested-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

b1b02fe9

25 1月, 2015 1 次提交

dm: fix handling of multiple internal suspends · 96b26c8c

由 Mikulas Patocka 提交于 1月 08, 2015

Commit ffcc3936 ("dm: enhance internal suspend and resume interface")
attempted to handle multiple internal suspends on the same device, but
it did that incorrectly.  When these functions are called in this order
on the same device the device is no longer suspended, but it should be:
	dm_internal_suspend_noflush
	dm_internal_suspend_noflush
	dm_internal_resume

Fix this bug by maintaining an 'internal_suspend_count' and resuming
the device when this count drops to zero.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

96b26c8c

24 1月, 2015 1 次提交

dm cache: fix problematic dual use of a single migration count variable · a59db676

由 Joe Thornber 提交于 1月 23, 2015

Introduce a new variable to count the number of allocated migration
structures.  The existing variable cache->nr_migrations became
overloaded.  It was used to:

 i) track of the number of migrations in flight for the purposes of
    quiescing during suspend.

 ii) to estimate the amount of background IO occuring.

Recent discard changes meant that REQ_DISCARD bios are processed with
a migration.  Discards are not background IO so nr_migrations was not
incremented.  However this could cause quiescing to complete early.

(i) is now handled with a new variable cache->nr_allocated_migrations.
cache->nr_migrations has been renamed cache->nr_io_migrations.
cleanup_migration() is now called free_io_migration(), since it
decrements that variable.

Also, remove the unused cache->next_migration variable that got replaced
with with prealloc_structs a while ago.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

a59db676

23 1月, 2015 1 次提交

dm cache: share cache-metadata object across inactive and active DM tables · 9b1cc9f2

由 Joe Thornber 提交于 1月 23, 2015

If a DM table is reloaded with an inactive table when the device is not
suspended (normal procedure for LVM2), then there will be two dm-bufio
objects that can diverge. This can lead to a situation where the
inactive table uses bufio to read metadata at the same time the active
table writes metadata -- resulting in the inactive table having stale
metadata buffers once it is promoted to the active table slot.

Fix this by using reference counting and a global list of cache metadata
objects to ensure there is only one metadata object per metadata device.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

9b1cc9f2

18 12月, 2014 4 次提交

dm: fix missed error code if .end_io isn't implemented by target_type · 5164bece

由 zhendong chen 提交于 12月 17, 2014

In bio-based DM's clone_endio(), when target_type doesn't implement
.end_io (e.g. linear) r will be always be initialized 0.  So if a
WRITE SAME bio fails WRITE SAME will not be disabled as intended.

Fix this by initializing r to error, rather than 0, in clone_endio().
Signed-off-by: NAlex Chen <alex.chen@huawei.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Fixes: 7eee4ae2 ("dm: disable WRITE SAME if it fails")
Cc: stable@vger.kernel.org

5164bece

dm thin: fix crash by initializing thin device's refcount and completion earlier · 2b94e896

由 Marc Dionne 提交于 12月 17, 2014

Commit 80e96c54 ("dm thin: do not allow thin device activation
while pool is suspended") delayed the initialization of a new thin
device's refcount and completion until after this new thin was added
to the pool's active_thins list and the pool lock is released.  This
opens a race with a worker thread that walks the list and calls
thin_get/put, noticing that the refcount goes to 0 and calling
complete, freezing up the system and giving the oops below:

 kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
 kernel: IP: [<ffffffff810d360b>] __wake_up_common+0x2b/0x90

 kernel: Call Trace:
 kernel: [<ffffffff810d3683>] __wake_up_locked+0x13/0x20
 kernel: [<ffffffff810d3dc7>] complete+0x37/0x50
 kernel: [<ffffffffa0595c50>] thin_put+0x20/0x30 [dm_thin_pool]
 kernel: [<ffffffffa059aab7>] do_worker+0x667/0x870 [dm_thin_pool]
 kernel: [<ffffffff816a8a4c>] ? __schedule+0x3ac/0x9a0
 kernel: [<ffffffff810b1aef>] process_one_work+0x14f/0x400
 kernel: [<ffffffff810b206b>] worker_thread+0x6b/0x490
 kernel: [<ffffffff810b2000>] ? rescuer_thread+0x260/0x260
 kernel: [<ffffffff810b6a7b>] kthread+0xdb/0x100
 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170
 kernel: [<ffffffff816ad7ec>] ret_from_fork+0x7c/0xb0
 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170

Set the thin device's initial refcount and initialize the completion
before adding it to the pool's active_thins list in thin_ctr().
Signed-off-by: NMarc Dionne <marc.dionne@your-file-system.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

2b94e896

dm thin: fix missing out-of-data-space to write mode transition if blocks are released · 2c43fd26

由 Joe Thornber 提交于 12月 11, 2014

Discard bios and thin device deletion have the potential to release data
blocks.  If the thin-pool is in out-of-data-space mode, and blocks were
released, transition the thin-pool back to full write mode.

The correct time to do this is just after the thin-pool metadata commit.
It cannot be done before the commit because the space maps will not
allow immediate reuse of the data blocks in case there's a rollback
following power failure.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

2c43fd26

dm thin: fix inability to discard blocks when in out-of-data-space mode · 45ec9bd0

由 Joe Thornber 提交于 12月 10, 2014

When the pool was in PM_OUT_OF_SPACE mode its process_prepared_discard
function pointer was incorrectly being set to
process_prepared_discard_passdown rather than process_prepared_discard.

This incorrect function pointer meant the discard was being passed down,
but not effecting the mapping.  As such any discard that was issued, in
an attempt to reclaim blocks, would not successfully free data space.
Reported-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

45ec9bd0

11 12月, 2014 1 次提交

md: Check MD_RECOVERY_RUNNING as well as ->sync_thread. · f851b60d

由 NeilBrown 提交于 12月 11, 2014

A recent change to md started the ->sync_thread from a asynchronously
from a work_queue rather than synchronously.  This means that there
can be a small window between the time when MD_RECOVERY_RUNNING is set
and when ->sync_thread is set.

So code that checks ->sync_thread might now conclude that the thread
has not been started and (because a lock is held) will not be started.
That is no longer the case.

Most of those places are best fixed by testing MD_RECOVERY_RUNNING
as well.  To make this completely reliable, we wake_up(&resync_wait)
after clearing that flag as well as after clearing ->sync_thread.

Other places are better served by flushing the relevant workqueue
to ensure that that if the sync thread was starting, it has now
started.  This is particularly best if we are about to stop the
sync thread.

Fixes: ac05f256Signed-off-by: NNeilBrown <neilb@suse.de>

f851b60d

03 12月, 2014 2 次提交

md: fix semicolon.cocci warnings · 7d7e64f2

由 kbuild test robot 提交于 12月 03, 2014

drivers/md/md.c:7175:43-44: Unneeded semicolon

 Removes unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci
Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

7d7e64f2

md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants. · 108cef3a

由 NeilBrown 提交于 12月 03, 2014

It is critical that fetch_block() and handle_stripe_dirtying()
are consistent in their analysis of what needs to be loaded.
Otherwise raid5 can wait forever for a block that won't be loaded.

Currently when writing to a RAID5 that is resyncing, to a location
beyond the resync offset, handle_stripe_dirtying chooses a
reconstruct-write cycle, but fetch_block() assumes a
read-modify-write, and a lockup can happen.

So treat that case just like RAID6, just as we do in
handle_stripe_dirtying.  RAID6 always does reconstruct-write.

This bug was introduced when the behaviour of handle_stripe_dirtying
was changed in 3.7, so the patch is suitable for any kernel since,
though it will need careful merging for some versions.

Cc: stable@vger.kernel.org (v3.7+)
Fixes: a7854487Reported-by: NHenry Cai <henryplusplus@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

108cef3a

02 12月, 2014 12 次提交

dm crypt: use memzero_explicit for on-stack buffer · 1a71d6ff

由 Milan Broz 提交于 11月 22, 2014

Use memzero_explicit to cleanup sensitive data allocated on stack
to prevent the compiler from optimizing and removing memset() calls.
Signed-off-by: NMilan Broz <gmazyland@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

1a71d6ff

dm space map metadata: fix sm_bootstrap_get_count() · 02717d98

由 Joe Thornber 提交于 12月 01, 2014

Must set 'result' accordingly rather than return it.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

02717d98

dm space map metadata: fix sm_bootstrap_get_nr_blocks() · c1c6156f

由 Dan Carpenter 提交于 11月 29, 2014

This function isn't right and it causes a static checker warning:

	drivers/md/dm-thin.c:3016 maybe_resize_data_dev()
	error: potentially using uninitialized 'sb_data_size'.

It should set "*count" and return zero on success the same as the
sm_metadata_get_nr_blocks() function does earlier.

Fixes: 3241b1d3 ('dm: add persistent data library')
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Acked-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

c1c6156f

dm bufio: fix memleak when using a dm_buffer's inline bio · 445559cd

由 Darrick J. Wong 提交于 11月 25, 2014

When dm-bufio sets out to use the bio built into a struct dm_buffer to
issue an IO, it needs to call bio_reset after it's done with the bio
so that we can free things attached to the bio such as the integrity
payload.  Therefore, inject our own endio callback to take care of
the bio_reset after calling submit_io's end_io callback.

Test case:
1. modprobe scsi_debug delay=0 dif=1 dix=199 ato=1 dev_size_mb=300
2. Set up a dm-bufio client, e.g. dm-verity, on the scsi_debug device
3. Repeatedly read metadata and watch kmalloc-192 leak!
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

445559cd

dm cache: fix spurious cell_defer when dealing with partial block at end of device · f824a2af

由 Joe Thornber 提交于 11月 28, 2014

We never bother caching a partial block that is at the back end of the
origin device.  No cell ever gets locked, but the calling code was
assuming it was and trying to release it.

Now the code only releases if the cell has been set to a non NULL
value.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

f824a2af

dm cache: dirty flag was mistakenly being cleared when promoting via overwrite · 1e32134a

由 Joe Thornber 提交于 11月 27, 2014

If the incoming bio is a WRITE and completely covers a block then we
don't bother to do any copying for a promotion operation.  Once this is
done the cache block and origin block will be different, so we need to
set it to 'dirty'.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

1e32134a

dm cache: only use overwrite optimisation for promotion when in writeback mode · f29a3147

由 Joe Thornber 提交于 11月 27, 2014

Overwrite causes the cache block and origin blocks to diverge, which
is only allowed in writeback mode.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org

f29a3147

dm cache: discard block size must be a multiple of cache block size · 2bb812df

由 Joe Thornber 提交于 11月 26, 2014

Otherwise the cache blocks may span two discard blocks, which we don't
handle when doing the discard lookup.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

2bb812df

dm cache: fix a harmless race when working out if a block is discarded · 43c32bf2

由 Joe Thornber 提交于 11月 25, 2014

It is more correct to hold the cell before checking the discard state.
These flags are only used as hints to the policy so this change will
have negligable effect.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

43c32bf2

dm cache: when reloading a discard bitset allow for a different discard block size · 3e2e1c30

由 Joe Thornber 提交于 11月 24, 2014

The discard block size can change if the origin changes size or if an
old DM cache is upgraded from using a discard block size that was equal
to cache block size.

To fix this an extent of discarded blocks is established for the purpose
of translating the old discard block size to the new in-core discard
block size and set bits.  The old (potentially huge) discard bitset is
left ondisk until it is re-written using the new in-core information on
the next successful DM cache shutdown.

Fixes: 7ae34e77 ("dm cache: improve discard support")
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

3e2e1c30

dm cache: fix some issues with the new discard range support · 2572629a

由 Joe Thornber 提交于 11月 24, 2014

Commit 7ae34e77 ("dm cache: improve discard support") needed to also:
- discontinue having DM core split the discard bios on cache block
  boundaries
- calculate the cache's discard_nr_blocks relative to the determined
  discard_block_size rather than using oblock_to_dblock()
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

2572629a

dm array: if resizing the array is a noop set the new root to the old one · 8001e87d

由 Joe Thornber 提交于 11月 24, 2014

This could've been quite bad (to return success but not update the new
root to point at the old) but in practice the only known consumer of the
dm array code is the DM cache target.  And the DM cache target passes in
the same old root to array_resize() anyway.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

8001e87d

24 11月, 2014 3 次提交

md: use generic io stats accounting functions to simplify io stat accounting · 18c0b223

由 Gu Zheng 提交于 11月 24, 2014

Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.
Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

18c0b223

md/bcache: use generic io stats accounting functions to simplify io stat accounting · aae4933d

由 Gu Zheng 提交于 11月 24, 2014

Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.
Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: NKent Overstreet <kmo@datera.io>
Signed-off-by: NJens Axboe <axboe@fb.com>

aae4933d

dm: use rcu_dereference_protected instead of rcu_dereference · a12f5d48

由 Eric Dumazet 提交于 11月 23, 2014

rcu_dereference() should be used in sections protected by rcu_read_lock.

For writers, holding some kind of mutex or lock,
rcu_dereference_protected() is the way to go, adding explicit lockdep
bits.

In __unbind(), we are the last user of this mapped device, so can use
the constant '1' instead of a lockdep_is_held(), not consistent with
other uses of rcu_dereference_protected() which use md->suspend_lock
mutex.
Reported-by: NKirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Fixes: 33423974 ("dm: Use rcu_dereference() for accessing rcu pointer")
Cc: Pranith Kumar <bobby.prani@gmail.com>
[snitzer: allow lines longer than 80 columns, refine subject]
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

a12f5d48

22 11月, 2014 1 次提交

dm thin: fix pool_io_hints to avoid looking at max_hw_sectors · d200c30e

由 Mike Snitzer 提交于 11月 20, 2014

Simplify the pool_io_hints code that works to establish a max_sectors
value that is a power-of-2 factor of the thin-pool's blocksize. The
biggest associated improvement is that the DM thin-pool is no longer
concerning itself with the data device's max_hw_sectors when adjusting
max_sectors.

This fixes the relative fragility of the original "dm thin: adjust
max_sectors_kb based on thinp blocksize" commit that only became
apparent when testing was performed using a DM thin-pool ontop of a
virtio_blk device. One proposed upstream patch detailed the problems
inherent in virtio_blk: https://lkml.org/lkml/2014/11/20/611

So even though virtio_blk incorrectly set its max_hw_sectors it actually
helped make it clear that we need DM thinp to be tolerant of any future
Linux driver that incorrectly sets max_hw_sectors.

We only need to be concerned with modifying the thin-pool device's
max_sectors limit if it is smaller than the thin-pool's blocksize. In
this case the value of max_sectors does become a limiting factor when
upper layers (e.g. filesystems) construct their bios. But if the
hardware can support IOs larger than the thin-pool's blocksize the user
is encouraged to adjust the thin-pool's data device's max_sectors
accordingly -- doing so will enable the thin-pool to inherit the
established user-defined max_sectors.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

d200c30e

20 11月, 2014 4 次提交

dm thin: suspend/resume active thin devices when reloading thin-pool · 583024d2

由 Mike Snitzer 提交于 10月 28, 2014

Before this change it was expected that userspace would first suspend
all active thin devices, reload/resize the thin-pool target, then resume
all active thin devices.  Now the thin-pool suspend/resume will trigger
the suspend/resume of all active thins via appropriate calls to
dm_internal_suspend and dm_internal_resume.

Store the mapped_device for each thin device in struct thin_c to make
these calls possible.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

583024d2

dm: enhance internal suspend and resume interface · ffcc3936

由 Mike Snitzer 提交于 10月 28, 2014

Rename dm_internal_{suspend,resume} to dm_internal_{suspend,resume}_fast
-- dm-stats will continue using these methods to avoid all the extra
suspend/resume logic that is not needed in order to quickly flush IO.

Introduce dm_internal_suspend_noflush() variant that actually calls the
mapped_device's target callbacks -- otherwise target-specific hooks are
avoided (e.g. dm-thin's thin_presuspend and thin_postsuspend). Common
code between dm_internal_{suspend_noflush,resume} and
dm_{suspend,resume} was factored out as __dm_{suspend,resume}.

Update dm_internal_{suspend_noflush,resume} to always take and release
the mapped_device's suspend_lock. Also update dm_{suspend,resume} to be
aware of potential for DM_INTERNAL_SUSPEND_FLAG to be set and respond
accordingly by interruptibly waiting for the DM_INTERNAL_SUSPEND_FLAG to
be cleared. Add lockdep annotation to dm_suspend() and dm_resume().

The existing DM_SUSPEND_FLAG remains unchanged.
DM_INTERNAL_SUSPEND_FLAG is set by dm_internal_suspend_noflush() and
cleared by dm_internal_resume().

Both DM_SUSPEND_FLAG and DM_INTERNAL_SUSPEND_FLAG may be set if a device
was already suspended when dm_internal_suspend_noflush() was called --
this can be thought of as a "nested suspend". A "nested suspend" can
occur with legacy userspace dm-thin code that might suspend all active
thin volumes before suspending the pool for resize.

But otherwise, in the normal dm-thin-pool suspend case moving forward:
the thin-pool will have DM_SUSPEND_FLAG set and all active thins from
that thin-pool will have DM_INTERNAL_SUSPEND_FLAG set.

Also add DM_INTERNAL_SUSPEND_FLAG to status report. This new
DM_INTERNAL_SUSPEND_FLAG state is being reported to assist with
debugging (e.g. 'dmsetup info' will report an internally suspended
device accordingly).
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

ffcc3936

dm thin: do not allow thin device activation while pool is suspended · 80e96c54

由 Mike Snitzer 提交于 11月 07, 2014

Otherwise IO could be issued to the pool while it is suspended.

Care was taken to properly interlock between the thin and thin-pool
targets when accessing the pool's 'suspended' flag.  The thin_ctr will
not add a new thin device to the pool's active_thins list if the pool is
susepended.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

80e96c54

dm: add presuspend_undo hook to target_type · d67ee213

由 Mike Snitzer 提交于 10月 28, 2014

The DM thin-pool target now must undo the changes performed during
pool_presuspend() so introduce presuspend_undo hook in target_type.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

d67ee213

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功