提交 · 4b382d0643603819e8b48da58efc254cabc22574 · openanolis / cloud-kernel

11 5月, 2011 16 次提交

md: allow resync_start to be set while an array is active. · b098636c

由 NeilBrown 提交于 5月 11, 2011

The sysfs attribute 'resync_start' (known internally as recovery_cp),
records where a resync is up to.  A value of 0 means the array is
not known to be in-sync at all.  A value of MaxSector means the array
is believed to be fully in-sync.

When the size of member devices of an array (RAID1,RAID4/5/6) is
increased, the array can be increased to match.  This process sets
resync_start to the old end-of-device offset so that the new part of
the array gets resynced.

However with RAID1 (and RAID6) a resync is not technically necessary
and may be undesirable.  So it would be good if the implied resync
after the array is resized could be avoided.

So: change 'resync_start' so the value can be changed while the array
is active, and as a precaution only allow it to be changed while
resync/recovery is 'frozen'.  Changing it once resync has started is
not going to be useful anyway.

This allows the array to be resized without a resync by:
  write 'frozen' to 'sync_action'
  write new size to 'component_size' (this will set resync_start)
  write 'none' to 'resync_start'
  write 'idle' to 'sync_action'.

Also slightly improve some tests on recovery_cp when resizing
raid1/raid5.  Now that an arbitrary value could be set we should be
more careful in our tests.
Signed-off-by: NNeilBrown <neilb@suse.de>

b098636c

md/raid10: reformat some loops with less indenting. · ab9d47e9

由 NeilBrown 提交于 5月 11, 2011

When a loop ends with an 'if' with a large body, it is neater
to make the if 'continue' on the inverse condition, and then
the body is indented less.

Apply this pattern 3 times, and wrap some other long lines.
Signed-off-by: NNeilBrown <neilb@suse.de>

ab9d47e9

md/raid10: remove unused variable. · f17ed07c

由 NeilBrown 提交于 5月 11, 2011

This variable 'disk' is never used - how odd.
Signed-off-by: NNeilBrown <neilb@suse.de>

f17ed07c

md/raid10: make more use of 'slot' in raid10d. · a8830bca

由 NeilBrown 提交于 5月 11, 2011

Now that we have a 'slot' variable, make better use of it to simplify
some code a little.
Signed-off-by: NNeilBrown <neilb@suse.de>

a8830bca

md/raid10: some tidying up in fix_read_error · 7c4e06ff

由 NeilBrown 提交于 5月 11, 2011

Currently the rdev on which a read error happened could be removed
before we perform the fix_error handling.  This requires extra tests
for NULL.

So delay the rdev_dec_pending call until after the call to
fix_read_error so that we can be sure that the rdev still exists.

This allows an 'if' clause to be removed so the body gets re-indented
back one level.
Signed-off-by: NNeilBrown <neilb@suse.de>

7c4e06ff

md/raid1: improve handling of pages allocated for write-behind. · af6d7b76

由 NeilBrown 提交于 5月 11, 2011

The current handling and freeing of these pages is a bit fragile.
We only keep the list of allocated pages in each bio, so we need to
still have a valid bio when freeing the pages, which is a bit clumsy.

So simply store the allocated page list in the r1_bio so it can easily
be found and freed when we are finished with the r1_bio.
Signed-off-by: NNeilBrown <neilb@suse.de>

af6d7b76

md/raid1: try fix_sync_read_error before process_checks. · 7ca78d57

由 NeilBrown 提交于 5月 11, 2011

If we get a read error during resync/recovery we current repeat with
single-page reads to find out just where the error is, and possibly
read each page from a different device.

With check/repair we don't currently do that, we just fail.
However it is possible that while all devices fail on the large 64K
read, we might be able to satisfy each 4K from one device or another.

So call fix_sync_read_error before process_checks to maximise the
chance of finding good data and writing it out to the devices with
read errors.

For this to work, we need to set the 'uptodate' flags properly after
fix_sync_read_error has succeeded.
Signed-off-by: NNeilBrown <neilb@suse.de>

7ca78d57

md/raid1: tidy up new functions: process_checks and fix_sync_read_error. · 78d7f5f7

由 NeilBrown 提交于 5月 11, 2011

These changes are mostly cosmetic:

1/ change mddev->raid_disks to conf->raid_disks because the later is
   technically safer, though in current practice it doesn't matter in
   this particular context.
2/ Rearrange two for / if loops to have an early 'continue' so the
   body of the 'if' doesn't need to be indented so much.
Signed-off-by: NNeilBrown <neilb@suse.de>

78d7f5f7

md/raid1: split out two sub-functions from sync_request_write · a68e5870

由 NeilBrown 提交于 5月 11, 2011

sync_request_write is too big and too deep.
So split out two self-contains bits of functionality into separate
function.
Signed-off-by: NNeilBrown <neilb@suse.de>

a68e5870

md: make error_handler functions more uniform and correct. · 6f8d0c77

由 NeilBrown 提交于 5月 11, 2011

- there is no need to test_bit Faulty, as that was already done in
  md_error which is the only caller of these functions.
- MD_CHANGE_DEVS should be set *after* faulty is set to ensure
  metadata is updated correctly.
- spinlock should be held while updating ->degraded.
Signed-off-by: NNeilBrown <neilb@suse.de>

6f8d0c77

md/multipath: discard ->working_disks in favour of ->degraded · 92f861a7

由 NeilBrown 提交于 5月 11, 2011

conf->working_disks duplicates information already available
in mddev->degraded.
So remove working_disks.
Signed-off-by: NNeilBrown <neilb@suse.de>

92f861a7

md/raid1: clean up read_balance. · 76073054

由 NeilBrown 提交于 5月 11, 2011

read_balance has two loops which both look for a 'best'
device based on slightly different criteria.
This is clumsy and makes is hard to add extra criteria.

So replace it all with a single loop that combines everything.
Signed-off-by: NNeilBrown <neilb@suse.de>

76073054

md: simplify raid10 read_balance · 56d99121

由 NeilBrown 提交于 5月 11, 2011

raid10 read balance has two different loop for looking through
possible devices to chose the best.
Collapse those into one loop and generally make the code more
readable.
Signed-off-by: NNeilBrown <neilb@suse.de>

56d99121

md/bitmap: fix saving of events_cleared and other state. · 8258c532

由 NeilBrown 提交于 5月 11, 2011

If a bitmap is found to be 'stale' the events_cleared value
is set to match 'events'.
However if the array is degraded this does not get stored on disk.
This can subsequently lead to incorrect behaviour.

So change bitmap_update_sb to always update events_cleared in the
superblock from the known events_cleared.
For neatness also set ->state from ->flags.
This requires updating ->state whenever we update ->flags, which makes
sense anyway.

This is suitable for any active -stable release.

cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

8258c532

md: reject a re-add request that cannot be honoured. · bedd86b7

由 NeilBrown 提交于 5月 11, 2011

The 'add_new_disk' ioctl can be used to add a device either as a
spare, or as an active disk that just needs to be resynced based on
write-intent-bitmap information (re-add)

Currently if a re-add is requested but fails we add as a spare
instead.  This makes it impossible for user-space to check for
failure.

So change to require that a re-add attempt will either succeed or
completely fail.  User-space can then decide what to do next.
Signed-off-by: NNeilBrown <neilb@suse.de>

bedd86b7

md: Fix race when creating a new md device. · b0140891

由 NeilBrown 提交于 5月 10, 2011

There is a race when creating an md device by opening /dev/mdXX.

If two processes do this at much the same time they will follow the
call path
  __blkdev_get -> get_gendisk -> kobj_lookup

The first will call
  -> md_probe -> md_alloc -> add_disk -> blk_register_region

and the race happens when the second gets to kobj_lookup after
add_disk has called blk_register_region but before it returns to
md_alloc.

In the case the second will not call md_probe (as the probe is already
done) but will get a handle on the gendisk, return to __blkdev_get
which will then call md_open (via the ->open) pointer.

As mddev->gendisk hasn't been set yet, md_open will think something is
wrong an return with ERESTARTSYS.

This can loop endlessly while the first thread makes no progress
through add_disk.  Nothing is blocking it, but due to scheduler
behaviour it doesn't get a turn.
So this is essentially a live-lock.

We fix this by simply moving the assignment to mddev->gendisk before
the call the add_disk() so md_open doesn't get confused.
Also move blk_queue_flush earlier because add_disk should be as late
as possible.

To make sure that md_open doesn't complete until md_alloc has done all
that is needed, we take mddev->open_mutex during the last part of
md_alloc.  md_open will wait for this.

This can cause a lock-up on boot so Cc:ing for stable.
For 2.6.36 and earlier a different patch will be needed as the
'blk_queue_flush' call isn't there.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reported-by: NThomas Jarosch <thomas.jarosch@intra2net.com>
Tested-by: NThomas Jarosch <thomas.jarosch@intra2net.com>
Cc: stable@kernel.org

b0140891

22 4月, 2011 1 次提交

raid5: fix build error, sector_t usage · d76c8420

由 Randy Dunlap 提交于 4月 21, 2011

Change <sectors> from unsigned long long to sector_t.
This matches its source field.

  ERROR: "__udivdi3" [drivers/md/raid456.ko] undefined!
Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d76c8420

20 4月, 2011 3 次提交

md: Cleanup after raid45->raid0 takeover · fee68723

由 Krzysztof Wojcik 提交于 4月 20, 2011

Problem:
After raid4->raid0 takeover operation, another takeover operation
(e.g raid0->raid10) results "kernel oops".
Root cause:
Variables 'degraded' in mddev structure is not cleared
on raid45->raid0 takeover.

This patch reset this variable.
Signed-off-by: NKrzysztof Wojcik <krzysztof.wojcik@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

fee68723

md: Fix dev_sectors on takeover from raid0 to raid4/5 · 3b71bd93

由 NeilBrown 提交于 4月 20, 2011

A raid0 array doesn't set 'dev_sectors' as each device might
contribute a different number of sectors.
So when converting to a RAID4 or RAID5 we need to set dev_sectors
as they need the number.
We have already verified that in fact all devices do contribute
the same number of sectors, so use that number.
Signed-off-by: NNeilBrown <neilb@suse.de>

3b71bd93

md/raid5: remove setting of ->queue_lock · 2b7da309

由 NeilBrown 提交于 4月 20, 2011

We previously needed to set ->queue_lock to match the raid5
device_lock so we could safely use queue_flag_* operations (e.g. for
plugging). which test the ->queue_lock is in fact locked.

However that need has completely gone away and is unlikely to come
back to remove this now-pointless setting.
Signed-off-by: NNeilBrown <neilb@suse.de>

2b7da309

18 4月, 2011 6 次提交

md: fix up raid1/raid10 unplugging. · c3b328ac

由 NeilBrown 提交于 4月 18, 2011

We just need to make sure that an unplug event wakes up the md
thread, which is exactly what mddev_check_plugged does.

Also remove some plug-related code that is no longer needed.
Signed-off-by: NNeilBrown <neilb@suse.de>

c3b328ac

md: incorporate new plugging into raid5. · 7c13edc8

由 NeilBrown 提交于 4月 18, 2011

In raid5 plugging is used for 2 things:
 1/ collecting writes that require a bitmap update
 2/ collecting writes in the hope that we can create full
    stripes - or at least more-full.

We now release these different sets of stripes when plug_cnt
is zero.

Also in make_request, we call mddev_check_plug to hopefully increase
plug_cnt, and wake up the thread at the end if plugging wasn't
achieved for some reason.
Signed-off-by: NNeilBrown <neilb@suse.de>

7c13edc8

md: provide generic support for handling unplug callbacks. · 97658cdd

由 NeilBrown 提交于 4月 18, 2011

When an md device adds a request to a queue, it can call
mddev_check_plugged.
If this succeeds then we know that the md thread will be woken up
shortly, and ->plug_cnt will be non-zero until then, so some
processing can be delayed.

If it fails, then no unplug callback is expected and the make_request
function needs to do whatever is required to make the request happen.
Signed-off-by: NNeilBrown <neilb@suse.de>

97658cdd

md - remove old plugging code. · 482c0834

由 NeilBrown 提交于 4月 18, 2011

md has some plugging infrastructure for RAID5 to use because the
normal plugging infrastructure required a 'request_queue', and when
called from dm, RAID5 doesn't have one of those available.

This relied on the ->unplug_fn callback which doesn't exist any more.

So remove all of that code, both in md and raid5.  Subsequent patches
with restore the plugging functionality.
Signed-off-by: NNeilBrown <neilb@suse.de>

482c0834

md/dm - remove remains of plug_fn callback. · af1db72d

由 NeilBrown 提交于 4月 18, 2011

Now that unplugging is done differently, the unplug_fn callback is
never called, so it can be completely discarded.
Signed-off-by: NNeilBrown <neilb@suse.de>

af1db72d

md: use new plugging interface for RAID IO. · e1dfa0a2

由 NeilBrown 提交于 4月 18, 2011

md/raid submits a lot of IO from the various raid threads.
So adding start/finish plug calls to those so that some
plugging happens.
Signed-off-by: NNeilBrown <neilb@suse.de>

e1dfa0a2

06 4月, 2011 1 次提交

dm: improve block integrity support · a63a5cf8

由 Mike Snitzer 提交于 4月 01, 2011

The current block integrity (DIF/DIX) support in DM is verifying that
all devices' integrity profiles match during DM device resume (which
is past the point of no return).  To some degree that is unavoidable
(stacked DM devices force this late checking).  But for most DM
devices (which aren't stacking on other DM devices) the ideal time to
verify all integrity profiles match is during table load.

Introduce the notion of an "initialized" integrity profile: a profile
that was blk_integrity_register()'d with a non-NULL 'blk_integrity'
template.  Add blk_integrity_is_initialized() to allow checking if a
profile was initialized.

Update DM integrity support to:
- check all devices with _initialized_ integrity profiles match
  during table load; uninitialized profiles (e.g. for underlying DM
  device(s) of a stacked DM device) are ignored.
- disallow a table load that would result in an integrity profile that
  conflicts with a DM device's existing (in-use) integrity profile
- avoid clearing an existing integrity profile
- validate all integrity profiles match during resume; but if they
  don't all we can do is report the mismatch (during resume we're past
  the point of no return)
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

a63a5cf8

31 3月, 2011 1 次提交

Fix common misspellings · 25985edc

由 Lucas De Marchi 提交于 3月 30, 2011

Fixes generated by 'codespell' and manually reviewed.
Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>

25985edc

29 3月, 2011 1 次提交

md: Fix integrity registration error when no devices are capable · 89078d57

由 Martin K. Petersen 提交于 3月 28, 2011

We incorrectly returned -EINVAL when none of the devices in the array
had an integrity profile.  This in turn prevented mdadm from starting
the metadevice.  Fix this so we only return errors on mismatched
profiles and memory allocation failures.
Reported-by: NGiacomo Catenazzi <cate@cateee.net>
Reported-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

89078d57

24 3月, 2011 10 次提交

dm stripe: implement merge method · 29915202

由 Mustafa Mesanovic 提交于 3月 24, 2011

Implement a merge function in the striped target.

When the striped target's underlying devices provide a merge_bvec_fn
(like all DM devices do via dm_merge_bvec) it is important to call down
to them when building a biovec that doesn't span a stripe boundary.

Without the merge method, a striped DM device stacked on DM devices
causes bios with a single page to be submitted which results
in unnecessary overhead that hurts performance.

This change really helps filesystems (e.g. XFS and now ext4) which take
care to assemble larger bios.  By implementing stripe_merge(), DM and the
stripe target no longer undermine the filesystem's work by only allowing
a single page per bio.  Buffered IO sees the biggest improvement
(particularly uncached reads, buffered writes to a lesser degree).  This
is especially so for more capable "enterprise" storage LUNs.

The performance improvement has been measured to be ~12-35% -- when a
reasonable chunk_size is used (e.g. 64K) in conjunction with a stripe
count that is a power of 2.

In contrast, the performance penalty is ~5-7% for the pathological worst
case stripe configuration (small chunk_size with a stripe count that is
not a power of 2).  The reason for this is that stripe_map_sector() is
now called once for every call to dm_merge_bvec().  stripe_map_sector()
will use slower division if stripe count isn't a power of 2.
Signed-off-by: NMustafa Mesanovic <mume@linux.vnet.ibm.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

29915202

dm mpath: allow table load with no priority groups · a490a07a

由 Mike Snitzer 提交于 3月 24, 2011

This patch adjusts the multipath target to allow a table with both 0
priority groups and 0 for the initial priority group number.

If any mpath device is held open when all paths in the last priority
group have failed, userspace multipathd will attempt to reload the
associated DM table to reflect the fact that the device no longer has
any priority groups.  But the reload attempt always failed because the
multipath target did not allow 0 priority groups.

All multipath target messages related to priority group (enable_group,
disable_group, switch_group) will handle a priority group of 0 (will
cause error).

When reloading a multipath table with 0 priority groups, userspace
multipathd must be updated to specify an initial priority group number
of 0 (rather than 1).
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: Babu Moger <babu.moger@lsi.com>
Acked-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

a490a07a

dm mpath: fail message ioctl if specified path is not valid · 19040c0b

由 Mike Snitzer 提交于 3月 24, 2011

Fail the reinstate_path and fail_path message ioctl if the specified
path is not valid.

The message ioctl would succeed for the 'reinistate_path' and
'fail_path' messages even if action was not taken because the
specified device was not a valid path of the multipath device.

Before, when /dev/vdb is not a path of mpathb:
$ dmsetup message mpathb 0 reinstate_path /dev/vdb
$ echo $?
0

After:
$ dmsetup message mpathb 0 reinstate_path /dev/vdb
device-mapper: message ioctl failed: Invalid argument
Command failed
$ echo $?
1
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

19040c0b

dm ioctl: add flag to wipe buffers for secure data · f8681205

由 Milan Broz 提交于 3月 24, 2011

Add DM_SECURE_DATA_FLAG which userspace can use to ensure
that all buffers allocated for dm-ioctl are wiped
immediately after use.

The user buffer is wiped as well (we do not want to keep
and return sensitive data back to userspace if the flag is set).

Wiping is useful for cryptsetup to ensure that the key
is present in memory only in defined places and only
for the time needed.

(For crypt, key can be present in table during load or table
status, wait and message commands).
Signed-off-by: NMilan Broz <mbroz@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

f8681205

dm ioctl: prepare for crypt key wiping · 6bb43b5d

由 Milan Broz 提交于 3月 24, 2011

Prepare code for implementing buffer wipe flag.
No functional change in this patch.
Signed-off-by: NMilan Broz <mbroz@redhat.com>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

6bb43b5d

dm crypt: wipe keys string immediately after key is set · de8be5ac

由 Milan Broz 提交于 3月 24, 2011

Always wipe the original copy of the key after processing it
in crypt_set_key().
Signed-off-by: NMilan Broz <mbroz@redhat.com>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

de8be5ac

dm: add flakey target · 3407ef52

由 Josef Bacik 提交于 3月 24, 2011

This target is the same as the linear target except that it returns I/O
errors periodically.  It's been found useful in simulating failing
devices for testing purposes.

I needed a dm target to do some failure testing on btrfs's raid code, and
Mike pointed me at this.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

3407ef52

dm: fix opening log and cow devices for read only tables · 024d37e9

由 Milan Broz 提交于 3月 24, 2011

If a table is read-only, also open any log and cow devices it uses read-only.

Previously, even read-only devices were opened read-write internally.
After patch 75f1dc0d
  block: check bdev_read_only() from blkdev_get()
was applied, loading such tables began to fail.  The patch
was reverted by e51900f7
  block: revert block_dev read-only check
but this patch fixes this part of the code to work with the original patch.
Signed-off-by: NMilan Broz <mbroz@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

024d37e9

dm: use little-endian bitops · bb5cda3d

由 Akinobu Mita 提交于 3月 23, 2011

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h.  This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bb5cda3d

md: use little-endian bitops · 6b33aff3

由 Akinobu Mita 提交于 3月 23, 2011

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h.  This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Acked-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6b33aff3

22 3月, 2011 1 次提交

block: fix non-atomic access to genhd inflight structures · 1e9bb880

由 Shaohua Li 提交于 3月 22, 2011

After the stack plugging introduction, these are called lockless.
Ensure that the counters are updated atomically.

Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

1e9bb880

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功