提交 · b8321b68d1445f308324517e45fb0a5c2b48e271 · openanolis / cloud-kernel

23 12月, 2011 10 次提交

md: change hot_remove_disk to take an rdev rather than a number. · b8321b68

由 NeilBrown 提交于 12月 23, 2011

Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement).  So removing
a device based on the number won't work.  So pass the actual device
handle instead.
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

b8321b68

md: remove test for duplicate device when setting slot number. · 476a7abb

由 NeilBrown 提交于 12月 23, 2011

When setting the slot number on a device in an active array we
currently check that the number is not already in use.
We then call into the personality's hot_add_disk function
which performs the same test and returns the same error.

Thus the common test is not needed.

As we will shortly be changing some personalities to allow duplicates
in some cases (to support hot-replace), the common test will become
inconvenient.

So remove the common test.
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

476a7abb

md/bitmap: be more consistent when setting new bits in memory bitmap. · 915c420d

由 NeilBrown 提交于 12月 23, 2011

For each active region corresponding to a bit in the bitmap with have
a 14bit counter (and some flags).
This counts
   number of active writes + bit in the on-disk bitmap + delay-needed.

The "delay-needed" is because we always want a delay before clearing a
bit.  So the number here is normally number of active writes plus 2.
If there have been no writes for a while, we drop to 1.
If still no writes we clear the bit and drop to 0.

So for consistency, when setting bit from the on-disk bitmap or by
request from user-space it is best to set the counter to '2' to start
with.

In particular we might also set the NEEDED_MASK flag at this time, and
in all other cases NEEDED_MASK is only set when the counter is 2 or
more.
Signed-off-by: NNeilBrown <neilb@suse.de>

915c420d

md/raid5: be more thorough in calculating 'degraded' value. · 908f4fbd

由 NeilBrown 提交于 12月 23, 2011

When an array is being reshaped to change the number of devices,
the two halves can be differently degraded.  e.g. one could be
missing a device and the other not.

So we need to be more careful about calculating the 'degraded'
attribute.

Instead of just inc/dec at appropriate times, perform a full
re-calculation examining both possible cases.  This doesn't happen
often so it not a big cost, and we already have most of the code to
do it.
Signed-off-by: NNeilBrown <neilb@suse.de>

908f4fbd

md/bitmap: daemon_work cleanup. · 2e61ebbc

由 NeilBrown 提交于 12月 23, 2011

We have a variable 'mddev' in this function, but repeatedly get the
same value by dereferencing bitmap->mddev.
There is room for simplification here...
Signed-off-by: NNeilBrown <neilb@suse.de>

2e61ebbc

md: allow non-privileged uses to GET_*_INFO about raid arrays. · 506c9e44

由 NeilBrown 提交于 12月 23, 2011

The info is already available in /proc/mdstat and /sys/block in
an accessible form so there is no point in putting a road-block in
the ioctl for information gathering.
Signed-off-by: NNeilBrown <neilb@suse.de>

506c9e44

md/bitmap: It is OK to clear bits during recovery. · 961902c0

由 NeilBrown 提交于 12月 23, 2011

commit d0a4bb49 introduced a
regression which is annoying but fairly harmless.

When writing to an array that is undergoing recovery (a spare
in being integrated into the array), writing to the array will
set bits in the bitmap, but they will not be cleared when the
write completes.

For bits covering areas that have not been recovered yet this is not a
problem as the recovery will clear the bits.  However bits set in
already-recovered region will stay set and never be cleared.
This doesn't risk data integrity.  The only negatives are:
 - next time there is a crash, more resyncing than necessary will
   be done.
 - the bitmap doesn't look clean, which is confusing.

While an array is recovering we don't want to update the
'events_cleared' setting in the bitmap but we do still want to clear
bits that have very recently been set - providing they were written to
the recovering device.

So split those two needs - which previously both depended on 'success'
and always clear the bit of the write went to all devices.
Signed-off-by: NNeilBrown <neilb@suse.de>

961902c0

md: don't give up looking for spares on first failure-to-add · 60fc1370

由 NeilBrown 提交于 12月 23, 2011

Before performing a recovery we try to remove any spares that
might not be working, then add any that might have become relevant.

Currently we abort on the first spare that cannot be added.
This is a false optimisation.
It is conceivable that - depending on rules in the personality - a
subsequent spare might be accepted.
Also the loop does other things like count the available spares and
reset the 'recovery_offset' value.

If we abort early these might not happen properly.

So remove the early abort.

In particular if you have an array what is undergoing recovery and
which has extra spares, then the recovery may not restart after as
reboot as the could of 'spares' might end up as zero.
Reported-by: NAnssi Hannula <anssi.hannula@iki.fi>
Signed-off-by: NNeilBrown <neilb@suse.de>

60fc1370

md/raid5: ensure correct assessment of drives during degraded reshape. · 30d7a483

由 NeilBrown 提交于 12月 23, 2011

While reshaping a degraded array (as when reshaping a RAID0 by first
converting it to a degraded RAID4) we currently get confused about
which devices are in_sync. In most cases we get it right, but in the
region that is being reshaped we need to treat non-failed devices as
in-sync when we have the data but haven't actually written it out yet.
Reported-by: NAdam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

30d7a483

md/linear: fix hot-add of devices to linear arrays. · 09cd9270

由 NeilBrown 提交于 12月 23, 2011

commit d70ed2e4
broke hot-add to a linear array.
After that commit, metadata if not written to devices until they
have been fully integrated into the array as determined by
saved_raid_disk.  That patch arranged to clear that field after
a recovery completed.

However for linear arrays, there is no recovery - the integration is
instantaneous.  So we need to explicitly clear the saved_raid_disk
field.
Signed-off-by: NNeilBrown <neilb@suse.de>

09cd9270

09 12月, 2011 1 次提交

md: raid5 crash during degradation · 5d8c71f9

由 Adam Kwolek 提交于 12月 09, 2011

NULL pointer access causes crash in raid5 module.
Signed-off-by: NAdam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

5d8c71f9

08 12月, 2011 5 次提交

md/raid5: never wait for bad-block acks on failed device. · 9283d8c5

由 NeilBrown 提交于 12月 08, 2011

Once a device is failed we really want to completely ignore it.
It should go away soon anyway.

In particular the presence of bad blocks on it should not cause us to
block as we won't be trying to write there anyway.

So as soon as we can check if a device is Faulty, do so and pretend
that it is already gone if it is Faulty.
Signed-off-by: NNeilBrown <neilb@suse.de>

9283d8c5

md: ensure new badblocks are handled promptly. · 8bd2f0a0

由 NeilBrown 提交于 12月 08, 2011

When we mark blocks as bad we need them to be acknowledged by the
metadata handler promptly.

For an in-kernel metadata handler that was already being done.  But
for an external metadata handler we need to alert it of the change by
sending a notification through the sysfs file.  This adds that
notification.
Signed-off-by: NNeilBrown <neilb@suse.de>

8bd2f0a0

md: bad blocks shouldn't cause a Blocked status on a Faulty device. · 52c64152

由 NeilBrown 提交于 12月 08, 2011

Once a device is marked Faulty the badblocks - whether acknowledged or
not - become irrelevant.  So they shouldn't cause the device to be
marked as Blocked.

Without this patch, a process might write "-blocked" to clear the
Blocked status, but while that will correctly fail the device, it
won't remove the apparent 'blocked' status.
Signed-off-by: NNeilBrown <neilb@suse.de>

52c64152

md: take a reference to mddev during sysfs access. · af8a2434

由 NeilBrown 提交于 12月 08, 2011


When we are accessing an mddev via sysfs we know that the
mddev cannot disappear because it has an embedded kobj which
is refcounted by sysfs.
And we also take the mddev_lock.
However this is not enough.

The final mddev_put could have been called and the
mddev_delayed_delete is waiting for sysfs to let go so it can destroy
the kobj and mddev.
In this state there are a lot of changes that should not be attempted.

To to guard against this we:
 - initialise mddev->all_mddevs in on last put so the state can be
   easily detected.
 - in md_attr_show and md_attr_store, check ->all_mddevs under
   all_mddevs_lock and mddev_get the mddev if it still appears to
   be active.

This means that if we get to sysfs as the mddev is being deleted we
will get -EBUSY.

rdev_attr_store and rdev_attr_show are similar but already have
sufficient protection.  They check that rdev->mddev still points to
mddev after taking mddev_lock.  As this is cleared  before delayed
removal which can only be requested under the mddev_lock, this
ensure the rdev and mddev are still alive.
Signed-off-by: NNeilBrown <neilb@suse.de>

af8a2434

md: refine interpretation of "hold_active == UNTIL_IOCTL". · 1d23f178

由 NeilBrown 提交于 12月 08, 2011

We like md devices to disappear when they really are not needed.
However it is not possible to tell from the current state whether it
is needed or not.  We can only tell from recent history of changes.

In particular immediately after we create an md device it looks very
similar to immediately after we have finished with it.

So we always preserve a newly created md device until something
significant happens.  This state is stored in 'hold_active'.

The normal case is to keep it until an ioctl happens, as that will
normally either activate it, or explicitly de-activate it.  If it
doesn't then it was probably created by mistake and it is now time to
get rid of it.

We can also modify an array via sysfs (instead of via ioctl) and we
currently treat any change via sysfs like an ioctl as a sign that if
it now isn't more active, it should be destroyed.
However this is not appropriate as changes made via sysfs are more
gradual so we should look for a more definitive change.

So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when
the array_state is changed via sysfs.  Other changes via sysfs
are ignored.
Signed-off-by: NNeilBrown <neilb@suse.de>

1d23f178

23 11月, 2011 1 次提交

md/lock: ensure updates to page_attrs are properly locked. · 7c8f4247

由 NeilBrown 提交于 11月 23, 2011

Page attributes are set using __set_bit rather than set_bit as
it normally called under a spinlock so the extra atomicity is not
needed.

However there are two places where we might set or clear page
attributes without holding the spinlock.
So add the spinlock in those cases.

This might be the cause of occasional reports that bits a aren't
getting clear properly - theory is that BITMAP_PAGE_PENDING gets lost
when BITMAP_PAGE_NEEDWRITE is set or cleared.  This is an
inconvenience, not a threat to data safety.
Signed-off-by: NNeilBrown <neilb@suse.de>

7c8f4247

08 11月, 2011 5 次提交

md/raid5: STRIPE_ACTIVE has lock semantics, add barriers · 257a4b42

由 Dan Williams 提交于 11月 08, 2011

All updates that occur under STRIPE_ACTIVE should be globally visible
when STRIPE_ACTIVE clears.  test_and_set_bit() implies a barrier, but
clear_bit() does not.

This is suitable for 3.1-stable.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Cc: stable@kernel.org

257a4b42

md/raid5: abort any pending parity operations when array fails. · 9a3f530f

由 NeilBrown 提交于 11月 08, 2011

When the number of failed devices exceeds the allowed number
we must abort any active parity operations (checks or updates) as they
are no longer meaningful, and can lead to a BUG_ON in
handle_parity_checks6.

This bug was introduce by commit 6c0069c0
in 2.6.29.
Reported-by: NManish Katiyar <mkatiyar@gmail.com>
Tested-by: NManish Katiyar <mkatiyar@gmail.com>
Acked-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Cc: stable@kernel.org

9a3f530f

device-mapper: using EXPORT_SYBOL in dm-space-map-checker.c needs export.h · a8445060

由 Stephen Rothwell 提交于 11月 01, 2011

Reported-by: NWitold Baryluk <baryluk@smp.if.uj.edu.pl>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a8445060

device-mapper: dm-bufio.c needs to include module.h · 6f66263f

由 Stephen Rothwell 提交于 11月 01, 2011

since it uses the module facilities.
Reported-by: NWitold Baryluk <baryluk@smp.if.uj.edu.pl>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6f66263f

drivers/md: change module.h -> export.h in persistent-data/dm-* · 1944ce60

由 Paul Gortmaker 提交于 9月 28, 2011

For the files which are not themselves modular, we can change
them to include only the smaller export.h since all they are
doing is looking for EXPORT_SYMBOL.
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1944ce60

01 11月, 2011 16 次提交

md: Add in export.h for files using EXPORT_SYMBOL · daaa5f7c

由 Paul Gortmaker 提交于 5月 27, 2011

These files were getting the defines for EXPORT_SYMBOL because
device.h was including module.h.  But we are going to put an
end to that.  So add the proper export.h include now.
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>

daaa5f7c

md: Add module.h to all files using it implicitly · 056075c7

由 Paul Gortmaker 提交于 7月 03, 2011

A pending cleanup will mean that module.h won't be implicitly
everywhere anymore. Make sure the modular drivers in md dir
are actually calling out for <module.h> explicitly in advance.
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>

056075c7

dm: raid fix device status indicator when array initializing · 2e727c3c

由 Jonathan E Brassow 提交于 10月 31, 2011

When devices in a RAID array are not in-sync, they are supposed to be
reported as such in the status output as an 'a' character, which means
"alive, but not in-sync". But when the entire array is rebuilt 'A' is
being used, which is incorrect. This patch corrects this to 'a'.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

2e727c3c

dm log userspace: add log device dependency · 5a25f0eb

由 Jonathan E Brassow 提交于 10月 31, 2011

Allow userspace dm log implementations to register their log device so it
is no longer missing from the list of device dependencies.

When device mapper targets use a device they normally call dm_get_device
which includes it in the device list returned to userspace applications
such as LVM through the DM_TABLE_DEPS ioctl. Userspace log devices
don't use dm_get_device as userspace opens them so they are missing from
the list of dependencies.

This patch extends the DM_ULOG_CTR operation to allow userspace to
respond with the name of the log device (if appropriate) to be
registered via 'dm_get_device'. DM_ULOG_REQUEST_VERSION is incremented.

This is backwards compatible. If the kernel and userspace log server
have both been updated, the new information will be passed down to the
kernel and the device will be registered. If the kernel is new, but
the log server is old, the log server will not pass down any device
information and the kernel will simply bypass the device registration
as before. If the kernel is old but the log server is new, the log
server will see the old version number and not pass the device info.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

5a25f0eb

dm log userspace: fix comment hyphens · b8954457

由 Jonathan Brassow 提交于 10月 31, 2011

Fix comments: clustered-disk needs a hyphen not an underscore.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

b8954457

dm: add thin provisioning target · 991d9fa0

由 Joe Thornber 提交于 10月 31, 2011

Initial EXPERIMENTAL implementation of device-mapper thin provisioning
with snapshot support. The 'thin' target is used to create instances of
the virtual devices that are hosted in the 'thin-pool' target. The
thin-pool target provides data sharing among devices. This sharing is
made possible using the persistent-data library in the previous patch.

The main highlight of this implementation, compared to the previous
implementation of snapshots, is that it allows many virtual devices to
be stored on the same data volume, simplifying administration and
allowing sharing of data between volumes (thus reducing disk usage).

Another big feature is support for arbitrary depth of recursive
snapshots (snapshots of snapshots of snapshots ...). The previous
implementation of snapshots did this by chaining together lookup tables,
and so performance was O(depth). This new implementation uses a single
data structure so we don't get this degradation with depth.

For further information and examples of how to use this, please read
Documentation/device-mapper/thin-provisioning.txt
Signed-off-by: NJoe Thornber <thornber@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

991d9fa0

dm: add persistent data library · 3241b1d3

由 Joe Thornber 提交于 10月 31, 2011

The persistent-data library offers a re-usable framework for the storage
and management of on-disk metadata in device-mapper targets.

It's used by the thin-provisioning target in the next patch and in an
upcoming hierarchical storage target.

For further information, please read
Documentation/device-mapper/persistent-data.txt
Signed-off-by: NJoe Thornber <thornber@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

3241b1d3

dm: add bufio · 95d402f0

由 Mikulas Patocka 提交于 10月 31, 2011

The dm-bufio interface allows you to do cached I/O on devices,
holding recently-read blocks in memory and performing delayed writes.

We don't use buffer cache or page cache already present in the kernel, because:
* we need to handle block sizes larger than a page
* we can't allocate memory to perform reads or we'd have deadlocks

Currently, when a cache is required, we limit its size to a fraction of
available memory.  Usage can be viewed and changed in
/sys/module/dm_bufio/parameters/ .

The first user is thin provisioning, but more dm users are planned.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

95d402f0

dm: export dm get md · 3cf2e4ba

由 Alasdair G Kergon 提交于 10月 31, 2011

Export dm_get_md() for the new thin provisioning target to use.
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

3cf2e4ba

dm table: add immutable feature · 36a0456f

由 Alasdair G Kergon 提交于 10月 31, 2011

Introduce DM_TARGET_IMMUTABLE to indicate that the target type cannot be mixed
with any other target type, and once loaded into a device, it cannot be
replaced with a table containing a different type.

The thin provisioning pool device will use this.
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

36a0456f

dm table: add always writeable feature · cc6cbe14

由 Alasdair G Kergon 提交于 10月 31, 2011

Add a target feature flag DM_TARGET_ALWAYS_WRITEABLE to indicate that a target
does not support read-only mode.

The initial implementation of the thin provisioning target uses this.
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

cc6cbe14

dm table: add singleton feature · 3791e2fc

由 Alasdair G Kergon 提交于 10月 31, 2011

Introduce the concept of a singleton table which contains exactly one target.

If a target type sets the DM_TARGET_SINGLETON feature bit device-mapper
will ensure that any table that includes that target contains no others.

The thin provisioning pool target uses this.
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

3791e2fc

dm kcopyd: add dm_kcopyd_zero to zero an area · 7f069653

由 Mikulas Patocka 提交于 10月 31, 2011

This patch introduces dm_kcopyd_zero() to make it easy to use
kcopyd to write zeros into the requested areas instead
instead of copying.  It is implemented by passing a NULL
copying source to dm_kcopyd_copy().

The forthcoming thin provisioning target uses this.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

7f069653

dm: remove superfluous smp_mb · fbdc86f3

由 Namhyung Kim 提交于 10月 31, 2011

Since set_current_state() contains a memory barrier in it,
an additional barrier isn't needed.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

fbdc86f3

dm: use local printk ratelimit · 71a16736

由 Namhyung Kim 提交于 10月 31, 2011

printk_ratelimit() shares global ratelimiting state with all
other subsystems, so its usage is discouraged. Instead,
define and use dm's local state.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

71a16736

dm table: propagate non rotational flag · 4693c966

由 Mandeep Singh Baines 提交于 10月 31, 2011

Allow QUEUE_FLAG_NONROT to propagate up the device stack if all
underlying devices are non-rotational.  Tools like ureadahead will
schedule IOs differently based on the rotational flag.

With this patch, I see boot time go from 7.75 s to 7.46 s on my device.
Suggested-by: NJ. Richard Barnette <jrbarnette@chromium.org>
Signed-off-by: NMandeep Singh Baines <msb@chromium.org>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: dm-devel@redhat.com
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

4693c966

31 10月, 2011 1 次提交

md/raid10: Fix bug when activating a hot-spare. · 7fcc7c8a

由 NeilBrown 提交于 10月 31, 2011

This is a fairly serious bug in RAID10.

When a RAID10 array is degraded and a hot-spare is activated, the
spare does not take up the empty slot, but rather replaces the first
working device.
This is likely to make the array non-functional.   It would normally
be possible to recover the data, but that would need care and is not
guaranteed.

This bug was introduced in commit
   2bb77736
which first appeared in 3.1.

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

7fcc7c8a

26 10月, 2011 1 次提交

md: Fix some bugs in recovery_disabled handling. · d890fa2b

由 NeilBrown 提交于 10月 26, 2011

In 3.0 we changed the way recovery_disabled was handle so that instead
of testing against zero, we test an mddev-> value against a conf->
value.
Two problems:
  1/ one place in raid1 was missed and still sets to '1'.
  2/ We didn't explicitly set the conf-> value at array creation
     time.
     It defaulted to '0' just like the mddev value does so they
     could appear equal and thus disable recovery.
     This did not affect normal 'md' as it calls bind_rdev_to_array
     which changes the mddev value.  However the dmraid interface
     doesn't call this and so doesn't change ->recovery_disabled; so at
     array start all recovery is incorrectly disabled.

So initialise the 'conf' value to one less that the mddev value, so
the will only be the same when explicitly set that way.
Reported-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NNeilBrown  <neilb@suse.de>

d890fa2b

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功