提交 · 3fa841d7e7266f6fcc1b3885b905f5153ba897d8 · openanolis / cloud-kernel

23 9月, 2009 3 次提交

md: report device as congested when suspended · 3fa841d7

由 NeilBrown 提交于 9月 23, 2009

This should writeback from coming when the device is temporarily
suspended.
Signed-off-by: NNeilBrown <neilb@suse.de>

3fa841d7

md: Improve name of threads created by md_register_thread · 0da3c619

由 NeilBrown 提交于 9月 23, 2009

The management thread for raid4,5,6 arrays are all called
mdX_raid5, independent of the actual raid level, which is wrong and
can be confusion.

So change md_register_thread to use the name from the personality
unless no alternate name (like 'resync' or 'reshape') is given.

This is simpler and more correct.

Cc: Jinzc <zhenchengjin@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

0da3c619

md: remove sparse waring "symbol xxx shadows an earlier one" · a9f326eb

由 NeilBrown 提交于 9月 23, 2009

Rename some variable and remove some duplicate definitions
to avoid there warnings.  None of them are actual errors.
Signed-off-by: NNeilBrown <neilb@suse.de>

a9f326eb

22 9月, 2009 1 次提交

const: make block_device_operations const · 83d5cde4

由 Alexey Dobriyan 提交于 9月 21, 2009

Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

83d5cde4

18 8月, 2009 1 次提交

Fix new incorrect error return from do_md_stop. · 80ffb3cc

由 NeilBrown 提交于 8月 18, 2009

Recent commit c8c00a69
changed the exit paths in do_md_stop and was not quite
careful enough.  There is one path were 'err' now needs
to be cleared but it isn't.
So setting an array to readonly (with mdadm --readonly) will
work, but will incorrectly report and error: ENXIO.
Signed-off-by: NNeilBrown <neilb@suse.de>

80ffb3cc

13 8月, 2009 2 次提交

md: allow upper limit for resync/reshape to be set when array is read-only · 4d484a4a

由 NeilBrown 提交于 8月 13, 2009

Normally we only allow the upper limit for a reshape to be decreased
when the array not performing a sync/recovery/reshape, otherwise there
could be races.  But if an array is part-way through a reshape when it
is assembled the reshape is started immediately leaving no window
to set an upper bound.

If the array is started read-only, the reshape will be suspended until
the array becomes writable, so that provides a window during which it
is perfectly safe to reduce the upper limit of a reshape.

So: allow the upper limit (sync_max) to be reduced even if the reshape
thread is running, as long as the array is still read-only.
Signed-off-by: NNeilBrown <neilb@suse.de>

4d484a4a

md: never advance 'events' counter by more than 1. · 51d5668c

由 NeilBrown 提交于 8月 13, 2009

When assembling arrays, md allows two devices to have different event
counts as long as the difference is only '1'.  This is to cope with
a system failure between updating the metadata on two difference
devices.

However there are currently times when we update the event count by
2.  This was done to keep the event count even when the array is clean
and odd when it is dirty, which allows us to avoid writing common
update to spare devices and so allow those spares to go to sleep.

This is bad for the above reason.  So change it to never increase by
two.  This means that the alignment between 'odd/even' and
'clean/dirty' might take a little longer to attain, but that is only a
small cost.  The spares will get a few more updates but that will
still be spared (;-) most updates and can still go to sleep.

Prior to this patch there was a small chance that after a crash an
array would fail to assemble due to the overly large event count
mismatch.
Signed-off-by: NNeilBrown <neilb@suse.de>

51d5668c

10 8月, 2009 1 次提交

Remove deadlock potential in md_open · c8c00a69

由 NeilBrown 提交于 8月 10, 2009

A recent commit:
  commit 449aad3e

introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.

__blkdev_get holds bd_mutex while calling md_open which takes
   reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
   takes bd_mutex in the call the revalidate_disk.

This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
   commit d63a5a74
do avoid a warning of an impossible deadlock.

It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.
In particular, the second lock prevents an open from completing
between the time when do_md_stop checks if there are any active opens,
and the time when the array is either set read-only, or when ->pers is
set to NULL.  So we can be certain that no IO is in flight as the
array is being destroyed.

So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.

This avoids the deadlock and also avoids the lockdep warning mentioned
in commit d63a5a74Reported-by: N"Mike Snitzer" <snitzer@gmail.com>
Reported-by: N"H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

c8c00a69

03 8月, 2009 5 次提交

md: Use revalidate_disk to effect changes in size of device. · 449aad3e

由 NeilBrown 提交于 8月 03, 2009

As revalidate_disk calls check_disk_size_change, it will cause
any capacity change of a gendisk to be propagated to the blockdev
inode.  So use that instead of mucking about with locks and
i_size_write.

Also add a call to revalidate_disk in do_md_run and a few other places
where the gendisk capacity is changed.
Signed-off-by: NNeilBrown <neilb@suse.de>

449aad3e

md: Handle growth of v1.x metadata correctly. · 70471daf

由 NeilBrown 提交于 8月 03, 2009

The v1.x metadata does not have a fixed size and can grow
when devices are added.
If it grows enough to require an extra sector of storage,
we need to update the 'sb_size' to match.

Without this, md can write out an incomplete superblock with a
bad checksum, which will be rejected when trying to re-assemble
the array.

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

70471daf

md: avoid array overflow with bad v1.x metadata · 3673f305

由 NeilBrown 提交于 8月 03, 2009

We trust the 'desc_nr' field in v1.x metadata enough to use it
as an index in an array.  This isn't really safe.
So range-check the value first.
Signed-off-by: NNeilBrown <neilb@suse.de>

3673f305

md: when a level change reduces the number of devices, remove the excess. · 3a981b03

由 NeilBrown 提交于 8月 03, 2009

When an array is changed from RAID6 to RAID5, fewer drives are
needed.  So any device that is made superfluous by the level
conversion must be marked as not-active.
For the RAID6->RAID5 conversion, this will be a drive which only
has 'Q' blocks on it.

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

3a981b03

md: Push down data integrity code to personalities. · ac5e7113

由 Andre Noll 提交于 8月 03, 2009

This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.

md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity.  The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().

The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.

For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

ac5e7113

09 7月, 2009 1 次提交

Remove multiple KERN_ prefixes from printk formats · ad361c98

由 Joe Perches 提交于 7月 06, 2009

Commit 5fd29d6c ("printk: clean up
handling of log-levels and newlines") changed printk semantics.  printk
lines with multiple KERN_<level> prefixes are no longer emitted as
before the patch.

<level> is now included in the output on each additional use.

Remove all uses of multiple KERN_<level>s in formats.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ad361c98

01 7月, 2009 4 次提交

md: use interruptible wait when duration is controlled by userspace. · e62e58a5

由 NeilBrown 提交于 7月 01, 2009

User space can set various limits on an md array so that resync waits
when it gets to a certain point, or so that I/O is blocked for a short
while.
When md is waiting against one of these limit, it should use an
interruptible wait so as not to add to the load average, and so are
not to trigger a warning if the wait goes on for too long.
Signed-off-by: NNeilBrown <neilb@suse.de>

e62e58a5

md: tidy up error paths in md_alloc · 0909dc44

由 NeilBrown 提交于 7月 01, 2009

As the recent bug in md_alloc showed, having a single exit path for
unlocking and putting is a good idea.  So restructure md_alloc to have
a single mutex_unlock and mddev_put, and use gotos where necessary.
Found-by: NJiri Slaby <jirislaby@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

0909dc44

md: fix error path when duplicate name is found on md device creation. · 1ec22eb2

由 NeilBrown 提交于 7月 01, 2009

When an md device is created by name (rather than number) we need to
check that the name is not already in use.  If this check finds a
duplicate, we return an error without dropping the lock or freeing
the newly create mddev.
This patch fixes that.

Cc: stable@kernel.org
Found-by: NJiri Slaby <jirislaby@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

1ec22eb2

md: avoid dereferencing NULL pointer when accessing suspend_* sysfs attributes. · b8d966ef

由 NeilBrown 提交于 7月 01, 2009

If we try to modify one of the md/ sysfs files
  suspend_lo or suspend_hi
when the array is not active, we dereference a NULL.
Protect against that.

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

b8d966ef

18 6月, 2009 8 次提交

md: Move check for bitmap presence to personality code. · 0894cc30

由 Andre Noll 提交于 6月 18, 2009

If the superblock of a component device indicates the presence of a
bitmap but the corresponding raid personality does not support bitmaps
(raid0, linear, multipath, faulty), then something is seriously wrong
and we'd better refuse to run such an array.

Currently, this check is performed while the superblocks are examined,
i.e. before entering personality code. Therefore the generic md layer
must know which raid levels support bitmaps and which do not.

This patch avoids this layer violation without adding identical code
to various personalities. This is accomplished by introducing a new
public function to md.c, md_check_no_bitmap(), which replaces the
hard-coded checks in the superblock loading functions.

A call to md_check_no_bitmap() is added to the ->run method of each
personality which does not support bitmaps and assembly is aborted
if at least one component device contains a bitmap.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

0894cc30

md: remove chunksize rounding from common code. · 8190e754

由 NeilBrown 提交于 6月 18, 2009

It is easiest to round sizes to multiples of chunk size in
the personality code for those personalities which care.
Those personalities now do the rounding, so we can
remove that function from common code.

Also remove the upper bound on the size of a chunk, and the lower
bound on the size of a device (1 chunk), neither of which really buy
us anything.
Signed-off-by: NNeilBrown <neilb@suse.de>

8190e754

md: move assignment of ->utime so that it never gets skipped. · 1b57f132

由 NeilBrown 提交于 6月 18, 2009

Currently the assignment to utime gets skipped for 'external'
metadata.  So move it to the top of the function so that it
always gets effected.
This is of largely cosmetic interest.  Nothing actually depends
on ->utime being right for external arrays.
"mdadm --monitor" does use it for 0.90 and 1.x arrays, but with
mdadm-3.0, this is not important for external metadata.
Signed-off-by: NNeilBrown <neilb@suse.de>

1b57f132

md: Push down reconstruction log message to personality code. · 8c6ac868

由 Andre Noll 提交于 6月 18, 2009

Currently, the md layer checks in analyze_sbs() if the raid level
supports reconstruction (mddev->level >= 1) and if reconstruction is
in progress (mddev->recovery_cp != MaxSector).

Move that printk into the personality code of those raid levels that
care (levels 1, 4, 5, 6, 10).
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

8c6ac868

md: merge reconfig and check_reshape methods. · 50ac168a

由 NeilBrown 提交于 6月 18, 2009

The difference between these two methods is artificial.
Both check that a pending reshape is valid, and perform any
aspect of it that can be done immediately.
'reconfig' handles chunk size and layout.
'check_reshape' handles raid_disks.

So make them just one method.
Signed-off-by: NNeilBrown <neilb@suse.de>

50ac168a

md: remove unnecessary arguments from ->reconfig method. · 597a711b

由 NeilBrown 提交于 6月 18, 2009

Passing the new layout and chunksize as args is not necessary as
the mddev has fields for new_check and new_layout.

This is preparation for combining the check_reshape and reconfig
methods
Signed-off-by: NNeilBrown <neilb@suse.de>

597a711b

md: Convert mddev->new_chunk to sectors. · 664e7c41

由 Andre Noll 提交于 6月 18, 2009

A straight-forward conversion which gets rid of some
multiplications/divisions/shifts. The patch also introduces a couple
of new ones, most of which are due to conf->chunk_size still being
represented in bytes. This will be cleaned up in subsequent patches.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

664e7c41

md: Make mddev->chunk_size sector-based. · 9d8f0363

由 Andre Noll 提交于 6月 18, 2009

This patch renames the chunk_size field to chunk_sectors with the
implied change of semantics.  Since

	is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9)
				  = is_power_of_2(chunk_sectors)

these bits don't need an adjustment for the shift.
Signed-off-by: NAndre Noll <maan@systemlinux.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

9d8f0363

16 6月, 2009 1 次提交

md: prepare for non-power-of-two chunk sizes · 2ac06c33

由 raz ben yehuda 提交于 6月 16, 2009

Remove chunk size check from md as this is now performed in the run
function in each personality.

Replace chunk size power 2 code calculations by a regular division.

Signed-off-by: raziebe@gmail.com
Signed-off-by: NNeilBrown <neilb@suse.de>

2ac06c33

26 5月, 2009 5 次提交

md: don't use locked_ioctl. · b492b852

由 NeilBrown 提交于 5月 26, 2009

md has no need for the BKL - it does its own locking.
So md_ioctl doesn't need to be a locked_ioctl.
Signed-off-by: NNeilBrown <neilb@suse.de>

b492b852

md: don't update curr_resync_completed without also updating reshape_position. · 7a91ee1f

由 NeilBrown 提交于 5月 26, 2009

In order for the metadata to always be consistent, we mustn't updated
curr_resync_completed without also updating reshape_position.

The reshape code updates both at the same time.  However since
commit 97e4f42d
the common md_do_sync will sometimes update curr_resync_completed
but is not in a position to update reshape_position.
So if MD_RECOVERY_RESHAPE is set (indicating that a reshape is
happening, so reshape_position might change), don't update
curr_resync_completed in md_do_sync, leave it to the per-personality
reshape code.
Signed-off-by: NNeilBrown <neilb@suse.de>

7a91ee1f

md: export 'frozen' resync state through sysfs · b6a9ce68

由 NeilBrown 提交于 5月 26, 2009

The md resync engine has a 'frozen' state which ensures that
no resync/recovery.  This is used to avoid races.

Export this state through the 'sync_action' sysfs attribute
so that user-space can benefit and also avoid some races.
Signed-off-by: NNeilBrown <neilb@suse.de>

b6a9ce68

md: improve errno return when setting array_size · 2b69c839

由 NeilBrown 提交于 5月 26, 2009

Instead of always returns EINVAL if anything goes wrong
when setting the array size, add the option of
  E2BIG
if the size requested is too large.  This makes it easier
for user-space to be sure what went wrong.
Signed-off-by: NNeilBrown <neilb@suse.de>

2b69c839

md: always update level / chunk_size / layout when writing v1.x metadata. · 62e1e389

由 NeilBrown 提交于 5月 26, 2009

We previously didn't update these fields when writing the metadata
because they could never change.  They can now, so we better write
them.
v0.90 metadata always updated these fields.
Signed-off-by: NNeilBrown <neilb@suse.de>

62e1e389

23 5月, 2009 1 次提交

block: Do away with the notion of hardsect_size · e1defc4f

由 Martin K. Petersen 提交于 5月 22, 2009

Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case.  The
sector size will be 4KB but the logical block size will remain
512-bytes.  Hence we need to distinguish between the physical block size
and the logical ditto.

This patch renames hardsect_size to logical_block_size.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

e1defc4f

07 5月, 2009 4 次提交

md: remove rd%d links immediately after stopping an array. · c4647292

由 NeilBrown 提交于 5月 07, 2009

md maintains link in sys/mdXX/md/ to identify which device has
which role in the array. e.g.
   rd2 -> dev-sda

indicates that the device with role '2' in the array is sda.

These links are only present when the array is active.  They are
created immediately after ->run is called, and so should be removed
immediately after ->stop is called.
However they are currently removed a little bit later, and it is
possible for ->run to be called again, thus adding these links, before
they are removed.

So move the removal earlier so they are consistently only present when
the array is active.
Signed-off-by: NNeilBrown <neilb@suse.de>

c4647292

md: remove ability to explicit set an inactive array to 'clean'. · 5bf29597

由 NeilBrown 提交于 5月 07, 2009

Being able to write 'clean' to an 'array_state' of an inactive array
to activate it in 'clean' mode is both unnecessary and inconvenient.

It is unnecessary because the same can be achieved by writing
'active'.  This activates and array, but it still remains 'clean'
until the first write.

It is inconvenient because writing 'clean' is more often used to
cause an 'active' array to revert to 'clean' mode (thus blocking
any writes until a 'write-pending' is promoted to 'active').

Allowing 'clean' to both activate an array and mark an active array as
clean can lead to races:  One program writes 'clean' to mark the
active array as clean at the same time as another program writes
'inactive' to deactivate (stop) and active array.  Depending on which
writes first, the array could be deactivated and immediately
reactivated which isn't what was desired.

So just disable the use of 'clean' to activate an array.

This avoids a race that can be triggered with mdadm-3.0 and external
metadata, so it suitable for -stable.
Reported-by: NRafal Marszewski <rafal.marszewski@intel.com>
Acked-by: NDan Williams <dan.j.williams@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

5bf29597

md: constify VFTs · 110518bc

由 Jan Engelhardt 提交于 5月 07, 2009

Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
Signed-off-by: NNeilBrown <neilb@suse.de>

110518bc

md: tidy up status_resync to handle large arrays. · dd71cf6b

由 NeilBrown 提交于 5月 07, 2009

Two problems in status_resync.
1/ It still used Kilobytes as the basic block unit, while most code
   now uses sectors uniformly.
2/ It doesn't allow for the possibility that max_sectors exceeds
   the range of "unsigned long".

So
 - change "max_blocks" to "max_sectors", and store sector numbers
   in there and in 'resync'
 - Make 'rt' a 'sector_t' so it can temporarily hold the number of
   remaining sectors.
 - use sector_div rather than normal division.
 - change the magic '100' used to preserve precision to '32'.
   + making it a power of 2 makes division easier
   + it doesn't need to be as large as it was chosen when we averaged
     speed over the entire run.  Now we average speed over the last 30
     seconds or so.
Reported-by: N"Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
Signed-off-by: NNeilBrown <neilb@suse.de>

dd71cf6b

17 4月, 2009 1 次提交

md: update sync_completed and reshape_position even more often. · c03f6a19

由 NeilBrown 提交于 4月 17, 2009

There are circumstances when a user-space process might need to
"oversee" a resync/reshape process.  For example when doing an
in-place reshape of a raid5, it is prudent to take a backup of each
section before reshaping it as this is the only way to provide
safety against an unplanned shutdown (i.e. crash/power failure).

The sync_max sysfs value can be used to stop the resync from
advancing beyond a particular point.
So user-space can:
  suspend IO to the first section and back it up
  set 'sync_max' to the end of the section
  wait for 'sync_completed' to reach that point
  resume IO on the first section and move on to the next section.

However this process requires the kernel and user-space to run in
lock-step which could introduce unnecessary delays.

It would be better if a 'double buffered' approach could be used with
userspace and kernel space working on different sections with the
'next' section always ready when the 'current' section is finished.

One problem with implementing this is that sync_completed is only
guaranteed to be updated when the sync process reaches sync_max.
(it is updated on a time basis at other times, but it is hard to rely
on that).  This defeats some of the double buffering.

With this patch, sync_completed (and reshape_position) get updated as
the current position approaches sync_max, so there is room for
userspace to advance sync_max early without losing updates.

To be precise, sync_completed is updated when the current sync
position reaches half way between the current value of sync_completed
and the value of sync_max.  This will usually be a good time for user
space to update sync_max.

If sync_max does not get updated, the updates to sync_completed
(together with associated metadata updates) will occur at an
exponentially increasing frequency which will get unreasonably fast
(one update every page) immediately before the process hits sync_max
and stops.  So the update rate will be unreasonably fast only for an
insignificant period of time.
Signed-off-by: NNeilBrown <neilb@suse.de>

c03f6a19

14 4月, 2009 2 次提交

md: improve usefulness and accuracy of sysfs file md/sync_completed. · acb180b0

由 NeilBrown 提交于 4月 14, 2009

The sync_completed file reports how much of a resync (or recovery or
reshape) has been completed.
However due to the possibility of out-of-order completion of writes,
it is not certain to be accurate.

We have an internal value - mddev->curr_resync_completed - which is an
accurate value (though it might not always be quite so uptodate).

So:
 - make curr_resync_completed be uptodate a little more often,
   particularly when raid5 reshape updates status in the metadata
 - report curr_resync_completed in the sysfs file
 - allow poll/select to report all updates to md/sync_completed.

This makes sync_completed completed usable by any external metadata
handler that wants to record this status information in its metadata.
Signed-off-by: NNeilBrown <neilb@suse.de>

acb180b0

md: allow setting newly added device to 'in_sync' via sysfs. · 6d56e278

由 NeilBrown 提交于 4月 14, 2009

When adding devices to an active array via sysfs, there is currently
no way to mark a device as 'in-sync' which is useful when
incrementally assembling an array.

So add that option.
Signed-off-by: NNeilBrown <neilb@suse.de>

6d56e278

openanolis / cloud-kernel 12 个月 前同步成功

openanolis / cloud-kernel
12 个月前同步成功