提交 · d30519fc59c5cc2f7772fa67b16b1a2426d36c95 · openeuler / raspberrypi-kernel

18 10月, 2011 1 次提交

md: clear In_sync bit on devices added to an active array. · d30519fc

由 NeilBrown 提交于 10月 18, 2011

When we add a device to an active array it can be meaningful to set
the 'insync' flag.  This indicates that the device is in-sync with the
array except for locations recorded in the bitmap.
A bitmap-based recovery can then bring it completely in-sync.

Internally we move that flag to 'saved_raid_disk' but forgot to clear
In_sync like we do in add_new_disk.

So clear In_sync after moving its value to saved_raid_disk.
Reported-by: NAndrei Warkentin <andreiw@vmware.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

d30519fc

11 10月, 2011 4 次提交
- N
  md: rename "mdk_personality" to "md_personality" · 84fc4b56
  由 NeilBrown 提交于 10月 11, 2011
```
"mdk" doesn't mean anything any more.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  84fc4b56
- N
  md: remove typedefs: mdk_thread_t -> struct md_thread · 2b8bf345
  由 NeilBrown 提交于 10月 11, 2011
```
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  2b8bf345
- N
  md: remove typedefs: mddev_t -> struct mddev · fd01b88c
  由 NeilBrown 提交于 10月 11, 2011
```
Having mddev_t and 'struct mddev_s' is ugly and not preferred
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  fd01b88c
- N
  md: removing typedefs: mdk_rdev_t -> struct md_rdev · 3cb03002
  由 NeilBrown 提交于 10月 11, 2011
```
The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  3cb03002
07 10月, 2011 1 次提交
- N
  md: remove PRINTK and dprintk debugging and use pr_debug · 36a4e1fe
  由 NeilBrown 提交于 10月 07, 2011
```
Being able to dynamically enable these make them much more useful.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  36a4e1fe
23 9月, 2011 1 次提交

md: don't delay reboot by 1 second if no MD devices exist · 2dba6a91

由 Daniel P. Berrange 提交于 9月 23, 2011

The md_notify_reboot() method includes a call to mdelay(1000),
to deal with "exotic SCSI devices" which are too volatile on
reboot. The delay is unconditional. Even if the machine does
not have any block devices, let alone MD devices, the kernel
shutdown sequence is slowed down.

1 second does not matter much with physical hardware, but with
certain virtualization use cases any wasted time in the bootup
& shutdown sequence counts for alot.

* drivers/md/md.c: md_notify_reboot() - only impose a delay if
  there was at least one MD device to be stopped during reboot
Signed-off-by: NDaniel P. Berrange <berrange@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

2dba6a91

21 9月, 2011 1 次提交

md: Avoid waking up a thread after it has been freed. · 01f96c0a

由 NeilBrown 提交于 9月 21, 2011

Two related problems:

1/ some error paths call "md_unregister_thread(mddev->thread)"
   without subsequently clearing ->thread.  A subsequent call
   to mddev_unlock will try to wake the thread, and crash.

2/ Most calls to md_wakeup_thread are protected against the thread
   disappeared either by:
      - holding the ->mutex
      - having an active request, so something else must be keeping
        the array active.
   However mddev_unlock calls md_wakeup_thread after dropping the
   mutex and without any certainty of an active request, so the
   ->thread could theoretically disappear.
   So we need a spinlock to provide some protections.

So change md_unregister_thread to take a pointer to the thread
pointer, and ensure that it always does the required locking, and
clears the pointer properly.
Reported-by: N"Moshe Melnikov" <moshe@zadarastorage.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
cc: stable@kernel.org

01f96c0a

10 9月, 2011 1 次提交

md: Fix handling for devices from 2TB to 4TB in 0.90 metadata. · 27a7b260

由 NeilBrown 提交于 9月 10, 2011

0.90 metadata uses an unsigned 32bit number to count the number of
kilobytes used from each device.
This should allow up to 4TB per device.
However we multiply this by 2 (to get sectors) before casting to a
larger type, so sizes above 2TB get truncated.

Also we allow rdev->sectors to be larger than 4TB, so it is possible
for the array to be resized larger than the metadata can handle.
So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in
used.

Also the sanity check at the end of super_90_load should include level
1 as it used ->size too. (RAID0 and Linear don't use ->size at all).
Reported-by: NPim Zandbergen <P.Zandbergen@macroscoop.nl>
Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

27a7b260

30 8月, 2011 1 次提交

md: fix clearing of 'blocked' flag in the presence of bad blocks. · 7da64a0a

由 NeilBrown 提交于 8月 30, 2011

When the 'blocked' flag on a device is cleared while there are
unacknowledged bad blocks we must fail the device.  This is needed for
backwards compatability of the interface.

The code currently uses the wrong test for "unacknowledged bad blocks
exist".  Change it to the right test.
Signed-off-by: NNeilBrown <neilb@suse.de>

7da64a0a

25 8月, 2011 3 次提交

md: use REQ_NOIDLE flag in md_super_write() · a5bf4df0

由 Namhyung Kim 提交于 8月 25, 2011

Queue idling is used for the anticipation of immediate
sequencial I/O's but md_super_write() is a kind of one-
shot operation, coupled with md_super_wait(), so the
idling in this case will be just a waste of time.

Specifying REQ_NOIDLE prevents it. Instead of adding
the flag to submit_bio() directly, use pre-defined
macro WRITE_FLUSH_FUA.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

a5bf4df0

md: ensure changes to 'write-mostly' are reflected in metadata. · aeb9b211

由 NeilBrown 提交于 8月 25, 2011

The 'write-mostly' flag can be changed through sysfs.
With 0.90 metadata, those changes are reflected in the metadata.
For 1.x metadata, they aren't.

So fix super_1_sync to record 'write-mostly' status.
Signed-off-by: NNeilBrown <neilb@suse.de>

aeb9b211

md: report failure if a 'set faulty' request doesn't. · 5ef56c8f

由 NeilBrown 提交于 8月 25, 2011

Sometimes a device will refuse to be set faulty. e.g. RAID1 will
never let the last working device become faulty.

So check if "md_error()" did manage to set the faulty flag and fail
with EBUSY if it didn't.

Resolves-Debian-Bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=601198Reported-by: NMike Hommey <mh+reportbug@glandium.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

5ef56c8f

28 7月, 2011 9 次提交

md/raid10 record bad blocks as needed during recovery. · e875ecea

由 NeilBrown 提交于 7月 28, 2011

When recovering one or more devices, if all the good devices have
bad blocks we should record a bad block on the device being rebuilt.

If this fails, we need to abort the recovery.

To ensure we don't think that we aborted later than we actually did,
we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync,
in particular before mddev->curr_resync is updated.
Signed-off-by: NNeilBrown <neilb@suse.de>

e875ecea

md: make it easier to wait for bad blocks to be acknowledged. · de393cde

由 NeilBrown 提交于 7月 28, 2011

It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.

If it hasn't we need to wait for the acknowledgement.

We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.

This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.

It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.

When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).

We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata.   This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed.  Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.

Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.

The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks.  So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.
Signed-off-by: NNeilBrown <neilb@suse.de>

de393cde

md: add 'write_error' flag to component devices. · d7a9d443

由 NeilBrown 提交于 7月 28, 2011

If a device has ever seen a write error, we will want to handle
known-bad-blocks differently.
So create an appropriate state flag and export it via sysfs.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

d7a9d443

md/raid1: avoid reading from known bad blocks. · d2eb35ac

由 NeilBrown 提交于 7月 28, 2011

Now that we have a bad block list, we should not read from those
blocks.
There are several main parts to this:
  1/ read_balance needs to check for bad blocks, and return not only
     the chosen device, but also how many good blocks are available
     there.
  2/ fix_read_error needs to avoid trying to read from bad blocks.
  3/ read submission must be ready to issue multiple reads to
     different devices as different bad blocks on different devices
     could mean that a single large read cannot be served by any one
     device, but can still be served by the array.
     This requires keeping count of the number of outstanding requests
     per bio.  This count is stored in 'bi_phys_segments'
  4/ retrying a read needs to also be ready to submit a smaller read
     and queue another request for the rest.

This does not yet handle bad blocks when reading to perform resync,
recovery, or check.

'md_trim_bio' will also be used for RAID10, so put it in md.c and
export it.
Signed-off-by: NNeilBrown <neilb@suse.de>

d2eb35ac

md: Disable bad blocks and v0.90 metadata. · 9f2f3830

由 NeilBrown 提交于 7月 28, 2011

v0.90 metadata cannot record bad blocks, so when loading metadata
for such a device, set shift to -1.
Signed-off-by: NNeilBrown <neilb@suse.de>

9f2f3830

md: load/store badblock list from v1.x metadata · 2699b672

由 NeilBrown 提交于 7月 28, 2011

Space must have been allocated when array was created.
A feature flag is set when the badblock list is non-empty, to
ensure old kernels don't load and trust the whole device.

We only update the on-disk badblocklist when it has changed.
If the badblocklist (or other metadata) is stored on a bad block, we
don't cope very well.

If metadata has no room for bad block, flag bad-blocks as disabled,
and do the same for 0.90 metadata.
Signed-off-by: NNeilBrown <neilb@suse.de>

2699b672

md/bad-block-log: add sysfs interface for accessing bad-block-log. · 16c791a5

由 NeilBrown 提交于 7月 28, 2011

This can show the log (providing it fits in one page) and
allows bad blocks to be 'acknowledged' meaning that they
have safely been recorded in metadata.

Clearing bad blocks is not allowed via sysfs (except for
code testing).  A bad block can only be cleared when
a write to the block succeeds.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

16c791a5

md: beginnings of bad block management. · 2230dfe4

由 NeilBrown 提交于 7月 28, 2011

This the first step in allowing md to track bad-blocks per-device so
that we can fail individual blocks rather than the whole device.

This patch just adds a data structure for recording bad blocks, with
routines to add, remove, search the list.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

2230dfe4

md: remove suspicious size_of() · a519b26d

由 NeilBrown 提交于 7月 28, 2011

When calling bioset_create we pass the size of the front_pad as
   sizeof(mddev)
which looks suspicious as mddev is a pointer and so it looks like a
common mistake where
   sizeof(*mddev)
was intended.
The size is actually correct as we want to store a pointer in the
front padding of the bios created by the bioset, so make the intent
more explicit by using
   sizeof(mddev_t *)
Reported-by: NZdenek Kabelac <zdenek.kabelac@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

a519b26d

27 7月, 2011 5 次提交

MD: generate an event when array sync is complete · 768e587e

由 Jonathan Brassow 提交于 7月 27, 2011

This patch causes MD to generate an event (for device-mapper) when the
synchronization thread is reaped. This is expected behavior for device-mapper.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

768e587e

md: get rid of unnecessary casts on page_address() · 65a06f06

由 Namhyung Kim 提交于 7月 27, 2011

page_address() returns void pointer, so the casts can be removed.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

65a06f06

md: change managed of recovery_disabled. · 5389042f

由 NeilBrown 提交于 7月 27, 2011

If we hit a read error while recovering a mirror, we want to abort the
recovery without necessarily failing the disk - as having a disk this
a read error is better than not having an array at all.

Currently this is managed with a per-array flag "recovery_disabled"
and is only implemented for RAID1.  For RAID10 we will need finer
grained control as we might want to disable recovery for individual
devices separately.

So push more of the decision making into the personality.
'recovery_disabled' is now a 'cookie' which is copied when the
personality want to disable recovery and is changed when a device is
added to the array as this is used as a trigger to 'try recovery
again'.

This will allow RAID10 to get the control that it needs.
Signed-off-by: NNeilBrown <neilb@suse.de>

5389042f

md: remove ro check in md_check_recovery() · a478a069

由 Namhyung Kim 提交于 7月 27, 2011

Commit c89a8eee ("Allow faulty devices to be removed from a
readonly array.") added some work on ro array in the function,
but it couldn't be done since we didn't allow the ro array to be
handled from the beginning. Fix it.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

a478a069

md: introduce link/unlink_rdev() helpers · 36fad858

由 Namhyung Kim 提交于 7月 27, 2011

There are places where sysfs links to rdev are handled
in a same way. Add the helper functions to consolidate
them.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

36fad858

21 7月, 2011 1 次提交

fs: seq_file - add event counter to simplify poll() support · f1514638

由 Kay Sievers 提交于 7月 12, 2011

Moving the event counter into the dynamically allocated 'struc seq_file'
allows poll() support without the need to allocate its own tracking
structure.

All current users are switched over to use the new counter.

Requested-by: Andrew Morton akpm@linux-foundation.org
Acked-by: NNeilBrown <neilb@suse.de>
Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

f1514638

28 6月, 2011 1 次提交

md: avoid endless recovery loop when waiting for fail device to complete. · 4274215d

由 NeilBrown 提交于 6月 28, 2011

If a device fails in a way that causes pending request to take a while
to complete, md will not be able to immediately remove it from the
array in remove_and_add_spares.
It will then incorrectly look like a spare device and md will try to
recover it even though it is failed.
This leads to a recovery process starting and instantly aborting over
and over again.

We should check if the device is faulty before considering it to be a
spare.  This will avoid trying to start a recovery that cannot
proceed.

This bug was introduced in 2.6.26 so that patch is suitable for any
kernel since then.

Cc: stable@kernel.org
Reported-by: NJim Paradis <james.paradis@stratus.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

4274215d

09 6月, 2011 2 次提交

md: check ->hot_remove_disk when removing disk · 01393f3d

由 Namhyung Kim 提交于 6月 09, 2011

Check pers->hot_remove_disk instead of pers->hot_add_disk in slot_store()
during disk removal. The linear personality only has ->hot_add_disk and
no ->hot_remove_disk, so that removing disk in the array resulted to
following kernel bug:

$ sudo mdadm --create /dev/md0 --level=linear --raid-devices=4 /dev/loop[0-3]
$ echo none | sudo tee /sys/block/md0/md/dev-loop2/slot
 BUG: unable to handle kernel NULL pointer dereference at           (null)
 IP: [<          (null)>]           (null)
 PGD c9f5d067 PUD 8575a067 PMD 0
 Oops: 0010 [#1] SMP
 CPU 2
 Modules linked in: linear loop bridge stp llc kvm_intel kvm asus_atk0110 sr_mod cdrom sg

 Pid: 10450, comm: tee Not tainted 3.0.0-rc1-leonard+ #173 System manufacturer System Product Name/P5G41TD-M PRO
 RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
 RSP: 0018:ffff880085757df0  EFLAGS: 00010282
 RAX: ffffffffa00168e0 RBX: ffff8800d1431800 RCX: 000000000000006e
 RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff88008543c000
 RBP: ffff880085757e48 R08: 0000000000000002 R09: 000000000000000a
 R10: 0000000000000000 R11: ffff88008543c2e0 R12: 00000000ffffffff
 R13: ffff8800b4641000 R14: 0000000000000005 R15: 0000000000000000
 FS:  00007fe8c9e05700(0000) GS:ffff88011fa00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 0000000000000000 CR3: 00000000b4502000 CR4: 00000000000406e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process tee (pid: 10450, threadinfo ffff880085756000, task ffff8800c9f08000)
 Stack:
  ffffffff8138496a ffff8800b4641000 ffff88008543c268 0000000000000000
  ffff8800b4641000 ffff88008543c000 ffff8800d1431868 ffffffff81a78a90
  ffff8800b4641000 ffff88008543c000 ffff8800d1431800 ffff880085757e98
 Call Trace:
  [<ffffffff8138496a>] ? slot_store+0xaa/0x265
  [<ffffffff81384bae>] rdev_attr_store+0x89/0xa8
  [<ffffffff8115a96a>] sysfs_write_file+0x108/0x144
  [<ffffffff81106b87>] vfs_write+0xb1/0x10d
  [<ffffffff8106e6c0>] ? trace_hardirqs_on_caller+0x111/0x135
  [<ffffffff81106cac>] sys_write+0x4d/0x77
  [<ffffffff814fe702>] system_call_fastpath+0x16/0x1b
 Code:  Bad RIP value.
 RIP  [<          (null)>]           (null)
  RSP <ffff880085757df0>
 CR2: 0000000000000000
 ---[ end trace ba5fc64319a826fb ]---
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

01393f3d

马
md: Using poll /proc/mdstat can monitor the events of adding a spare disks · 9864c005
由马建朋提交于 6月 09, 2011
```
Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
```
9864c005

08 6月, 2011 5 次提交

MD: add sync_super to mddev_t struct · 076f968b

由 Jonathan Brassow 提交于 6月 07, 2011

Add the 'sync_super' function pointer to MD array structure (struct mddev_s)

If device-mapper (dm-raid.c) is to define its own on-disk superblock and be
able to load it, there must still be a way for MD to initiate superblock
updates. The simplest way to make this happen is to provide a pointer in
the MD array structure that can be set by device-mapper (or other module)
with a function to do this. If the function has been set, it will be used;
otherwise, the method with be looked up via 'super_types' as usual.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

076f968b

MD: move thread wakeups into resume · 0fd018af

由 Jonathan Brassow 提交于 6月 07, 2011

Move personality and sync/recovery thread starting outside md_run.

Moving the wakeup's of the personality and sync/recovery threads out of
md_run and into do_md_run and mddev_resume solves two issues:
1) It allows bitmap_load to be called before the sync_thread is run and
2) when MD personalities are used by device-mapper (dm-raid.c), the start-up
of the array is better alligned with device-mapper primatives
(CTR/resume/suspend/DTR).  I/O - in this case, recovery operations - should
not happen until after a resume has taken place.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

0fd018af

MD: possible typo · ac42450c

由 Jonathan Brassow 提交于 6月 07, 2011

Make message a bit clearer by s/blocks/k/

I chose 'k' vs 'kiB' or 'kB' because it is what is used earlier in the
message.  'k' may be a bit ambigous, but I think it's better than "blocks"
which normally means 512, but means 1024 in MD.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

ac42450c

MD: no sync IO while suspended · 68866e42

由 Jonathan Brassow 提交于 6月 08, 2011

Disallow resync I/O while the RAID array is suspended.

Recovery, resync, and metadata I/O should not be allowed while a device is
suspended.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

68866e42

MD: no integrity register if no gendisk · 629acb6a

由 Jonathan Brassow 提交于 6月 08, 2011

Don't attempt md_integrity_register if there is no gendisk struct available.

When MD arrays are built via device-mapper, the gendisk structure is not
available via mddev.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

629acb6a

11 5月, 2011 3 次提交

md: allow resync_start to be set while an array is active. · b098636c

由 NeilBrown 提交于 5月 11, 2011

The sysfs attribute 'resync_start' (known internally as recovery_cp),
records where a resync is up to.  A value of 0 means the array is
not known to be in-sync at all.  A value of MaxSector means the array
is believed to be fully in-sync.

When the size of member devices of an array (RAID1,RAID4/5/6) is
increased, the array can be increased to match.  This process sets
resync_start to the old end-of-device offset so that the new part of
the array gets resynced.

However with RAID1 (and RAID6) a resync is not technically necessary
and may be undesirable.  So it would be good if the implied resync
after the array is resized could be avoided.

So: change 'resync_start' so the value can be changed while the array
is active, and as a precaution only allow it to be changed while
resync/recovery is 'frozen'.  Changing it once resync has started is
not going to be useful anyway.

This allows the array to be resized without a resync by:
  write 'frozen' to 'sync_action'
  write new size to 'component_size' (this will set resync_start)
  write 'none' to 'resync_start'
  write 'idle' to 'sync_action'.

Also slightly improve some tests on recovery_cp when resizing
raid1/raid5.  Now that an arbitrary value could be set we should be
more careful in our tests.
Signed-off-by: NNeilBrown <neilb@suse.de>

b098636c

md: reject a re-add request that cannot be honoured. · bedd86b7

由 NeilBrown 提交于 5月 11, 2011

The 'add_new_disk' ioctl can be used to add a device either as a
spare, or as an active disk that just needs to be resynced based on
write-intent-bitmap information (re-add)

Currently if a re-add is requested but fails we add as a spare
instead.  This makes it impossible for user-space to check for
failure.

So change to require that a re-add attempt will either succeed or
completely fail.  User-space can then decide what to do next.
Signed-off-by: NNeilBrown <neilb@suse.de>

bedd86b7

md: Fix race when creating a new md device. · b0140891

由 NeilBrown 提交于 5月 10, 2011

There is a race when creating an md device by opening /dev/mdXX.

If two processes do this at much the same time they will follow the
call path
  __blkdev_get -> get_gendisk -> kobj_lookup

The first will call
  -> md_probe -> md_alloc -> add_disk -> blk_register_region

and the race happens when the second gets to kobj_lookup after
add_disk has called blk_register_region but before it returns to
md_alloc.

In the case the second will not call md_probe (as the probe is already
done) but will get a handle on the gendisk, return to __blkdev_get
which will then call md_open (via the ->open) pointer.

As mddev->gendisk hasn't been set yet, md_open will think something is
wrong an return with ERESTARTSYS.

This can loop endlessly while the first thread makes no progress
through add_disk.  Nothing is blocking it, but due to scheduler
behaviour it doesn't get a turn.
So this is essentially a live-lock.

We fix this by simply moving the assignment to mddev->gendisk before
the call the add_disk() so md_open doesn't get confused.
Also move blk_queue_flush earlier because add_disk should be as late
as possible.

To make sure that md_open doesn't complete until md_alloc has done all
that is needed, we take mddev->open_mutex during the last part of
md_alloc.  md_open will wait for this.

This can cause a lock-up on boot so Cc:ing for stable.
For 2.6.36 and earlier a different patch will be needed as the
'blk_queue_flush' call isn't there.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reported-by: NThomas Jarosch <thomas.jarosch@intra2net.com>
Tested-by: NThomas Jarosch <thomas.jarosch@intra2net.com>
Cc: stable@kernel.org

b0140891

20 4月, 2011 1 次提交

md: Cleanup after raid45->raid0 takeover · fee68723

由 Krzysztof Wojcik 提交于 4月 20, 2011

Problem:
After raid4->raid0 takeover operation, another takeover operation
(e.g raid0->raid10) results "kernel oops".
Root cause:
Variables 'degraded' in mddev structure is not cleared
on raid45->raid0 takeover.

This patch reset this variable.
Signed-off-by: NKrzysztof Wojcik <krzysztof.wojcik@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

fee68723