提交 · 50c37b136a3807eda44afe16529b5af701ec49f5 · openeuler / raspberrypi-kernel

22 4月, 2015 6 次提交

md: don't require sync_min to be a multiple of chunk_size. · 50c37b13

由 NeilBrown 提交于 3月 23, 2015

There is really no need for sync_min to be a multiple of
chunk_size, and values read from here often aren't.
That means you cannot read a value and expect to be able
to write it back later.

So remove the chunk_size check, and round down to a multiple
of 4K, to be sure everything works with 4K-sector devices.
Signed-off-by: NNeilBrown <neilb@suse.de>

50c37b13

md-cluster: re-add capabilities · 97f6cd39

由 Goldwyn Rodrigues 提交于 4月 14, 2015

When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
the clustered md:

1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
   clear the Faulty bit in their respective rdev->flags.
2. The node initiating re-add, gathers the bitmaps of all nodes
   and copies them into the local bitmap. It does not clear the bitmap
   from which it is copying.
3. Initiating node schedules a md recovery to sync the devices.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

97f6cd39

md: re-add a failed disk · a6da4ef8

由 Goldwyn Rodrigues 提交于 4月 14, 2015

This adds the capability of re-adding a failed disk by
writing "re-add" to /sys/block/mdXX/md/dev-YYY/state.

This facilitates adding disks which have encountered a temporary
error such as a network disconnection/hiccup in an iSCSI device,
or a SAN cable disconnection which has been restored. In such
a situation, you do not need to remove and re-add the device.
Writing re-add to the failed device's state would add it again
to the array and perform the recovery of only the blocks which
were written after the device failed.

This works for generic md, and is not related to clustering. However,
this patch is to ease re-add operations listed above in clustering
environments.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

a6da4ef8

md-cluster: remove capabilities · 88bcfef7

由 Goldwyn Rodrigues 提交于 4月 14, 2015

This adds "remove" capabilities for the clustered environment.
When a user initiates removal of a device from the array, a
REMOVE message with disk number in the array is sent to all
the nodes which kick the respective device in their own array.

This facilitates the removal of failed devices.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

88bcfef7

md: Export and rename find_rdev_nr_rcu · 57d051dc

由 Goldwyn Rodrigues 提交于 4月 14, 2015

This is required by the clustering module (patches to follow) to
find the device to remove or re-add.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

57d051dc

md: Export and rename kick_rdev_from_array · fb56dfef

由 Goldwyn Rodrigues 提交于 4月 14, 2015

This export is required for clustering module in order to
co-ordinate remove/readd a rdev from all nodes.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

fb56dfef

08 4月, 2015 1 次提交

md: fix md io stats accounting broken · 74672d06

由 Gu Zheng 提交于 4月 03, 2015

Simon reported the md io stats accounting issue:
"
I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
the other two being write-mostly:

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 345.00 0.00 0.00 0.00 0.00 100.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 58779.00 0.00 0.00 0.00 0.00 100.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 100.00
"
The cause is commit "18c0b223" uses the
generic_start_io_acct to account the disk stats rather than the open code,
but it also introduced the increase to .in_flight[rw] which is needless to
md. So we re-use the open code here to fix it.
Reported-by: NSimon Kirby <sim@hostway.ca>
Cc: <stable@vger.kernel.org> 3.19
Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

74672d06

21 3月, 2015 2 次提交

md: Fix stray --cluster-confirm crash · fa8259da

由 Goldwyn Rodrigues 提交于 3月 02, 2015

A --cluster-confirm without an --add (by another node) can
crash the kernel.

Fix it by guarding it using a state.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

fa8259da

md: fix problems with freeing private data after ->run failure. · 0c35bd47

由 NeilBrown 提交于 3月 13, 2015

If ->run() fails, it can either free the data structures it
allocated, or leave that task to ->free() which will be called
on failures.

However:
  md.c calls ->free() even if ->private_data is NULL, which
     causes problems in some personalities.
  raid0.c frees the data, but doesn't clear ->private_data,
     which will become a problem when we fix md.c

So better fix both these issues at once.
Reported-by: NRichard W.M. Jones <rjones@redhat.com>
Fixes: 5aa61f42
URL: https://bugzilla.kernel.org/show_bug.cgi?id=94381Signed-off-by: NNeilBrown <neilb@suse.de>

0c35bd47

25 2月, 2015 2 次提交

md: fix error paths from bitmap_create. · ba599aca

由 NeilBrown 提交于 2月 25, 2015

Recent change to bitmap_create mishandles errors.
In particular a failure doesn't alway cause 'err' to be set.
Signed-off-by: NNeilBrown <neilb@suse.de>

ba599aca

md: mark some attributes as pre-alloc · 750f199e

由 NeilBrown 提交于 9月 30, 2014

Since __ATTR_PREALLOC was introduced in v3.19-rc1~78^2~18
it can now be used by md.

This ensure that writing to these sysfs attributes will never
block due to a memory allocation.
Such blocking could become a deadlock if mdmon is trying to
reconfigure an array after a failure prior to re-enabling writes.
Signed-off-by: NNeilBrown <neilb@suse.de>

750f199e

23 2月, 2015 11 次提交

Add new disk to clustered array · 1aee41f6

由 Goldwyn Rodrigues 提交于 10月 29, 2014

Algorithm:
1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
   ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
2. Node 1 sends NEWDISK with uuid and slot number
3. Other nodes issue kobject_uevent_env with uuid and slot number
(Steps 4,5 could be a udev rule)
4. In userspace, the node searches for the disk, perhaps
   using blkid -t SUB_UUID=""
5. Other nodes issue either of the following depending on whether the disk
   was found:
   ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
	 disc.number set to slot number)
   ioctl(CLUSTERED_DISK_NACK)
6. Other nodes drop lock on no-new-devs (CR) if device is found
7. Node 1 attempts EX lock on no-new-devs
8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
   as SpareLocal
9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message.
Signed-off-by: NLidong Zhong <lzhong@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

1aee41f6

Suspend writes in RAID1 if within range · 589a1c49

由 Goldwyn Rodrigues 提交于 6月 07, 2014

If there is a resync going on, all nodes must suspend writes to the
range. This is recorded in the suspend_info/suspend_list.

If there is an I/O within the ranges of any of the suspend_info,
should_suspend will return 1.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

589a1c49

Send RESYNCING while performing resync start/stop · 965400eb

由 Goldwyn Rodrigues 提交于 6月 07, 2014

When a resync is initiated, RESYNCING message is sent to all active
nodes with the range (lo,hi). When the resync is over, a RESYNCING
message is sent with (0,0). A high sector value of zero indicates
that the resync is over.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

965400eb

Reload superblock if METADATA_UPDATED is received · 1d7e3e96

由 Goldwyn Rodrigues 提交于 6月 07, 2014

Re-reads the devices by invalidating the cache.
Since we don't write to faulty devices, this is detected using
events recorded in the devices. If it is old as compared to the mddev
mark it is faulty.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

1d7e3e96

metadata_update sends message to other nodes · 293467aa

由 Goldwyn Rodrigues 提交于 6月 07, 2014

   - request to send a message
   - make changes to superblock
   - send messages telling everyone that the superblock has changed
   - other nodes all read the superblock
   - other nodes all ack the messages
   - updating node release the "I'm sending a message" resource.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

293467aa

bitmap_create returns bitmap pointer · f9209a32

由 Goldwyn Rodrigues 提交于 6月 06, 2014

This is done to have multiple bitmaps open at the same time.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

f9209a32

Gather on-going resync information of other nodes · 96ae923a

由 Goldwyn Rodrigues 提交于 6月 06, 2014

When a node joins, it does not know of other nodes performing resync.
So, each node keeps the resync information in it's LVB. When a new
node joins, it reads the LVB of each "online" bitmap.

[TODO] The new node attempts to get the PW lock on other bitmap, if
it is successful, it reads the bitmap and performs the resync (if
required) on it's behalf.

If the node does not get the PW, it requests CR and reads the LVB
for the resync information.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

96ae923a

Add node recovery callbacks · cf921cc1

由 Goldwyn Rodrigues 提交于 3月 30, 2014

DLM offers callbacks when a node fails and the lock remastery
is performed:

1. recover_prep: called when DLM discovers a node is down
2. recover_slot: called when DLM identifies the node and recovery
		can start
3. recover_done: called when all nodes have completed recover_slot

recover_slot() and recover_done() are also called when the node joins
initially in order to inform the node with its slot number. These slot
numbers start from one, so we deduct one to make it start with zero
which the cluster-md code uses.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

cf921cc1

G
Return MD_SB_CLUSTERED if mddev is clustered · ca8895d9
由 Goldwyn Rodrigues 提交于 11月 26, 2014
```
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
```
ca8895d9

Introduce md_cluster_info · c4ce867f

由 Goldwyn Rodrigues 提交于 3月 29, 2014

md_cluster_info stores the cluster information in the MD device.

The join() is called when mddev detects it is a clustered device.
The main responsibilities are:
	1. Setup a DLM lockspace
	2. Setup all initial locks such as super block locks and bitmap lock (will come later)

The leave() clears up the lockspace and all the locks held.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

c4ce867f

G
Introduce md_cluster_operations to handle cluster functions · edb39c9d
由 Goldwyn Rodrigues 提交于 3月 29, 2014
```
This allows dynamic registering of cluster hooks.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
```
edb39c9d

06 2月, 2015 10 次提交

md: make reconfig_mutex optional for writes to md sysfs files. · 6791875e

由 NeilBrown 提交于 12月 15, 2014

Rather than using mddev_lock() to take the reconfig_mutex
when writing to any md sysfs file, we only take mddev_lock()
in the particular _store() functions that require it.
Admittedly this is most, but it isn't all.

This also allows us to remove special-case handling for new_dev_store
(in md_attr_store).
Signed-off-by: NNeilBrown <neilb@suse.de>

6791875e

md: move mddev_lock and related to md.h · 5c47daf6

由 NeilBrown 提交于 12月 15, 2014

The one which is not inline (mddev_unlock) gets EXPORTed.

This makes the locking available to personality modules so that it
doesn't have to be imposed upon them.
Signed-off-by: NNeilBrown <neilb@suse.de>

5c47daf6

md: use mddev->lock to protect updates to resync_{min,max}. · 23da422b

由 NeilBrown 提交于 12月 15, 2014

There are interdependencies between these two sysfs attributes
and whether a resync is currently running.

Rather than depending on reconfig_mutex to ensure no races when
testing these interdependencies are met, use the spinlock.
This will allow the mutex to be remove from protecting this
code in a subsequent patch.
Signed-off-by: NNeilBrown <neilb@suse.de>

23da422b

md: minor cleanup in safe_delay_store. · 1b30e66f

由 NeilBrown 提交于 12月 15, 2014

There isn't really much room for races with ->safemode_delay.
But as I am trying to clean up any racy code and will soon
be removing reconfig_mutex protection from most _store()
functions:
 - only set mddev->safemode_delay once, to ensure no code
   can see an intermediate value
 - use safemode_timer to call md_safemode_timeout() rather than
   calling it directly, to ensure it never races with itself.
Signed-off-by: NNeilBrown <neilb@suse.de>

1b30e66f

md: move GET_BITMAP_FILE ioctl out from mddev_lock. · 4af1a041

由 NeilBrown 提交于 12月 15, 2014

It makes more sense to report bitmap_info->file, rather than
bitmap->file (the later is only available once the array is
active).

With that change, use mddev->lock to protect bitmap_info being
set to NULL, and we can call get_bitmap_file() without taking
the mutex.
Signed-off-by: NNeilBrown <neilb@suse.de>

4af1a041

md: tidy up set_bitmap_file · 1e594bb2

由 NeilBrown 提交于 12月 15, 2014

1/ delay setting mddev->bitmap_info.file until 'f' looks
   usable, so we don't have to unset it.
2/ Don't allow bitmap file to be set if bitmap_info.file
   is already set.
Signed-off-by: NNeilBrown <neilb@suse.de>

1e594bb2

md: remove unnecessary 'buf' from get_bitmap_file. · f4ad3d38

由 NeilBrown 提交于 12月 15, 2014

'buf' is only used because d_path fills from the end of the
buffer instead of from the start.
We don't need a separate buf to handle that, we just need to use
memmove() to move the string to the start.
Signed-off-by: NNeilBrown <neilb@suse.de>

f4ad3d38

md: remove mddev_lock from rdev_attr_show() · 758bfc8a

由 NeilBrown 提交于 12月 15, 2014

No rdev attributes need locking for 'show', though
state_show() might benefit from ensuring it sees a
consistent set of flags.

None even use rdev->mddev, so testing for it isn't really
needed and it certainly doesn't need to be held constant.

So improve state_show() and remove the locking.
Signed-off-by: NNeilBrown <neilb@suse.de>

758bfc8a

md: remove mddev_lock() from md_attr_show() · b7b17c9b

由 NeilBrown 提交于 12月 15, 2014

Most attributes can be read safely without any locking.
A race might lead to a slightly out-dated value, but nothing wrong.

We already have locking in some places where needed.
All that remains is can_clear_show(), behind_writes_used_show()
and action_show() which are easily fixed.
Signed-off-by: NNeilBrown <neilb@suse.de>

b7b17c9b

md: remove need for mddev_lock() in md_seq_show() · f97fcad3

由 NeilBrown 提交于 12月 15, 2014

The only access in md_seq_show that could suffer from races
not protected by ->lock is walking the rdev list.
This can receive sufficient protection from 'rcu'.

So use rdev_for_each_rcu() and get rid of mddev_lock().

Now reading /proc/mdstat will never block in md_seq_show.
Signed-off-by: NNeilBrown <neilb@suse.de>

f97fcad3

04 2月, 2015 7 次提交

md: protect ->pers changes with mddev->lock · 36d091f4

由 NeilBrown 提交于 12月 15, 2014

->pers is already protected by ->reconfig_mutex, and
cannot possibly change when there are threads running or
outstanding IO.

However there are some places where we access ->pers
not in a thread or IO context, and where ->reconfig_mutex
is unnecessarily heavy-weight:  level_show and md_seq_show().

So protect all changes, and those accesses, with ->lock.
This is a step toward taking those accesses out from under
reconfig_mutex.

[Fixed missing "mddev->pers" -> "pers" conversion, thanks to
 Dan Carpenter <dan.carpenter@oracle.com>]
Signed-off-by: NNeilBrown <neilb@suse.de>

36d091f4

md: level_store: group all important changes into one place. · db721d32

由 NeilBrown 提交于 12月 15, 2014

Gather all the changes that can happen atomically and might
be relevant to other code into one place.  This will
make it easier to refine the locking.

Note that this puts quite a few things between mddev_detach()
and ->free().  Enabling this was the point of some recent patches.
Signed-off-by: NNeilBrown <neilb@suse.de>

db721d32

md: rename ->stop to ->free · afa0f557

由 NeilBrown 提交于 12月 15, 2014

Now that the ->stop function only frees the private data,
rename is accordingly.

Also pass in the private pointer as an arg rather than using
mddev->private.  This flexibility will be useful in level_store().

Finally, don't clear ->private.  It doesn't make sense to clear
it seeing that isn't what we free, and it is no longer necessary
to clear ->private (it was some time ago before  ->to_remove was
introduced).

Setting ->to_remove in ->free() is a bit of a wart, but not a
big problem at the moment.
Signed-off-by: NNeilBrown <neilb@suse.de>

afa0f557

md: split detach operation out from ->stop. · 5aa61f42

由 NeilBrown 提交于 12月 15, 2014

Each md personality has a 'stop' operation which does two
things:
 1/ it finalizes some aspects of the array to ensure nothing
    is accessing the ->private data
 2/ it frees the ->private data.

All the steps in '1' can apply to all arrays and so can be
performed in common code.

This is useful as in the case where we change the personality which
manages an array (in level_store()), it would be helpful to do
step 1 early, and step 2 later.

So split the 'step 1' functionality out into a new mddev_detach().
Signed-off-by: NNeilBrown <neilb@suse.de>

5aa61f42

md: make merge_bvec_fn more robust in face of personality changes. · 64590f45

由 NeilBrown 提交于 12月 15, 2014

There is no locking around calls to merge_bvec_fn(), so
it is possible that calls which coincide with a level (or personality)
change could go wrong.

So create a central dispatch point for these functions and use
rcu_read_lock().
If the array is suspended, reject any merge that can be rejected.
If not, we know it is safe to call the function.
Signed-off-by: NNeilBrown <neilb@suse.de>

64590f45

md: make ->congested robust against personality changes. · 5c675f83

由 NeilBrown 提交于 12月 15, 2014

There is currently no locking around calls to the 'congested'
bdi function.  If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.

So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.

When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock.  Fortunately this is the case.
Signed-off-by: NNeilBrown <neilb@suse.de>

5c675f83

md: rename mddev->write_lock to mddev->lock · 85572d7c

由 NeilBrown 提交于 12月 15, 2014

This lock is used for (slightly) more than helping with writing
superblocks, and it will soon be extended further.  So the
name is inappropriate.

Also, the _irq variant hasn't been needed since 2.6.37 as it is
never taking from interrupt or bh context.

So:
  -rename write_lock to lock
  -document what it protects
  -remove _irq ... except in md_flush_request() as there
     is no wait_event_lock() (with no _irq).  This can be
     cleaned up after appropriate changes to wait.h.
Signed-off-by: NNeilBrown <neilb@suse.de>

85572d7c

11 12月, 2014 1 次提交

md: Check MD_RECOVERY_RUNNING as well as ->sync_thread. · f851b60d

由 NeilBrown 提交于 12月 11, 2014

A recent change to md started the ->sync_thread from a asynchronously
from a work_queue rather than synchronously.  This means that there
can be a small window between the time when MD_RECOVERY_RUNNING is set
and when ->sync_thread is set.

So code that checks ->sync_thread might now conclude that the thread
has not been started and (because a lock is held) will not be started.
That is no longer the case.

Most of those places are best fixed by testing MD_RECOVERY_RUNNING
as well.  To make this completely reliable, we wake_up(&resync_wait)
after clearing that flag as well as after clearing ->sync_thread.

Other places are better served by flushing the relevant workqueue
to ensure that that if the sync thread was starting, it has now
started.  This is particularly best if we are about to stop the
sync thread.

Fixes: ac05f256Signed-off-by: NNeilBrown <neilb@suse.de>

f851b60d