提交 · b2b9bfff0aa721a04a3924ed451c417d2bd9ed15 · openanolis / cloud-kernel

01 9月, 2015 8 次提交

md-cluster: remove the unused sb_lock · b2b9bfff

由 Guoqing Jiang 提交于 7月 10, 2015

The sb_lock is not used anywhere, so let's remove it.
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

b2b9bfff

md-cluster: init suspend_list and suspend_lock early in join · 9e3072e3

由 Guoqing Jiang 提交于 7月 10, 2015

If the node just join the cluster, and receive the msg from other nodes
before init suspend_list, it will cause kernel crash due to NULL pointer
dereference, so move the initializations early to fix the bug.

md-cluster: Joined cluster 3578507b-e0cb-6d4f-6322-696cd7b1b10c slot 3
BUG: unable to handle kernel NULL pointer dereference at           (null)
... ... ...
Call Trace:
[<ffffffffa0444924>] process_recvd_msg+0x2e4/0x330 [md_cluster]
[<ffffffffa0444a06>] recv_daemon+0x96/0x170 [md_cluster]
[<ffffffffa045189d>] md_thread+0x11d/0x170 [md_mod]
[<ffffffff810768c4>] kthread+0xb4/0xc0
[<ffffffff8151927c>] ret_from_fork+0x7c/0xb0
... ... ...
RIP  [<ffffffffa0443581>] __remove_suspend_info+0x11/0xa0 [md_cluster]
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

9e3072e3

md-cluster: add the error check if failed to get dlm lock · b5ef5678

由 Guoqing Jiang 提交于 7月 10, 2015

In complicated cluster environment, it is possible that the
dlm lock couldn't be get/convert on purpose, the related err
info is added for better debug potential issue.

For lockres_free, if the lock is blocking by a lock request or
conversion request, then dlm_unlock just put it back to grant
queue, so need to ensure the lock is free finally.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

b5ef5678

md-cluster: init completion within lockres_init · b83d51c0

由 Guoqing Jiang 提交于 7月 10, 2015

We should init completion within lockres_init, otherwise
completion could be initialized more than one time during
it's life cycle.
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

b83d51c0

md-cluster: fix deadlock issue on message lock · 66099bb0

由 Guoqing Jiang 提交于 7月 10, 2015

There is problem with previous communication mechanism, and we got below
deadlock scenario with cluster which has 3 nodes.

	Sender                	    Receiver        		Receiver

	token(EX)
       message(EX)
      writes message
   downconverts message(CR)
      requests ack(EX)
		                  get message(CR)            gets message(CR)
                		  reads message                reads message
		               requests EX on message    requests EX on message

To fix this problem, we do the following changes:

1. the sender downconverts MESSAGE to CW rather than CR.
2. and the receiver request PR lock not EX lock on message.

And in case we failed to down-convert EX to CW on message, it is better to
unlock message otherthan still hold the lock.
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NLidong Zhong <ldzhong@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

66099bb0

md-cluster: transfer the resync ownership to another node · dc737d7c

由 Guoqing Jiang 提交于 7月 10, 2015

When node A stops an array while the array is doing a resync, we need
to let another node B take over the resync task.

To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC
message to the cluster. And the node B which received that message will
invoke __recover_slot to do resync.
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

dc737d7c

md-cluster: split recover_slot for future code reuse · 05cd0e51

由 Guoqing Jiang 提交于 7月 10, 2015

Make recover_slot as a wraper to __recover_slot, since the
logic of __recover_slot can be reused for the condition
when other nodes need to take over the resync job.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

05cd0e51

md-cluster: use %pU to print UUIDs · b89f704a

由 Guoqing Jiang 提交于 7月 10, 2015

Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

b89f704a

24 7月, 2015 1 次提交

Fix read-balancing during node failure · 90382ed9

由 Goldwyn Rodrigues 提交于 6月 24, 2015

During a node failure, We need to suspend read balancing so that the
reads are directed to the first device and stale data is not read.
Suspending writes is not required because these would be recorded and
synced eventually.

A new flag MD_CLUSTER_SUSPEND_READ_BALANCING is set in recover_prep().
area_resyncing() will respond true for the entire devices if this
flag is set and the request type is READ. The flag is cleared
in recover_done().
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Reported-By: NDavid Teigland <teigland@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

90382ed9

22 4月, 2015 3 次提交

md-cluster: re-add capabilities · 97f6cd39

由 Goldwyn Rodrigues 提交于 4月 14, 2015

When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
the clustered md:

1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
   clear the Faulty bit in their respective rdev->flags.
2. The node initiating re-add, gathers the bitmaps of all nodes
   and copies them into the local bitmap. It does not clear the bitmap
   from which it is copying.
3. Initiating node schedules a md recovery to sync the devices.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

97f6cd39

md-cluster: remove capabilities · 88bcfef7

由 Goldwyn Rodrigues 提交于 4月 14, 2015

This adds "remove" capabilities for the clustered environment.
When a user initiates removal of a device from the array, a
REMOVE message with disk number in the array is sent to all
the nodes which kick the respective device in their own array.

This facilitates the removal of failed devices.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

88bcfef7

md-cluster: correct the num for comparison · 8c58f02e

由 Guoqing Jiang 提交于 4月 21, 2015


Since the node num of md-cluster is from zero, and
cinfo->slot_number represents the slot num of dlm,
no need to check for equality.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

8c58f02e

21 3月, 2015 3 次提交

md/cluster: Communication Framework: fix semicolon.cocci warnings · 09dd1af2

由 kbuild test robot 提交于 2月 28, 2015

drivers/md/md-cluster.c:328:2-3: Unneeded semicolon

 Removes unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci
Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

09dd1af2

md: recover_bitmaps() can be static · 6dc69c9c

由 kbuild test robot 提交于 2月 28, 2015

drivers/md/md-cluster.c:190:6: sparse: symbol 'recover_bitmaps' was not declared. Should it be static?
Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

6dc69c9c

md: Fix stray --cluster-confirm crash · fa8259da

由 Goldwyn Rodrigues 提交于 3月 02, 2015

A --cluster-confirm without an --add (by another node) can
crash the kernel.

Fix it by guarding it using a state.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

fa8259da

23 2月, 2015 18 次提交

Add new disk to clustered array · 1aee41f6

由 Goldwyn Rodrigues 提交于 10月 29, 2014

Algorithm:
1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
   ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
2. Node 1 sends NEWDISK with uuid and slot number
3. Other nodes issue kobject_uevent_env with uuid and slot number
(Steps 4,5 could be a udev rule)
4. In userspace, the node searches for the disk, perhaps
   using blkid -t SUB_UUID=""
5. Other nodes issue either of the following depending on whether the disk
   was found:
   ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
	 disc.number set to slot number)
   ioctl(CLUSTERED_DISK_NACK)
6. Other nodes drop lock on no-new-devs (CR) if device is found
7. Node 1 attempts EX lock on no-new-devs
8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
   as SpareLocal
9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message.
Signed-off-by: NLidong Zhong <lzhong@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

1aee41f6

Suspend writes in RAID1 if within range · 589a1c49

由 Goldwyn Rodrigues 提交于 6月 07, 2014

If there is a resync going on, all nodes must suspend writes to the
range. This is recorded in the suspend_info/suspend_list.

If there is an I/O within the ranges of any of the suspend_info,
should_suspend will return 1.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

589a1c49

Resync start/Finish actions · e59721cc

由 Goldwyn Rodrigues 提交于 6月 07, 2014

When a RESYNC_START message arrives, the node removes the entry
with the current slot number and adds the range to the
suspend_list.

Simlarly, when a RESYNC_FINISHED message is received, node clears
entry with respect to the bitmap number.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

e59721cc

Send RESYNCING while performing resync start/stop · 965400eb

由 Goldwyn Rodrigues 提交于 6月 07, 2014

When a resync is initiated, RESYNCING message is sent to all active
nodes with the range (lo,hi). When the resync is over, a RESYNCING
message is sent with (0,0). A high sector value of zero indicates
that the resync is over.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

965400eb

Reload superblock if METADATA_UPDATED is received · 1d7e3e96

由 Goldwyn Rodrigues 提交于 6月 07, 2014

Re-reads the devices by invalidating the cache.
Since we don't write to faulty devices, this is detected using
events recorded in the devices. If it is old as compared to the mddev
mark it is faulty.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

1d7e3e96

metadata_update sends message to other nodes · 293467aa

由 Goldwyn Rodrigues 提交于 6月 07, 2014

   - request to send a message
   - make changes to superblock
   - send messages telling everyone that the superblock has changed
   - other nodes all read the superblock
   - other nodes all ack the messages
   - updating node release the "I'm sending a message" resource.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

293467aa

Communication Framework: Sending functions · 601b515c

由 Goldwyn Rodrigues 提交于 6月 07, 2014

The sending part is split in two functions to make sure
atomicity of the operations, such as the MD superblock update.
Signed-off-by: NLidong Zhong <lzhong@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

601b515c

Communication Framework: Receiving · 4664680c

由 Goldwyn Rodrigues 提交于 6月 07, 2014

1. receive status

   sender                         receiver                   receiver
   ACK:CR                          ACK:CR                     ACK:CR

2. sender get EX of TOKEN
   sender get EX of MESSAGE
   sender                          receiver                   receiver
   TOKEN:EX                         ACK:CR                     ACK:CR
   MESSAGE:EX
   ACK:CR

3. sender write LVB.
   sender down-convert MESSAGE from EX to CR
   sender try to get EX of ACK
   [ wait until all receiver has *processed* the MESSAGE ]

                                     [ triggered by bast of ACK ]
                                     receiver get CR of MESSAGE
                                     receiver read LVB
                                     receiver processes the message
				     [ wait finish ]
                                     receiver release ACK

   sender                         receiver                   receiver
   TOKEN:EX                       MESSAGE:CR                 MESSAGE:CR
   MESSAGE:CR
   ACK:EX

4. sender down-convert ACK from EX to CR
   sender release MESSAGE
   sender release TOKEN
				  receiver upconvert to EX of MESSAGE
                                  receiver get CR of ACK
				  receiver release MESSAGE

   sender                        receiver                   receiver
   ACK:CR                         ACK:CR                     ACK:CR
Signed-off-by: NLidong Zhong <lzhong@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

4664680c

Perform resync for cluster node failure · 4b26a08a

由 Goldwyn Rodrigues 提交于 6月 07, 2014

If bitmap_copy_slot returns hi>0, we need to perform resync.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

4b26a08a

Initiate recovery on node failure · e94987db

由 Goldwyn Rodrigues 提交于 6月 07, 2014

The DLM informs us in case of node failure with the DLM slot number.
cluster_info->recovery_map sets the bit corresponding to the slot number
and wakes up the recovery thread.

The recovery thread:
1. Derives the slot number from the recovery_map
2. Locks the bitmap corresponding to the slot
3. Copies the set bits to the node-local bitmap
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

e94987db

Gather on-going resync information of other nodes · 96ae923a

由 Goldwyn Rodrigues 提交于 6月 06, 2014

When a node joins, it does not know of other nodes performing resync.
So, each node keeps the resync information in it's LVB. When a new
node joins, it reads the LVB of each "online" bitmap.

[TODO] The new node attempts to get the PW lock on other bitmap, if
it is successful, it reads the bitmap and performs the resync (if
required) on it's behalf.

If the node does not get the PW, it requests CR and reads the LVB
for the resync information.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

96ae923a

G
Lock bitmap while joining the cluster · 54519c5f
由 Goldwyn Rodrigues 提交于 6月 06, 2014
```
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
```
54519c5f

Use separate bitmaps for each nodes in the cluster · b97e9257

由 Goldwyn Rodrigues 提交于 6月 06, 2014

On-disk format:

0                    4k                     8k                    12k
-------------------------------------------------------------------
| idle                | md super            | bm super [0] + bits |
| bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
| bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
| bm bits [3, contd]  |                     |                     |

Bitmap super has a field nodes, which defines the maximum number
of nodes the device can use. While reading the bitmap super, if
the cluster finds out that the number of nodes is > 0:
1. Requests the md-cluster module.
2. Calls md_cluster_ops->join(), which sets up clustering such as
   joining DLM lockspace.

Since the first time, the first bitmap is read. After the call
to the cluster_setup, the bitmap offset is adjusted and the
superblock is re-read. This also ensures the bitmap is read
the bitmap lock (when bitmap lock is introduced in later patches)

Questions:
1. cluster name is repeated in all bitmap supers. Is that okay?
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

b97e9257

Add node recovery callbacks · cf921cc1

由 Goldwyn Rodrigues 提交于 3月 30, 2014

DLM offers callbacks when a node fails and the lock remastery
is performed:

1. recover_prep: called when DLM discovers a node is down
2. recover_slot: called when DLM identifies the node and recovery
		can start
3. recover_done: called when all nodes have completed recover_slot

recover_slot() and recover_done() are also called when the node joins
initially in order to inform the node with its slot number. These slot
numbers start from one, so we deduct one to make it start with zero
which the cluster-md code uses.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

cf921cc1

Introduce md_cluster_info · c4ce867f

由 Goldwyn Rodrigues 提交于 3月 29, 2014

md_cluster_info stores the cluster information in the MD device.

The join() is called when mddev detects it is a clustered device.
The main responsibilities are:
	1. Setup a DLM lockspace
	2. Setup all initial locks such as super block locks and bitmap lock (will come later)

The leave() clears up the lockspace and all the locks held.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

c4ce867f

G
Introduce md_cluster_operations to handle cluster functions · edb39c9d
由 Goldwyn Rodrigues 提交于 3月 29, 2014
```
This allows dynamic registering of cluster hooks.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
```
edb39c9d

DLM lock and unlock functions · 47741b7c

由 Goldwyn Rodrigues 提交于 3月 07, 2014

A dlm_lock_resource is a structure which contains all information
required for locking using DLM. The init function allocates the
lock and acquires the lock in NL mode. The unlock function
converts the lock resource to NL mode. This is done to preserve
LVB and for faster processing of locks. The lock resource is
DLM unlocked only in the lockres_free function, which is the end
of life of the lock resource.
Signed-off-by: NLidong Zhong <lzhong@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

47741b7c

G
Create a separate module for clustering support · 8e854e9c
由 Goldwyn Rodrigues 提交于 3月 07, 2014
```
Tagged as EXPERIMENTAL for now.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
```
8e854e9c

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功