提交 · 86b572770e7964f006d438c4e05008914e9db79b · openanolis / cloud-kernel

13 10月, 2015 6 次提交

G
md-cluster: Add 'SUSE' as author for md-cluster.c · 86b57277
由 Guoqing Jiang 提交于 10月 12, 2015
```
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
```
86b57277
G
md-cluster: zero cmsg before it was sent · aee177ac
由 Guoqing Jiang 提交于 10月 12, 2015
```
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
```
aee177ac

md-cluster: make sure the node do not receive it's own msg · 256f5b24

由 Guoqing Jiang 提交于 10月 12, 2015

During the past test, the node occasionally received the msg which is
sent from itself, this case should not happen in theory, but it is
better to avoid it in case something wrong happened.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

256f5b24

md-cluster: remove unnecessary setting for slot · 487cf914

由 Guoqing Jiang 提交于 10月 12, 2015

Since slot will be set within _sendmsg, we can remove
the redundant code in resync_info_update.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>

487cf914

G
md-cluster: make other members of cluster_msg is handled by little endian funcs · faeff83f
由 Guoqing Jiang 提交于 10月 12, 2015
```
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
```
faeff83f

md-cluster: Do not printk() every received message · d216711b

由 Goldwyn Rodrigues 提交于 10月 12, 2015

The receive daemon prints kernel messages for every network message
received. This would fill the kernel message log with unnecessary messages.
Remove the pr_info() messages.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

d216711b

12 10月, 2015 10 次提交

md-cluster: Fix adding of new disk with new reload code · dbb64f86

由 Goldwyn Rodrigues 提交于 10月 01, 2015

Adding the disk worked incorrectly with the new reload code. Fix it:

 - No operation should be performed on rdev marked as Candidate
 - After a metadata update operation, kick disk if role is 0xfffe
   else clear Candidate bit and continue with the regular change check.
 - Saving the mode of the lock resource to check if token lock is already
   locked, because it can be called twice while adding a disk. However,
   unlock_comm() must be called only once.
 - add_new_disk() is called by the node initiating the --add operation.
   If it needs to be canceled, call add_new_disk_cancel(). The operation
   is completed by md_update_sb() which will write and unlock the
   communication.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

dbb64f86

md-cluster: Perform resync/recovery under a DLM lock · c186b128

由 Goldwyn Rodrigues 提交于 9月 30, 2015

Resync or recovery must be performed by only one node at a time.
A DLM lock resource, resync_lockres provides the mutual exclusion
so that only one node performs the recovery/resync at a time.

If a node is unable to get the resync_lockres, because recovery is
being performed by another node, it set MD_RECOVER_NEEDED so as
to schedule recovery in the future.

Remove the debug message in resync_info_update()
used during development.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

c186b128

md-cluster: Perform a lazy update · 2aa82191

由 Goldwyn Rodrigues 提交于 9月 28, 2015

In a clustered environment, a change such as marking a device faulty,
can be recorded by any of the nodes. This is communicated to all the
nodes and re-recording such a change is unnecessary, and quite often
pretty disruptive.

With this patch, just before the update, we detect for the changes
and if the changes are already in superblock, we abort the update
after clearing all the flags
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

2aa82191

md-cluster: Improve md_reload_sb to be less error prone · 70bcecdb

由 Goldwyn Rodrigues 提交于 8月 21, 2015

md_reload_sb is too simplistic and it explicitly needs to determine
the changes made by the writing node. However, there are multiple areas
where a simple reload could fail.

Instead, read the superblock of one of the "good" rdevs and update
the necessary information:

- read the superblock into a newly allocated page, by temporarily
  swapping out rdev->sb_page and calling ->load_super.
- if that fails return
- if it succeeds, call check_sb_changes
  1. iterates over list of active devices and checks the matching
   dev_roles[] value.
   	If that is 'faulty', the device must be  marked as faulty
	 - call md_error to mark the device as faulty. Make sure
	   not to set CHANGE_DEVS and wakeup mddev->thread or else
	   it would initiate a resync process, which is the responsibility
	   of the "primary" node.
	 - clear the Blocked bit
	 - Call remove_and_add_spares() to hot remove the device.
	If the device is 'spare':
	 - call remove_and_add_spares() to get the number of spares
	   added in this operation.
	 - Reduce mddev->degraded to mark the array as not degraded.
  2. reset recovery_cp
- read the rest of the rdevs to update recovery_offset. If recovery_offset
  is equal to MaxSector, call spare_active() to set it In_sync

This required that recovery_offset be initialized to MaxSector, as
opposed to zero so as to communicate the end of sync for a rdev.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

70bcecdb

md: remove_and_add_spares() to activate specific rdev · 2910ff17

由 Goldwyn Rodrigues 提交于 9月 28, 2015

remove_and_add_spares() checks for all devices to activate spare.
Change it to activate a specific device if a non-null rdev
argument is passed.

remove_and_add_spares() can be used to activate spares in
slot_store() as well.

For hot_remove_disk(), check if rdev->raid_disk == -1 before
calling remove_and_add_spares()
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

2910ff17

md-cluster: Wake up suspended process · b8ca846e

由 Goldwyn Rodrigues 提交于 10月 09, 2015

When the suspended_area is deleted, the suspended processes
must be woken up in order to complete their I/O.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>

b8ca846e

md-cluster: send BITMAP_NEEDS_SYNC when node is leaving cluster · 09995411

由 Guoqing Jiang 提交于 10月 01, 2015

Previously, BITMAP_NEEDS_SYNC message is sent when the resyc
aborts, but it could abort for different reasons, and not all
of reasons require another node to take over the resync ownship.

It is better make BITMAP_NEEDS_SYNC message only be sent when
the node is leaving cluster with dirty bitmap. And we also need
to ensure dlm connection is ok.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

09995411

md-cluster: Use a small window for resync · c40f341f

由 Goldwyn Rodrigues 提交于 8月 19, 2015

Suspending the entire device for resync could take too long. Resync
in small chunks.

cluster's resync window (32M) is maintained in r1conf as
cluster_sync_low and cluster_sync_high and processed in
raid1's sync_request(). If the current resync is outside the cluster
resync window:

1. Set the cluster_sync_low to curr_resync_completed.
2. Check if the sync will fit in the new window, if not issue a
   wait_barrier() and set cluster_sync_low to sector_nr.
3. Set cluster_sync_high to cluster_sync_low + resync_window.
4. Send a message to all nodes so they may add it in their suspension
   list.

bitmap_cond_end_sync is modified to allow to force a sync inorder
to get the curr_resync_completed uptodate with the sector passed.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

c40f341f

md: Increment version for clustered bitmaps · 3c462c88

由 Goldwyn Rodrigues 提交于 8月 19, 2015

Add BITMAP_MAJOR_CLUSTERED as 5, in order to prevent older kernels
to assemble a clustered device.

In order to maximize compatibility, the major version is set to
BITMAP_MAJOR_CLUSTERED *only* if the bitmap is clustered.

Added MD_FEATURE_CLUSTERED in order to return error for older
kernels which would assemble MD even if the bitmap is corrupted.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

3c462c88

md-cluster: complete all write requests before adding suspend_info · 9ed38ff5

由 Goldwyn Rodrigues 提交于 8月 14, 2015

process_suspend_info - which handles the RESYNCING request - must not
reply until all writes which were initiated before the request arrived,
have completed.

As a by-product, all process_* functions now take mddev as their
first arguement making it uniform.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

9ed38ff5

02 10月, 2015 8 次提交

md/bitmap: don't pass -1 to bitmap_storage_alloc. · da6fb7a9

由 NeilBrown 提交于 10月 01, 2015

Passing -1 to bitmap_storage_alloc() causes page->index to be set to
-1, which is quite problematic.

So only pass ->cluster_slot if mddev_is_clustered().

Fixes: b97e9257 ("Use separate bitmaps for each nodes in the cluster")
Cc: stable@vger.kernel.org (v4.1+)
Signed-off-by: NNeilBrown <neilb@suse.com>

da6fb7a9

md/raid1: Avoid raid1 resync getting stuck · e8ff8bf0

由 Jes Sorensen 提交于 9月 16, 2015

close_sync() needs to set conf->next_resync to a large, but safe value
below MaxSector and use it to determine whether or not to set
start_next_window in wait_barrier()

Solution suggested by Neil Brown.
Reported-by: NNate Dailey <nate.dailey@stratus.com>
Tested-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NJes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

e8ff8bf0

md: drop null test before destroy functions · 644df1a8

由 Julia Lawall 提交于 9月 13, 2015

Remove unneeded NULL test.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ expression x; @@
-if (x != NULL)
  \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>
Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NNeilBrown <neilb@suse.com>

644df1a8

md: clear CHANGE_PENDING in readonly array · d4929add

由 Shaohua Li 提交于 9月 18, 2015

If faulty disks of an array are more than allowed degraded number, the
array enters error handling. It will be marked as read-only with
MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't
clear CHANGE_PENDING bit for read-only array. If MD_CHANGE_PENDING is
set for a raid5 array, all returned IO will be hold on a list till the
bit is clear. But recovery nevery clears this bit, the IO is always in
pending state and nevery finish. This has bad effects like upper layer
can't get an IO error and the array can't be stopped.

Fixes: c3cce6cd ("md/raid5: ensure device failure recorded before write request returns.")
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

d4929add

md/raid0: apply base queue limits *before* disk_stack_limits · 66eefe5d

由 NeilBrown 提交于 9月 24, 2015

Calling e.g. blk_queue_max_hw_sectors() after calls to
disk_stack_limits() discards the settings determined by
disk_stack_limits().
So we need to make those calls first.

Fixes: 199dc6ed ("md/raid0: update queue parameter in a safer location.")
Cc: stable@vger.kernel.org (v2.6.35+ - please apply with 199dc6ed).
Reported-by: NJes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

66eefe5d

md/raid5: don't index beyond end of array in need_this_block(). · 36707bb2

由 NeilBrown 提交于 9月 24, 2015

When need_this_block probably shouldn't be called when there
are more than 2 failed devices, we really don't want it to try
indexing beyond the end of the failed_num[] of fdev[] arrays.

So limit the loops to at most 2 iterations.
Reported-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

36707bb2

raid5: update analysis state for failed stripe · ebda780b

由 Shaohua Li 提交于 9月 18, 2015

handle_failed_stripe() makes the stripe fail, eg, all IO will return
with a failure, but it doesn't update stripe_head_state. Later
handle_stripe() has special handling for raid6 for handle_stripe_fill().
That check before handle_stripe_fill() doesn't skip the failed stripe
and we get a kernel crash in need_this_block.  This patch clear the
analysis state to make sure no functions wrongly called after
handle_failed_stripe()
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

ebda780b

md: wait for pending superblock updates before switching to read-only · 88724bfa

由 NeilBrown 提交于 9月 24, 2015

If a superblock update is pending, wait for it to complete before
letting md_set_readonly() switch to readonly.
Otherwise we might lose important information about a device having
failed.

For external arrays, waiting for superblock updates can wait on
user-space, so in that case, just return an error.
Reported-and-tested-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

88724bfa

12 9月, 2015 1 次提交

scsi_dh: fix randconfig build error · 294ab783

由 Christoph Hellwig 提交于 9月 09, 2015

It looks like the Kconfig check that was meant to fix this (commit
fe9233fb [SCSI] scsi_dh: fix kconfig related
build errors) was actually reversed, but no-one noticed until the new set of
patches which separated DM and SCSI_DH).

Fixes: fe9233fbSigned-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJames Bottomley <JBottomley@Odin.com>

294ab783

01 9月, 2015 15 次提交

dm cache: fix use after freeing migrations · cc7da0ba

由 Joe Thornber 提交于 9月 01, 2015

Both free_io_migration() and issue_discard() dereference a migration
that was just freed. Fix those by saving off the migrations's cache
object before freeing the migration. Also cleanup needless mg->cache
dereferences now that the cache object is available directly.

Fixes: e44b6a5a ("dm cache: move wake_waker() from free_migrations() to where it is needed")
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

cc7da0ba

dm cache: small cleanups related to deferred prison cell cleanup · dc9cee5d

由 Mike Snitzer 提交于 8月 31, 2015

Eliminate __cell_release() since it only had one caller that always
released the cell holder.

Switch cell_error_with_code() to using free_prison_cell() for the sake
of consistency.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

dc9cee5d

dm cache: fix leaking of deferred bio prison cells · 9153df74

由 Joe Thornber 提交于 8月 31, 2015

There were two cases where dm_cell_visit_release() was being called,
which removes the cell from the prison's rbtree, but the callers didn't
also return the cell to the mempool.  Fix this by having them call
free_prison_cell().

This leak manifested as the 'kmalloc-96' slab growing until OOM.

Fixes: 651f5fa2 ("dm cache: defer whole cells")
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+

9153df74

md/raid5: ensure device failure recorded before write request returns. · c3cce6cd

由 NeilBrown 提交于 8月 14, 2015

When a write to one of the devices of a RAID5/6 fails, the failure is
recorded in the metadata of the other devices so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).

Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that completed when MD_CHANGE_PENDING is set to
   only be processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.
Signed-off-by: NNeilBrown <neilb@suse.com>

c3cce6cd

md/raid5: use bio_list for the list of bios to return. · 34a6f80e

由 NeilBrown 提交于 8月 14, 2015

This will make it easier to splice two lists together which will
be needed in future patch.
Signed-off-by: NNeilBrown <neilb@suse.com>

34a6f80e

md/raid10: ensure device failure recorded before write request returns. · 95af587e

由 NeilBrown 提交于 8月 14, 2015

When a write to one of the legs of a RAID10 fails, the failure is
recorded in the metadata of the other legs so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that experienced an error on a new queue which
   is only processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.
Signed-off-by: NNeilBrown <neilb@suse.com>

95af587e

md/raid1: ensure device failure recorded before write request returns. · 55ce74d4

由 NeilBrown 提交于 8月 14, 2015

When a write to one of the legs of a RAID1 fails, the failure is
recorded in the metadata of the other leg(s) so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again  (maybe a cable was unplugged).

Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that experienced an error on a new queue which
   is only processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.
Signed-off-by: NNeilBrown <neilb@suse.com>

55ce74d4

md-cluster: remove inappropriate try_module_get from join() · 18b9f679

由 NeilBrown 提交于 8月 14, 2015

md_setup_cluster already calls try_module_get(), so this
try_module_get isn't needed.
Also, there is no matching module_put (except in error patch),
so this leaves an unbalanced module count.
Signed-off-by: NNeilBrown <neilb@suse.com>

18b9f679

md: extend spinlock protection in register_md_cluster_operations · 6022e75b

由 NeilBrown 提交于 8月 13, 2015

This code looks racy.

The only possible race is if two modules try to register at the same
time and that won't happen.  But make the code look safe anyway.
Signed-off-by: NNeilBrown <neilb@suse.com>

6022e75b

md-cluster: Read the disk bitmap sb and check if it needs recovery · abb9b22a

由 Guoqing Jiang 提交于 7月 10, 2015

In gather_all_resync_info, we need to read the disk bitmap sb and
check if it needs recovery.
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

abb9b22a

md-cluster: only call complete(&cinfo->completion) when node join cluster · eece075c

由 Guoqing Jiang 提交于 7月 10, 2015

Introduce MD_CLUSTER_BEGIN_JOIN_CLUSTER flag to make sure
complete(&cinfo->completion) is only be invoked when node
join cluster. Otherwise node failure could also call the
complete, and it doesn't make sense to do it.
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

eece075c

md-cluster: add missed lockres_free · 6e6d9f2c

由 Guoqing Jiang 提交于 7月 10, 2015

We also need to free the lock resource before goto out.
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

6e6d9f2c

md-cluster: remove the unused sb_lock · b2b9bfff

由 Guoqing Jiang 提交于 7月 10, 2015

The sb_lock is not used anywhere, so let's remove it.
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

b2b9bfff

md-cluster: init suspend_list and suspend_lock early in join · 9e3072e3

由 Guoqing Jiang 提交于 7月 10, 2015

If the node just join the cluster, and receive the msg from other nodes
before init suspend_list, it will cause kernel crash due to NULL pointer
dereference, so move the initializations early to fix the bug.

md-cluster: Joined cluster 3578507b-e0cb-6d4f-6322-696cd7b1b10c slot 3
BUG: unable to handle kernel NULL pointer dereference at           (null)
... ... ...
Call Trace:
[<ffffffffa0444924>] process_recvd_msg+0x2e4/0x330 [md_cluster]
[<ffffffffa0444a06>] recv_daemon+0x96/0x170 [md_cluster]
[<ffffffffa045189d>] md_thread+0x11d/0x170 [md_mod]
[<ffffffff810768c4>] kthread+0xb4/0xc0
[<ffffffff8151927c>] ret_from_fork+0x7c/0xb0
... ... ...
RIP  [<ffffffffa0443581>] __remove_suspend_info+0x11/0xa0 [md_cluster]
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

9e3072e3

md-cluster: add the error check if failed to get dlm lock · b5ef5678

由 Guoqing Jiang 提交于 7月 10, 2015

In complicated cluster environment, it is possible that the
dlm lock couldn't be get/convert on purpose, the related err
info is added for better debug potential issue.

For lockres_free, if the lock is blocking by a lock request or
conversion request, then dlm_unlock just put it back to grant
queue, so need to ensure the lock is free finally.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

b5ef5678

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功