提交 · 414e6b9a7032a6c2f5ddf018fdb199190b075170 · openanolis / cloud-kernel

14 6月, 2016 4 次提交

md/raid1, raid10: don't recheck "Faulty" flag in read-balance. · 414e6b9a

由 NeilBrown 提交于 6月 02, 2016

Re-checking the faulty flag here brings no value.
The comment about "risk" refers to the risk that the device could
be in the process of being removed by ->hot_remove_disk().
However providing that the ->nr_pending count is incremented inside
an rcu_read_locked() region, there is no risk of that happening.

This is because the rdev pointer (in the personalities array) is set
to NULL before synchronize_rcu(), and ->nr_pending is tested
afterwards.  If the rcu_read_locked region happens before the
synchronize_rcu(), the test will see that nr_pending has been incremented.
If it happens afterwards, the rdev pointer will be NULL so there is nothing
to increment.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

414e6b9a

md: disconnect device from personality before trying to remove it. · 8430e7e0

由 NeilBrown 提交于 6月 02, 2016

When the HOT_REMOVE_DISK ioctl is used to remove a device, we
call remove_and_add_spares() which will remove it from the personality
if possible.  This improves the chances that the removal will succeed.

When writing "remove" to dev-XX/state, we don't.  So that can fail more easily.

So add the remove_and_add_spares() into "remove" handling.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

8430e7e0

raid1/raid10: slow down resync if there is non-resync activity pending · 7ac50447

由 Tomasz Majchrzak 提交于 6月 13, 2016

A performance drop of mkfs has been observed on RAID10 during resync
since commit 09314799 ("md: remove 'go_faster' option from
->sync_request()"). Resync sends so many IOs it slows down non-resync
IOs significantly (few times). Add a short delay to a resync. The
previous long sleep (1s) has proven unnecessary, even very short delay
brings performance right.

The change also applied to raid1. The problem has not been observed on
raid1, however it shares barriers code with raid10 so it might be an
issue for some setup too.
Suggested-by: NNeilBrown <neilb@suse.com>
Link: http://lkml.kernel.org/r/20160609134555.GA9104@proton.igk.intel.comSigned-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

7ac50447

MD:Update superblock when err == 0 in size_store · 4ba1e788

由 Xiao Ni 提交于 6月 12, 2016

This is a simple check before updating the superblock. It should update
the superblock when update_size return 0.
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

4ba1e788

10 6月, 2016 1 次提交

md: use a mutex to protect a global list · 5b1f5bc3

由 Cong Wang 提交于 6月 08, 2016

We saw a list corruption in the list all_detected_devices:

 WARNING: CPU: 16 PID: 226 at lib/list_debug.c:29 __list_add+0x3c/0xa9()
 list_add corruption. next->prev should be prev (ffff880859d58320), but was ffff880859ce74c0. (next=ffffffff81abfdb0).
 Modules linked in: ahci libahci libata sd_mod scsi_mod
 CPU: 16 PID: 226 Comm: kworker/u241:4 Not tainted 4.1.20 #1
 Hardware name: Dell Inc. PowerEdge C6220/04GD66, BIOS 2.2.3 11/07/2013
 Workqueue: events_unbound async_run_entry_fn
  0000000000000000 ffff880859a5baf8 ffffffff81502872 ffff880859a5bb48
  0000000000000009 ffff880859a5bb38 ffffffff810692a5 ffff880859ee8828
  ffffffff812ad02c ffff880859d58320 ffffffff81abfdb0 ffff880859eb90c0
 Call Trace:
  [<ffffffff81502872>] dump_stack+0x4d/0x63
  [<ffffffff810692a5>] warn_slowpath_common+0xa1/0xbb
  [<ffffffff812ad02c>] ? __list_add+0x3c/0xa9
  [<ffffffff81069305>] warn_slowpath_fmt+0x46/0x48
  [<ffffffff812ad02c>] __list_add+0x3c/0xa9
  [<ffffffff81406f28>] md_autodetect_dev+0x41/0x62
  [<ffffffff81285862>] rescan_partitions+0x25f/0x29d
  [<ffffffff81506372>] ? mutex_lock+0x13/0x31
  [<ffffffff811a090f>] __blkdev_get+0x1aa/0x3cd
  [<ffffffff811a0b91>] blkdev_get+0x5f/0x294
  [<ffffffff81377ceb>] ? put_device+0x17/0x19
  [<ffffffff8128227c>] ? disk_put_part+0x12/0x14
  [<ffffffff812836f3>] add_disk+0x29d/0x407
  [<ffffffff81384345>] ? __pm_runtime_use_autosuspend+0x5c/0x64
  [<ffffffffa004a724>] sd_probe_async+0x115/0x1af [sd_mod]
  [<ffffffff81083177>] async_run_entry_fn+0x72/0x12c
  [<ffffffff8107c44c>] process_one_work+0x198/0x2ce
  [<ffffffff8107cac7>] worker_thread+0x1dd/0x2bb
  [<ffffffff8107c8ea>] ? cancel_delayed_work_sync+0x15/0x15
  [<ffffffff8107c8ea>] ? cancel_delayed_work_sync+0x15/0x15
  [<ffffffff81080d9c>] kthread+0xae/0xb6
  [<ffffffff81080000>] ? param_array_set+0x40/0xfa
  [<ffffffff81080cee>] ? __kthread_parkme+0x61/0x61
  [<ffffffff81508152>] ret_from_fork+0x42/0x70
  [<ffffffff81080cee>] ? __kthread_parkme+0x61/0x61

I suspect it is because there is no lock protecting this
global list, autostart_arrays() is called in ioctl() path
where there is no lock.

Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

5b1f5bc3

04 6月, 2016 2 次提交

G
md: simplify the code with md_kick_rdev_from_array · db767672
由 Guoqing Jiang 提交于 6月 02, 2016
```
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>
```
db767672

md-cluster: fix deadlock issue when add disk to an recoverying array · bb8bf15b

由 Guoqing Jiang 提交于 6月 02, 2016

Add a disk to an array which is performing recovery
is a little complicated, we need to do both reap the
sync thread and perform add disk for the case, then
it caused deadlock as follows.

linux44:~ # ps aux|grep md|grep D
root      1822  0.0  0.0      0     0 ?        D    16:50   0:00 [md127_resync]
root      1848  0.0  0.0  19860   952 pts/0    D+   16:50   0:00 mdadm --manage /dev/md127 --re-add /dev/vdb
linux44:~ # cat /proc/1848/stack
[<ffffffff8107afde>] kthread_stop+0x6e/0x120
[<ffffffffa051ddb0>] md_unregister_thread+0x40/0x80 [md_mod]
[<ffffffffa0526e45>] md_reap_sync_thread+0x15/0x150 [md_mod]
[<ffffffffa05271e0>] action_store+0x260/0x270 [md_mod]
[<ffffffffa05206b4>] md_attr_store+0xb4/0x100 [md_mod]
[<ffffffff81214a7e>] sysfs_write_file+0xbe/0x140
[<ffffffff811a6b98>] vfs_write+0xb8/0x1e0
[<ffffffff811a75b8>] SyS_write+0x48/0xa0
[<ffffffff8152a5c9>] system_call_fastpath+0x16/0x1b
[<00007f068ea1ed30>] 0x7f068ea1ed30
linux44:~ # cat /proc/1822/stack
[<ffffffffa05251a6>] md_do_sync+0x846/0xf40 [md_mod]
[<ffffffffa052402d>] md_thread+0x16d/0x180 [md_mod]
[<ffffffff8107ad94>] kthread+0xb4/0xc0
[<ffffffff8152a518>] ret_from_fork+0x58/0x90

                        Task1848                                Task1822
md_attr_store (held reconfig_mutex by call mddev_lock())
                        action_store
			md_reap_sync_thread
			md_unregister_thread
			kthread_stop                    md_wakeup_thread(mddev->thread);
						wait_event(mddev->sb_wait, !test_bit(MD_CHANGE_PENDING))

md_check_recovery is triggered by wakeup mddev->thread,
but it can't clear MD_CHANGE_PENDING flag since it can't
get lock which was held by md_attr_store already.

To solve the deadlock problem, we move "->resync_finish()"
from md_do_sync to md_reap_sync_thread (after md_update_sb),
also MD_HELD_RESYNC_LOCK is introduced since it is possible
that node can't get resync lock in md_do_sync.

Then we do not need to wait for MD_CHANGE_PENDING is cleared
or not since metadata should be updated after md_update_sb,
so just call resync_finish if MD_HELD_RESYNC_LOCK is set.

We also unified the code after skip label, since set PENDING
for non-clustered case should be harmless.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

bb8bf15b

26 5月, 2016 1 次提交

right meaning of PARITY_ENABLE_RMW and PARITY_PREFER_RMW · 41257580

由 Song Liu 提交于 5月 23, 2016

In current handle_stripe_dirtying, the code prefers rmw with
PARITY_ENABLE_RMW; while prefers rcw with PARITY_PREFER_RMW.

This patch reverses this behavior.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

41257580

13 5月, 2016 4 次提交

dm thin: unroll issue_discard() to create longer discard bio chains · 202bae52

由 Joe Thornber 提交于 5月 04, 2016

There is little benefit to doing this but it does structure DM thinp's
code to more cleanly use the __blkdev_issue_discard() interface --
particularly in passdown_double_checking_shared_status().
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

202bae52

dm thin: use __blkdev_issue_discard for async discard support · 3dba53a9

由 Mike Snitzer 提交于 5月 02, 2016

With commit 38f25255 ("block: add __blkdev_issue_discard") DM thinp
no longer needs to carry its own async discard method.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

3dba53a9

dm thin: remove __bio_inc_remaining() and switch to using bio_inc_remaining() · 13e4f8a6

由 Mike Snitzer 提交于 5月 04, 2016

DM thinp's use of bio_inc_remaining() is critical to ensure the original
parent discard bio isn't completed before sub-discards have. DM thinp
needs this due to the extra quiescing that occurs, via multiple DM thinp
mappings, while processing large discards. As such DM thinp must build
the async discard bio chain after some delay -- so bio_inc_remaining()
is used to enable DM thinp to take a reference on the original parent
discard bio for each mapping. This allows the immediate use of
bio_endio() on that discard bio; but with the understanding that the
actual completion won't occur until each of the sub-discards'
per-mapping references are dropped.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJoe Thornber <ejt@redhat.com>

13e4f8a6

dm raid: make sure no feature flags are set in metadata · 4c9971ca

由 Heinz Mauelshagen 提交于 4月 29, 2016

Given we don't yet support any feature flags in the dm-raid ondisk
metadata (see: 'features' member of 'struct dm_raid_superblock'),
add a check to ensure no flags are actually set, if any features are
set reject the activation of the RAID mapping.

This is to prevent possible data corruption in case of a kernel
downgrade when there'll potentially be feature flags set by a future
dm-raid target.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

4c9971ca

10 5月, 2016 6 次提交

md-cluster: check the return value of process_recvd_msg · 1fa9a1ad

由 Guoqing Jiang 提交于 5月 03, 2016

We don't need to run the full path of recv_daemon
if process_recvd_msg doesn't return 0.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

1fa9a1ad

md-cluster: gather resync infos and enable recv_thread after bitmap is ready · 51e453ae

由 Guoqing Jiang 提交于 5月 04, 2016

The in-memory bitmap is not ready when node joins cluster,
so it doesn't make sense to make gather_all_resync_info()
called so earlier, we need to call it after the node's
bitmap is setup. Also, recv_thread could be wake up after
node joins cluster, but it could cause problem if node
receives RESYNCING message without persionality since
mddev->pers->quiesce is called in process_suspend_info.

This commit introduces a new cluster interface load_bitmaps
to fix above problems, load_bitmaps is called in bitmap_load
where bitmap and persionality are ready, and load_bitmaps
does the following tasks:

1. call gather_all_resync_info to load all the node's
   bitmap info.
2. set MD_CLUSTER_ALREADY_IN_CLUSTER bit to recv_thread
   could be wake up, and wake up recv_thread if there is
   pending recv event.

Then ack_bast only wakes up recv_thread after IN_CLUSTER
bit is ready otherwise MD_CLUSTER_PENDING_RESYNC_EVENT is
set.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

51e453ae

md: set MD_CHANGE_PENDING in a atomic region · 85ad1d13

由 Guoqing Jiang 提交于 5月 03, 2016

Some code waits for a metadata update by:

1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
2. setting MD_CHANGE_PENDING and waking the management thread
3. waiting for MD_CHANGE_PENDING to be cleared

If the first two are done without locking, the code in md_update_sb()
which checks if it needs to repeat might test if an update is needed
before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
in the wait returning early.

So make sure all places that set MD_CHANGE_PENDING are atomicial, and
bit_clear_unless (suggested by Neil) is introduced for the purpose.

Cc: Martin Kepplinger <martink@posteo.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: <linux-kernel@vger.kernel.org>
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

85ad1d13

md: raid5: add prerequisite to run underneath dm-raid · fe67d19a

由 Heinz Mauelshagen 提交于 5月 03, 2016

In case md runs underneath the dm-raid target, the mddev does not have
a request queue or gendisk, thus avoid accesses.

This patch adds a missing conditional to the raid5 personality.
Signed-of-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

fe67d19a

md: raid10: add prerequisite to run underneath dm-raid · 859644f0

由 Heinz Mauelshagen 提交于 5月 03, 2016

In case md runs underneath the dm-raid target, the mddev does not have
a request queue or gendisk, thus avoid accesses to it.

This patch adds two missing conditionals to the raid10 personality.
Signed-of-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

859644f0

md: md.c: fix oops in mddev_suspend for raid0 · 092398dc

由 Heinz Mauelshagen 提交于 5月 03, 2016

Introduced by upstream commit 70d9798b

The raid0 personality does not create mddev->thread as oposed to
other personalities leading to its unconditional access in
mddev_suspend() causing an oops.

Patch checks for mddev->thread in order to keep the
intention of aforementioned commit.

Fixes: 70d9798b ("MD: warn for potential deadlock")
Cc: stable@vger.kernel.org (4.5+)
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

092398dc

06 5月, 2016 7 次提交

dm ioctl: drop use of __GFP_REPEAT in copy_params()'s __vmalloc() call · 72f6d8d8

由 Michal Hocko 提交于 4月 28, 2016

copy_params()'s use of __GFP_REPEAT for the __vmalloc() call doesn't make much
sense because vmalloc doesn't rely on costly high order allocations.
Signed-off-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

72f6d8d8

dm mpath: eliminate use of spinlock in IO fast-paths · 2da1610a

由 Mike Snitzer 提交于 3月 17, 2016

The primary motivation of this commit is to improve the scalability of
DM multipath on large NUMA systems where m->lock spinlock contention has
been proven to be a serious bottleneck on really fast storage.

The ability to atomically read a pointer, using lockless_dereference(),
is leveraged in this commit.  But all pointer writes are still protected
by the m->lock spinlock (which is fine since these all now occur in the
slow-path).

The following functions no longer require the m->lock spinlock in their
fast-path: multipath_busy(), __multipath_map(), and do_end_io()

And choose_pgpath() is modified to _not_ update m->current_pgpath unless
it also switches the path-group.  This is done to avoid needing to take
the m->lock everytime __multipath_map() calls choose_pgpath().
But m->current_pgpath will be reset if it is failed via fail_path().
Suggested-by: NJeff Moyer <jmoyer@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Tested-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

2da1610a

dm mpath: move trigger_event member to the end of 'struct multipath' · 20800cb3

由 Mike Snitzer 提交于 3月 17, 2016

Allows the 'work_mutex' member to no longer cross a cacheline.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Tested-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

20800cb3

dm mpath: use atomic_t for counting members of 'struct multipath' · 91e968aa

由 Mike Snitzer 提交于 3月 17, 2016

The use of atomic_t for nr_valid_paths, pg_init_in_progress and
pg_init_count will allow relaxing the use of the m->lock spinlock.
Suggested-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Tested-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

91e968aa

dm mpath: switch to using bitops for state flags · 518257b1

由 Mike Snitzer 提交于 3月 17, 2016

Mechanical change that doesn't make any real effort to reduce the use of
m->lock; that will come later (once atomics are used for counters, etc).
Suggested-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Tested-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

518257b1

dm thin: Remove return statement from void function · 813923b1

由 Amitoj Kaur Chawla 提交于 4月 11, 2016

Return statement at the end of a void function is useless.

The Coccinelle semantic patch used to make this change is as follows:
//<smpl>
@@
identifier f;
expression e;
@@
void f(...) {
<...
- return
  e;
...>
}
//</smpl>
Signed-off-by: NAmitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

813923b1

M
dm: remove unused mapped_device argument from free_tio() · cfae7529
由 Mike Snitzer 提交于 4月 11, 2016
```
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
```
cfae7529

05 5月, 2016 13 次提交

md-cluster: fix ifnullfree.cocci warnings · bc47e842

由 kbuild test robot 提交于 5月 02, 2016

drivers/md/bitmap.c:2049:6-11: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values.

NULL check before some freeing functions is not needed.

Based on checkpatch warning
"kfree(NULL) is safe this check is probably not required"
and kfreeaddr.cocci by Julia Lawall.

Generated by: scripts/coccinelle/free/ifnullfree.cocci
Acked-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

bc47e842

md-cluster/bitmap: unplug bitmap to sync dirty pages to disk · c84400c8

由 Guoqing Jiang 提交于 5月 02, 2016

This patch is doing two distinct but related things.

1. It adds bitmap_unplug() for the main bitmap (mddev->bitmap).  As bit
have been set, BITMAP_PAGE_DIRTY is set so bitmap_deamon_work() will
not write those pages out in its regular scans, only bitmap_unplug()
will.  If there are no writes to the array, bitmap_unplug() won't be
called, so we need to call it explicitly here.

2. bitmap_write_all() is a bit of a confusing interface as it doesn't
actually write anything.  The current code for writing "bitmap" works
but this change makes it a bit clearer.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

c84400c8

md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and bitmap_file_set_bit · 23cea66a

由 Guoqing Jiang 提交于 5月 02, 2016

The pnum passed to set_page_attr and test_page_attr should from
0 to storage.file_pages - 1, but bitmap_file_set_bit and
bitmap_file_clear_bit call set_page_attr and test_page_attr with
page->index parameter while page->index has already added node_offset
before.

So we need to minus node_offset in both bitmap_file_clear_bit
and bitmap_file_set_bit.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

23cea66a

md-cluster/bitmap: fix wrong calcuation of offset · 7f86ffed

由 Guoqing Jiang 提交于 5月 02, 2016

The offset is wrong in bitmap_storage_alloc, we should
set it like below in bitmap_init_from_disk().

node_offset = bitmap->cluster_slot * (DIV_ROUND_UP(store->bytes, PAGE_SIZE));

Because 'offset' is only assigned to 'page->index' and
that is usually over-written by read_sb_page. So it does
not cause problem in general, but it still need to be fixed.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

7f86ffed

md-cluster: sync bitmap when node received RESYNCING msg · 18c9ff7f

由 Guoqing Jiang 提交于 5月 02, 2016

If the node received RESYNCING message which means
another node will perform resync with the area, then
we don't want to do it again in another node.

Let's set RESYNC_MASK and clear NEEDED_MASK for the
region from old-low to new-low which has finished
syncing, and the region from old-hi to new-hi is about
to syncing, bitmap_sync_with_cluste is introduced for
the purpose.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

18c9ff7f

md-cluster: always setup in-memory bitmap · c9d65032

由 Guoqing Jiang 提交于 5月 02, 2016

The in-memory bitmap for raid is allocated on demand,
then for cluster scenario, it is possible that slave
node which received RESYNCING message doesn't have the
in-memory bitmap when master node is perform resyncing,
so we can't make bitmap is match up well among each
nodes.

So for cluster scenario, we need always preserve the
bitmap, and ensure the page will not be freed. And a
no_hijack flag is introduced to both bitmap_checkpage
and bitmap_get_counter, which makes cluster raid returns
fail once allocate failed.

And the next patch is relied on this change since it
keeps sync bitmap among each nodes during resyncing
stage.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

c9d65032

md-cluster: wakeup thread if activated a spare disk · a578183e

由 Guoqing Jiang 提交于 5月 02, 2016

When a device is re-added, it will ultimately need
to be activated and that happens in md_check_recovery,
so we need to set MD_RECOVERY_NEEDED right after
remove_and_add_spares.

A specifical issue without the change is that when
one node perform fail/remove/readd on a disk, but
slave nodes could not add the disk back to array as
expected (added as missed instead of in sync). So
give slave nodes a chance to do resync.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

a578183e

md-cluster: change array_sectors and update size are not supported · ab5a98b1

由 Guoqing Jiang 提交于 5月 02, 2016

Currently, some features are not supported yet,
such as change array_sectors and update size, so
return EINVAL for them and listed it in document.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

ab5a98b1

md-cluster: fix locking when node joins cluster during message broadcast · 1535212c

由 Guoqing Jiang 提交于 5月 02, 2016

If a node joins the cluster while a message broadcast
is under way, a lock issue could happen as follows.

For a cluster which included two nodes, if node A is
calling __sendmsg before up-convert CR to EX on ack,
and node B released CR on ack. But if a new node C
joins the cluster and it doesn't receive the message
which A sent before, so it could hold CR on ack before
A up-convert CR to EX on ack.

So a node joining the cluster should get an EX lock on
the "token" first to ensure no broadcast is ongoing,
then release it after held CR on ack.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

1535212c

md-cluster: unregister thread if err happened · 5b0fb33e

由 Guoqing Jiang 提交于 5月 02, 2016

The two threads need to be unregistered if a node
can't join cluster successfully.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

5b0fb33e

md-cluster: wake up thread to continue recovery · eb315cd0

由 Guoqing Jiang 提交于 5月 02, 2016

In recovery case, we need to set MD_RECOVERY_NEEDED
and wake up thread only if recover is not finished.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

eb315cd0

md-cluser: make resync_finish only called after pers->sync_request · 2c97cf13

由 Guoqing Jiang 提交于 5月 02, 2016

It is not reasonable that cluster raid to release resync
lock before the last pers->sync_request has finished.

As the metadata will be changed when node performs resync,
we need to inform other nodes to update metadata, so the
MD_CHANGE_PENDING flag is set before finish resync.

Then metadata_update_finish is move ahead to ensure that
METADATA_UPDATED msg is sent before finish resync, and
metadata_update_start need to be run after "repeat:" label
accordingly.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

2c97cf13

md-cluster: change resync lock from asynchronous to synchronous · 41a9a0dc

由 Guoqing Jiang 提交于 5月 02, 2016

If multiple nodes choose to attempt do resync at the same time
they need to be serialized so they don't duplicate effort. This
serialization is done by locking the 'resync' DLM lock.

Currently if a node cannot get the lock immediately it doesn't
request notification when the lock becomes available (i.e.
DLM_LKF_NOQUEUE is set), so it may not reliably find out when it
is safe to try again.

Rather than trying to arrange an async wake-up when the lock
becomes available, switch to using synchronous locking - this is
a lot easier to think about.  As it is not permitted to block in
the 'raid1d' thread, move the locking to the resync thread.  So
the rsync thread is forked immediately, but it blocks until the
resync lock is available. Once the lock is locked it checks again
if any resync action is needed.

A particular symptom of the current problem is that a node can
get stuck with "resync=pending" indefinitely.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

41a9a0dc

30 4月, 2016 1 次提交

raid5: delete unnecessary warnning · b8a0b8e9

由 Shaohua Li 提交于 4月 29, 2016

If device has R5_LOCKED set, it's legit device has R5_SkipCopy set and page !=
orig_page. After R5_LOCKED is clear, handle_stripe_clean_event will clear the
SkipCopy flag and set page to orig_page. So the warning is unnecessary.
Reported-by: NJoey Liao <joeyliao@qnap.com>
Signed-off-by: NShaohua Li <shli@fb.com>

b8a0b8e9

26 4月, 2016 1 次提交

MD: make bio mergeable · 9c573de3

由 Shaohua Li 提交于 4月 25, 2016

blk_queue_split marks bio unmergeable, which makes sense for normal bio.
But if dispatching the bio to underlayer disk, the blk_queue_split
checks are invalid, hence it's possible the bio becomes mergeable.

In the reported bug, this bug causes trim against raid0 performance slash
https://bugzilla.kernel.org/show_bug.cgi?id=117051Reported-and-tested-by: NPark Ju Hyung <qkrwngud825@gmail.com>
Fixes: 6ac45aeb(block: avoid to merge splitted bio)
Cc: stable@vger.kernel.org (v4.3+)
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Neil Brown <neilb@suse.de>
Reviewed-by: NJens Axboe <axboe@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

9c573de3

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功