提交 · d4929add83ad4660b1824a9282ab5dd4d60140fa · openanolis / cloud-kernel

02 10月, 2015 5 次提交

md: clear CHANGE_PENDING in readonly array · d4929add

由 Shaohua Li 提交于 9月 18, 2015

If faulty disks of an array are more than allowed degraded number, the
array enters error handling. It will be marked as read-only with
MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't
clear CHANGE_PENDING bit for read-only array. If MD_CHANGE_PENDING is
set for a raid5 array, all returned IO will be hold on a list till the
bit is clear. But recovery nevery clears this bit, the IO is always in
pending state and nevery finish. This has bad effects like upper layer
can't get an IO error and the array can't be stopped.

Fixes: c3cce6cd ("md/raid5: ensure device failure recorded before write request returns.")
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

d4929add

md/raid0: apply base queue limits *before* disk_stack_limits · 66eefe5d

由 NeilBrown 提交于 9月 24, 2015

Calling e.g. blk_queue_max_hw_sectors() after calls to
disk_stack_limits() discards the settings determined by
disk_stack_limits().
So we need to make those calls first.

Fixes: 199dc6ed ("md/raid0: update queue parameter in a safer location.")
Cc: stable@vger.kernel.org (v2.6.35+ - please apply with 199dc6ed).
Reported-by: NJes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

66eefe5d

md/raid5: don't index beyond end of array in need_this_block(). · 36707bb2

由 NeilBrown 提交于 9月 24, 2015

When need_this_block probably shouldn't be called when there
are more than 2 failed devices, we really don't want it to try
indexing beyond the end of the failed_num[] of fdev[] arrays.

So limit the loops to at most 2 iterations.
Reported-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

36707bb2

raid5: update analysis state for failed stripe · ebda780b

由 Shaohua Li 提交于 9月 18, 2015

handle_failed_stripe() makes the stripe fail, eg, all IO will return
with a failure, but it doesn't update stripe_head_state. Later
handle_stripe() has special handling for raid6 for handle_stripe_fill().
That check before handle_stripe_fill() doesn't skip the failed stripe
and we get a kernel crash in need_this_block.  This patch clear the
analysis state to make sure no functions wrongly called after
handle_failed_stripe()
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

ebda780b

md: wait for pending superblock updates before switching to read-only · 88724bfa

由 NeilBrown 提交于 9月 24, 2015

If a superblock update is pending, wait for it to complete before
letting md_set_readonly() switch to readonly.
Otherwise we might lose important information about a device having
failed.

For external arrays, waiting for superblock updates can wait on
user-space, so in that case, just return an error.
Reported-and-tested-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

88724bfa

12 9月, 2015 1 次提交

scsi_dh: fix randconfig build error · 294ab783

由 Christoph Hellwig 提交于 9月 09, 2015

It looks like the Kconfig check that was meant to fix this (commit
fe9233fb [SCSI] scsi_dh: fix kconfig related
build errors) was actually reversed, but no-one noticed until the new set of
patches which separated DM and SCSI_DH).

Fixes: fe9233fbSigned-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJames Bottomley <JBottomley@Odin.com>

294ab783

01 9月, 2015 33 次提交

dm cache: fix use after freeing migrations · cc7da0ba

由 Joe Thornber 提交于 9月 01, 2015

Both free_io_migration() and issue_discard() dereference a migration
that was just freed. Fix those by saving off the migrations's cache
object before freeing the migration. Also cleanup needless mg->cache
dereferences now that the cache object is available directly.

Fixes: e44b6a5a ("dm cache: move wake_waker() from free_migrations() to where it is needed")
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

cc7da0ba

dm cache: small cleanups related to deferred prison cell cleanup · dc9cee5d

由 Mike Snitzer 提交于 8月 31, 2015

Eliminate __cell_release() since it only had one caller that always
released the cell holder.

Switch cell_error_with_code() to using free_prison_cell() for the sake
of consistency.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

dc9cee5d

dm cache: fix leaking of deferred bio prison cells · 9153df74

由 Joe Thornber 提交于 8月 31, 2015

There were two cases where dm_cell_visit_release() was being called,
which removes the cell from the prison's rbtree, but the callers didn't
also return the cell to the mempool.  Fix this by having them call
free_prison_cell().

This leak manifested as the 'kmalloc-96' slab growing until OOM.

Fixes: 651f5fa2 ("dm cache: defer whole cells")
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+

9153df74

md/raid5: ensure device failure recorded before write request returns. · c3cce6cd

由 NeilBrown 提交于 8月 14, 2015

When a write to one of the devices of a RAID5/6 fails, the failure is
recorded in the metadata of the other devices so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).

Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that completed when MD_CHANGE_PENDING is set to
   only be processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.
Signed-off-by: NNeilBrown <neilb@suse.com>

c3cce6cd

md/raid5: use bio_list for the list of bios to return. · 34a6f80e

由 NeilBrown 提交于 8月 14, 2015

This will make it easier to splice two lists together which will
be needed in future patch.
Signed-off-by: NNeilBrown <neilb@suse.com>

34a6f80e

md/raid10: ensure device failure recorded before write request returns. · 95af587e

由 NeilBrown 提交于 8月 14, 2015

When a write to one of the legs of a RAID10 fails, the failure is
recorded in the metadata of the other legs so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that experienced an error on a new queue which
   is only processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.
Signed-off-by: NNeilBrown <neilb@suse.com>

95af587e

md/raid1: ensure device failure recorded before write request returns. · 55ce74d4

由 NeilBrown 提交于 8月 14, 2015

When a write to one of the legs of a RAID1 fails, the failure is
recorded in the metadata of the other leg(s) so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again  (maybe a cable was unplugged).

Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that experienced an error on a new queue which
   is only processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.
Signed-off-by: NNeilBrown <neilb@suse.com>

55ce74d4

md-cluster: remove inappropriate try_module_get from join() · 18b9f679

由 NeilBrown 提交于 8月 14, 2015

md_setup_cluster already calls try_module_get(), so this
try_module_get isn't needed.
Also, there is no matching module_put (except in error patch),
so this leaves an unbalanced module count.
Signed-off-by: NNeilBrown <neilb@suse.com>

18b9f679

md: extend spinlock protection in register_md_cluster_operations · 6022e75b

由 NeilBrown 提交于 8月 13, 2015

This code looks racy.

The only possible race is if two modules try to register at the same
time and that won't happen.  But make the code look safe anyway.
Signed-off-by: NNeilBrown <neilb@suse.com>

6022e75b

md-cluster: Read the disk bitmap sb and check if it needs recovery · abb9b22a

由 Guoqing Jiang 提交于 7月 10, 2015

In gather_all_resync_info, we need to read the disk bitmap sb and
check if it needs recovery.
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

abb9b22a

md-cluster: only call complete(&cinfo->completion) when node join cluster · eece075c

由 Guoqing Jiang 提交于 7月 10, 2015

Introduce MD_CLUSTER_BEGIN_JOIN_CLUSTER flag to make sure
complete(&cinfo->completion) is only be invoked when node
join cluster. Otherwise node failure could also call the
complete, and it doesn't make sense to do it.
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

eece075c

md-cluster: add missed lockres_free · 6e6d9f2c

由 Guoqing Jiang 提交于 7月 10, 2015

We also need to free the lock resource before goto out.
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

6e6d9f2c

md-cluster: remove the unused sb_lock · b2b9bfff

由 Guoqing Jiang 提交于 7月 10, 2015

The sb_lock is not used anywhere, so let's remove it.
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

b2b9bfff

md-cluster: init suspend_list and suspend_lock early in join · 9e3072e3

由 Guoqing Jiang 提交于 7月 10, 2015

If the node just join the cluster, and receive the msg from other nodes
before init suspend_list, it will cause kernel crash due to NULL pointer
dereference, so move the initializations early to fix the bug.

md-cluster: Joined cluster 3578507b-e0cb-6d4f-6322-696cd7b1b10c slot 3
BUG: unable to handle kernel NULL pointer dereference at           (null)
... ... ...
Call Trace:
[<ffffffffa0444924>] process_recvd_msg+0x2e4/0x330 [md_cluster]
[<ffffffffa0444a06>] recv_daemon+0x96/0x170 [md_cluster]
[<ffffffffa045189d>] md_thread+0x11d/0x170 [md_mod]
[<ffffffff810768c4>] kthread+0xb4/0xc0
[<ffffffff8151927c>] ret_from_fork+0x7c/0xb0
... ... ...
RIP  [<ffffffffa0443581>] __remove_suspend_info+0x11/0xa0 [md_cluster]
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

9e3072e3

md-cluster: add the error check if failed to get dlm lock · b5ef5678

由 Guoqing Jiang 提交于 7月 10, 2015

In complicated cluster environment, it is possible that the
dlm lock couldn't be get/convert on purpose, the related err
info is added for better debug potential issue.

For lockres_free, if the lock is blocking by a lock request or
conversion request, then dlm_unlock just put it back to grant
queue, so need to ensure the lock is free finally.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

b5ef5678

md-cluster: init completion within lockres_init · b83d51c0

由 Guoqing Jiang 提交于 7月 10, 2015

We should init completion within lockres_init, otherwise
completion could be initialized more than one time during
it's life cycle.
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

b83d51c0

md-cluster: fix deadlock issue on message lock · 66099bb0

由 Guoqing Jiang 提交于 7月 10, 2015

There is problem with previous communication mechanism, and we got below
deadlock scenario with cluster which has 3 nodes.

	Sender                	    Receiver        		Receiver

	token(EX)
       message(EX)
      writes message
   downconverts message(CR)
      requests ack(EX)
		                  get message(CR)            gets message(CR)
                		  reads message                reads message
		               requests EX on message    requests EX on message

To fix this problem, we do the following changes:

1. the sender downconverts MESSAGE to CW rather than CR.
2. and the receiver request PR lock not EX lock on message.

And in case we failed to down-convert EX to CW on message, it is better to
unlock message otherthan still hold the lock.
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NLidong Zhong <ldzhong@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

66099bb0

md-cluster: transfer the resync ownership to another node · dc737d7c

由 Guoqing Jiang 提交于 7月 10, 2015

When node A stops an array while the array is doing a resync, we need
to let another node B take over the resync task.

To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC
message to the cluster. And the node B which received that message will
invoke __recover_slot to do resync.
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

dc737d7c

md-cluster: split recover_slot for future code reuse · 05cd0e51

由 Guoqing Jiang 提交于 7月 10, 2015

Make recover_slot as a wraper to __recover_slot, since the
logic of __recover_slot can be reused for the condition
when other nodes need to take over the resync job.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

05cd0e51

md-cluster: use %pU to print UUIDs · b89f704a

由 Guoqing Jiang 提交于 7月 10, 2015

Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

b89f704a

md: setup safemode_timer before it's being used · 25b2edfa

由 Sasha Levin 提交于 7月 24, 2015

We used to set up the safemode_timer timer in md_run. If md_run
would fail before the timer was set up we'd end up trying to modify
a timer that doesn't have a callback function when we access safe_delay_store,
which would trigger a BUG.

neilb: delete init_timer() call as setup_timer() does that.
Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

25b2edfa

md/raid5: handle possible race as reshape completes. · 6cbd8148

由 NeilBrown 提交于 7月 24, 2015

It is possible (though unlikely) for a reshape to be
interrupted between the time that end_reshape is called
and the time when raid5_finish_reshape is called.

This can leave conf->reshape_progress set to MaxSector,
but mddev->reshape_position not.

This combination confused reshape_request() when ->reshape_backwards.
As conf->reshape_progress is so high, it seems the reshape hasn't
really begun.  But assuming MaxSector is a valid address only
leads to sorrow.

So ensure reshape_position and reshape_progress both agree,
and add an extra check in reshape_request() just in case they don't.
Signed-off-by: NNeilBrown <neilb@suse.com>

6cbd8148

md: sync sync_completed has correct value as recovery finishes. · 5ed1df2e

由 NeilBrown 提交于 7月 24, 2015

There can be a small window between the moment that recovery
actually writes the last block and the time when various sysfs
and /proc/mdstat attributes report that it has finished.
During this time, 'sync_completed' can have the wrong value.
This can confuse monitoring software.

So:
 - don't set curr_resync_completed beyond the end of the devices,
 - set it correctly when resync/recovery has completed.
Signed-off-by: NNeilBrown <neilb@suse.com>

5ed1df2e

md: be careful when testing resync_max against curr_resync_completed. · c5e19d90

由 NeilBrown 提交于 7月 17, 2015

While it generally shouldn't happen, it is not impossible for
curr_resync_completed to exceed resync_max.
This can particularly happen when reshaping RAID5 - the current
status isn't copied to curr_resync_completed promptly, so when it
is, it can exceed resync_max.
This happens when the reshape is 'frozen', resync_max is set low,
and reshape is re-enabled.

Taking a difference between two unsigned numbers is always dangerous
anyway, so add a test to behave correctly if
   curr_resync_completed > resync_max
Signed-off-by: NNeilBrown <neilb@suse.com>

c5e19d90

md: set MD_RECOVERY_RECOVER when starting a degraded array. · a4a3d26d

由 NeilBrown 提交于 7月 17, 2015

This ensures that 'sync_action' will show 'recover' immediately the
array is started.  If there is no spare the status will change to
'idle' once that is detected.

Clear MD_RECOVERY_RECOVER for a read-only array to ensure this change
happens.

This allows scripts which monitor status not to get confused -
particularly my test scripts.
Signed-off-by: NNeilBrown <neilb@suse.com>

a4a3d26d

md/raid5: remove incorrect "min_t()" when calculating writepos. · c74c0d76

由 NeilBrown 提交于 7月 15, 2015

This code is calculating:
  writepos, which is the furthest along address (device-space) that we
     *will* be writing to
  readpos, which is the earliest address that we *could* possible read
     from, and
  safepos, which is the earliest address in the 'old' section that we
     might read from after a crash when the reshape position is
     recovered from metadata.

  The first is a precise calculation, so clipping at zero doesn't
  make sense.  As the reshape position is now guaranteed to always be
  a multiple of reshape_sectors and as we already BUG_ON when
  reshape_progress is zero, there is no point in this min_t() call.

  The readpos and safepos are worst case - actual value depends on
  precise geometry.  That worst case could be negative, which is only
  a problem because we are storing the value in an unsigned.
  So leave the min_t() for those.
Signed-off-by: NNeilBrown <neilb@suse.com>

c74c0d76

md/raid5: strengthen check on reshape_position at run. · 05256d98

由 NeilBrown 提交于 7月 15, 2015

When reshaping, we work in units of the largest chunk size.
If changing from a larger to a smaller chunk size, that means we
reshape more than one stripe at a time.  So the required alignment
of reshape_position needs to take into account both the old
and new chunk size.

This means that both 'here_new' and 'here_old' are calculated with
respect to the same (maximum) chunk size, so testing if they are the
same when delta_disks is zero becomes pointless.
Signed-off-by: NNeilBrown <neilb@suse.com>

05256d98

md/raid5: switch to use conf->chunk_sectors in place of mddev->chunk_sectors where possible · 3cb5edf4

由 NeilBrown 提交于 7月 15, 2015

The chunk_sectors and new_chunk_sectors fields of mddev can be changed
any time (via sysfs) that the reconfig mutex can be taken.  So raid5
keeps internal copies in 'conf' which are stable except for a short
locked moment when reshape stops/starts.

So any access that does not hold reconfig_mutex should use the 'conf'
values, not the 'mddev' values.
Several don't.

This could result in corruption if new values were written at awkward
times.

Also use min() or max() rather than open-coding.
Signed-off-by: NNeilBrown <neilb@suse.com>

3cb5edf4

md/raid5: always set conf->prev_chunk_sectors and ->prev_algo · 5cac6bcb

由 NeilBrown 提交于 7月 17, 2015

These aren't really needed when no reshape is happening,
but it is safer to have them always set to a meaningful value.
The next patch will use ->prev_chunk_sectors without checking
if a reshape is happening (because that makes the code simpler),
and this patch makes that safe.
Signed-off-by: NNeilBrown <neilb@suse.com>

5cac6bcb

N
md/raid10: fix a few typos in comments · 02ec5026
由 NeilBrown 提交于 7月 06, 2015
```
Signed-off-by: NNeilBrown <neilb@suse.com>
```
02ec5026

md/raid5: consider updating reshape_position at start of reshape. · 92140480

由 NeilBrown 提交于 7月 06, 2015

md/raid5 only updates ->reshape_position (which is stored in
metadata and is authoritative) occasionally, but particularly
when getting closed to ->resync_max as it must be correct
when ->resync_max is reached.

When mdadm tries to stop an array which is reshaping it will:
 - freeze the reshape,
 - set resync_max to where the reshape has reached.
 - unfreeze the reshape.
When this happens, the reshape is aborted and then restarted.

The restart doesn't check that resync_max is close, and so doesn't
update ->reshape_position like it should.
This results in the reshape stopping, but ->reshape_position being
incorrect.

So on that first call to reshape_request, make sure ->reshape_position
is updated if needed.
Signed-off-by: NNeilBrown <neilb@suse.com>

92140480

md: close some races between setting and checking sync_action. · 985ca973

由 NeilBrown 提交于 7月 06, 2015

When checking sync_action in a script, we want to be sure it is
as accurate as possible.
As resync/reshape etc doesn't always start immediately (a separate
thread is scheduled to do it), it is best if 'action_show'
checks if MD_RECOVER_NEEDED is set (which it does) and in that
case reports what is likely to start soon (which it only sometimes
does).

So:
 - report 'reshape' if reshape_position suggests one might start.
 - set MD_RECOVERY_RECOVER in raid1_reshape(), because that is very
   likely to happen next.
Signed-off-by: NNeilBrown <neilb@suse.com>

985ca973

md: Keep /proc/mdstat reporting recovery until fully DONE. · f7851be7

由 NeilBrown 提交于 7月 02, 2015

Currently when a recovery completes, mdstat shows that it has finished
before the new device is marked as a full member.  Because of this it
can appear to a script that the recovery finished but the array isn't
in sync.

So while MD_RECOVERY_DONE is still set, keep mdstat reporting "recovery".
Once md_reap_sync_thread() completes, the spare will be active and then
MD_RECOVERY_DONE will be cleared.

To ensure this is race-free, set MD_RECOVERY_DONE before clearning
curr_resync.
Signed-off-by: NNeilBrown <neilb@suse.com>

f7851be7

29 8月, 2015 1 次提交

dm-mpath, scsi_dh: request scsi_dh modules in scsi_dh, not dm-mpath · 566079c8

由 Christoph Hellwig 提交于 8月 27, 2015

This way we can reused the same code any attachment method, not just those
requested from dm-mpath.

[jejb: fixup checkpatch error]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJames Bottomley <JBottomley@Odin.com>

566079c8

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功