- 02 10月, 2015 5 次提交
-
-
由 Shaohua Li 提交于
If faulty disks of an array are more than allowed degraded number, the array enters error handling. It will be marked as read-only with MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't clear CHANGE_PENDING bit for read-only array. If MD_CHANGE_PENDING is set for a raid5 array, all returned IO will be hold on a list till the bit is clear. But recovery nevery clears this bit, the IO is always in pending state and nevery finish. This has bad effects like upper layer can't get an IO error and the array can't be stopped. Fixes: c3cce6cd ("md/raid5: ensure device failure recorded before write request returns.") Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
Calling e.g. blk_queue_max_hw_sectors() after calls to disk_stack_limits() discards the settings determined by disk_stack_limits(). So we need to make those calls first. Fixes: 199dc6ed ("md/raid0: update queue parameter in a safer location.") Cc: stable@vger.kernel.org (v2.6.35+ - please apply with 199dc6ed). Reported-by: NJes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
When need_this_block probably shouldn't be called when there are more than 2 failed devices, we really don't want it to try indexing beyond the end of the failed_num[] of fdev[] arrays. So limit the loops to at most 2 iterations. Reported-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Shaohua Li 提交于
handle_failed_stripe() makes the stripe fail, eg, all IO will return with a failure, but it doesn't update stripe_head_state. Later handle_stripe() has special handling for raid6 for handle_stripe_fill(). That check before handle_stripe_fill() doesn't skip the failed stripe and we get a kernel crash in need_this_block. This patch clear the analysis state to make sure no functions wrongly called after handle_failed_stripe() Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
If a superblock update is pending, wait for it to complete before letting md_set_readonly() switch to readonly. Otherwise we might lose important information about a device having failed. For external arrays, waiting for superblock updates can wait on user-space, so in that case, just return an error. Reported-and-tested-by: NShaohua Li <shli@fb.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
- 12 9月, 2015 1 次提交
-
-
由 Christoph Hellwig 提交于
It looks like the Kconfig check that was meant to fix this (commit fe9233fb [SCSI] scsi_dh: fix kconfig related build errors) was actually reversed, but no-one noticed until the new set of patches which separated DM and SCSI_DH). Fixes: fe9233fbSigned-off-by: NChristoph Hellwig <hch@lst.de> Tested-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJames Bottomley <JBottomley@Odin.com>
-
- 01 9月, 2015 33 次提交
-
-
由 Joe Thornber 提交于
Both free_io_migration() and issue_discard() dereference a migration that was just freed. Fix those by saving off the migrations's cache object before freeing the migration. Also cleanup needless mg->cache dereferences now that the cache object is available directly. Fixes: e44b6a5a ("dm cache: move wake_waker() from free_migrations() to where it is needed") Signed-off-by: NJoe Thornber <ejt@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com>
-
由 Mike Snitzer 提交于
Eliminate __cell_release() since it only had one caller that always released the cell holder. Switch cell_error_with_code() to using free_prison_cell() for the sake of consistency. Signed-off-by: NMike Snitzer <snitzer@redhat.com>
-
由 Joe Thornber 提交于
There were two cases where dm_cell_visit_release() was being called, which removes the cell from the prison's rbtree, but the callers didn't also return the cell to the mempool. Fix this by having them call free_prison_cell(). This leak manifested as the 'kmalloc-96' slab growing until OOM. Fixes: 651f5fa2 ("dm cache: defer whole cells") Signed-off-by: NJoe Thornber <ejt@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+
-
由 NeilBrown 提交于
When a write to one of the devices of a RAID5/6 fails, the failure is recorded in the metadata of the other devices so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Similarly when we record a bad-block in response to a write failure, we must not let the write complete until the bad-block update is safe. Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that completed when MD_CHANGE_PENDING is set to only be processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
This will make it easier to splice two lists together which will be needed in future patch. Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
When a write to one of the legs of a RAID10 fails, the failure is recorded in the metadata of the other legs so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that experienced an error on a new queue which is only processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
When a write to one of the legs of a RAID1 fails, the failure is recorded in the metadata of the other leg(s) so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Similarly when we record a bad-block in response to a write failure, we must not let the write complete until the bad-block update is safe. Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that experienced an error on a new queue which is only processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
md_setup_cluster already calls try_module_get(), so this try_module_get isn't needed. Also, there is no matching module_put (except in error patch), so this leaves an unbalanced module count. Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
This code looks racy. The only possible race is if two modules try to register at the same time and that won't happen. But make the code look safe anyway. Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Guoqing Jiang 提交于
In gather_all_resync_info, we need to read the disk bitmap sb and check if it needs recovery. Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Guoqing Jiang 提交于
Introduce MD_CLUSTER_BEGIN_JOIN_CLUSTER flag to make sure complete(&cinfo->completion) is only be invoked when node join cluster. Otherwise node failure could also call the complete, and it doesn't make sense to do it. Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Guoqing Jiang 提交于
We also need to free the lock resource before goto out. Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Guoqing Jiang 提交于
The sb_lock is not used anywhere, so let's remove it. Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Guoqing Jiang 提交于
If the node just join the cluster, and receive the msg from other nodes before init suspend_list, it will cause kernel crash due to NULL pointer dereference, so move the initializations early to fix the bug. md-cluster: Joined cluster 3578507b-e0cb-6d4f-6322-696cd7b1b10c slot 3 BUG: unable to handle kernel NULL pointer dereference at (null) ... ... ... Call Trace: [<ffffffffa0444924>] process_recvd_msg+0x2e4/0x330 [md_cluster] [<ffffffffa0444a06>] recv_daemon+0x96/0x170 [md_cluster] [<ffffffffa045189d>] md_thread+0x11d/0x170 [md_mod] [<ffffffff810768c4>] kthread+0xb4/0xc0 [<ffffffff8151927c>] ret_from_fork+0x7c/0xb0 ... ... ... RIP [<ffffffffa0443581>] __remove_suspend_info+0x11/0xa0 [md_cluster] Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Guoqing Jiang 提交于
In complicated cluster environment, it is possible that the dlm lock couldn't be get/convert on purpose, the related err info is added for better debug potential issue. For lockres_free, if the lock is blocking by a lock request or conversion request, then dlm_unlock just put it back to grant queue, so need to ensure the lock is free finally. Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Guoqing Jiang 提交于
We should init completion within lockres_init, otherwise completion could be initialized more than one time during it's life cycle. Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Guoqing Jiang 提交于
There is problem with previous communication mechanism, and we got below deadlock scenario with cluster which has 3 nodes. Sender Receiver Receiver token(EX) message(EX) writes message downconverts message(CR) requests ack(EX) get message(CR) gets message(CR) reads message reads message requests EX on message requests EX on message To fix this problem, we do the following changes: 1. the sender downconverts MESSAGE to CW rather than CR. 2. and the receiver request PR lock not EX lock on message. And in case we failed to down-convert EX to CW on message, it is better to unlock message otherthan still hold the lock. Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NLidong Zhong <ldzhong@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Guoqing Jiang 提交于
When node A stops an array while the array is doing a resync, we need to let another node B take over the resync task. To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC message to the cluster. And the node B which received that message will invoke __recover_slot to do resync. Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Guoqing Jiang 提交于
Make recover_slot as a wraper to __recover_slot, since the logic of __recover_slot can be reused for the condition when other nodes need to take over the resync job. Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Guoqing Jiang 提交于
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 Sasha Levin 提交于
We used to set up the safemode_timer timer in md_run. If md_run would fail before the timer was set up we'd end up trying to modify a timer that doesn't have a callback function when we access safe_delay_store, which would trigger a BUG. neilb: delete init_timer() call as setup_timer() does that. Signed-off-by: NSasha Levin <sasha.levin@oracle.com> Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
It is possible (though unlikely) for a reshape to be interrupted between the time that end_reshape is called and the time when raid5_finish_reshape is called. This can leave conf->reshape_progress set to MaxSector, but mddev->reshape_position not. This combination confused reshape_request() when ->reshape_backwards. As conf->reshape_progress is so high, it seems the reshape hasn't really begun. But assuming MaxSector is a valid address only leads to sorrow. So ensure reshape_position and reshape_progress both agree, and add an extra check in reshape_request() just in case they don't. Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
There can be a small window between the moment that recovery actually writes the last block and the time when various sysfs and /proc/mdstat attributes report that it has finished. During this time, 'sync_completed' can have the wrong value. This can confuse monitoring software. So: - don't set curr_resync_completed beyond the end of the devices, - set it correctly when resync/recovery has completed. Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
While it generally shouldn't happen, it is not impossible for curr_resync_completed to exceed resync_max. This can particularly happen when reshaping RAID5 - the current status isn't copied to curr_resync_completed promptly, so when it is, it can exceed resync_max. This happens when the reshape is 'frozen', resync_max is set low, and reshape is re-enabled. Taking a difference between two unsigned numbers is always dangerous anyway, so add a test to behave correctly if curr_resync_completed > resync_max Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
This ensures that 'sync_action' will show 'recover' immediately the array is started. If there is no spare the status will change to 'idle' once that is detected. Clear MD_RECOVERY_RECOVER for a read-only array to ensure this change happens. This allows scripts which monitor status not to get confused - particularly my test scripts. Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
This code is calculating: writepos, which is the furthest along address (device-space) that we *will* be writing to readpos, which is the earliest address that we *could* possible read from, and safepos, which is the earliest address in the 'old' section that we might read from after a crash when the reshape position is recovered from metadata. The first is a precise calculation, so clipping at zero doesn't make sense. As the reshape position is now guaranteed to always be a multiple of reshape_sectors and as we already BUG_ON when reshape_progress is zero, there is no point in this min_t() call. The readpos and safepos are worst case - actual value depends on precise geometry. That worst case could be negative, which is only a problem because we are storing the value in an unsigned. So leave the min_t() for those. Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
When reshaping, we work in units of the largest chunk size. If changing from a larger to a smaller chunk size, that means we reshape more than one stripe at a time. So the required alignment of reshape_position needs to take into account both the old and new chunk size. This means that both 'here_new' and 'here_old' are calculated with respect to the same (maximum) chunk size, so testing if they are the same when delta_disks is zero becomes pointless. Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
The chunk_sectors and new_chunk_sectors fields of mddev can be changed any time (via sysfs) that the reconfig mutex can be taken. So raid5 keeps internal copies in 'conf' which are stable except for a short locked moment when reshape stops/starts. So any access that does not hold reconfig_mutex should use the 'conf' values, not the 'mddev' values. Several don't. This could result in corruption if new values were written at awkward times. Also use min() or max() rather than open-coding. Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
These aren't really needed when no reshape is happening, but it is safer to have them always set to a meaningful value. The next patch will use ->prev_chunk_sectors without checking if a reshape is happening (because that makes the code simpler), and this patch makes that safe. Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
md/raid5 only updates ->reshape_position (which is stored in metadata and is authoritative) occasionally, but particularly when getting closed to ->resync_max as it must be correct when ->resync_max is reached. When mdadm tries to stop an array which is reshaping it will: - freeze the reshape, - set resync_max to where the reshape has reached. - unfreeze the reshape. When this happens, the reshape is aborted and then restarted. The restart doesn't check that resync_max is close, and so doesn't update ->reshape_position like it should. This results in the reshape stopping, but ->reshape_position being incorrect. So on that first call to reshape_request, make sure ->reshape_position is updated if needed. Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
When checking sync_action in a script, we want to be sure it is as accurate as possible. As resync/reshape etc doesn't always start immediately (a separate thread is scheduled to do it), it is best if 'action_show' checks if MD_RECOVER_NEEDED is set (which it does) and in that case reports what is likely to start soon (which it only sometimes does). So: - report 'reshape' if reshape_position suggests one might start. - set MD_RECOVERY_RECOVER in raid1_reshape(), because that is very likely to happen next. Signed-off-by: NNeilBrown <neilb@suse.com>
-
由 NeilBrown 提交于
Currently when a recovery completes, mdstat shows that it has finished before the new device is marked as a full member. Because of this it can appear to a script that the recovery finished but the array isn't in sync. So while MD_RECOVERY_DONE is still set, keep mdstat reporting "recovery". Once md_reap_sync_thread() completes, the spare will be active and then MD_RECOVERY_DONE will be cleared. To ensure this is race-free, set MD_RECOVERY_DONE before clearning curr_resync. Signed-off-by: NNeilBrown <neilb@suse.com>
-
- 29 8月, 2015 1 次提交
-
-
由 Christoph Hellwig 提交于
This way we can reused the same code any attachment method, not just those requested from dm-mpath. [jejb: fixup checkpatch error] Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Acked-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJames Bottomley <JBottomley@Odin.com>
-