- 22 6月, 2017 1 次提交
-
-
由 NeilBrown 提交于
md devices allocate a bio_set and use it for two distinct purposes. mddev->bio_set is used to clone bios as part of sending upper level requests down to lower level devices, and it is also use for synchronous IO such as superblock and bitmap updates, and for correcting read errors. This multiple usage can lead to deadlocks. It is likely that cloned bios might be queued for write and to be waiting for a metadata update before the write can be permitted. If the cloning exhausted mddev->bio_set, the metadata update may not be able to proceed. This scenario has been seen during heavy testing, with lots of IO and lots of memory pressure. Address this by adding a new bio_set specifically for synchronous IO. All synchronous IO goes directly to the underlying device and is not queued at the md level, so request using entries from the new mddev->sync_set will complete in a timely fashion. Requests that use mddev->bio_set will sometimes need to wait for synchronous IO, but will no longer risk deadlocking that iO. Also: small simplification in mddev_put(): there is no need to wait until the spinlock is released before calling bioset_free(). Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 17 6月, 2017 1 次提交
-
-
由 Lidong Zhong 提交于
The value for spare spot of sb->dev_roles is changed from MD_DISK_ROLE_FAULTY to MD_DISK_ROLE_SPARE to keep align with the value when the superblock is firstly created in userspace. Signed-off-by: NLidong Zhong <lzhong@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 14 6月, 2017 1 次提交
-
-
由 NeilBrown 提交于
If mddev_suspend() races with md_write_start() we can deadlock with mddev_suspend() waiting for the request that is currently in md_write_start() to complete the ->make_request() call, and md_write_start() waiting for the metadata to be updated to mark the array as 'dirty'. As metadata updates done by md_check_recovery() only happen then the mddev_lock() can be claimed, and as mddev_suspend() is often called with the lock held, these threads wait indefinitely for each other. We fix this by having md_write_start() abort if mddev_suspend() is happening, and ->make_request() aborts if md_write_start() aborted. md_make_request() can detect this abort, decrease the ->active_io count, and wait for mddev_suspend(). Reported-by: NNix <nix@esperi.org.uk> Fix: 68866e42(MD: no sync IO while suspended) Cc: stable@vger.kernel.org Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 06 6月, 2017 1 次提交
-
-
由 NeilBrown 提交于
The new per-cpu counter for writes_pending is initialised in md_alloc(), which is not called by dm-raid. So dm-raid fails when md_write_start() is called. Move the initialization to the personality modules that need it. This way it is always initialised when needed, but isn't unnecessarily initialized (requiring memory allocation) when the personality doesn't use writes_pending. Reported-by: NHeinz Mauelshagen <heinzm@redhat.com> Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending") Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 01 6月, 2017 1 次提交
-
-
由 Jan Kara 提交于
Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...} definitions. generic_make_request_checks() however strips REQ_FUA and REQ_PREFLUSH flags from a bio when the storage doesn't report volatile write cache and thus write effectively becomes asynchronous which can lead to performance regressions Fix the problem by making sure all bios which are synchronous are properly marked with REQ_SYNC. CC: linux-raid@vger.kernel.org CC: Shaohua Li <shli@kernel.org> Fixes: b685d3d6 CC: stable@vger.kernel.org Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 09 5月, 2017 1 次提交
-
-
由 Artur Paszkiewicz 提交于
This essentially reverts commit b5470dc5 ("md: resolve external metadata handling deadlock in md_allow_write") with some adjustments. Since commit 6791875e ("md: make reconfig_mutex optional for writes to md sysfs files.") changing array_state to 'active' does not use mddev_lock() and will not cause a deadlock with md_allow_write(). This revert simplifies userspace tools that write to sysfs attributes like "stripe_cache_size" or "consistency_policy" because it removes the need for special handling for external metadata arrays, checking for EAGAIN and retrying the write. Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 21 4月, 2017 1 次提交
-
-
由 NeilBrown 提交于
1/ If an array has any read-only devices when it is started, the array itself must be read-only 2/ A read-only device cannot be added to an array after it is started. 3/ Setting an array to read-write should not succeed if any member devices are read-only Reported-and-Tested-by: NNanda Kishore Chinnaram <Nanda_Kishore_Chinna@dell.com> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 13 4月, 2017 2 次提交
-
-
由 NeilBrown 提交于
md allows a new array device to be created by simply opening a device file. This make it difficult to remove the device and udev is likely to open the device file as part of processing the REMOVE event. There is an alternate mechanism for creating arrays by writing to the new_array module parameter. When using tools that work with this parameter, it is best to disable the old semantics. This new module parameter allows that. Signed-off-by: NNeilBrown <neilb@suse.com> Acted-by: NColy Li <colyli@suse.de> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 NeilBrown 提交于
The intention when creating the "new_array" parameter and the possibility of having array names line "md_HOME" was to transition away from the old way of creating arrays and to eventually only use this new way. The "old" way of creating array is to create a device node in /dev and then open it. The act of opening creates the array. This is problematic because sometimes the device node can be opened when we don't want to create an array. This can easily happen when some rule triggered by udev looks at a device as it is being destroyed. The node in /dev continues to exist for a short period after an array is stopped, and opening it during this time recreates the array (as an inactive array). Unfortunately no clear plan for the transition was created. It is now time to fix that. This patch allows devices with numeric names, like "md999" to be created by writing to "new_array". This will only work if the minor number given is not already in use. This will allow mdadm to support the creation of arrays with numbers > 511 (currently not possible) by writing to new_array. mdadm can, at some point, use this approach to create *all* arrays, which will allow the transition to only using the new-way. Signed-off-by: NNeilBrown <neilb@suse.com> Acted-by: NColy Li <colyli@suse.de> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 11 4月, 2017 2 次提交
-
-
由 Zhilong Liu 提交于
md.c: it needs to release the mddev lock before the array_size_store() returns. Fixes: ab5a98b1 ("md-cluster: change array_sectors and update size are not supported") Signed-off-by: NZhilong Liu <zlliu@suse.com> Reviewed-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 NeilBrown 提交于
if called md_set_readonly and set MD_CLOSING bit, the mddev cannot be opened any more due to the MD_CLOING bit wasn't cleared. Thus it needs to be cleared in md_ioctl after any call to md_set_readonly() or do_md_stop(). Signed-off-by: NNeilBrown <neilb@suse.com> Fixes: af8d8e6f ("md: changes for MD_STILL_CLOSED flag") Cc: stable@vger.kernel.org (v4.9+) Signed-off-by: NZhilong Liu <zlliu@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 23 3月, 2017 4 次提交
-
-
由 NeilBrown 提交于
The 'writes_pending' counter is used to determine when the array is stable so that it can be marked in the superblock as "Clean". Consequently it needs to be updated frequently but only checked for zero occasionally. Recent changes to raid5 cause the count to be updated even more often - once per 4K rather than once per bio. This provided justification for making the updates more efficient. So we replace the atomic counter a percpu-refcount. This can be incremented and decremented cheaply most of the time, and can be switched to "atomic" mode when more precise counting is needed. As it is possible for multiple threads to want a precise count, we introduce a "sync_checker" counter to count the number of threads in "set_in_sync()", and only switch the refcount back to percpu mode when that is zero. We need to be careful about races between set_in_sync() setting ->in_sync to 1, and md_write_start() setting it to zero. md_write_start() holds the rcu_read_lock() while checking if the refcount is in percpu mode. If it is, then we know a switch to 'atomic' will not happen until after we call rcu_read_unlock(), in which case set_in_sync() will see the elevated count, and not set in_sync to 1. If it is not in percpu mode, we take the mddev->lock to ensure proper synchronization. It is no longer possible to quickly check if the count is zero, which we previously did to update a timer or to schedule the md_thread. So now we do these every time we decrement that counter, but make sure they are fast. mod_timer() already optimizes the case where the timeout value doesn't actually change. We leverage that further by always rounding off the jiffies to the timeout value. This may delay the marking of 'clean' slightly, but ensure we only perform atomic operation here when absolutely needed. md_wakeup_thread() current always calls wake_up(), even if THREAD_WAKEUP is already set. That too can be optimised to avoid calls to wake_up(). Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 NeilBrown 提交于
If ->in_sync is being set just as md_write_start() is being called, it is possible that set_in_sync() won't see the elevated ->writes_pending, and md_write_start() won't see the set ->in_sync. To close this race, re-test ->writes_pending after setting ->in_sync, and add memory barriers to ensure the increment of ->writes_pending will be seen by the time of this second test, or the new ->in_sync will be seen by md_write_start(). Add a spinlock to array_state_show() to ensure this temporary instability is never visible from userspace. Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 NeilBrown 提交于
Three separate places in md.c check if the number of active writes is zero and, if so, sets mddev->in_sync. There are a few differences, but there shouldn't be: - it is always appropriate to notify the change in sysfs_state, and there is no need to do this outside a spin-locked region. - we never need to check ->recovery_cp. The state of resync is not relevant for whether there are any pending writes or not (which is what ->in_sync reports). So create set_in_sync() which does the correct tests and makes the correct changes, and call this in all three places. Any behaviour changes here a minor and cosmetic. Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 NeilBrown 提交于
We use md_write_start() to increase the count of pending writes, and md_write_end() to decrement the count. We currently count bios submitted to md/raid5. Change it count stripe_heads that a WRITE bio has been attached to. So now, raid5_make_request() calls md_write_start() and then md_write_end() to keep the count elevated during the setup of the request. add_stripe_bio() calls md_write_start() for each stripe_head, and the completion routines always call md_write_end(), instead of only calling it when raid5_dec_bi_active_stripes() returns 0. make_discard_request also calls md_write_start/end(). The parallel between md_write_{start,end} and use of bi_phys_segments can be seen in that: Whenever we set bi_phys_segments to 1, we now call md_write_start. Whenever we increment it on non-read requests with raid5_inc_bi_active_stripes(), we now call md_write_start(). Whenever we decrement bi_phys_segments on non-read requsts with raid5_dec_bi_active_stripes(), we now call md_write_end(). This reduces our dependence on keeping a per-bio count of active stripes in bi_phys_segments. md_write_inc() is added which parallels md_write_start(), but requires that a write has already been started, and is certain never to sleep. This can be used inside a spinlocked region when adding to a write request. Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 17 3月, 2017 6 次提交
-
-
由 Guoqing Jiang 提交于
Since we have switched to sync way to handle METADATA_UPDATED msg for md-cluster, then process_metadata_update is depended on mddev->thread->wqueue. With the new change, clustered raid could possible hang if array received a METADATA_UPDATED msg after array unregistered mddev->thread, so we need to stop clustered raid (bitmap_destroy -> bitmap_free -> md_cluster_stop) earlier than unregister thread (mddev_detach -> md_unregister_thread). And this change should be safe for non-clustered raid since all writes are stopped before the destroy. Also in md_run, we activate the personality (pers->run()) before activating the bitmap (bitmap_create()). So it is pleasingly symmetric to stop the bitmap (bitmap_destroy()) before stopping the personality (__md_stop() calls pers->free()), we achieve this by move bitmap_destroy to the beginning of __md_stop. But we don't want to break the codes for waiting behind IO as Shaohua mentioned, so introduce bitmap_wait_behind_writes to call the codes, and call the new fun in both mddev_detach and bitmap_destroy, then we will not break original behind IO code and also fit the new condition well. Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Artur Paszkiewicz 提交于
Allow writing to 'consistency_policy' attribute when the array is active. Add a new function 'change_consistency_policy' to the md_personality operations structure to handle the change in the personality code. Values "ppl" and "resync" are accepted and turn PPL on and off respectively. When enabling PPL its location and size should first be set using 'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be written at this location on each member device. Enabling or disabling PPL is performed under a suspended array. The raid5_reset_stripe_cache function frees the stripe cache and allocates it again in order to allocate or free the ppl_pages for the stripes in the stripe cache. Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Artur Paszkiewicz 提交于
Add 'consistency_policy' attribute for array. It indicates how the array maintains consistency in case of unexpected shutdown. Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location and size of the PPL space on the device. They can't be changed for active members if the array is started and PPL is enabled, so in the setter functions only basic checks are performed. More checks are done in ppl_validate_rdev() when starting the log. These attributes are writable to allow enabling PPL for external metadata arrays and (later) to enable/disable PPL for a running array. Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Artur Paszkiewicz 提交于
Include information about PPL location and size into mdp_superblock_1 and copy it to/from rdev. Because PPL is mutually exclusive with bitmap, put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for 'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL to mddev->flags to indicate that PPL is enabled on an array. Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Guoqing Jiang 提交于
To update size for cluster raid, we need to make sure all nodes can perform the change successfully. However, it is possible that some of them can't do it due to failure (bitmap_resize could fail). So we need to consider the issue before we set the capacity unconditionally, and we use below steps to perform sanity check. 1. A change the size, then broadcast METADATA_UPDATED msg. 2. B and C receive METADATA_UPDATED change the size excepts call set_capacity, sync_size is not update if the change failed. Also call bitmap_update_sb to sync sb to disk. 3. A checks other node's sync_size, if sync_size has been updated in all nodes, then send CHANGE_CAPACITY msg otherwise send msg to revert previous change. 4. B and C call set_capacity if receive CHANGE_CAPACITY msg, otherwise pers->resize will be called to restore the old value. Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Guoqing Jiang 提交于
Previously, when node received METADATA_UPDATED msg, it just need to wakeup mddev->thread, then md_reload_sb will be called eventually. We taken the asynchronous way to avoid a deadlock issue, the deadlock issue could happen when one node is receiving the METADATA_UPDATED msg (wants reconfig_mutex) and trying to run the path: md_check_recovery -> mddev_trylock(hold reconfig_mutex) -> md_update_sb-metadata_update_start (want EX on token however token is got by the sending node) Since we will support resizing for clustered raid, and we need the metadata update handling to be synchronous so that the initiating node can detect failure, so we need to change the way for handling METADATA_UPDATED msg. But, we obviously need to avoid above deadlock with the sync way. To make this happen, we considered to not hold reconfig_mutex to call md_reload_sb, if some other thread has already taken reconfig_mutex and waiting for the 'token', then process_recvd_msg() can safely call md_reload_sb() without taking the mutex. This is because we can be certain that no other thread will take the mutex, and we also certain that the actions performed by md_reload_sb() won't interfere with anything that the other thread is in the middle of. To make this more concrete, we added a new cinfo->state bit MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD Which is set in lock_token() just before dlm_lock_sync() is called, and cleared just after. As lock_token() is always called with reconfig_mutex() held (the specific case is the resync_info_update which is distinguished well in previous patch), if process_recvd_msg() finds that the new bit is set, then the mutex must be held by some other thread, and it will keep waiting. So process_metadata_update() can call md_reload_sb() if either mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is set. The tricky bit is what to do if neither of these apply. We need to wait. Fortunately mddev_unlock() always calls wake_up() on mddev->thread->wqueue. So we can get lock_token() to call wake_up() on that when it sets the bit. There are also some related changes inside this commit: 1. remove RELOAD_SB related codes since there are not valid anymore. 2. mddev is added into md_cluster_info then we can get mddev inside lock_token. 3. add new parameter for lock_token to distinguish reconfig_mutex is held or not. And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below: 1. set it before unregister thread, otherwise a deadlock could appear if stop a resyncing array. This is because md_unregister_thread(&cinfo->recv_thread) is blocked by recv_daemon -> process_recvd_msg -> process_metadata_update. To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is also need to be set before unregister thread. 2. set it in metadata_update_start to fix another deadlock. a. Node A sends METADATA_UPDATED msg (held Token lock). b. Node B wants to do resync, and is blocked since it can't get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is not set since the callchain (md_do_sync -> sync_request -> resync_info_update -> sendmsg -> lock_comm -> lock_token) doesn't hold reconfig_mutex. c. Node B trys to update sb (held reconfig_mutex), but stopped at wait_event() in metadata_update_start since we have set MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2). d. Then Node B receives METADATA_UPDATED msg from A, of course recv_daemon is blocked forever. Since metadata_update_start always calls lock_token with reconfig_mutex, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and lock_token don't need to set it twice unless lock_token is invoked from lock_comm. Finally, thanks to Neil for his great idea and help! Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 11 3月, 2017 2 次提交
-
-
由 Jason Yan 提交于
The sb->layout is of type __le32, so we shoud use le32_to_cpu. Signed-off-by: NJason Yan <yanaijie@huawei.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Jason Yan 提交于
The sb->super_offset should be big-endian, but the rdev->sb_start is in host byte order, so fix this by adding cpu_to_le64. Signed-off-by: NJason Yan <yanaijie@huawei.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 10 3月, 2017 3 次提交
-
-
由 NeilBrown 提交于
These arrays, created with "mdadm --build" don't benefit from a limit. The default will be used, which is '0' and is interpreted as "don't impose a limit". Reported-by: ian_bruce@mail.ru Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Guoqing Jiang 提交于
raid1_resize and raid5_resize should also check the mddev->queue if run underneath dm-raid. And both set_capacity and revalidate_disk are used in pers->resize such as raid1, raid10 and raid5. So move them from personality file to common code. Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Shaohua Li 提交于
Nobody is using mddev_check_plugged(), so delete the dead code Signed-off-by: NShaohua Li <shli@fb.com>
-
- 02 3月, 2017 1 次提交
-
-
由 Ingo Molnar 提交于
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/signal.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 16 2月, 2017 3 次提交
-
-
由 Ming Lei 提交于
Firstly bio_clone_mddev() is used in raid normal I/O and isn't in resync I/O path. Secondly all the direct access to bvec table in raid happens on resync I/O except for write behind of raid1, in which we still use bio_clone() for allocating new bvec table. So this patch replaces bio_clone() with bio_clone_fast() in bio_clone_mddev(). Also kill bio_clone_mddev() and call bio_clone_fast() directly, as suggested by Christoph Hellwig. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMing Lei <tom.leiming@gmail.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Ming Lei 提交于
mddev is never NULL and neither is ->bio_set, so remove the check. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMing Lei <tom.leiming@gmail.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Ming Lei 提交于
The current behaviour is to fall back to allocate bio from 'fs_bio_set', that isn't a correct way because it might cause deadlock. So this patch simply return failure if mddev->bio_set can't be created. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMing Lei <tom.leiming@gmail.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 14 2月, 2017 1 次提交
-
-
由 NeilBrown 提交于
Commit: cbd19983 ("md: Fix unfortunate interaction with evms") change mddev_put() so that it would not destroy an md device while ->ctime was non-zero. Unfortunately, we didn't make sure to clear ->ctime when unloading the module, so it is possible for an md device to remain after module unload. An attempt to open such a device will trigger an invalid memory reference in: get_gendisk -> kobj_lookup -> exact_lock -> get_disk when tring to access disk->fops, which was in the module that has been removed. So ensure we clear ->ctime in md_exit(), and explain how that is useful, as it isn't immediately obvious when looking at the code. Fixes: cbd19983 ("md: Fix unfortunate interaction with evms") Tested-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 02 2月, 2017 1 次提交
-
-
由 Jan Kara 提交于
We will want to have struct backing_dev_info allocated separately from struct request_queue. As the first step add pointer to backing_dev_info to request_queue and convert all users touching it. No functional changes in this patch. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 25 1月, 2017 1 次提交
-
-
由 Song Liu 提交于
For safer operation, all arrays start in write-through mode, which has been better tested and is more mature. And actually the write-through/write-mode isn't persistent after array restarted, so we always start array in write-through mode. However, if recovery found data-only stripes before the shutdown (from previous write-back mode), it is not safe to start the array in write-through mode, as write-through mode can not handle stripes with data in write-back cache. To solve this problem, we flush all data-only stripes in r5l_recovery_log(). When r5l_recovery_log() returns, the array starts with empty cache in write-through mode. This logic is implemented in r5c_recovery_flush_data_only_stripes(): 1. enable write back cache 2. flush all stripes 3. wake up conf->mddev->thread 4. wait for all stripes get flushed (reuse wait_for_quiescent) 5. disable write back cache The wait in 4 will be waked up in release_inactive_stripe_list() when conf->active_stripes reaches 0. It is safe to wake up mddev->thread here because all the resource required for the thread has been initialized. Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 09 12月, 2016 2 次提交
-
-
由 Shaohua Li 提交于
The mddev->flags are used for different purposes. There are a lot of places we check/change the flags without masking unrelated flags, we could check/change unrelated flags. These usage are most for superblock write, so spearate superblock related flags. This should make the code clearer and also fix real bugs. Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 Shaohua Li 提交于
Fixes: 90f5f7ad("md: Wait for md_check_recovery before attempting device removal.") Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 06 12月, 2016 1 次提交
-
-
由 NeilBrown 提交于
md_open() gets a counted reference on an mddev using mddev_find(). If it ends up returning an error, it must drop this reference. There are two error paths where the reference is not dropped. One only happens if the process is signalled and an awkward time, which is quite unlikely. The other was introduced recently in commit af8d8e6f. Change the code to ensure the drop the reference when returning an error, and make it harded to re-introduce this sort of bug in the future. Reported-by: NMarc Smith <marc.smith@mcc.edu> Fixes: af8d8e6f ("md: changes for MD_STILL_CLOSED flag") Signed-off-by: NNeilBrown <neilb@suse.com> Acked-by: NGuoqing Jiang <gqjiang@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
- 24 11月, 2016 2 次提交
-
-
由 Shaohua Li 提交于
__md_stop_writes currently doesn't stop raid5-cache reclaim thread. It's possible the reclaim thread is still running and doing write, which doesn't match what __md_stop_writes should do. The extra ->quiesce() call should not harm any raid types. For raid5-cache, this will guarantee we reclaim all caches before we update superblock. Signed-off-by: NShaohua Li <shli@fb.com> Reviewed-by: NNeilBrown <neilb@suse.de> Cc: Song Liu <songliubraving@fb.com>
-
由 Shaohua Li 提交于
There is mechanism to suspend a kernel thread. Use it instead of playing create/destroy game. Signed-off-by: NShaohua Li <shli@fb.com> Reviewed-by: NNeilBrown <neilb@suse.de> Cc: Song Liu <songliubraving@fb.com>
-
- 23 11月, 2016 2 次提交
-
-
由 NeilBrown 提交于
This can only be supported on personalities which ensure that md_error() never causes an array to enter the 'failed' state. i.e. if marking a device Faulty would cause some data to be inaccessible, the device is status is left as non-Faulty. This is true for RAID1 and RAID10. If we get a failure writing metadata but the device doesn't fail, it must be the last device so we re-write without FAILFAST to improve chance of success. We also flag the device as LastDev so that future metadata updates don't waste time on failfast writes. Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-
由 NeilBrown 提交于
This patch just adds a 'failfast' per-device flag which can be stored in v0.90 or v1.x metadata. The flag is not used yet but the intent is that it can be used for mirrored (raid1/raid10) arrays where low latency is more important than keeping all devices on-line. Setting the flag for a device effectively gives permission for that device to be marked as Faulty and excluded from the array on the first error. The underlying driver will be directed not to retry requests that result in failures. There is a proviso that the device must not be marked faulty if that would cause the array as a whole to fail, it may only be marked Faulty if the array remains functional, but is degraded. Failures on read requests will cause the device to be marked as Faulty immediately so that further reads will avoid that device. No attempt will be made to correct read errors by over-writing with the correct data. It is expected that if transient errors, such as cable unplug, are possible, then something in user-space will revalidate failed devices and re-add them when they appear to be working again. Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NShaohua Li <shli@fb.com>
-