- 18 5月, 2010 9 次提交
-
-
由 NeilBrown 提交于
Many 'printk' messages from the raid456 module mention 'raid5' even though it may be a 'raid6' or even 'raid4' array. This can cause confusion. Also the actual array name is not always reported and when it is it is not reported consistently. So change all the messages to start: md/raid:%s: where '%s' becomes e.g. md3 to identify the particular array. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Dan Williams 提交于
For consistency allow raid4 to takeover raid0 in addition to raid5 (with a raid4 layout). Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 NeilBrown 提交于
We used to pass the personality make_request function direct to the block layer so the first argument had to be a queue. But now we have the intermediary md_make_request so it makes at lot more sense to pass a struct mddev_s. It makes it possible to have an mddev without its own queue too. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
We set ->changed to 1 and call check_disk_change at the end of md_open so that bd_invalidated would be set and thus partition rescan would happen appropriately. Now that we call revalidate_disk directly, which sets bd_invalidates, that indirection is no longer needed and can be removed. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
While I generally prefer letting personalities do as much as possible, given that we have a central md_make_request anyway we may as well use it to simplify code. Also this centralises knowledge of ->gendisk which will help later. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Diving through ->queue to find mddev is unnecessarily complex - there is an easier path to finding mddev, so use that. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
This is unlikely to be wanted, but we may as well provide it for completeness. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Trela Maciej 提交于
Signed-off-by: NMaciej Trela <maciej.trela@intel.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 H Hartley Sweeten 提交于
void pointers do not need to be cast to other pointer types. Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 17 5月, 2010 1 次提交
-
-
由 NeilBrown 提交于
Some levels expect the 'redundancy group' to be present, others don't. So when we change level of an array we might need to add or remove this group. This requires fixing up the current practice of overloading ->private to indicate (when ->pers == NULL) that something needs to be removed. So create a new ->to_remove to fill that role. When changing levels, we may need to add or remove attributes. When changing RAID5 -> RAID6, we both add and remove the same thing. It is important to catch this and optimise it out as the removal is delayed until a lock is released, so trying to add immediately would cause problems. Cc: stable@kernel.org Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 26 2月, 2010 1 次提交
-
-
由 Martin K. Petersen 提交于
Except for SCSI no device drivers distinguish between physical and hardware segment limits. Consolidate the two into a single segment limit. Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 17 2月, 2010 1 次提交
-
-
由 Tejun Heo 提交于
Add __percpu sparse annotations to places which didn't make it in one of the previous patches. All converions are trivial. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NBorislav Petkov <borislav.petkov@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Neil Brown <neilb@suse.de>
-
- 10 2月, 2010 1 次提交
-
-
由 NeilBrown 提交于
====== This fix is related to http://bugzilla.kernel.org/show_bug.cgi?id=15142 but does not address that exact issue. ====== sysfs does like attributes being removed while they are being accessed (i.e. read or written) and waits for the access to complete. As accessing some md attributes takes the same lock that is held while removing those attributes a deadlock can occur. This patch addresses 3 issues in md that could lead to this deadlock. Two relate to calling flush_scheduled_work while the lock is held. This is probably a bad idea in general and as we use schedule_work to delete various sysfs objects it is particularly bad. In one case flush_scheduled_work is called from md_alloc (called by md_probe) called from do_md_run which holds the lock. This call is only present to ensure that ->gendisk is set. However we can be sure that gendisk is always set (though possibly we couldn't when that code was originally written. This is because do_md_run is called in three different contexts: 1/ from md_ioctl. This requires that md_open has succeeded, and it fails if ->gendisk is not set. 2/ from writing a sysfs attribute. This can only happen if the mddev has been registered in sysfs which happens in md_alloc after ->gendisk has been set. 3/ from autorun_array which is only called by autorun_devices, which checks for ->gendisk to be set before calling autorun_array. So the call to md_probe in do_md_run can be removed, and the check on ->gendisk can also go. In the other case flush_scheduled_work is being called in do_md_stop, purportedly to wait for all md_delayed_delete calls (which delete the component rdevs) to complete. However there really isn't any need to wait for them - they have already been disconnected in all important ways. The third issue is that raid5->stop() removes some attribute names while the lock is held. There is already some infrastructure in place to delay attribute removal until after the lock is released (using schedule_work). So extend that infrastructure to remove the raid5_attrs_group. This does not address all lockdep issues related to the sysfs "s_active" lock. The rest can be address by splitting that lockdep context between symlinks and non-symlinks which hopefully will happen. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 09 2月, 2010 1 次提交
-
-
由 NeilBrown 提交于
This code was written long ago when it was not possible to reshape a degraded array. Now it is so the current level of degraded-ness needs to be taken in to account. Also newly addded devices should only reduce degradedness if they are deemed to be in-sync. In particular, if you convert a RAID5 to a RAID6, and increase the number of devices at the same time, then the 5->6 conversion will make the array degraded so the current code will produce a wrong value for 'degraded' - "-1" to be precise. If the reshape runs to completion end_reshape will calculate a correct new value for 'degraded', but if a device fails during the reshape an incorrect decision might be made based on the incorrect value of "degraded". This patch is suitable for 2.6.32-stable and if they are still open, 2.6.31-stable and 2.6.30-stable as well. Cc: stable@kernel.org Reported-by: NMichael Evans <mjevans1983@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 14 12月, 2009 4 次提交
-
-
由 NeilBrown 提交于
Suggested by Oren Held <orenhe@il.ibm.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
The post-barrier-flush is sent by md as soon as make_request on the barrier write completes. For raid5, the data might not be in the per-device queues yet. So for barrier requests, wait for any pre-reading to be done so that the request will be in the per-device queues. We use the 'preread_active' count to check that nothing is still in the preread phase, and delay the decrement of this count until after write requests have been submitted to the underlying devices. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Previously barriers were only supported on RAID1. This is because other levels requires synchronisation across all devices and so needed a different approach. Here is that approach. When a barrier arrives, we send a zero-length barrier to every active device. When that completes - and if the original request was not empty - we submit the barrier request itself (with the barrier flag cleared) and then submit a fresh load of zero length barriers. The barrier request itself is asynchronous, but any subsequent request will block until the barrier completes. The reason for clearing the barrier flag is that a barrier request is allowed to fail. If we pass a non-empty barrier through a striping raid level it is conceivable that part of it could succeed and part could fail. That would be way too hard to deal with. So if the first run of zero length barriers succeed, we assume all is sufficiently well that we send the request and ignore errors in the second run of barriers. RAID5 needs extra care as write requests may not have been submitted to the underlying devices yet. So we flush the stripe cache before proceeding with the barrier. Note that the second set of zero-length barriers are submitted immediately after the original request is submitted. Thus when a personality finds mddev->barrier to be set during make_request, it should not return from make_request until the corresponding per-device request(s) have been queued. That will be done in later patches. Signed-off-by: NNeilBrown <neilb@suse.de> Reviewed-by: NAndre Noll <maan@systemlinux.org>
-
由 NeilBrown 提交于
qd_idx is previously declared and given exactly the same value! Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 13 11月, 2009 2 次提交
-
-
由 NeilBrown 提交于
Normally is it not safe to allow a raid5 that is both dirty and degraded to be assembled without explicit request from that admin, as it can cause hidden data corruption. This is because 'dirty' means that the parity cannot be trusted, and 'degraded' means that the parity needs to be used. However, if the device that is missing contains only parity, then there is no issue and assembly can continue. This particularly applies when a RAID5 is being converted to a RAID6 and there is an unclean shutdown while the conversion is happening. So check for whether the degraded space only contains parity, and in that case, allow the assembly. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
When a reshape finds that it can add spare devices into the array, those devices might already be 'in_sync' if they are beyond the old size of the array, or they might not if they are within the array. The first case happens when we change an N-drive RAID5 to an N+1-drive RAID5. The second happens when we convert an N-drive RAID5 to an N+1-drive RAID6. So set the flag more carefully. Also, ->recovery_offset is only meaningful when the flag is clear, so only set it in that case. This change needs the preceding two to ensure that the non-in_sync device doesn't get evicted from the array when it is stopped, in the case where v0.90 metadata is used. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 06 11月, 2009 1 次提交
-
-
由 NeilBrown 提交于
This value is visible through sysfs and is used by mdadm when it manages a reshape (backing up data that is about to be rearranged). So it is important that it is always correct. Current it does not get updated properly when a reshape starts which can cause problems when assembling an array that is in the middle of being reshaped. This is suitable for 2.6.31.y stable kernels. Cc: stable@kernel.org Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 20 10月, 2009 1 次提交
-
-
由 Dan Williams 提交于
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 16 10月, 2009 6 次提交
-
-
由 NeilBrown 提交于
md/raid6 passes a list of 'struct page *' to the async_tx routines, which then either DMA map them for offload, or take the page_address for CPU based calculations. For RAID6 we sometime leave 'blanks' in the list of pages. For CPU based calcs, we want to treat theses as a page of zeros. For offloaded calculations, we simply don't pass a page to the hardware. Currently the 'blanks' are encoded as a pointer to raid6_empty_zero_page. This is a 4096 byte memory region, not a 'struct page'. This is mostly handled correctly but is rather ugly. So change the code to pass and expect a NULL pointer for the blanks. When taking page_address of a page, we need to check for a NULL and in that case use raid6_empty_zero_page. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
When a raid5 (or raid6) array is being reshaped to have fewer devices, conf->raid_disks is the latter and hence smaller number of devices. However sometimes we want to use a number which is the total number of currently required devices - the larger of the 'old' and 'new' sizes. Before we implemented reducing the number of devices, this was always 'new' i.e. ->raid_disks. Now we need max(raid_disks, previous_raid_disks) in those places. This particularly affects assembling an array that was shutdown while in the middle of a reshape to fewer devices. md.c needs a similar fix when interpreting the md metadata. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Dan Williams 提交于
The percpu conversion allowed a straightforward handoff of stripe processing to the async subsytem that initially showed some modest gains (+4%). However, this model is too simplistic and leads to stripes bouncing between raid5d and the async thread pool for every invocation of handle_stripe(). As reported by Holger this can fall into a pathological situation severely impacting throughput (6x performance loss). By downleveling the parallelism to raid_run_ops the pathological stripe_head bouncing is eliminated. This version still exhibits an average 11% throughput loss for: mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6 echo 1024 > /sys/block/md0/md/stripe_cache_size dd if=/dev/zero of=/dev/md0 bs=1024k count=2048 ...but the results are at least stable and can be used as a base for further multicore experimentation. Reported-by: NHolger Kiehl <Holger.Kiehl@dwd.de> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Dan Williams 提交于
Deallocating a raid5_conf_t structure requires taking 'device_lock'. Ensure it is initialized before it is used, i.e. initialize the lock before attempting any further initializations that might fail. Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
This reverts commit df10cfbc. This patch was based on a misunderstanding and risks introducing a busy-wait loop. So revert it. Acked-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 23 9月, 2009 3 次提交
-
-
由 NeilBrown 提交于
This should writeback from coming when the device is temporarily suspended. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
The management thread for raid4,5,6 arrays are all called mdX_raid5, independent of the actual raid level, which is wrong and can be confusion. So change md_register_thread to use the name from the personality unless no alternate name (like 'resync' or 'reshape') is given. This is simpler and more correct. Cc: Jinzc <zhenchengjin@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Rename some variable and remove some duplicate definitions to avoid there warnings. None of them are actual errors. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 17 9月, 2009 2 次提交
-
-
由 Dan Williams 提交于
Neil says: "It is correct as it stands, but the fact that every branch in the 'if' part ends with a 'return' isn't immediately obvious, so it is clearer if we are explicit about the if / then / else structure." Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
As pointed out by Neil it should be possible to build a driver with all BUG_ON statements deleted. It's bad form to have a BUG_ON with a side effect. Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 11 9月, 2009 1 次提交
-
-
由 Jens Axboe 提交于
Get rid of any functions that test for these bits and make callers use bio_rw_flagged() directly. Then it is at least directly apparent what variable and flag they check. Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 09 9月, 2009 1 次提交
-
-
由 Dan Williams 提交于
Some engines optimize operation by reading ahead in the descriptor chain such that descriptor2 may start execution before descriptor1 completes. If descriptor2 depends on the result from descriptor1 then a fence is required (on descriptor2) to disable this optimization. The async_tx api could implicitly identify dependencies via the 'depend_tx' parameter, but that would constrain cases where the dependency chain only specifies a completion order rather than a data dependency. So, provide an ASYNC_TX_FENCE to explicitly identify data dependencies. Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 30 8月, 2009 5 次提交
-
-
由 Dan Williams 提交于
Now that the resources to handle stripe_head operations are allocated percpu it is possible for raid5d to distribute stripe handling over multiple cores. This conversion also adds a call to cond_resched() in the non-multicore case to prevent one core from getting monopolized for raid operations. Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Yuri Tikhonov 提交于
These routines have been replaced by there asynchronous counterparts. Signed-off-by: NYuri Tikhonov <yur@emcraft.com> Signed-off-by: NIlya Yanok <yanok@emcraft.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Yuri Tikhonov 提交于
1/ Use STRIPE_OP_BIOFILL to offload completion of read requests to raid_run_ops 2/ Implement a handler for sh->reconstruct_state similar to the raid5 case (adds handling of Q parity) 3/ Prevent handle_parity_checks6 from running concurrently with 'compute' operations 4/ Hook up raid_run_ops Signed-off-by: NYuri Tikhonov <yur@emcraft.com> Signed-off-by: NIlya Yanok <yanok@emcraft.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
[ Based on an original patch by Yuri Tikhonov ] Implement the state machine for handling the RAID-6 parities check and repair functionality. Note that the raid6 case does not need to check for new failures, like raid5, as it will always writeback the correct disks. The raid5 case can be updated to check zero_sum_result to avoid getting confused by new failures rather than retrying the entire check operation. Signed-off-by: NYuri Tikhonov <yur@emcraft.com> Signed-off-by: NIlya Yanok <yanok@emcraft.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Yuri Tikhonov 提交于
In the synchronous implementation of stripe dirtying we processed a degraded stripe with one call to handle_stripe_dirtying6(). I.e. compute the missing blocks from the other drives, then copy in the new data and reconstruct the parities. In the asynchronous case we do not perform stripe operations directly. Instead, operations are scheduled with flags to be later serviced by raid_run_ops. So, for the degraded case the final reconstruction step can only be carried out after all blocks have been brought up to date by being read, or computed. Like the raid5 case schedule_reconstruction() sets STRIPE_OP_RECONSTRUCT to request a parity generation pass and through operation chaining can handle compute and reconstruct in a single raid_run_ops pass. [dan.j.williams@intel.com: fixup handle_stripe_dirtying6 gating] Signed-off-by: NYuri Tikhonov <yur@emcraft.com> Signed-off-by: NIlya Yanok <yanok@emcraft.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-