- 26 7月, 2010 4 次提交
-
-
由 NeilBrown 提交于
md/raid5 uses the plugging infrastructure provided by the block layer and 'struct request_queue'. However when we plug raid5 under dm there is no request queue so we cannot use that. So create a similar infrastructure that is much lighter weight and use it for raid5. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
the dm module will need this for dm-raid45. Also only access ->queue->backing_dev_info->congested_fn if ->queue actually exists. It won't in a dm target. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
dm-raid456 does not provide a 'queue' for raid5 to use, so we must make raid5 stop depending on the queue. First: read_ahead dm handles read-ahead adjustment fully in userspace, so simply don't do any readahead adjustments if there is no queue. Also re-arrange code slightly so all the accesses to ->queue are together. Finally, move the blk_queue_merge_bvec function into the 'if' as the ->split_io setting in dm-raid456 has the same effect. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
We will shortly allow md devices with no gendisk (they are attached to a dm-target instead). That will cause mdname() to return 'mdX'. There is one place where mdname really needs to be unique: when creating the name for a slab cache. So in that case, if there is no gendisk, you the address of the mddev formatted in HEX to provide a unique name. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 21 7月, 2010 2 次提交
-
-
由 NeilBrown 提交于
Separate the actual 'change' code from the sysfs interface so that it can eventually be called internally. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
We will want md devices to live as dm targets where sysfs is not visible. So allow md to not connect to sysfs. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 24 6月, 2010 6 次提交
-
-
由 NeilBrown 提交于
There are few situations where it would make any sense to add a spare when reducing the number of devices in an array, but it is conceivable: A 6 drive RAID6 with two missing devices could be reshaped to a 5 drive RAID6, and a spare could become available just in time for the reshape, but not early enough to have been recovered first. 'freezing' recovery can make this easy to do without any races. However doing such a thing is a bad idea. md will not record the partially-recovered state of the 'spare' and when the reshape finished it will think that the spare is still spare. Easiest way to avoid this confusion is to simply disallow it. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
As the comment says, the tail of this loop only applies to devices that are not fully in sync, so if In_sync was set, we should avoid the rest of the loop. This bug will hardly ever cause an actual problem. The worst it can do is allow an array to be assembled that is dirty and degraded, which is not generally a good idea (without warning the sysadmin first). This will only happen if the array is RAID4 or a RAID5/6 in an intermediate state during a reshape and so has one drive that is all 'parity' - no data - while some other device has failed. This is certainly possible, but not at all common. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
During a recovery of reshape the early part of some devices might be in-sync while the later parts are not. We we know we are looking at an early part it is good to treat that part as in-sync for stripe calculations. This is particularly important for a reshape which suffers device failure. Treating the data as in-sync can mean the difference between data-safety and data-loss. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
When we are reshaping an array, the device failure combinations that cause us to decide that the array as failed are more subtle. In particular, any 'spare' will be fully in-sync in the section of the array that has already been reshaped, thus failures that affect only that section are less critical. So encode this subtlety in a new function and call it as appropriate. The case that showed this problem was a 4 drive RAID5 to 8 drive RAID6 conversion where the last two devices failed. This resulted in: good good good good incomplete good good failed failed while converting a 5-drive RAID6 to 8 drive RAID5 The incomplete device causes the whole array to look bad, bad as it was actually good for the section that had been converted to 8-drives, all the data was actually safe. Reported-by: NTerry Morris <tbmorris@tbmorris.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
When an array is reshaped to have fewer devices, the reshape proceeds from the end of the devices to the beginning. If a device happens to be non-In_sync (which is possible but rare) we would normally update the ->recovery_offset as the reshape progresses. However that would be wrong as the recover_offset records that the early part of the device is in_sync, while in fact it would only be the later part that is in_sync, and in any case the offset number would be measured from the wrong end of the device. Relatedly, if after a reshape a spare is discovered to not be recoverred all the way to the end, not allow spare_active to incorporate it in the array. This becomes relevant in the following sample scenario: A 4 drive RAID5 is converted to a 6 drive RAID6 in a combined operation. The RAID5->RAID6 conversion will cause a 5 drive to be included as a spare, then the 5drive -> 6drive reshape will effectively rebuild that spare as it progresses. The 6th drive is treated as in_sync the whole time as there is never any case that we might consider reading from it, but must not because there is no valid data. If we interrupt this reshape part-way through and reverse it to return to a 5-drive RAID6 (or event a 4-drive RAID5), we don't want to update the recovery_offset - as that would be wrong - and we don't want to include that spare as active in the 5-drive RAID6 when the reversed reshape completed and it will be mostly out-of-sync still. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
The entries in the stripe_cache maintained by raid5 are enlarged when we increased the number of devices in the array, but not shrunk when we reduce the number of devices. So if entries are added after reducing the number of devices, we much ensure to initialise the whole entry, not just the part that is currently relevant. Otherwise if we enlarge the array again, we will reference uninitialised values. As grow_buffers/shrink_buffer now want to use a count that is stored explicity in the raid_conf, they should get it from there rather than being passed it as a parameter. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 28 5月, 2010 1 次提交
-
-
由 Akinobu Mita 提交于
By the previous modification, the cpu notifier can return encapsulate errno value. This converts the cpu notifiers for raid5. Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 5月, 2010 10 次提交
-
-
由 Gabriele A. Trombetti 提交于
Fix: Raid-6 was not trying to correct a read-error when in singly-degraded state and was instead dropping one more device, going to doubly-degraded state. This patch fixes this behaviour. Tested-by: NJanos Haar <janos.haar@netcenter.hu> Signed-off-by: NGabriele A. Trombetti <g.trombetti.lkrnl1213@logicschema.com> Reported-by: NJanos Haar <janos.haar@netcenter.hu> Signed-off-by: NNeilBrown <neilb@suse.de> Cc: stable@kernel.org
-
由 NeilBrown 提交于
Many 'printk' messages from the raid456 module mention 'raid5' even though it may be a 'raid6' or even 'raid4' array. This can cause confusion. Also the actual array name is not always reported and when it is it is not reported consistently. So change all the messages to start: md/raid:%s: where '%s' becomes e.g. md3 to identify the particular array. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Dan Williams 提交于
For consistency allow raid4 to takeover raid0 in addition to raid5 (with a raid4 layout). Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 NeilBrown 提交于
We used to pass the personality make_request function direct to the block layer so the first argument had to be a queue. But now we have the intermediary md_make_request so it makes at lot more sense to pass a struct mddev_s. It makes it possible to have an mddev without its own queue too. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
We set ->changed to 1 and call check_disk_change at the end of md_open so that bd_invalidated would be set and thus partition rescan would happen appropriately. Now that we call revalidate_disk directly, which sets bd_invalidates, that indirection is no longer needed and can be removed. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
While I generally prefer letting personalities do as much as possible, given that we have a central md_make_request anyway we may as well use it to simplify code. Also this centralises knowledge of ->gendisk which will help later. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Diving through ->queue to find mddev is unnecessarily complex - there is an easier path to finding mddev, so use that. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
This is unlikely to be wanted, but we may as well provide it for completeness. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Trela Maciej 提交于
Signed-off-by: NMaciej Trela <maciej.trela@intel.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 H Hartley Sweeten 提交于
void pointers do not need to be cast to other pointer types. Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 17 5月, 2010 1 次提交
-
-
由 NeilBrown 提交于
Some levels expect the 'redundancy group' to be present, others don't. So when we change level of an array we might need to add or remove this group. This requires fixing up the current practice of overloading ->private to indicate (when ->pers == NULL) that something needs to be removed. So create a new ->to_remove to fill that role. When changing levels, we may need to add or remove attributes. When changing RAID5 -> RAID6, we both add and remove the same thing. It is important to catch this and optimise it out as the removal is delayed until a lock is released, so trying to add immediately would cause problems. Cc: stable@kernel.org Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 07 5月, 2010 1 次提交
-
-
由 Gabriele A. Trombetti 提交于
Fix: Raid-6 was not trying to correct a read-error when in singly-degraded state and was instead dropping one more device, going to doubly-degraded state. This patch fixes this behaviour. Tested-by: NJanos Haar <janos.haar@netcenter.hu> Signed-off-by: NGabriele A. Trombetti <g.trombetti.lkrnl1213@logicschema.com> Reported-by: NJanos Haar <janos.haar@netcenter.hu> Signed-off-by: NNeilBrown <neilb@suse.de> Cc: stable@kernel.org
-
- 23 4月, 2010 1 次提交
-
-
由 NeilBrown 提交于
Previous patch changes stripe and chunk_number to sector_t but mistakenly did not update all of the divisions to use sector_dev(). This patch changes all the those divisions (actually the '%' operator) to sector_div. Signed-off-by: NNeilBrown <neilb@suse.de> Cc: stable@kernel.org Tested-by: NStefan Lippers-Hollmann <s.l-h@gmx.de>
-
- 20 4月, 2010 1 次提交
-
-
由 NeilBrown 提交于
With many large drives and small chunk sizes it is possible to create a RAID5 with more than 2^31 chunks. Make sure this works. Reported-by: NBrett King <king.br@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de> Cc: stable@kernel.org
-
- 30 3月, 2010 1 次提交
-
-
由 Tejun Heo 提交于
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: NTejun Heo <tj@kernel.org> Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
-
- 26 2月, 2010 1 次提交
-
-
由 Martin K. Petersen 提交于
Except for SCSI no device drivers distinguish between physical and hardware segment limits. Consolidate the two into a single segment limit. Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 17 2月, 2010 1 次提交
-
-
由 Tejun Heo 提交于
Add __percpu sparse annotations to places which didn't make it in one of the previous patches. All converions are trivial. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NBorislav Petkov <borislav.petkov@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Neil Brown <neilb@suse.de>
-
- 10 2月, 2010 1 次提交
-
-
由 NeilBrown 提交于
====== This fix is related to http://bugzilla.kernel.org/show_bug.cgi?id=15142 but does not address that exact issue. ====== sysfs does like attributes being removed while they are being accessed (i.e. read or written) and waits for the access to complete. As accessing some md attributes takes the same lock that is held while removing those attributes a deadlock can occur. This patch addresses 3 issues in md that could lead to this deadlock. Two relate to calling flush_scheduled_work while the lock is held. This is probably a bad idea in general and as we use schedule_work to delete various sysfs objects it is particularly bad. In one case flush_scheduled_work is called from md_alloc (called by md_probe) called from do_md_run which holds the lock. This call is only present to ensure that ->gendisk is set. However we can be sure that gendisk is always set (though possibly we couldn't when that code was originally written. This is because do_md_run is called in three different contexts: 1/ from md_ioctl. This requires that md_open has succeeded, and it fails if ->gendisk is not set. 2/ from writing a sysfs attribute. This can only happen if the mddev has been registered in sysfs which happens in md_alloc after ->gendisk has been set. 3/ from autorun_array which is only called by autorun_devices, which checks for ->gendisk to be set before calling autorun_array. So the call to md_probe in do_md_run can be removed, and the check on ->gendisk can also go. In the other case flush_scheduled_work is being called in do_md_stop, purportedly to wait for all md_delayed_delete calls (which delete the component rdevs) to complete. However there really isn't any need to wait for them - they have already been disconnected in all important ways. The third issue is that raid5->stop() removes some attribute names while the lock is held. There is already some infrastructure in place to delay attribute removal until after the lock is released (using schedule_work). So extend that infrastructure to remove the raid5_attrs_group. This does not address all lockdep issues related to the sysfs "s_active" lock. The rest can be address by splitting that lockdep context between symlinks and non-symlinks which hopefully will happen. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 09 2月, 2010 1 次提交
-
-
由 NeilBrown 提交于
This code was written long ago when it was not possible to reshape a degraded array. Now it is so the current level of degraded-ness needs to be taken in to account. Also newly addded devices should only reduce degradedness if they are deemed to be in-sync. In particular, if you convert a RAID5 to a RAID6, and increase the number of devices at the same time, then the 5->6 conversion will make the array degraded so the current code will produce a wrong value for 'degraded' - "-1" to be precise. If the reshape runs to completion end_reshape will calculate a correct new value for 'degraded', but if a device fails during the reshape an incorrect decision might be made based on the incorrect value of "degraded". This patch is suitable for 2.6.32-stable and if they are still open, 2.6.31-stable and 2.6.30-stable as well. Cc: stable@kernel.org Reported-by: NMichael Evans <mjevans1983@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 14 12月, 2009 4 次提交
-
-
由 NeilBrown 提交于
Suggested by Oren Held <orenhe@il.ibm.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
The post-barrier-flush is sent by md as soon as make_request on the barrier write completes. For raid5, the data might not be in the per-device queues yet. So for barrier requests, wait for any pre-reading to be done so that the request will be in the per-device queues. We use the 'preread_active' count to check that nothing is still in the preread phase, and delay the decrement of this count until after write requests have been submitted to the underlying devices. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Previously barriers were only supported on RAID1. This is because other levels requires synchronisation across all devices and so needed a different approach. Here is that approach. When a barrier arrives, we send a zero-length barrier to every active device. When that completes - and if the original request was not empty - we submit the barrier request itself (with the barrier flag cleared) and then submit a fresh load of zero length barriers. The barrier request itself is asynchronous, but any subsequent request will block until the barrier completes. The reason for clearing the barrier flag is that a barrier request is allowed to fail. If we pass a non-empty barrier through a striping raid level it is conceivable that part of it could succeed and part could fail. That would be way too hard to deal with. So if the first run of zero length barriers succeed, we assume all is sufficiently well that we send the request and ignore errors in the second run of barriers. RAID5 needs extra care as write requests may not have been submitted to the underlying devices yet. So we flush the stripe cache before proceeding with the barrier. Note that the second set of zero-length barriers are submitted immediately after the original request is submitted. Thus when a personality finds mddev->barrier to be set during make_request, it should not return from make_request until the corresponding per-device request(s) have been queued. That will be done in later patches. Signed-off-by: NNeilBrown <neilb@suse.de> Reviewed-by: NAndre Noll <maan@systemlinux.org>
-
由 NeilBrown 提交于
qd_idx is previously declared and given exactly the same value! Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 13 11月, 2009 2 次提交
-
-
由 NeilBrown 提交于
Normally is it not safe to allow a raid5 that is both dirty and degraded to be assembled without explicit request from that admin, as it can cause hidden data corruption. This is because 'dirty' means that the parity cannot be trusted, and 'degraded' means that the parity needs to be used. However, if the device that is missing contains only parity, then there is no issue and assembly can continue. This particularly applies when a RAID5 is being converted to a RAID6 and there is an unclean shutdown while the conversion is happening. So check for whether the degraded space only contains parity, and in that case, allow the assembly. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
When a reshape finds that it can add spare devices into the array, those devices might already be 'in_sync' if they are beyond the old size of the array, or they might not if they are within the array. The first case happens when we change an N-drive RAID5 to an N+1-drive RAID5. The second happens when we convert an N-drive RAID5 to an N+1-drive RAID6. So set the flag more carefully. Also, ->recovery_offset is only meaningful when the flag is clear, so only set it in that case. This change needs the preceding two to ensure that the non-in_sync device doesn't get evicted from the array when it is stopped, in the case where v0.90 metadata is used. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 06 11月, 2009 1 次提交
-
-
由 NeilBrown 提交于
This value is visible through sysfs and is used by mdadm when it manages a reshape (backing up data that is about to be rearranged). So it is important that it is always correct. Current it does not get updated properly when a reshape starts which can cause problems when assembling an array that is in the middle of being reshaped. This is suitable for 2.6.31.y stable kernels. Cc: stable@kernel.org Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 20 10月, 2009 1 次提交
-
-
由 Dan Williams 提交于
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-