- 02 8月, 2012 1 次提交
-
-
由 NeilBrown 提交于
queuing writes to the md thread means that all requests go through the one processor which may not be able to keep up with very high request rates. So use the plugging infrastructure to submit all requests on unplug. If a 'schedule' is needed, we fall back on the old approach of handing the requests to the thread for it to handle. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 31 7月, 2012 8 次提交
-
-
由 NeilBrown 提交于
This seemed like a good idea at the time, but after further thought I cannot see it making a difference other than very occasionally and testing to try to exercise the case it is most likely to help did not show any performance difference by removing it. So remove the counting of active plugs and allow 'pending writes' to be activated at any time, not just when no plugs are active. This is only relevant when there is a write-intent bitmap, and the updating of the bitmap will likely introduce enough delay that the single-threading of bitmap updates will be enough to collect large numbers of updates together. Removing this will make it easier to centralise the unplug code, and will clear the other for other unplug enhancements which have a measurable effect. Signed-off-by: NNeilBrown <neilb@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Alexander Lyakas 提交于
When doing resync or repair, attempt to correct bad blocks, according to WriteErrorSeen policy Signed-off-by: NAlex Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
If a resync of a RAID1 array with 2 devices finds a known bad block one device it will neither read from, or write to, that device for this block offset. So there will be one read_target (The other device) and zero write targets. This condition causes md/raid1 to abort the resync assuming that it has finished - without known bad blocks this would be true. When there are no write targets because of the presence of bad blocks we should only skip over the area covered by the bad block. RAID10 already gets this right, raid1 doesn't. Or didn't. As this can cause a 'sync' to abort early and appear to have succeeded it could lead to some data corruption, so it suitable for -stable. Cc: stable@vger.kernel.org Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Shaohua Li 提交于
For SSD, if request size exceeds specific value (optimal io size), request size isn't important for bandwidth. In such condition, if making request size bigger will cause some disks idle, the total throughput will actually drop. A good example is doing a readahead in a two-disk raid1 setup. So when should we split big requests? We absolutly don't want to split big request to very small requests. Even in SSD, big request transfer is more efficient. This patch only considers request with size above optimal io size. If all disks are busy, is it worth doing a split? Say optimal io size is 16k, two requests 32k and two disks. We can let each disk run one 32k request, or split the requests to 4 16k requests and each disk runs two. It's hard to say which case is better, depending on hardware. So only consider case where there are idle disks. For readahead, split is always better in this case. And in my test, below patch can improve > 30% thoughput. Hmm, not 100%, because disk isn't 100% busy. Such case can happen not just in readahead, for example, in directio. But I suppose directio usually will have bigger IO depth and make all disks busy, so I ignored it. Note: if the raid uses any hard disk, we don't prevent merging. That will make performace worse. Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Shaohua Li 提交于
SSD hasn't spindle, distance between requests means nothing. And the original distance based algorithm sometimes can cause severe performance issue for SSD raid. Considering two thread groups, one accesses file A, the other access file B. The first group will access one disk and the second will access the other disk, because requests are near from one group and far between groups. In this case, read balance might keep one disk very busy but the other relative idle. For SSD, we should try best to distribute requests to as many disks as possible. There isn't spindle move penality anyway. With below patch, I can see more than 50% throughput improvement sometimes depending on workloads. The only exception is small requests can be merged to a big request which typically can drive higher throughput for SSD too. Such small requests are sequential reads. Unlike hard disk, sequential read which can't be merged (for example direct IO, or read without readahead) can be ignored for SSD. Again there is no spindle move penality. readahead dispatches small requests and such requests can be merged. Last patch can help detect sequential read well, at least if concurrent read number isn't greater than raid disk number. In that case, distance based algorithm doesn't work well too. V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for random IO too. This makes the algorithm generic for raid with SSD. Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Shaohua Li 提交于
Currently the sequential read detection is global wide. It's natural to make it per disk based, which can improve the detection for concurrent multiple sequential reads. And next patch will make SSD read balance not use distance based algorithm, where this change help detect truly sequential read for SSD. Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Jonathan Brassow 提交于
MD RAID1/RAID10: Move some macros from .h file to .c file There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined in both raid1.h and raid10.h. They are only used in there respective .c files. However, if we wish to make RAID10 accessible to the device-mapper RAID target (dm-raid.c), then we need to move these macros into the .c files where they are used so that they do not conflict with each other. The macros from the two files are identical and could be moved into md.h, but I chose to leave the duplication and have them remain in the personality files. Signed-off-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Jonathan Brassow 提交于
MD RAID1: Rename the structure 'mirror_info' to 'raid1_info' The same structure name ('mirror_info') is used by raid10. Each of these structures are defined in there respective header files. If dm-raid is to support both RAID1 and RAID10, the header files will be included and the structure names must not collide. While only one of these structure names needs to change, this patch adds consistency to the naming of the structure. Signed-off-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 19 7月, 2012 1 次提交
-
-
由 NeilBrown 提交于
commit 4367af55 md/raid1: clear bad-block record when write succeeds. Added a 'reschedule_retry' call possibility at the end of end_sync_write, but didn't add matching code at the end of sync_request_write. So if the writes complete very quickly, or scheduling makes it seem that way, then we can miss rescheduling the request and the resync could hang. Also commit 73d5c38a md: avoid races when stopping resync. Fix a race condition in this same code in end_sync_write but didn't make the change in sync_request_write. This patch updates sync_request_write to fix both of those. Patch is suitable for 3.1 and later kernels. Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com> Original-version-by: NAlexander Lyakas <alex.bolshoy@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 09 7月, 2012 1 次提交
-
-
由 NeilBrown 提交于
This bug has been present ever since data-check was introduce in 2.6.16. However it would only fire if a data-check were done on a degraded array, which was only possible if the array has 3 or more devices. This is certainly possible, but is quite uncommon. Since hot-replace was added in 3.3 it can happen more often as the same condition can arise if not all possible replacements are present. The problem is that as soon as we submit the last read request, the 'r1_bio' structure could be freed at any time, so we really should stop looking at it. If the last device is being read from we will stop looking at it. However if the last device is not due to be read from, we will still check the bio pointer in the r1_bio, but the r1_bio might already be free. So use the read_targets counter to make sure we stop looking for bios to submit as soon as we have submitted them all. This fix is suitable for any -stable kernel since 2.6.16. Cc: stable@vger.kernel.org Reported-by: NArnold Schulz <arnysch@gmx.net> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 03 7月, 2012 3 次提交
-
-
由 NeilBrown 提交于
The value returned by "mddev_check_plug" is only valid until the next 'schedule' as that will unplug things. This could happen at any call to mempool_alloc. So just calling mddev_check_plug at the start doesn't really make sense. So call it just before, or just after, queuing things for the thread. As the action that happens at unplug is to wake the thread, this makes lots of sense. If we cannot add a plug (which requires a small GFP_ATOMIC alloc) we wake thread immediately. RAID5 is a bit different. Requests are queued for the thread and the thread is woken by release_stripe. So we don't need to wake the thread on failure. However the thread doesn't perform certain actions when there is any active plug, so it is important to install a plug before waking the thread. So for RAID5 we install the plug *before* queuing the request and waking the thread. Without this patch it is possible for raid1 or raid10 to queue a request without then waking the thread, resulting in the array locking up. Also change raid10 to only flush_pending_write when there are not active plugs, just like raid1. This patch is suitable for 3.0 or later. I plan to submit it to -stable, but I'll like to let it spend a few weeks in mainline first to be sure it is completely safe. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
When we added hot_replace we doubled the number of devices that could be in a RAID1 array. So we doubled how far read_balance would search. Unfortunately we didn't double the point at which it looped back to the beginning - so it effectively loops over all non-replacement disks twice. This doesn't cause bad behaviour, but it pointless and means we never read from replacement devices. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Having the 'name' arg optional and defaulting to the current personality name is no necessary and leads to errors, as when changing the level of an array we can end up using the name of the old level instead of the new one. So make it non-optional and always explicitly pass the name of the level that the array will be. Reported-by: Nmajianpeng <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 31 5月, 2012 1 次提交
-
-
由 NeilBrown 提交于
The new merge_bvec_fn which calls the corresponding function in subsidiary devices requires that mddev->merge_check_needed be set if any child has a merge_bvec_fn. However were were only setting that when a device was hot-added, not when a device was present from the start. This bug was introduced in 3.4 so patch is suitable for 3.4.y kernels. However that are conflicts in raid10.c so a separate patch will be needed for 3.4.y. Cc: stable@vger.kernel.org Reported-by: NSebastian Riemer <sebastian.riemer@profitbricks.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 22 5月, 2012 3 次提交
-
-
由 Jonathan Brassow 提交于
A RAID1 device does not necessarily need a fullsync if the bitmap can be used instead. Similar to commit d6b212f4 in raid5.c, if a raid1 device can be brought back (i.e. from a transient failure) it shouldn't need a complete resync. Provided the bitmap is not to old, it will have recorded the areas of the disk that need recovery. Signed-off-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Now that bitmaps can be resized, we can allow an array to be resized while the bitmap is present. This only covers resizing that involves changing the effective size of member devices, not resizing that changes the number of devices. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 majianpeng 提交于
When attempting to fix a read error, it is acceptable to read from a device that is recovering, provided the recovery has got past the place we are reading from. This makes the test for "can we read from here" the same as the test in read_balance. Signed-off-by: Nmajianpeng <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 21 5月, 2012 1 次提交
-
-
由 NeilBrown 提交于
When reshaping we can avoid costly intermediate backup by changing the 'start' address of the array on the device (if there is enough room). So as a first step, allow such a change to be requested through sysfs, and recorded in v1.x metadata. (As we didn't previous check that all 'pad' fields were zero, we need a new FEATURE flag for this. A (belatedly) check that all remaining 'pad' fields are zero to avoid a repeat of this) The new data offset must be requested separately for each device. This allows each to have a different change in the data offset. This is not likely to be used often but as data_offset can be set per-device, new_data_offset should be too. This patch also removes the 'acknowledged' arg to rdev_set_badblocks as it is never used and never will be. At the same time we add a new arg ('in_new') which is currently always zero but will be used more soon. When a reshape finishes we will need to update the data_offset and rdev->sectors. So provide an exported function to do that. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 12 4月, 2012 1 次提交
-
-
由 majianpeng 提交于
If r1bio->sectors % 8 != 0,then the memcmp and a later memcpy will omit the last bio_vec. This is suitable for any stable kernel since 3.1 when bad-block management was introduced. Cc: stable@vger.kernel.org Signed-off-by: Nmajianpeng <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 03 4月, 2012 2 次提交
-
-
由 NeilBrown 提交于
When comparing two pages read from different legs of a mirror, only compare the bytes that were read, not the whole page. In most cases we read a whole page, but in some cases with bad blocks or odd sizes devices we might read fewer than that. This bug has been present "forever" but at worst it might cause a report of two many mismatches and generate a little bit extra resync IO, so there is no need to back-port to -stable kernels. Reported-by: Nmajianpeng <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 majianpeng 提交于
Because rde->nr_pending > 0,so can not remove this disk. And in any case, we aren't holding rcu_read_lock() Signed-off-by: Nmajianpeng <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 02 4月, 2012 1 次提交
-
-
由 majianpeng 提交于
Signed-off-by: Nmajianpeng <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 19 3月, 2012 3 次提交
-
-
由 NeilBrown 提交于
Currently we don't honour merge_bvec_fn in member devices so if there is one, we force all requests to be single-page at most. This is not ideal. So create a raid1 merge_bvec_fn to check that function in children as well. This introduces a small problem. There is no locking around calls the ->merge_bvec_fn and subsequent calls to ->make_request. So a device added between these could end up getting a request which violates its merge_bvec_fn. Currently the best we can do is synchronize_sched(). This will work providing no preemption happens. If there is is preemption, we just have to hope that new devices are largely consistent with old devices. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
md.h has an 'rdev_for_each()' macro for iterating the rdevs in an mddev. However it uses the 'safe' version of list_for_each_entry, and so requires the extra variable, but doesn't include 'safe' in the name, which is useful documentation. Consequently some places use this safe version without needing it, and many use an explicity list_for_each entry. So: - rename rdev_for_each to rdev_for_each_safe - create a new rdev_for_each which uses the plain list_for_each_entry, - use the 'safe' version only where needed, and convert all other list_for_each_entry calls to use rdev_for_each. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
If RAID1 or RAID10 is used under LVM or some other stacking block device, it is possible to enter a deadlock during resync or recovery. This can happen if the upper level block device creates two requests to the RAID1 or RAID10. The first request gets processed, blocks recovery and queue requests for underlying requests in current->bio_list. A resync request then starts which will wait for those requests and block new IO. But then the second request to the RAID1/10 will be attempted and it cannot progress until the resync request completes, which cannot progress until the underlying device requests complete, which are on a queue behind that second request. So allow that second request to proceed even though there is a resync request about to start. This is suitable for any -stable kernel. Cc: stable@vger.kernel.org Reported-by: NRay Morris <support@bettercgi.com> Tested-by: NRay Morris <support@bettercgi.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 13 2月, 2012 1 次提交
-
-
由 NeilBrown 提交于
Since we added 'replacement' capability, RAID1 can have twice as many devices as ->raid_disks indicates. So md_raid1_congested needs to check that many possible devices, not just ->raid_disks many. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 11 1月, 2012 1 次提交
-
-
由 NeilBrown 提交于
We normally try to avoid reading from write-mostly devices, but when we do we really have to check for bad blocks and be sure not to try reading them. With the current code, best_good_sectors might not get set and that causes zero-length read requests to be send down which is very confusing. This bug was introduced in commit d2eb35ac and so the patch is suitable for 3.1.x and 3.2.x Reported-and-tested-by: NMichał Mirosław <mirq-linux@rere.qmqm.pl> Reported-and-tested-by: NArt -kwaak- van Breemen <ard@telegraafnet.nl> Signed-off-by: NNeilBrown <neilb@suse.de> Cc: stable@vger.kernel.org
-
- 23 12月, 2011 8 次提交
-
-
由 NeilBrown 提交于
Now that WantReplacement drives are replaced cleanly, mark a drive as want_replacement when we see a write error. It might get failed soon so the WantReplacement flag is irrelevant, but if the write error is recorded in the bad block log, we still want to activate any spare that might be available. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
When attempting to add a spare to a RAID1 array, also consider adding it as a replacement for a want_replacement device. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
If a Replacement is seen, file it as such. If we see two replacements (or two normal devices) for the one slot, abort. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
When recovery completes ->spare_active is called. This checks if the replacement is ready and if so it fails the original. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Replacement devices are stored at a different offset, so look there too. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
In RAID1, a replacement is much like a normal device, so we just double the size of the relevant arrays and look at all possible devices for reads and writes. This means that the array looks like it is now double the size in some way - we need to be careful about that. In particular, we checking if the array is still degraded while creating a recovery request we need to only consider the first 'half' - i.e. the real (non-replacement) devices. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
In general mddev->raid_disks can change unexpectedly while conf->raid_disks will only change in a very controlled way. So change some uses of one to the other. The use of mddev->raid_disks will not cause actually problems but this way is more consistent and safer in the long term. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Soon an array will be able to have multiple devices with the same raid_disk number (an original and a replacement). So removing a device based on the number won't work. So pass the actual device handle instead. Reviewed-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 01 11月, 2011 1 次提交
-
-
由 Paul Gortmaker 提交于
A pending cleanup will mean that module.h won't be implicitly everywhere anymore. Make sure the modular drivers in md dir are actually calling out for <module.h> explicitly in advance. Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
-
- 26 10月, 2011 1 次提交
-
-
由 NeilBrown 提交于
In 3.0 we changed the way recovery_disabled was handle so that instead of testing against zero, we test an mddev-> value against a conf-> value. Two problems: 1/ one place in raid1 was missed and still sets to '1'. 2/ We didn't explicitly set the conf-> value at array creation time. It defaulted to '0' just like the mddev value does so they could appear equal and thus disable recovery. This did not affect normal 'md' as it calls bind_rdev_to_array which changes the mddev value. However the dmraid interface doesn't call this and so doesn't change ->recovery_disabled; so at array start all recovery is incorrectly disabled. So initialise the 'conf' value to one less that the mddev value, so the will only be the same when explicitly set that way. Reported-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 24 10月, 2011 1 次提交
-
-
由 Tao Ma 提交于
bio originally has the functionality to set the complete cpu, but it is broken. Chirstoph said that "This code is unused, and from the all the discussions lately pretty obviously broken. The only thing keeping it serves is creating more confusion and possibly more bugs." And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine with leaving cpu control to the request based drivers, they are the only ones that can toggle the setting anyway". So this patch tries to remove all the work of controling complete cpu from a bio. Cc: Shaohua Li <shaohua.li@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: NTao Ma <boyu.mt@taobao.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 11 10月, 2011 1 次提交
-
-
由 NeilBrown 提交于
RAID1 and RAID10 handle write requests by queuing them for handling by a separate thread. This is because when a write-intent-bitmap is active we might need to update the bitmap first, so it is good to queue a lot of writes, then do one big bitmap update for them all. However writeback request devices to appear to be congested after a while so it can make some guesstimate of throughput. The infinite queue defeats that (note that RAID5 has already has a finite queue so it doesn't suffer from this problem). So impose a limit on the number of pending write requests. By default it is 1024 which seems to be generally suitable. Make it configurable via module option just in case someone finds a regression. Signed-off-by: NNeilBrown <neilb@suse.de>
-