- 26 2月, 2013 9 次提交
-
-
由 NeilBrown 提交于
When raid1/raid10 needs to fix a read error, it first drains all pending requests by calling freeze_array(). This calls flush_pending_writes() if it needs to sleep, but some writes may be pending in a per-process plug rather than in the per-array request queue. When raid1{,0}_unplug() moves the request from the per-process plug to the per-array request queue (from which flush_pending_writes() can flush them), it needs to wake up freeze_array(), or freeze_array() will never flush them and so it will block forever. So add the requires wake_up() calls. This bug was introduced by commit f54a9d0e for raid1 and a similar commit for RAID10, and so has been present since linux-3.6. As the bug causes a deadlock I believe this fix is suitable for -stable. Cc: stable@vger.kernel.org (3.6.y 3.7.y 3.8.y) Reported-by: NTregaron Bayly <tbayly@bluehost.com> Tested-by: NTregaron Bayly <tbayly@bluehost.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Mentioning "bad disk number -1" exposes irrelevant internal detail. Just say they are inactive and must be removed. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Create_stripe_zones returns an error slightly differently to raid0_run and to raid0_takeover_*. The error returned used by the second was wrong and an error would result in mddev->private being set to NULL and sooner or later a crash. So never return NULL, return ERR_PTR(err), not NULL from create_stripe_zones. This bug has been present since 2.6.35 so the fix is suitable for any kernel since then. Cc: stable@vger.kernel.org Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
You cannot resize a RAID0 array (in terms of making the devices bigger), but the code doesn't entirely stop you. So: disable setting of the available size on each device for RAID0 and Linear devices. This must not change as doing so can change the effective layout of data. Make sure that the size that raid0_size() reports is accurate, but rounding devices sizes to chunk sizes. As the device sizes cannot change now, this isn't so important, but it is best to be safe. Without this change: mdadm --grow /dev/md0 -z max mdadm --grow /dev/md0 -Z max then read to the end of the array can cause a BUG in a RAID0 array. These bugs have been present ever since it became possible to resize any device, which is a long time. So the fix is suitable for any -stable kerenl. Cc: stable@vger.kernel.org Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Jonathan Brassow 提交于
DM RAID: Add support for MD's RAID10 "far" and "offset" algorithms Until now, dm-raid.c only supported the "near" algorthm of MD's RAID10 implementation. This patch adds support for the "far" and "offset" algorithms, but only with the improved redundancy that is brought with the introduction of the 'use_far_sets' bit, which shifts copied stripes according to smaller sets vs the entire array. That is, the 17th bit of the 'layout' variable that defines the RAID10 implementation will always be set. (More information on how the 'layout' variable selects the RAID10 algorithm can be found in the opening comments of drivers/md/raid10.c.) Signed-off-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Jonathan Brassow 提交于
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 2) This patch addresses raid arrays that have a number of devices that cannot be evenly divided by 'far_copies'. (E.g. 5 devices, far_copies = 2) This case must be handled differently because it causes that last set to be of a different size than the rest of the sets. We must compute a new modulo for this last set so that copied chunks are properly wrapped around. Example use_far_sets=1, far_copies=2, near_copies=1, devices=5: "far" algorithm dev1 dev2 dev3 dev4 dev5 ==== ==== ==== ==== ==== [ A B ] [ C D E ] [ G H ] [ I J K ] ... [ B A ] [ E C D ] --> nominal set of 2 and last set of 3 [ H G ] [ K I J ] []'s show far/offset sets Signed-off-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Jonathan Brassow 提交于
The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe widths - copying them to a different location on the same devices after shifting the stripe. An example layout of each follows below: "far" algorithm dev1 dev2 dev3 dev4 dev5 dev6 ==== ==== ==== ==== ==== ==== A B C D E F G H I J K L ... F A B C D E --> Copy of stripe0, but shifted by 1 L G H I J K ... "offset" algorithm dev1 dev2 dev3 dev4 dev5 dev6 ==== ==== ==== ==== ==== ==== A B C D E F F A B C D E --> Copy of stripe0, but shifted by 1 G H I J K L L G H I J K ... Redundancy for these algorithms is gained by shifting the copied stripes one device to the right. This patch proposes that array be divided into sets of adjacent devices and when the stripe copies are shifted, they wrap on set boundaries rather than the array size boundary. That is, for the purposes of shifting, the copies are confined to their sets within the array. The sets are 'near_copies * far_copies' in size. The above "far" algorithm example would change to: "far" algorithm dev1 dev2 dev3 dev4 dev5 dev6 ==== ==== ==== ==== ==== ==== A B C D E F G H I J K L ... B A D C F E --> Copy of stripe0, shifted 1, 2-dev sets H G J I L K Dev sets are 1-2, 3-4, 5-6 ... This has the affect of improving the redundancy of the array. We can always sustain at least one failure, but sometimes more than one can be handled. In the first examples, the pairs of devices that CANNOT fail together are: (1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs] In the example where the copies are confined to sets, the pairs of devices that cannot fail together are: (1,2) (3,4) (5,6) [20% of possible pairs] We cannot simply replace the old algorithms, so the 17th bit of the 'layout' variable is used to indicate whether we use the old or new method of computing the shift. (This is similar to the way the 16th bit indicates whether the "far" algorithm or the "offset" algorithm is being used.) This patch only handles the cases where the number of total raid disks is a multiple of 'far_copies'. A follow-on patch addresses the condition where this is not true. Signed-off-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Jonathan Brassow 提交于
Changes include assigning 'addr' from 's' instead of 'sector' to be consistent with the way the code does it just a few lines later and using '%=' vs a conditional and subtraction. Signed-off-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Joe Lawrence 提交于
Set mddev queue's max_write_same_sectors to its chunk_sector value (before disk_stack_limits merges the underlying disk limits.) With that in place, be sure to handle writes coming down from the block layer that have the REQ_WRITE_SAME flag set. That flag needs to be copied into any newly cloned write bio. Signed-off-by: NJoe Lawrence <joe.lawrence@stratus.com> Acked-by: N"Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 21 2月, 2013 1 次提交
-
-
由 Sebastian Riemer 提交于
If an fsync occurs on a read-only array, we need to send a completion for the IO and may not increment the active IO count. Otherwise, we hit a bug trace and can't stop the MD array anymore. By advice of Christoph Hellwig we return success upon a flush request but we return -EROFS for other writes. We detect flush requests by checking if the bio has zero sectors. This patch is suitable to any -stable kernel to which it applies. Cc: Christoph Hellwig <hch@infradead.org> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: NeilBrown <neilb@suse.de> Cc: stable@vger.kernel.org Signed-off-by: NSebastian Riemer <sebastian.riemer@profitbricks.com> Reported-by: NBen Hutchings <ben@decadent.org.uk> Acked-by: NPaul Menzel <paulepanter@users.sourceforge.net> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 31 1月, 2013 2 次提交
-
-
由 Alasdair G Kergon 提交于
When processing write same requests, fix dm to send the configured number of WRITE SAME requests to the target rather than the number of discards, which is not always the same. Device-mapper WRITE SAME support was introduced by commit 23508a96 ("dm: add WRITE SAME support"). Signed-off-by: NAlasdair G Kergon <agk@redhat.com> Acked-by: NMike Snitzer <snitzer@redhat.com>
-
由 Mike Snitzer 提交于
thin_io_hints() is blindly copying the queue limits from the thin-pool which can lead to incorrect limits being set. The fix here simply deletes the thin_io_hints() hook which leaves the existing stacking infrastructure to set the limits correctly. When a thin-pool uses an MD device for the data device a thin device from the thin-pool must respect MD's constraints about disallowing a bio from spanning multiple chunks. Otherwise we can see problems. If the raid0 chunksize is 1152K and thin-pool chunksize is 256K I see the following md/raid0 error (with extra debug tracing added to thin_endio) when mkfs.xfs is executed against the thin device: md/raid0:md99: make_request bug: can't convert block across chunks or bigger than 1152k 6688 127 device-mapper: thin: bio sector=2080 err=-5 bi_size=130560 bi_rw=17 bi_vcnt=32 bi_idx=0 This extra DM debugging shows that the failing bio is spanning across the first and second logical 1152K chunk (sector 2080 + 255 takes the bio beyond the first chunk's boundary of sector 2304). So the bio splitting that DM is doing clearly isn't respecting the MD limits. max_hw_sectors_kb is 127 for both the thin-pool and thin device (queue_max_hw_sectors returns 255 so we'll excuse sysfs's lack of precision). So this explains why bi_size is 130560. But the thin device's max_hw_sectors_kb should be 4 (PAGE_SIZE) given that it doesn't have a .merge function (for bio_add_page to consult indirectly via dm_merge_bvec) yet the thin-pool does sit above an MD device that has a compulsory merge_bvec_fn. This scenario is exactly why DM must resort to sending single PAGE_SIZE bios to the underlying layer. Some additional context for this is available in the header for commit 8cbeb67a ("dm: avoid unsupported spanning of md stripe boundaries"). Long story short, the reason a thin device doesn't properly get configured to have a max_hw_sectors_kb of 4 (PAGE_SIZE) is that thin_io_hints() is blindly copying the queue limits from the thin-pool device directly to the thin device's queue limits. Fix this by eliminating thin_io_hints. Doing so is safe because the block layer's queue limits stacking already enables the upper level thin device to inherit the thin-pool device's discard and minimum_io_size and optimal_io_size limits that get set in pool_io_hints. But avoiding the queue limits copy allows the thin and thin-pool limits to be different where it is important, namely max_hw_sectors_kb. Reported-by: NDaniel Browning <db@kavod.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
- 24 1月, 2013 1 次提交
-
-
由 Jonathan Brassow 提交于
Before attempting to activate a RAID array, it is checked for sufficient redundancy. That is, we make sure that there are not too many failed devices - or devices specified for rebuild - to undermine our ability to activate the array. The current code performs this check twice - once to ensure there were not too many devices specified for rebuild by the user ('validate_rebuild_devices') and again after possibly experiencing a failure to read the superblock ('analyse_superblocks'). Neither of these checks are sufficient. The first check is done properly but with insufficient information about the possible failure state of the devices to make a good determination if the array can be activated. The second check is simply done wrong in the case of RAID10 because it doesn't account for the independence of the stripes (i.e. mirror sets). The solution is to use the properly written check ('validate_rebuild_devices'), but perform the check after the superblocks have been read and we know which devices have failed. This gives us one check instead of two and performs it in a location where it can be done right. Only RAID10 was affected and it was affected in the following ways: - the code did not properly catch the condition where a user specified a device for rebuild that already had a failed device in the same mirror set. (This condition would, however, be caught at a deeper level in MD.) - the code triggers a false positive and denies activation when devices in independent mirror sets have failed - counting the failures as though they were all in the same set. The most likely place this error was introduced (or this patch should have been included) is in commit 4ec1e369 - first introduced in v3.7-rc1. Consequently this fix should also go in v3.7.y, however there is a small conflict on the .version in raid_target, so I'll submit a separate patch to -stable. Cc: stable@vger.kernel.org Signed-off-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 22 12月, 2012 27 次提交
-
-
由 Mike Snitzer 提交于
Rename stripe_map_discard to stripe_map_range and reuse it for WRITE SAME bio processing. Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
This patch removes map_info from bio-based device mapper targets. map_info is still used for request-based targets. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Eliminate struct map_info from dm-snap. map_info->ptr was used in dm-snap to indicate if the bio was tracked. If map_info->ptr was non-NULL, the bio was linked in tracked_chunk_hash. This patch removes the use of map_info->ptr. We determine if the bio was tracked based on hlist_unhashed(&c->node). If hlist_unhashed is true, the bio is not tracked, if it is false, the bio is tracked. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
This patch removes endio_hook_pool from dm-thin and uses per-bio data instead. This patch removes any use of map_info in preparation for the next patch that removes map_info from bio-based device mapper. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Don't use map_info any more in dm-raid1. map_info was used for writes to hold the region number. For this purpose we add a new field dm_bio_details to dm_raid1_bio_record. map_info was used for reads to hold a pointer to dm_raid1_bio_record (if the pointer was non-NULL, bio details were saved; if the pointer was NULL, bio details were not saved). We use dm_raid1_bio_record.details->bi_bdev for this purpose. If bi_bdev is NULL, details were not saved, if bi_bdev is non-NULL, details were saved. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Replace map_info with a per-bio structure "struct per_bio_data" in dm-flakey. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Rename struct read_record to bio_record in dm-raid1. In the following patch, the structure will be used for both read and write bios, so rename it. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
This patch moves target_request_nr from map_info to dm_target_io and makes it accessible with dm_bio_get_target_request_nr. This patch is a preparation for the next patch that removes map_info. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Replace tracked_chunk_pool with per_bio_data in dm-snap. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Replace io_mempool with per_bio_data in dm-verity. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Replace read_record_pool with per_bio_data in dm-raid1. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Introduce a field per_bio_data_size in struct dm_target. Targets can set this field in the constructor. If a target sets this field to a non-zero value, "per_bio_data_size" bytes of auxiliary data are allocated for each bio submitted to the target. These data can be used for any purpose by the target and help us improve performance by removing some per-target mempools. Per-bio data is accessed with dm_per_bio_data. The argument data_size must be the same as the value per_bio_data_size in dm_target. If the target has a pointer to per_bio_data, it can get a pointer to the bio with dm_bio_from_per_bio_data() function (data_size must be the same as the value passed to dm_per_bio_data). Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mike Snitzer 提交于
Add WRITE SAME support to dm-io and make it accessible to dm_kcopyd_zero(). dm_kcopyd_zero() provides an asynchronous interface whereas the blkdev_issue_write_same() interface is synchronous. WRITE SAME is a SCSI command that can be leveraged for more efficient zeroing of a specified logical extent of a device which supports it. Only a single zeroed logical block is transfered to the target for each WRITE SAME and the target then writes that same block across the specified extent. The dm thin target uses this. Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mike Snitzer 提交于
The linear target can already support WRITE SAME requests so signal this by setting num_write_same_requests to 1. Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mike Snitzer 提交于
WRITE SAME bios have a payload that contain a single page. When cloning WRITE SAME bios DM has no need to modify the bi_io_vec attributes (and doing so would be detrimental). DM need only alter the start and end of the WRITE SAME bio accordingly. Rather than duplicate __clone_and_map_discard, factor out a common function that is also used by __clone_and_map_write_same. Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mike Snitzer 提交于
Allow targets to opt in to WRITE SAME support by setting 'num_write_same_requests' in the dm_target structure. A dm device will only advertise WRITE SAME support if all its targets and all its underlying devices support it. Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
If the parameter buffer is small enough, try to allocate it with kmalloc() rather than vmalloc(). vmalloc is noticeably slower than kmalloc because it has to manipulate page tables. In my tests, on PA-RISC this patch speeds up activation 13 times. On Opteron this patch speeds up activation by 5%. This patch introduces a new function free_params() to free the parameters and this uses new flags that record whether or not vmalloc() was used and whether or not the input buffer must be wiped after use. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
When allocating memory for the userspace ioctl data, set some appropriate GPF flags directly instead of using PF_MEMALLOC. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Joe Thornber 提交于
Improve space map error message when unable to allocate a new metadata block. Signed-off-by: NJoe Thornber <ejt@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mike Snitzer 提交于
Throttle all errors logged from the IO path by dm thin. Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mike Snitzer 提交于
Nearly all of persistent-data is in the IO path so throttle error messages with DMERR_LIMIT to limit the amount logged when something has gone wrong. Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mike Snitzer 提交于
Reinstate a useful error message when the block manager buffer validator fails. This was mistakenly eliminated when the block manager was converted to use dm-bufio. Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Jonathan Brassow 提交于
If the user does not supply a bitmap region_size to the dm raid target, a reasonable size is computed automatically. If this is not a power of 2, the md code will report an error later. This patch catches the problem early and rounds the region_size to the next power of two. Signed-off-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Joe Thornber 提交于
Remove unused @data_block parameter from cell_defer. Change thin_bio_map to use many returns rather than setting a variable. Signed-off-by: NJoe Thornber <ejt@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Joe Thornber 提交于
Rename cell_defer_except() to cell_defer_no_holder() which describes its function more clearly. Signed-off-by: NJoe Thornber <ejt@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
track_chunk is always called with interrupts enabled. Consequently, we do not need to save and restore interrupt state in "flags" variable. This patch changes spin_lock_irqsave to spin_lock_irq and spin_unlock_irqrestore to spin_unlock_irq. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Use a defined macro DM_ENDIO_INCOMPLETE instead of a numeric constant. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-