- 25 7月, 2013 1 次提交
-
-
由 NeilBrown 提交于
If a device in a RAID4/5/6 is being replaced while another is being recovered, then the writes to the replacement device currently don't happen, resulting in corruption when the replacement completes and the new drive takes over. This is because the replacement writes are only triggered when 's.replacing' is set and not when the similar 's.sync' is set (which is the case during resync and recovery - it means all devices need to be read). So schedule those writes when s.replacing is set as well. In this case we cannot use "STRIPE_INSYNC" to record that the replacement has happened as that is needed for recording that any parity calculation is complete. So introduce STRIPE_REPLACED to record if the replacement has happened. For safety we should also check that STRIPE_COMPUTE_RUN is not set. This has a similar effect to the "s.locked == 0" test. The latter ensure that now IO has been flagged but not started. The former checks if any parity calculation has been flagged by not started. We must wait for both of these to complete before triggering the 'replace'. Add a similar test to the subsequent check for "are we finished yet". This possibly isn't needed (is subsumed in the STRIPE_INSYNC test), but it makes it more obvious that the REPLACE will happen before we think we are finished. Finally if a NeedReplace device is not UPTODATE then that is an error. We really must trigger a warning. This bug was introduced in commit 9a3e1101 (md/raid5: detect and handle replacements during recovery.) which introduced replacement for raid5. That was in 3.3-rc3, so any stable kernel since then would benefit from this fix. Cc: stable@vger.kernel.org (3.3+) Reported-by: Nqindehua <13691222965@163.com> Tested-by: Nqindehua <qindehua@163.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 20 3月, 2013 2 次提交
-
-
由 Paul Bolle 提交于
Once instance of this Kconfig macro remained after commit 51acbcec ("md: remove CONFIG_MULTICORE_RAID456"). Remove that one too. And, while we're at it, also remove it from the defconfig files that carry it. Signed-off-by: NPaul Bolle <pebolle@tiscali.nl> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
A number of problems can occur due to races between resync/recovery and discard. - if sync_request calls handle_stripe() while a discard is happening on the stripe, it might call handle_stripe_clean_event before all of the individual discard requests have completed (so some devices are still locked, but not all). Since commit ca64cae9 md/raid5: Make sure we clear R5_Discard when discard is finished. this will cause R5_Discard to be cleared for the parity device, so handle_stripe_clean_event() will not be called when the other devices do become unlocked, so their ->written will not be cleared. This ultimately leads to a WARN_ON in init_stripe and a lock-up. - If handle_stripe_clean_event() does clear R5_UPTODATE at an awkward time for resync, it can lead to s->uptodate being less than disks in handle_parity_checks5(), which triggers a BUG (because it is one). So: - keep R5_Discard on the parity device until all other devices have completed their discard request - make sure we don't try to have a 'discard' and a 'sync' action at the same time. This involves a new stripe flag to we know when a 'discard' is happening, and the use of R5_Overlap on the parity disk so when a discard is wanted while a sync is active, so we know to wake up the discard at the appropriate time. Discard support for RAID5 was added in 3.7, so this is suitable for any -stable kernel since 3.7. Cc: stable@vger.kernel.org (v3.7+) Reported-by: NJes Sorensen <Jes.Sorensen@redhat.com> Tested-by: NJes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 11 10月, 2012 1 次提交
-
-
由 Shaohua Li 提交于
Discard for raid4/5/6 has limitation. If discard request size is small, we do discard for one disk, but we need calculate parity and write parity disk. To correctly calculate parity, zero_after_discard must be guaranteed. Even it's true, we need do discard for one disk but write another disks, which makes the parity disks wear out fast. This doesn't make sense. So an efficient discard for raid4/5/6 should discard all data disks and parity disks, which requires the write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size is smaller than chunk_size, such pattern is almost impossible in practice. So in this patch, I only handle the case that A's size equals to chunk_size. That is discard request should be aligned to stripe size and its size is multiple of stripe size. Since we can only handle request with specific alignment and size (or part of the request fitting stripes), we can't guarantee zero_after_discard even zero_after_discard is true in low level drives. The block layer doesn't send down correctly aligned requests even correct discard alignment is set, so I must filter out. For raid4/5/6 parity calculation, if data is 0, parity is 0. So if zero_after_discard is true for all disks, data is consistent after discard. Otherwise, data might be lost. Let's consider a scenario: discard a stripe, write data to one disk and write parity disk. The stripe could be still inconsistent till then depending on using data from other data disks or parity disks to calculate new parity. If the disk is broken, we can't restore it. So in this patch, we only enable discard support if all disks have zero_after_discard. If discard fails in one disk, we face the similar inconsistent issue above. The patch will make discard follow the same path as normal write request. If discard fails, a resync will be scheduled to make the data consistent. This isn't good to have extra writes, but data consistency is important. If a subsequent read/write request hits raid5 cache of a discarded stripe, the discarded dev page should have zero filled, so the data is consistent. This patch will always zero dev page for discarded request stripe. This isn't optimal because discard request doesn't need such payload. Next patch will avoid it. Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 02 8月, 2012 1 次提交
-
-
由 Shaohua Li 提交于
make_request() does stripe release for every stripe and the stripe usually has count 1, which makes previous release_stripe() optimization not work. In my test, this release_stripe() becomes the heaviest pleace to take conf->device_lock after previous patches applied. Below patch makes stripe release batch. All the stripes will be released in unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe lru. Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 31 7月, 2012 1 次提交
-
-
由 majianpeng 提交于
Because bios will merge at block-layer,so bios-error may caused by other bio which be merged into to the same request. Using this flag,it will find exactly error-sector and not do redundant operation like re-write and re-read. V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block layer. Signed-off-by: NJianpeng Ma <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 19 7月, 2012 1 次提交
-
-
由 Shaohua Li 提交于
Add a per-stripe lock to protect stripe specific data. The purpose is to reduce lock contention of conf->device_lock. stripe ->toread, ->towrite are protected by per-stripe lock. Accessing bio list of the stripe is always serialized by this lock, so adding bio to the lists (add_stripe_bio()) and removing bio from the lists (like ops_run_biofill()) not race. If bio in ->read, ->written ... list are not shared by multiple stripes, we don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will protect them. If the bio are shared, there are two protections: 1. bi_phys_segments acts as a reference count 2. traverse the list uses r5_next_bio, which makes traverse never access bio not belonging to the stripe Let's have an example: | stripe1 | stripe2 | stripe3 | ...bio1......|bio2|bio3|....bio4..... stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2 because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and stripe3 will end_bio bio4. before add_stripe_bio() addes a bio to a stripe, we already increament the bio bi_phys_segments, so don't worry other stripes release the bio. Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 22 5月, 2012 1 次提交
-
-
由 Shaohua Li 提交于
REQ_SYNC is ignored in current raid5 code. Block layer does use it to do policy, for example ioscheduler. This patch adds it. Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 21 5月, 2012 1 次提交
-
-
由 NeilBrown 提交于
The important issue here is incorporating the different in data_offset into calculations concerning when we might need to over-write data that is still thought to be valid. To this end we find the minimum offset difference across all devices and add that where appropriate. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 23 12月, 2011 4 次提交
-
-
由 NeilBrown 提交于
During recovery we want to write to the replacement but not the original. So we have two new flags - R5_NeedReplace if this stripe has a replacement that needs to be written at some stage - R5_WantReplace if NeedReplace, and the data is available, and a 'sync' has been requested on this stripe. We also distinguish between 'sync and replace' which need to read all other devices, and 'replace' which only needs to read the devices being replaced. Note that during resync we always write to any replacement device. It might not need to be written to, but as we don't read to compare, we have to write to be sure. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
When writing, we need to submit two writes, one to the original, and one to the replacement - if there is a replacement. If the write to the replacement results in a write error, we just fail the device. We only try to record write errors to the original. When writing for recovery, we shouldn't write to the original. This will be addressed in a subsequent patch that generally addresses recovery. Reviewed-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Remove some #defines that are no longer used, and replace some others with an enum. And remove an unused field. Reviewed-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Just enhance data structures to record a second device per slot to be used as a 'replacement' device, replacing the original. We also have a second bio in each slot in each stripe_head. This will only be used when writing to the array - we need to write to both the original and the replacement at the same time, so will need two bios. For now, only try using the replacement drive for aligned-reads. In this case, we prefer the replacement if it has been recovered far enough, otherwise use the original. This includes a small enhancement. Previously we would only do aligned reads if the target device was fully recovered. Now we also do them if it has recovered far enough. Reviewed-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 11 10月, 2011 4 次提交
-
-
由 NeilBrown 提交于
Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Having mddev_t and 'struct mddev_s' is ugly and not preferred Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
The typedefs are just annoying. 'mdk' probably refers to 'md_k.h' which used to be an include file that defined this thing. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 28 7月, 2011 3 次提交
-
-
由 NeilBrown 提交于
On a successful write to a known bad block, flag the sh so that raid5d can remove the known bad block from the list. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
When a write error is detected, don't mark the device as failed immediately but rather record the fact for handle_stripe to deal with. Handle_stripe then attempts to record a bad block. Only if that fails does the device get marked as faulty. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
If we get an uncorrectable read error - record a bad block rather than failing the device. And if these errors (which may be due to known bad blocks) cause recovery to be impossible, record a bad block on the recovering devices, or abort the recovery. As we might abort a recovery without failing a device we need to teach RAID5 about recovery_disabled handling. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 26 7月, 2011 4 次提交
-
-
由 NeilBrown 提交于
Adding these three fields will allow more common code to be moved to handle_stripe() struct field rearrangement by Namhyung Kim. Signed-off-by: NNeilBrown <neilb@suse.de> Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
-
由 NeilBrown 提交于
'struct stripe_head_state' stores state about the 'current' stripe that is passed around while handling the stripe. For RAID6 there is an extension structure: r6_state, which is also passed around. There is no value in keeping these separate, so move the fields from the latter into the former. This means that all code now needs to treat s->failed_num as an small array, but this is a small cost. Signed-off-by: NNeilBrown <neilb@suse.de> Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
-
由 NeilBrown 提交于
sh->lock is now mainly used to ensure that two threads aren't running in the locked part of handle_stripe[56] at the same time. That can more neatly be achieved with an 'active' flag which we set while running handle_stripe. If we find the flag is set, we simply requeue the stripe for later by setting STRIPE_HANDLE. For safety we take ->device_lock while examining the state of the stripe and creating a summary in 'stripe_head_state / r6_state'. This possibly isn't needed but as shared fields like ->toread, ->towrite are checked it is safer for now at least. We leave the label after the old 'unlock' called "unlock" because it will disappear in a few patches, so renaming seems pointless. This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE later, but that is not a problem. Signed-off-by: NNeilBrown <neilb@suse.de> Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
-
由 NeilBrown 提交于
This is the start of a series of patches to remove sh->lock. sync_request takes sh->lock before setting STRIPE_SYNCING to ensure there is no race with testing it in handle_stripe[56]. Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early in handle_stripe[56] (after getting the same lock) and perform the same set/clear operations if it was set. Signed-off-by: NNeilBrown <neilb@suse.de> Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
-
- 18 4月, 2011 1 次提交
-
-
由 NeilBrown 提交于
md has some plugging infrastructure for RAID5 to use because the normal plugging infrastructure required a 'request_queue', and when called from dm, RAID5 doesn't have one of those available. This relied on the ->unplug_fn callback which doesn't exist any more. So remove all of that code, both in md and raid5. Subsequent patches with restore the plugging functionality. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 10 3月, 2011 1 次提交
-
-
由 Jens Axboe 提交于
Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
- 10 9月, 2010 1 次提交
-
-
由 Tejun Heo 提交于
This patch converts md to support REQ_FLUSH/FUA instead of now deprecated REQ_HARDBARRIER. In the core part (md.c), the following changes are notable. * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with processing of other requests and thus there is no reason to mark the queue congested while FLUSH/FUA is in progress. * REQ_FLUSH/FUA failures are final and its users don't need retry logic. Retry logic is removed. * Preflush needs to be issued to all member devices but FUA writes can be handled the same way as other writes - their processing can be deferred to request_queue of member devices. md_barrier_request() is renamed to md_flush_request() and simplified accordingly. For linear, raid0 and multipath, the core changes are enough. raid1, 5 and 10 need the following conversions. * raid1: Handling of FLUSH/FUA bio's can simply be deferred to request_queues of member devices. Barrier related logic removed. * raid5: Queue draining logic dropped. FUA bit is propagated through biodrain and stripe resconstruction such that all the updated parts of the stripe are written out with FUA writes if any of the dirtying writes was FUA. preread_active_stripes handling in make_request() is updated as suggested by Neil Brown. * raid10: FUA bit needs to be propagated to write clones. linear, raid0, 1, 5 and 10 tested. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NNeil Brown <neilb@suse.de> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
- 26 7月, 2010 4 次提交
-
-
由 NeilBrown 提交于
Also remove remaining accesses to ->queue and ->gendisk when ->queue is NULL (As it is in a DM target). Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
md/raid5 uses the plugging infrastructure provided by the block layer and 'struct request_queue'. However when we plug raid5 under dm there is no request queue so we cannot use that. So create a similar infrastructure that is much lighter weight and use it for raid5. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
the dm module will need this for dm-raid45. Also only access ->queue->backing_dev_info->congested_fn if ->queue actually exists. It won't in a dm target. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
We will shortly allow md devices with no gendisk (they are attached to a dm-target instead). That will cause mdname() to return 'mdX'. There is one place where mdname really needs to be unique: when creating the name for a slab cache. So in that case, if there is no gendisk, you the address of the mddev formatted in HEX to provide a unique name. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 21 7月, 2010 1 次提交
-
-
由 NeilBrown 提交于
Separate the actual 'change' code from the sysfs interface so that it can eventually be called internally. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 17 2月, 2010 1 次提交
-
-
由 Tejun Heo 提交于
Add __percpu sparse annotations to places which didn't make it in one of the previous patches. All converions are trivial. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NBorislav Petkov <borislav.petkov@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Neil Brown <neilb@suse.de>
-
- 16 10月, 2009 2 次提交
-
-
由 NeilBrown 提交于
Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Dan Williams 提交于
The percpu conversion allowed a straightforward handoff of stripe processing to the async subsytem that initially showed some modest gains (+4%). However, this model is too simplistic and leads to stripes bouncing between raid5d and the async thread pool for every invocation of handle_stripe(). As reported by Holger this can fall into a pathological situation severely impacting throughput (6x performance loss). By downleveling the parallelism to raid_run_ops the pathological stripe_head bouncing is eliminated. This version still exhibits an average 11% throughput loss for: mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6 echo 1024 > /sys/block/md0/md/stripe_cache_size dd if=/dev/zero of=/dev/md0 bs=1024k count=2048 ...but the results are at least stable and can be used as a base for further multicore experimentation. Reported-by: NHolger Kiehl <Holger.Kiehl@dwd.de> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 30 8月, 2009 4 次提交
-
-
由 Dan Williams 提交于
[ Based on an original patch by Yuri Tikhonov ] The raid_run_ops routine uses the asynchronous offload api and the stripe_operations member of a stripe_head to carry out xor+pq+copy operations asynchronously, outside the lock. The operations performed by RAID-6 are the same as in the RAID-5 case except for no support of STRIPE_OP_PREXOR operations. All the others are supported: STRIPE_OP_BIOFILL - copy data into request buffers to satisfy a read request STRIPE_OP_COMPUTE_BLK - generate missing blocks (1 or 2) in the cache from the other blocks STRIPE_OP_BIODRAIN - copy data out of request buffers to satisfy a write request STRIPE_OP_RECONSTRUCT - recalculate parity for new data that has entered the cache STRIPE_OP_CHECK - verify that the parity is correct The flow is the same as in the RAID-5 case, and reuses some routines, namely: 1/ ops_complete_postxor (renamed to ops_complete_reconstruct) 2/ ops_complete_compute (updated to set up to 2 targets uptodate) 3/ ops_run_check (renamed to ops_run_check_p for xor parity checks) [neilb@suse.de: fixes to get it to pass mdadm regression suite] Reviewed-by: NAndre Noll <maan@systemlinux.org> Signed-off-by: NYuri Tikhonov <yur@emcraft.com> Signed-off-by: NIlya Yanok <yanok@emcraft.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
Replace the flat zero_sum_result with a collection of flags to contain the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed solomon syndrome) zero-sum result. Use the SUM_CHECK_ namespace instead of DMA_ since these flags will be used on non-dma-zero-sum enabled platforms. Reviewed-by: NAndre Noll <maan@systemlinux.org> Acked-by: NMaciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
Use percpu memory rather than stack for storing the buffer lists used in parity calculations. Include space for dma address conversions and pass that to async_tx via the async_submit_ctl.scribble pointer. [ Impact: move memory pressure from stack to heap ] Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
In preparation for asynchronous handling of raid6 operations move the spare page to a percpu allocation to allow multiple simultaneous synchronous raid6 recovery operations. Make this allocation cpu hotplug aware to maximize allocation efficiency. Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 18 6月, 2009 1 次提交
-
-
由 Andre Noll 提交于
This kills some more shifts. Signed-off-by: NAndre Noll <maan@systemlinux.org> Signed-off-by: NNeilBrown <neilb@suse.de>
-