- 09 9月, 2009 1 次提交
-
-
由 Dan Williams 提交于
Some engines optimize operation by reading ahead in the descriptor chain such that descriptor2 may start execution before descriptor1 completes. If descriptor2 depends on the result from descriptor1 then a fence is required (on descriptor2) to disable this optimization. The async_tx api could implicitly identify dependencies via the 'depend_tx' parameter, but that would constrain cases where the dependency chain only specifies a completion order rather than a data dependency. So, provide an ASYNC_TX_FENCE to explicitly identify data dependencies. Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 30 8月, 2009 13 次提交
-
-
由 Dan Williams 提交于
Now that the resources to handle stripe_head operations are allocated percpu it is possible for raid5d to distribute stripe handling over multiple cores. This conversion also adds a call to cond_resched() in the non-multicore case to prevent one core from getting monopolized for raid operations. Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Yuri Tikhonov 提交于
These routines have been replaced by there asynchronous counterparts. Signed-off-by: NYuri Tikhonov <yur@emcraft.com> Signed-off-by: NIlya Yanok <yanok@emcraft.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Yuri Tikhonov 提交于
1/ Use STRIPE_OP_BIOFILL to offload completion of read requests to raid_run_ops 2/ Implement a handler for sh->reconstruct_state similar to the raid5 case (adds handling of Q parity) 3/ Prevent handle_parity_checks6 from running concurrently with 'compute' operations 4/ Hook up raid_run_ops Signed-off-by: NYuri Tikhonov <yur@emcraft.com> Signed-off-by: NIlya Yanok <yanok@emcraft.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
[ Based on an original patch by Yuri Tikhonov ] Implement the state machine for handling the RAID-6 parities check and repair functionality. Note that the raid6 case does not need to check for new failures, like raid5, as it will always writeback the correct disks. The raid5 case can be updated to check zero_sum_result to avoid getting confused by new failures rather than retrying the entire check operation. Signed-off-by: NYuri Tikhonov <yur@emcraft.com> Signed-off-by: NIlya Yanok <yanok@emcraft.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Yuri Tikhonov 提交于
In the synchronous implementation of stripe dirtying we processed a degraded stripe with one call to handle_stripe_dirtying6(). I.e. compute the missing blocks from the other drives, then copy in the new data and reconstruct the parities. In the asynchronous case we do not perform stripe operations directly. Instead, operations are scheduled with flags to be later serviced by raid_run_ops. So, for the degraded case the final reconstruction step can only be carried out after all blocks have been brought up to date by being read, or computed. Like the raid5 case schedule_reconstruction() sets STRIPE_OP_RECONSTRUCT to request a parity generation pass and through operation chaining can handle compute and reconstruct in a single raid_run_ops pass. [dan.j.williams@intel.com: fixup handle_stripe_dirtying6 gating] Signed-off-by: NYuri Tikhonov <yur@emcraft.com> Signed-off-by: NIlya Yanok <yanok@emcraft.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Yuri Tikhonov 提交于
Modify handle_stripe_fill6 to work asynchronously by introducing fetch_block6 as the raid6 analog of fetch_block5 (schedule compute operations for missing/out-of-sync disks). [dan.j.williams@intel.com: compute D+Q in one pass] Signed-off-by: NYuri Tikhonov <yur@emcraft.com> Signed-off-by: NIlya Yanok <yanok@emcraft.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Yuri Tikhonov 提交于
Extend schedule_reconstruction5 for reuse by the raid6 path. Add support for generating Q and BUG() if a request is made to perform 'prexor'. Signed-off-by: NYuri Tikhonov <yur@emcraft.com> Signed-off-by: NIlya Yanok <yanok@emcraft.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
[ Based on an original patch by Yuri Tikhonov ] The raid_run_ops routine uses the asynchronous offload api and the stripe_operations member of a stripe_head to carry out xor+pq+copy operations asynchronously, outside the lock. The operations performed by RAID-6 are the same as in the RAID-5 case except for no support of STRIPE_OP_PREXOR operations. All the others are supported: STRIPE_OP_BIOFILL - copy data into request buffers to satisfy a read request STRIPE_OP_COMPUTE_BLK - generate missing blocks (1 or 2) in the cache from the other blocks STRIPE_OP_BIODRAIN - copy data out of request buffers to satisfy a write request STRIPE_OP_RECONSTRUCT - recalculate parity for new data that has entered the cache STRIPE_OP_CHECK - verify that the parity is correct The flow is the same as in the RAID-5 case, and reuses some routines, namely: 1/ ops_complete_postxor (renamed to ops_complete_reconstruct) 2/ ops_complete_compute (updated to set up to 2 targets uptodate) 3/ ops_run_check (renamed to ops_run_check_p for xor parity checks) [neilb@suse.de: fixes to get it to pass mdadm regression suite] Reviewed-by: NAndre Noll <maan@systemlinux.org> Signed-off-by: NYuri Tikhonov <yur@emcraft.com> Signed-off-by: NIlya Yanok <yanok@emcraft.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
ops_complete_compute5 can be reused in the raid6 path if it is updated to generically handle a second target. Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
Port drivers/md/raid6test/test.c to use the async raid6 recovery routines. This is meant as a unit test for raid6 acceleration drivers. In addition to the 16-drive test case this implements tests for the 4-disk and 5-disk special cases (dma devices can not generically handle less than 2 sources), and adds a test for the D+Q case. Reviewed-by: NAndre Noll <maan@systemlinux.org> Acked-by: NMaciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
Replace the flat zero_sum_result with a collection of flags to contain the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed solomon syndrome) zero-sum result. Use the SUM_CHECK_ namespace instead of DMA_ since these flags will be used on non-dma-zero-sum enabled platforms. Reviewed-by: NAndre Noll <maan@systemlinux.org> Acked-by: NMaciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
Use percpu memory rather than stack for storing the buffer lists used in parity calculations. Include space for dma address conversions and pass that to async_tx via the async_submit_ctl.scribble pointer. [ Impact: move memory pressure from stack to heap ] Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
In preparation for asynchronous handling of raid6 operations move the spare page to a percpu allocation to allow multiple simultaneous synchronous raid6 recovery operations. Make this allocation cpu hotplug aware to maximize allocation efficiency. Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 15 7月, 2009 1 次提交
-
-
由 Dan Williams 提交于
Add missing call to safe_put_page from stop() by unifying open coded raid5_conf_t de-allocation under free_conf(). Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 09 6月, 2009 3 次提交
-
-
由 NeilBrown 提交于
Now that we support changing the chunksize, we calculate "reshape_sectors" to be the max of number of sectors in old and new chunk size. However there is one please where we still use 'chunksize' rather than 'reshape_sectors'. This causes a reshape that reduces the size of chunks to freeze. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
md has functionality to 'quiesce' and array so that all pending IO completed and no new IO starts. This is used to achieve a stable state before making internal changes. Currently this quiescing applies equally to normal IO, resync IO, and reshape IO. However there is a problem with applying it to reshape IO. Reshape can have multiple 'stripe_heads' that must be active together. If the quiesce come between allocating the first and the last of such a collection, then we deadlock, as the last will not be allocated until the quiesce is lifted, the quiesce will not be lifted until the first (which has been allocated) gets used, and that first cannot be used until the last is allocated. It is not necessary to inhibit reshape IO when a quiesce is requested. Those places in the code that require a full quiesce will ensure the reshape thread is not running at all. So allow reshape requests to get access to new stripe_heads without being blocked by a 'quiesce'. This only affects in-place reshapes (i.e. where the array does not grow or shrink) and these are only newly supported. So this patch is not needed in earlier kernels. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
mddev->raid_disks can be changed and any time by a request from user-space. It is a suggestion as to what number of raid_disks is desired. conf->raid_disks can only be changed by the raid5 module with suitable locks in place. It is a statement as to the current number of raid_disks. There are two places where the latter should be used, but the former is used. This can lead to a crash when reshaping an array. This patch changes to mddev-> to conf-> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 04 6月, 2009 2 次提交
-
-
由 Dan Williams 提交于
Prepare the api for the arrival of a new parameter, 'scribble'. This will allow callers to identify scratchpad memory for dma address or page address conversions. As this adds yet another parameter, take this opportunity to convert the common submission parameters (flags, dependency, callback, and callback argument) into an object that is passed by reference. Also, take this opportunity to fix up the kerneldoc and add notes about the relevant ASYNC_TX_* flags for each routine. [ Impact: moves api pass-by-value parameters to a pass-by-reference struct ] Signed-off-by: NAndre Noll <maan@systemlinux.org> Acked-by: NMaciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
In support of inter-channel chaining async_tx utilizes an ack flag to gate whether a dependent operation can be chained to another. While the flag is not set the chain can be considered open for appending. Setting the ack flag closes the chain and flags the descriptor for garbage collection. The ASYNC_TX_DEP_ACK flag essentially means "close the chain after adding this dependency". Since each operation can only have one child the api now implicitly sets the ack flag at dependency submission time. This removes an unnecessary management burden from clients of the api. [ Impact: clean up and enforce one dependency per operation ] Reviewed-by: NAndre Noll <maan@systemlinux.org> Acked-by: NMaciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 27 5月, 2009 1 次提交
-
-
由 NeilBrown 提交于
A recent patch to raid5.c use min on an int and a sector_t. This isn't allowed. So change it to min_t(sector_t,x,y). Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 26 5月, 2009 7 次提交
-
-
由 NeilBrown 提交于
md has no need for the BKL - it does its own locking. So md_ioctl doesn't need to be a locked_ioctl. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
In order for the metadata to always be consistent, we mustn't updated curr_resync_completed without also updating reshape_position. The reshape code updates both at the same time. However since commit 97e4f42d the common md_do_sync will sometimes update curr_resync_completed but is not in a position to update reshape_position. So if MD_RECOVERY_RESHAPE is set (indicating that a reshape is happening, so reshape_position might change), don't update curr_resync_completed in md_do_sync, leave it to the per-personality reshape code. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
As sector_t in unsigned, we cannot afford to let 'safepos' etc go negative. So replace a -= b; by a -= min(b,a); Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
The md resync engine has a 'frozen' state which ensures that no resync/recovery. This is used to avoid races. Export this state through the 'sync_action' sysfs attribute so that user-space can benefit and also avoid some races. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
The code for checking which bits in the bitmap can be cleared has 2 problems: 1/ it repeatedly takes and drops a spinlock, where it would make more sense to just hold on to it most of the time. 2/ it doesn't make use of some opportunities to skip large sections of the bitmap This patch fixes those. It will only affect CPU consumption, not correctness. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Instead of always returns EINVAL if anything goes wrong when setting the array size, add the option of E2BIG if the size requested is too large. This makes it easier for user-space to be sure what went wrong. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
We previously didn't update these fields when writing the metadata because they could never change. They can now, so we better write them. v0.90 metadata always updated these fields. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 07 5月, 2009 7 次提交
-
-
由 NeilBrown 提交于
md maintains link in sys/mdXX/md/ to identify which device has which role in the array. e.g. rd2 -> dev-sda indicates that the device with role '2' in the array is sda. These links are only present when the array is active. They are created immediately after ->run is called, and so should be removed immediately after ->stop is called. However they are currently removed a little bit later, and it is possible for ->run to be called again, thus adding these links, before they are removed. So move the removal earlier so they are consistently only present when the array is active. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Being able to write 'clean' to an 'array_state' of an inactive array to activate it in 'clean' mode is both unnecessary and inconvenient. It is unnecessary because the same can be achieved by writing 'active'. This activates and array, but it still remains 'clean' until the first write. It is inconvenient because writing 'clean' is more often used to cause an 'active' array to revert to 'clean' mode (thus blocking any writes until a 'write-pending' is promoted to 'active'). Allowing 'clean' to both activate an array and mark an active array as clean can lead to races: One program writes 'clean' to mark the active array as clean at the same time as another program writes 'inactive' to deactivate (stop) and active array. Depending on which writes first, the array could be deactivated and immediately reactivated which isn't what was desired. So just disable the use of 'clean' to activate an array. This avoids a race that can be triggered with mdadm-3.0 and external metadata, so it suitable for -stable. Reported-by: NRafal Marszewski <rafal.marszewski@intel.com> Acked-by: NDan Williams <dan.j.williams@intel.com> Cc: <stable@kernel.org> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Jan Engelhardt 提交于
Signed-off-by: NJan Engelhardt <jengelh@medozas.de> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Two problems in status_resync. 1/ It still used Kilobytes as the basic block unit, while most code now uses sectors uniformly. 2/ It doesn't allow for the possibility that max_sectors exceeds the range of "unsigned long". So - change "max_blocks" to "max_sectors", and store sector numbers in there and in 'resync' - Make 'rt' a 'sector_t' so it can temporarily hold the number of remaining sectors. - use sector_div rather than normal division. - change the magic '100' used to preserve precision to '32'. + making it a power of 2 makes division easier + it doesn't need to be as large as it was chosen when we averaged speed over the entire run. Now we average speed over the last 30 seconds or so. Reported-by: N"Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
If a write intent bitmap covers more than 2TB, we sometimes work with values beyond 32bit, so these need to be sector_t. This patches add the required casts to some unsigned longs that are being shifted up. This will affect any raid10 larger than 2TB, or any raid1/4/5/6 with member devices that are larger than 2TB. Signed-off-by: NNeilBrown <neilb@suse.de> Reported-by: N"Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE> Cc: stable@kernel.org
-
由 NeilBrown 提交于
If we have a raid10 with multiple missing devices, and we recover just one of these to a spare, then we risk (depending on the bitmap and array chunk size) clearing bits of the bitmap for which recovery isn't complete (because a device is still missing). This can lead to a subsequent "re-add" being recovered without any IO happening, which would result in loss of data. This patch takes the safe approach of not clearing bitmap bits if the array will still be degraded. This patch is suitable for all active -stable kernels. Cc: stable@kernel.org Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
When md is loading a bitmap which it knows is out of date, it fills each page with 1s and writes it back out again. However the write_page call makes used of bitmap->file_pages and bitmap->last_page_size which haven't been set correctly yet. So this can sometimes fail. Move the setting of file_pages and last_page_size to before the call to write_page. This bug can cause the assembly on an array to fail, thus making the data inaccessible. Hence I think it is a suitable candidate for -stable. Cc: stable@kernel.org Reported-by: NVojtech Pavlik <vojtech@suse.cz> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 20 4月, 2009 1 次提交
-
-
由 NeilBrown 提交于
.. and other arrays with components larger than 2 terabytes. We use a "long" rather than a "sector_t" in part of the bitmap size calculations, which is sad. Reported-by: N"Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 17 4月, 2009 1 次提交
-
-
由 NeilBrown 提交于
There are circumstances when a user-space process might need to "oversee" a resync/reshape process. For example when doing an in-place reshape of a raid5, it is prudent to take a backup of each section before reshaping it as this is the only way to provide safety against an unplanned shutdown (i.e. crash/power failure). The sync_max sysfs value can be used to stop the resync from advancing beyond a particular point. So user-space can: suspend IO to the first section and back it up set 'sync_max' to the end of the section wait for 'sync_completed' to reach that point resume IO on the first section and move on to the next section. However this process requires the kernel and user-space to run in lock-step which could introduce unnecessary delays. It would be better if a 'double buffered' approach could be used with userspace and kernel space working on different sections with the 'next' section always ready when the 'current' section is finished. One problem with implementing this is that sync_completed is only guaranteed to be updated when the sync process reaches sync_max. (it is updated on a time basis at other times, but it is hard to rely on that). This defeats some of the double buffering. With this patch, sync_completed (and reshape_position) get updated as the current position approaches sync_max, so there is room for userspace to advance sync_max early without losing updates. To be precise, sync_completed is updated when the current sync position reaches half way between the current value of sync_completed and the value of sync_max. This will usually be a good time for user space to update sync_max. If sync_max does not get updated, the updates to sync_completed (together with associated metadata updates) will occur at an exponentially increasing frequency which will get unreasonably fast (one update every page) immediately before the process hits sync_max and stops. So the update rate will be unreasonably fast only for an insignificant period of time. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 15 4月, 2009 1 次提交
-
-
由 Christoph Hellwig 提交于
It's used by DM and MD and generally useful, so move the bio list helpers into bio.h. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NAlasdair G Kergon <agk@redhat.com> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 14 4月, 2009 2 次提交
-
-
由 NeilBrown 提交于
The sync_completed file reports how much of a resync (or recovery or reshape) has been completed. However due to the possibility of out-of-order completion of writes, it is not certain to be accurate. We have an internal value - mddev->curr_resync_completed - which is an accurate value (though it might not always be quite so uptodate). So: - make curr_resync_completed be uptodate a little more often, particularly when raid5 reshape updates status in the metadata - report curr_resync_completed in the sysfs file - allow poll/select to report all updates to md/sync_completed. This makes sync_completed completed usable by any external metadata handler that wants to record this status information in its metadata. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
When adding devices to an active array via sysfs, there is currently no way to mark a device as 'in-sync' which is useful when incrementally assembling an array. So add that option. Signed-off-by: NNeilBrown <neilb@suse.de>
-