- 28 8月, 2013 3 次提交
-
-
由 Shaohua Li 提交于
This is another attempt to create multiple threads to handle raid5 stripes. This time I use workqueue. raid5 handles request (especially write) in stripe unit. A stripe is page size aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a state machine for the corresponding stripe, which includes reading some disks of the stripe, calculating parity, and writing some disks of the stripe. The state machine is running in raid5d thread currently. Since there is only one thread, it doesn't scale well for high speed storage. An obvious solution is multi-threading. To get better performance, we have some requirements: a. locality. stripe corresponding to request submitted from one cpu is better handled in thread in local cpu or local node. local cpu is preferred but some times could be a bottleneck, for example, parity calculation is too heavy. local node running has wide adaptability. b. configurablity. Different setup of raid5 array might need diffent configuration. Especially the thread number. More threads don't always mean better performance because of lock contentions. My original implementation is creating some kernel threads. There are interfaces to control which cpu's stripe each thread should handle. And userspace can set affinity of the threads. This provides biggest flexibility and configurability. But it's hard to use and apparently a new thread pool implementation is disfavor. Recent workqueue improvement is quite promising. unbound workqueue will be bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to do affinity setting. For example, we can only include one HT sibling in affinity. Since work is non-reentrant by default, and we can control running thread number by limiting dispatched work_struct number. In this patch, I created several stripe worker group. A group is a numa node. stripes from cpus of one node will be added to a group list. Workqueue thread of one node will only handle stripes of worker group of the node. In this way, stripe handling has numa node locality. And as I said, we can control thread number by limiting dispatched work_struct number. The work_struct callback function handles several stripes in one run. A typical work queue usage is to run one unit in each work_struct. In raid5 case, the unit is a stripe. But we can't do that: a. Though handling a stripe doesn't need lock because of reference accounting and stripe isn't in any list, queuing a work_struct for each stripe will make workqueue lock contended very heavily. b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we might dispatch request. If each work_struct only handles one stripe, such block plug is meaningless. This implementation can't do very fine grained configuration. But the numa binding is most popular usage model, should be enough for most workloads. Note: since we have only one stripe queue, switching to multi-thread might decrease request size dispatching down to low level layer. The impact depends on thread number, raid configuration and workload. So multi-thread raid5 might not be proper for all setups. Changes V1 -> V2: 1. remove WQ_NON_REENTRANT 2. disabling multi-threading by default 3. Add more descriptions in changelog Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Shaohua Li 提交于
patch "make release_stripe lockless" changes the order stripes are released. Originally I thought block layer can take care of request merge, but it appears there are still some requests not merged. It's easy to fix the order. Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Shaohua Li 提交于
release_stripe still has big lock contention. We just add the stripe to a llist without taking device_lock. We let the raid5d thread to do the real stripe release, which must hold device_lock anyway. In this way, release_stripe doesn't hold any locks. The side effect is the released stripes order is changed. But sounds not a big deal, stripes are never handled in order. And I thought block layer can already do nice request merge, which means order isn't that important. I kept the unplug release batch, which is unnecessary with this patch from lock contention avoid point of view, and actually if we delete it, the stripe_head release_list and lru can share storage. But the unplug release batch is also helpful for request merge. We probably can delay wakeup raid5d till unplug, but I'm still afraid of the case which raid5d is running. Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 27 8月, 2013 5 次提交
-
-
由 NeilBrown 提交于
When the last process closes /dev/mdX sync_blockdev will be called so that all buffers get flushed. So if it is then opened for the STOP_ARRAY ioctl to be sent there will be nothing to flush. However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just moments before some other process which was writing closes their file descriptor, then there won't be a 'last close' and the buffers might not get flushed. So do_md_stop() calls sync_blockdev(). However at this point it is holding ->reconfig_mutex. So if the array is currently 'clean' then the writes from sync_blockdev() will not complete until the array can be marked dirty and that won't happen until some other thread can get ->reconfig_mutex. So we deadlock. We need to move the sync_blockdev() call to before we take ->reconfig_mutex. However then some other thread could open /dev/mdX and write to it after we call sync_blockdev() and before we actually stop the array. This can leave dirty data in the page cache which is awkward. So introduce new flag MD_STILL_CLOSED. Set it before calling sync_blockdev(), clear it if anyone does open the file, and abort the STOP_ARRAY attempt if it gets set before we lock against further opens. It is still possible to get problems if you open /dev/mdX, write to it, then issue the STOP_ARRAY ioctl. Just don't do that. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
mddev->flags is mostly used to record if an update of the metadata is needed. Sometimes the whole field is tested instead of just the important bits. This makes it difficult to introduce more state bits. So replace all bare tests of mddev->flags with tests for the bits that actually need testing. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Dave Jones 提交于
Setting a variable to itself probably wasn't the intention here. Signed-off-by: NDave Jones <davej@fedoraproject.org> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Whe we set the safe_mode_timeout to a smaller value we trigger a timeout immediately - otherwise the small value might not be honoured. However if the previous timeout was 0 meaning "no timeout", we didn't. This would mean that no timeout happens until the next write completes, which could be a long time. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
There is no really need as GFP_NOIO is very likely sufficient, and failure is not catastrophic. Calling md_allow_write here will convert a read-auto array to read/write which could be confusing when you are just performing a read operation. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 17 8月, 2013 1 次提交
-
-
由 Geert Uytterhoeven 提交于
On sparc32, which includes <linux/swap.h> from <asm/pgtable_32.h>: drivers/md/dm-cache-policy-mq.c:962:13: error: conflicting types for 'remove_mapping' include/linux/swap.h:285:12: note: previous declaration of 'remove_mapping' was here As mq_remove_mapping() already exists, and the local remove_mapping() is used only once, inline it manually to avoid the conflict. Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NAlasdair Kergon <agk@redhat.com> Acked-by: NJoe Thornber <ejt@redhat.com>
-
- 25 7月, 2013 2 次提交
-
-
由 NeilBrown 提交于
If a device in a RAID4/5/6 is being replaced while another is being recovered, then the writes to the replacement device currently don't happen, resulting in corruption when the replacement completes and the new drive takes over. This is because the replacement writes are only triggered when 's.replacing' is set and not when the similar 's.sync' is set (which is the case during resync and recovery - it means all devices need to be read). So schedule those writes when s.replacing is set as well. In this case we cannot use "STRIPE_INSYNC" to record that the replacement has happened as that is needed for recording that any parity calculation is complete. So introduce STRIPE_REPLACED to record if the replacement has happened. For safety we should also check that STRIPE_COMPUTE_RUN is not set. This has a similar effect to the "s.locked == 0" test. The latter ensure that now IO has been flagged but not started. The former checks if any parity calculation has been flagged by not started. We must wait for both of these to complete before triggering the 'replace'. Add a similar test to the subsequent check for "are we finished yet". This possibly isn't needed (is subsumed in the STRIPE_INSYNC test), but it makes it more obvious that the REPLACE will happen before we think we are finished. Finally if a NeedReplace device is not UPTODATE then that is an error. We really must trigger a warning. This bug was introduced in commit 9a3e1101 (md/raid5: detect and handle replacements during recovery.) which introduced replacement for raid5. That was in 3.3-rc3, so any stable kernel since then would benefit from this fix. Cc: stable@vger.kernel.org (3.3+) Reported-by: Nqindehua <13691222965@163.com> Tested-by: Nqindehua <qindehua@163.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
We always need to be careful when calling generic_make_request, as it can start a chain of events which might free something that we are using. Here is one place I wasn't careful enough. If the wbio2 is not in use, then it might get freed at the first generic_make_request call. So perform all necessary tests first. This bug was introduced in 3.3-rc3 (24afd80d) and can cause an oops, so fix is suitable for any -stable since then. Cc: stable@vger.kernel.org (3.3+) Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 18 7月, 2013 3 次提交
-
-
由 NeilBrown 提交于
Recent change to use bio_copy_data() in raid1 when repairing an array is faulty. The underlying may have changed the bio in various ways using bio_advance and these need to be undone not just for the 'sbio' which is being copied to, but also the 'pbio' (primary) which is being copied from. So perform the reset on all bios that were read from and do it early. This also ensure that the sbio->bi_io_vec[j].bv_len passed to memcmp is correct. This fixes a crash during a 'check' of a RAID1 array. The crash was introduced in 3.10 so this is suitable for 3.10-stable. Cc: stable@vger.kernel.org (3.10) Reported-by: NJoe Lawrence <joe.lawrence@stratus.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
commit 7ceb17e8 md: Allow devices to be re-added to a read-only array. allowed a bit more than just that. It also allows devices to be added to a read-write array and to end up skipping recovery. This patch removes the offending piece of code pending a rewrite for a subsequent release. More specifically: If the array has a bitmap, then the device will still need a bitmap based resync ('saved_raid_disk' is set under different conditions is a bitmap is present). If the array doesn't have a bitmap, then this is correct as long as nothing has been written to the array since the metadata was checked by ->validate_super. However there is no locking to ensure that there was no write. Bug was introduced in 3.10 and causes data corruption so patch is suitable for 3.10-stable. Cc: stable@vger.kernel.org (3.10) Reported-by: NJoe Lawrence <joe.lawrence@stratus.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
1/ When an different between blocks is found, data is copied from one bio to the other. However bv_len is used as the length to copy and this could be zero. So use r10_bio->sectors to calculate length instead. Using bv_len was probably always a bit dubious, but the introduction of bio_advance made it much more likely to be a problem. 2/ When preparing some blocks for sync, we don't set BIO_UPTODATE except on bios that we schedule for a read. This ensures that missing/failed devices don't confuse the loop at the top of sync_request write. Commit 8be185f2 "raid10: Use bio_reset()" removed a loop which set BIO_UPTDATE on all appropriate bios. So we need to re-add that flag. These bugs were introduced in 3.10, so this patch is suitable for 3.10-stable, and can remove a potential for data corruption. Cc: stable@vger.kernel.org (3.10) Reported-by: NBrassow Jonathan <jbrassow@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 12 7月, 2013 8 次提交
-
-
由 Kent Overstreet 提交于
The alloc kthread should've been using try_to_freeze() - and also there was the potential for the alloc kthread to get woken up after it had shut down, which would have been bad. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
Part of the job of garbage collection is to add up however many sectors of live data it finds in each bucket, but that doesn't work very well if it doesn't reset GC_SECTORS_USED() when it starts. Whoops. This wouldn't have broken anything horribly, but allocation tries to preferentially reclaim buckets that are mostly empty and that's not gonna work with an incorrect GC_SECTORS_USED() value. Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
-
由 Kent Overstreet 提交于
The journal replay code starts by finding something that looks like a valid journal entry, then it does a binary search over the unchecked region of the journal for the journal entries with the highest sequence numbers. Trouble is, the logic was wrong - journal_read_bucket() returns true if it found journal entries we need, but if the range of journal entries we're looking for loops around the end of the journal - in that case journal_read_bucket() could return true when it hadn't found the highest sequence number we'd seen yet, and in that case the binary search did the wrong thing. Whoops. Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
-
由 Kent Overstreet 提交于
Stopping a cache set is supposed to make it stop attached backing devices, but somewhere along the way that code got lost. Fixing this mainly has the effect of fixing our reboot notifier. Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
-
由 Kent Overstreet 提交于
If we stopped a bcache device when we were already detaching (or something like that), bcache_device_unlink() would try to remove a symlink from sysfs that was already gone because the bcache dev kobject had already been removed from sysfs. So keep track of whether we've removed stuff from sysfs. Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
-
由 Kent Overstreet 提交于
Whoops - bcache's flush/FUA was mostly correct, but flushes get filtered out unless we say we support them... Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
-
由 Dan Carpenter 提交于
There is a missing NULL check after the kzalloc(). Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
-
由 Kent Overstreet 提交于
In the far-too-complicated closure code - closures can have destructors, for probably dubious reasons; they get run after the closure is no longer waiting on anything but before dropping the parent ref, intended just for freeing whatever memory the closure is embedded in. Trouble is, when remaining goes to 0 and we've got nothing more to run - we also have to unlock the closure, setting remaining to -1. If there's a destructor, that unlock isn't doing anything - nobody could be trying to lock it if we're about to free it - but if the unlock _is needed... that check for a destructor was racy. Argh. Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
-
- 11 7月, 2013 12 次提交
-
-
由 Jim Ramsay 提交于
dm-switch is a new target that maps IO to underlying block devices efficiently when there is a large number of fixed-sized address regions but there is no simple pattern to allow for a compact mapping representation such as dm-stripe. Though we have developed this target for a specific storage device, Dell EqualLogic, we have made an effort to keep it as general purpose as possible in the hope that others may benefit. Originally developed by Jim Ramsay. Simplified by Mikulas Patocka. Signed-off-by: NJim Ramsay <jim_ramsay@dell.com> Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
This reorder actually improves performance by 20% (from 39.1s to 32.8s) on x86-64 quad core Opteron. I have no explanation for this, possibly it makes some other entries are better cache-aligned. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
This patch removes "io_lock" and "map_lock" in struct mapped_device and "holders" in struct dm_table and replaces these mechanisms with sleepable-rcu. Previously, the code would call "dm_get_live_table" and "dm_table_put" to get and release table. Now, the code is changed to call "dm_get_live_table" and "dm_put_live_table". dm_get_live_table locks sleepable-rcu and dm_put_live_table unlocks it. dm_get_live_table_fast/dm_put_live_table_fast can be used instead of dm_get_live_table/dm_put_live_table. These *_fast functions use non-sleepable RCU, so the caller must not block between them. If the code changes active or inactive dm table, it must call dm_sync_table before destroying the old table. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
This patch changes dm-bufio so that it submits write I/Os outside of the lock. If the number of submitted buffers is greater than the number of requests on the target queue, submit_bio blocks. We want to block outside of the lock to improve latency of other threads that may need the lock. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Use __always_inline to avoid a link failure with gcc 4.6 on ARM. gcc 4.7 is OK. It creates a function block_div.part.8, it references __udivdi3 and __umoddi3 and it is never called. The references to __udivdi3 and __umoddi3 cause a link failure. Reported-by: NRob Herring <robherring2@gmail.com> Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
This patch changes ffs() to __ffs() and fls() to __fls() which don't add one to the result. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Alasdair G Kergon 提交于
Remove the reference to the "linear" target from the error message issued when allocation fails in the flakey target. Cc: Robin Dong <sanbai@taobao.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Remove num < 0 test in verity_ctr because num is unsigned. (Found by Coverity.) Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Use __GFP_HIGHMEM in __vmalloc. Pages allocated with __vmalloc can be allocated in high memory that is not directly mapped to kernel space, so use __GFP_HIGHMEM just like vmalloc does. This patch reduces memory pressure slightly because pages can be allocated in the high zone. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Fix a boundary condition that caused failure for certain device sizes. The problem is reported at http://code.google.com/p/cryptsetup/issues/detail?id=160 For certain device sizes the number of hashes at a specific level was calculated incorrectly. It happens for example for a device with data and metadata block size 4096 that has 16385 blocks and algorithm sha256. The user can test if he is affected by this bug by running the "veritysetup verify" command and also by activating the dm-verity kernel driver and reading the whole block device. If it passes without an error, then the user is not affected. The condition for the bug is: Split the total number of data blocks (data_block_bits) into bit strings, each string has hash_per_block_bits bits. hash_per_block_bits is rounddown(log2(metadata_block_size/hash_digest_size)). Equivalently, you can say that you convert data_blocks_bits to 2^hash_per_block_bits base. If there some zero bit string below the most significant bit string and at least one bit below this zero bit string is set, then the bug happens. The same bug exists in the userspace veritysetup tool, so you must use fixed veritysetup too if you want to use devices that are affected by this boundary condition. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # 3.4+ Cc: Milan Broz <gmazyland@gmail.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mikulas Patocka 提交于
Set noio flag while calling __vmalloc() because it doesn't fully respect gfp flags to avoid a possible deadlock (see commit 502624bd). This should be backported to stable kernels 3.8 and newer. The kernel 3.8 doesn't have memalloc_noio_save(), so we should set and restore process flag PF_MEMALLOC instead. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Hannes Reinecke 提交于
When multipath needs to retry an ioctl the reference to the current live table needs to be dropped. Otherwise a deadlock occurs when all paths are down: - dm_blk_ioctl takes a reference to the current table and spins in multipath_ioctl(). - A new table is being loaded, but upon resume the process hangs in dm_table_destroy() waiting for references to drop to zero. With this patch the reference to the old table is dropped prior to retry, thereby avoiding the deadlock. Signed-off-by: NHannes Reinecke <hare@suse.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
- 04 7月, 2013 2 次提交
-
-
由 NeilBrown 提交于
The recent comment: commit 7e83ccbe md/raid10: Allow skipping recovery when clean arrays are assembled Causes raid10 to skip a recovery in certain cases where it is safe to do so. Unfortunately it also causes a reshape to be skipped which is never safe. The result is that an attempt to reshape a RAID10 will appear to complete instantly, but no data will have been moves so the array will now contain garbage. (If nothing is written, you can recovery by simple performing the reverse reshape which will also complete instantly). Bug was introduced in 3.10, so this is suitable for 3.10-stable. Cc: stable@vger.kernel.org (3.10) Cc: Martin Wilck <mwilck@arcor.de> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
There is a bug in 'check_reshape' for raid5.c To checks that the new minimum number of devices is large enough (which is good), but it does so also after the reshape has started (bad). This is bad because - the calculation is now wrong as mddev->raid_disks has changed already, and - it is pointless because it is now too late to stop. So only perform that test when reshape has not been committed to. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 03 7月, 2013 1 次提交
-
-
由 NeilBrown 提交于
1/ If a RAID10 is being reshaped to a fewer number of devices and is stopped while this is ongoing, then when the array is reassembled the 'mirrors' array will be allocated too small. This will lead to an access error or memory corruption. 2/ A sanity test for a reshaping RAID10 array is restarted is slightly incorrect. Due to the first bug, this is suitable for any -stable kernel since 3.5 where this code was introduced. Cc: stable@vger.kernel.org (v3.5+) Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 02 7月, 2013 3 次提交
-
-
由 Kent Overstreet 提交于
Some of bcache's utility code has made it into the rest of the kernel, so drop the bcache versions. Bcache used to have a workaround for allocating from a bio set under generic_make_request() (if you allocated more than once, the bios you already allocated would get stuck on current->bio_list when you submitted, and you'd risk deadlock) - bcache would mask out __GFP_WAIT when allocating bios under generic_make_request() so that allocation could fail and it could retry from workqueue. But bio_alloc_bioset() has a workaround now, so we can drop this hack and the associated error handling. Signed-off-by: NKent Overstreet <koverstreet@google.com>
-
由 Kent Overstreet 提交于
This code has rotted and it hasn't been used in ages anyways. Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-
由 Kent Overstreet 提交于
Signed-off-by: NKent Overstreet <kmo@daterainc.com>
-