- 12 10月, 2012 2 次提交
-
-
由 Wei Yongjun 提交于
Use list_move() instead of list_del() + list_add(). spatch with a semantic match was used to find this. (http://coccinelle.lip6.fr/) Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Wei Yongjun 提交于
The mpio dereference should be moved below the BUG_ON NULL test in multipath_end_io(). spatch with a semantic match was used to found this. (http://coccinelle.lip6.fr/) Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
- 27 9月, 2012 9 次提交
-
-
由 NeilBrown 提交于
The 'enough' function is written to work with 'near' arrays only in that is implicitly assumes that the offset from one 'group' of devices to the next is the same as the number of copies. In reality it is the number of 'near' copies. So change it to make this number explicit. This bug makes it possible to run arrays without enough drives present, which is dangerous. It is appropriate for an -stable kernel, but will almost certainly need to be modified for some of them. Cc: stable@vger.kernel.org Reported-by: NJakub Husák <jakub@gooseman.cz> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Mikulas Patocka 提交于
This patch fixes sector_t overflow checking in dm-verity. Without this patch, the code checks for overflow only if sector_t is smaller than long long, not if sector_t and long long have the same size. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mike Snitzer 提交于
The discard limits that get established for a thin-pool or thin device may be incompatible with the pool's data device. Avoid this by checking the discard limits of the pool's data device. If an incompatibility is found then the pool's 'discard passdown' feature is disabled. Change thin_io_hints to ensure that a thin device always uses the same queue limits as its pool device. Introduce requested_pf to track whether or not the table line originally contained the no_discard_passdown flag and use this directly for table output. We prepare the correct setting for discard_passdown directly in bind_control_target (called from pool_io_hints) and store it in adjusted_pf rather than waiting until we have access to pool->pf in pool_preresume. Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJoe Thornber <ejt@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mike Snitzer 提交于
A little thin discard code refactoring to make the next patch (dm thin: fix discard support for data devices) more readable. Pull out a couple of functions (and uses bools instead of unsigned for features). No functional changes. Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJoe Thornber <ejt@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mike Snitzer 提交于
Add a safety net that will re-use the DM device's existing limits in the event that DM device has a temporary table that doesn't have any component devices. This is to reduce the chance that requests not respecting the hardware limits will reach the device. DM recalculates queue limits based only on devices which currently exist in the table. This creates a problem in the event all devices are temporarily removed such as all paths being lost in multipath. DM will reset the limits to the maximum permissible, which can then assemble requests which exceed the limits of the paths when the paths are restored. The request will fail the blk_rq_check_limits() test when sent to a path with lower limits, and will be retried without end by multipath. This became a much bigger issue after v3.6 commit fe86cdce ("block: do not artificially constrain max_sectors for stacking drivers"). Reported-by: NDavid Jeffery <djeffery@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Milan Broz 提交于
Always clear QUEUE_FLAG_ADD_RANDOM if any underlying device does not have it set. Otherwise devices with predictable characteristics may contribute entropy. QUEUE_FLAG_ADD_RANDOM specifies whether or not queue IO timings contribute to the random pool. For bio-based targets this flag is always 0 because such devices have no real queue. For request-based devices this flag was always set to 1 by default. Now set it according to the flags on underlying devices. If there is at least one device which should not contribute, set the flag to zero: If a device, such as fast SSD storage, is not suitable for supplying entropy, a request-based queue stacked over it will not be either. Because the checking logic is exactly same as for the rotational flag, share the iteration function with device_is_nonrot(). Signed-off-by: NMilan Broz <mbroz@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mike Snitzer 提交于
The access beyond the end of device BUG_ON that was introduced to dm_request_fn via commit 29e4013d ("dm: implement REQ_FLUSH/FUA support for request-based dm") was an overly drastic (but simple) response to this situation. I have received a report that this BUG_ON was hit and now think it would be better to use dm_kill_unmapped_request() to fail the clone and original request with -EIO. map_request() will assign the valid target returned by dm_table_find_target to tio->ti. But when the target isn't valid tio->ti is never assigned (because map_request isn't called); so add a check for tio->ti != NULL to dm_done(). Reported-by: NMike Christie <michaelc@cs.wisc.edu> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: stable@vger.kernel.org # v2.6.37+ Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mike Snitzer 提交于
When there are no paths and multipath receives an ioctl, it waits until a path becomes available. This behaviour is incorrect if the "queue_if_no_path" setting was not specified, as then the ioctl should be rejected immediately, which this patch now does. commit 35991652 ("dm mpath: allow ioctls to trigger pg init") should have checked if queue_if_no_path was configured before queueing IO. Checking for the queue_if_no_path feature, like is done in map_io(), allows the following table load to work without blocking in the multipath_ioctl retry loop: echo "0 1024 multipath 0 0 0 0" | dmsetup create mpath_nodevs Without this fix the multipath_ioctl will block with the following stack trace: blkid D 0000000000000002 0 23936 1 0x00000000 ffff8802b89e5cd8 0000000000000082 ffff8802b89e5fd8 0000000000012440 ffff8802b89e4010 0000000000012440 0000000000012440 0000000000012440 ffff8802b89e5fd8 0000000000012440 ffff88030c2aab30 ffff880325794040 Call Trace: [<ffffffff814ce099>] schedule+0x29/0x70 [<ffffffff814cc312>] schedule_timeout+0x182/0x2e0 [<ffffffff8104dee0>] ? lock_timer_base+0x70/0x70 [<ffffffff814cc48e>] schedule_timeout_uninterruptible+0x1e/0x20 [<ffffffff8104f840>] msleep+0x20/0x30 [<ffffffffa0000839>] multipath_ioctl+0x109/0x170 [dm_multipath] [<ffffffffa06bfb9c>] dm_blk_ioctl+0xbc/0xd0 [dm_mod] [<ffffffff8122a408>] __blkdev_driver_ioctl+0x28/0x30 [<ffffffff8122a79e>] blkdev_ioctl+0xce/0x730 [<ffffffff811970ac>] block_ioctl+0x3c/0x40 [<ffffffff8117321c>] do_vfs_ioctl+0x8c/0x340 [<ffffffff81166293>] ? sys_newfstat+0x33/0x40 [<ffffffff81173571>] sys_ioctl+0xa1/0xb0 [<ffffffff814d70a9>] system_call_fastpath+0x16/0x1b Signed-off-by: NMike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.5+ Acked-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
由 Mike Snitzer 提交于
The dm thin pool target claims to support the zeroing of discarded data areas. This turns out to be incorrect when processing discards that do not exactly cover a complete number of blocks, so the target must always set discard_zeroes_data_unsupported. The thin pool target will zero blocks when they are allocated if the skip_block_zeroing feature is not specified. The block layer may send a discard that only partly covers a block. If a thin pool block is partially discarded then there is no guarantee that the discarded data will get zeroed before it is accessed again. Due to this, thin devices cannot claim discards will always zero data. Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJoe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org # 3.4+ Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
-
- 24 9月, 2012 1 次提交
-
-
由 NeilBrown 提交于
commit b17459c0 raid5: add a per-stripe lock added a spin_lock to the 'stripe_head' struct. Unfortunately there are two places where this struct is allocated but the spin lock was only initialised in one of them. So add the missing spin_lock_init. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 20 9月, 2012 1 次提交
-
-
由 Martin K. Petersen 提交于
The WRITE SAME command supported on some SCSI devices allows the same block to be efficiently replicated throughout a block range. Only a single logical block is transferred from the host and the storage device writes the same data to all blocks described by the I/O. This patch implements support for WRITE SAME in the block layer. The blkdev_issue_write_same() function can be used by filesystems and block drivers to replicate a buffer across a block range. This can be used to efficiently initialize software RAID devices, etc. Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com> Acked-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 19 9月, 2012 3 次提交
-
-
由 NeilBrown 提交于
It isn't always necessary to update the metadata when spares are removed as the presence-or-not of a spare isn't really important to the integrity of an array. Also activating a spare doesn't always require updating the metadata as the update on 'recovery-completed' is usually sufficient. However the introduction of 'replacement' devices have made these transitions sometimes more important. For example the 'Replacement' flag isn't cleared until the original device is removed, so we need to ensure a metadata update after that 'spare' is removed. So set MD_CHANGE_DEVS whenever a spare is activated or removed, to complement the current situation where it is set when a spare is added or a device is failed (or a number of other less common situations). This is suitable for -stable as out-of-data metadata could lead to data corruption. This is only relevant for 3.3 and later 9when 'replacement' as introduced. Cc: stable@vger.kernel.org Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
When a replacement device becomes active, we mark the device that it replaces as 'faulty' so that it can subsequently get removed. However 'calc_degraded' only pays attention to the primary device, not the replacement, so the array appears to become degraded, which is wrong. So teach 'calc_degraded' to consider any replacement if a primary device is faulty. This is suitable for -stable as an incorrect 'degraded' value can confuse md and could lead to data corruption. This is only relevant for 3.3 and later. Cc: stable@vger.kernel.org Reported-by: NRobin Hill <robin@robinhill.me.uk> Reported-by: NJohn Drescher <drescherjm@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
This reverts commit 895e3c5c. While this patch seemed like a good idea and did help some workloads, it hurts other workloads. Large sequential O_DIRECT writes were faster, Small random O_DIRECT writes were slower. Other changes (batching RAID5 writes) have improved the sequential writes using a different mechanism, so the net result of this patch is definitely negative. So revert it. Reported-by: NShaohua Li <shli@kernel.org> Tested-by: NJianpeng Ma <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 09 9月, 2012 4 次提交
-
-
由 Kent Overstreet 提交于
Previously, there was bio_clone() but it only allocated from the fs bio set; as a result various users were open coding it and using __bio_clone(). This changes bio_clone() to become bio_clone_bioset(), and then we add bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of the functionality the last patch adedd. This will also help in a later patch changing how bio cloning works. Signed-off-by: NKent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> CC: Alasdair Kergon <agk@redhat.com> CC: Boaz Harrosh <bharrosh@panasas.com> CC: Jeff Garzik <jeff@garzik.org> Acked-by: NJeff Garzik <jgarzik@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Kent Overstreet 提交于
Previously, dm_rq_clone_bio_info needed to be freed by the bio's destructor to avoid a memory leak in the blk_rq_prep_clone() error path. This gets rid of a memory allocation and means we can kill dm_rq_bio_destructor. The _rq_bio_info_cache kmem cache is unused now and needs to be deleted, but due to the way io_pool is used and overloaded this looks not quite trivial so I'm leaving it for a later patch. v6: Fix comment on struct dm_rq_clone_bio_info, per Tejun Signed-off-by: NKent Overstreet <koverstreet@google.com> CC: Alasdair Kergon <agk@redhat.com> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Kent Overstreet 提交于
Now that bios keep track of where they were allocated from, bio_integrity_alloc_bioset() becomes redundant. Remove bio_integrity_alloc_bioset() and drop bio_set argument from the related functions and make them use bio->bi_pool. Signed-off-by: NKent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Kent Overstreet 提交于
With the old code, when you allocate a bio from a bio pool you have to implement your own destructor that knows how to find the bio pool the bio was originally allocated from. This adds a new field to struct bio (bi_pool) and changes bio_alloc_bioset() to use it. This makes various bio destructors unnecessary, so they're then deleted. v6: Explain the temporary if statement in bio_put Signed-off-by: NKent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> CC: Alasdair Kergon <agk@redhat.com> CC: Nicholas Bellinger <nab@linux-iscsi.org> CC: Lars Ellenberg <lars.ellenberg@linbit.com> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NNicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 8月, 2012 1 次提交
-
-
由 Tejun Heo 提交于
flush[_delayed]_work_sync() are now spurious. Mark them deprecated and convert all users to flush[_delayed]_work(). If you're cc'd and wondering what's going on: Now all workqueues are non-reentrant and the regular flushes guarantee that the work item is not pending or running on any CPU on return, so there's no reason to use the sync flushes at all and they're going away. This patch doesn't make any functional difference. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mattia Dongili <malattia@linux.it> Cc: Kent Yoder <key@linux.vnet.ibm.com> Cc: David Airlie <airlied@linux.ie> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: Bryan Wu <bryan.wu@canonical.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-wireless@vger.kernel.org Cc: Anton Vorontsov <cbou@mail.ru> Cc: Sangbeom Kim <sbkim73@samsung.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Avi Kivity <avi@redhat.com>
-
- 18 8月, 2012 1 次提交
-
-
由 NeilBrown 提交于
A 'struct r10bio' has an array of per-copy information at the end. This array is declared with size [0] and r10bio_pool_alloc allocates enough extra space to store the per-copy information depending on the number of copies needed. So declaring a 'struct r10bio on the stack isn't going to work. It won't allocate enough space, and memory corruption will ensue. So in the two places where this is done, declare a sufficiently large structure and use that instead. The two call-sites of this bug were introduced in 3.4 and 3.5 so this is suitable for both those kernels. The patch will have to be modified for 3.4 as it only has one bug. Cc: stable@vger.kernel.org Reported-by: NIvan Vasilyev <ivan.vasilyev@gmail.com> Tested-by: NIvan Vasilyev <ivan.vasilyev@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 16 8月, 2012 1 次提交
-
-
由 NeilBrown 提交于
commit 27a7b260 md: Fix handling for devices from 2TB to 4TB in 0.90 metadata. changed 0.90 metadata handling to truncated size to 4TB as that is all that 0.90 can record. However for RAID0 and Linear, 0.90 doesn't need to record the size, so this truncation is not needed and causes working arrays to become too small. So avoid the truncation for RAID0 and Linear This bug was introduced in 3.1 and is suitable for any stable kernels from then onwards. As the offending commit was tagged for 'stable', any stable kernel that it was applied to should also get this patch. That includes at least 2.6.32, 2.6.33 and 3.0. (Thanks to Ben Hutchings for providing that list). Cc: stable@vger.kernel.org Signed-off-by: NNeil Brown <neilb@suse.de>
-
- 02 8月, 2012 4 次提交
-
-
由 NeilBrown 提交于
Now that DM_RAID supports raid10, it needs to select that code to ensure it is included. Cc: Jonathan Brassow <jbrassow@redhat.com> Reported-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
queuing writes to the md thread means that all requests go through the one processor which may not be able to keep up with very high request rates. So use the plugging infrastructure to submit all requests on unplug. If a 'schedule' is needed, we fall back on the old approach of handing the requests to the thread for it to handle. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Shaohua Li 提交于
Let raid5d handle stripe in batch way to reduce conf->device_lock locking. Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Shaohua Li 提交于
make_request() does stripe release for every stripe and the stripe usually has count 1, which makes previous release_stripe() optimization not work. In my test, this release_stripe() becomes the heaviest pleace to take conf->device_lock after previous patches applied. Below patch makes stripe release batch. All the stripes will be released in unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe lru. Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 01 8月, 2012 1 次提交
-
-
由 Jonathan Brassow 提交于
Support the MD RAID10 personality through dm-raid.c Signed-off-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 31 7月, 2012 12 次提交
-
-
由 NeilBrown 提交于
This will allow md/raid to know why the unplug was called, and will be able to act according - if !from_schedule it is safe to perform tasks which could themselves schedule. Signed-off-by: NNeilBrown <neilb@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
Both md and umem has similar code for getting notified on an blk_finish_plug event. Centralize this code in block/ and allow each driver to provide its distinctive difference. Signed-off-by: NNeilBrown <neilb@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 NeilBrown 提交于
This seemed like a good idea at the time, but after further thought I cannot see it making a difference other than very occasionally and testing to try to exercise the case it is most likely to help did not show any performance difference by removing it. So remove the counting of active plugs and allow 'pending writes' to be activated at any time, not just when no plugs are active. This is only relevant when there is a write-intent bitmap, and the updating of the bitmap will likely introduce enough delay that the single-threading of bitmap updates will be enough to collect large numbers of updates together. Removing this will make it easier to centralise the unplug code, and will clear the other for other unplug enhancements which have a measurable effect. Signed-off-by: NNeilBrown <neilb@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Alexander Lyakas 提交于
When doing resync or repair, attempt to correct bad blocks, according to WriteErrorSeen policy Signed-off-by: NAlex Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Akinobu Mita 提交于
Use memweight() to count the total number of bits set in memory area. Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com> Cc: Alasdair Kergon <agk@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 majianpeng 提交于
'sync' writes set both REQ_SYNC and REQ_NOIDLE. O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE. We currently assume that a REQ_SYNC request will not be followed by more requests and so set STRIPE_PREREAD_ACTIVE to expedite the request. This is appropriate for sync requests, but not for O_DIRECT requests. So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE rather than REQ_SYNC. This is consistent with the documented meaning of REQ_NOIDLE: __REQ_NOIDLE, /* don't anticipate more IO after this one */ Signed-off-by: NJianpeng Ma <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
If a resync of a RAID1 array with 2 devices finds a known bad block one device it will neither read from, or write to, that device for this block offset. So there will be one read_target (The other device) and zero write targets. This condition causes md/raid1 to abort the resync assuming that it has finished - without known bad blocks this would be true. When there are no write targets because of the presence of bad blocks we should only skip over the area covered by the bad block. RAID10 already gets this right, raid1 doesn't. Or didn't. As this can cause a 'sync' to abort early and appear to have succeeded it could lead to some data corruption, so it suitable for -stable. Cc: stable@vger.kernel.org Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
do_md_stop tests mddev->openers while holding ->open_mutex, and fails if this count is too high. So callers do not need to check mddev->openers and doing so isn't very meaningful as they don't hold ->open_mutex so the number could change. So remove the unnecessary tests on mddev->openers. These are not called often enough for there to be any gain in an early test on ->open_mutex to avoid the need for a slightly more costly mutex_lock call. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 majianpeng 提交于
Because bios will merge at block-layer,so bios-error may caused by other bio which be merged into to the same request. Using this flag,it will find exactly error-sector and not do redundant operation like re-write and re-read. V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block layer. Signed-off-by: NJianpeng Ma <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Shaohua Li 提交于
For SSD, if request size exceeds specific value (optimal io size), request size isn't important for bandwidth. In such condition, if making request size bigger will cause some disks idle, the total throughput will actually drop. A good example is doing a readahead in a two-disk raid1 setup. So when should we split big requests? We absolutly don't want to split big request to very small requests. Even in SSD, big request transfer is more efficient. This patch only considers request with size above optimal io size. If all disks are busy, is it worth doing a split? Say optimal io size is 16k, two requests 32k and two disks. We can let each disk run one 32k request, or split the requests to 4 16k requests and each disk runs two. It's hard to say which case is better, depending on hardware. So only consider case where there are idle disks. For readahead, split is always better in this case. And in my test, below patch can improve > 30% thoughput. Hmm, not 100%, because disk isn't 100% busy. Such case can happen not just in readahead, for example, in directio. But I suppose directio usually will have bigger IO depth and make all disks busy, so I ignored it. Note: if the raid uses any hard disk, we don't prevent merging. That will make performace worse. Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Shaohua Li 提交于
SSD hasn't spindle, distance between requests means nothing. And the original distance based algorithm sometimes can cause severe performance issue for SSD raid. Considering two thread groups, one accesses file A, the other access file B. The first group will access one disk and the second will access the other disk, because requests are near from one group and far between groups. In this case, read balance might keep one disk very busy but the other relative idle. For SSD, we should try best to distribute requests to as many disks as possible. There isn't spindle move penality anyway. With below patch, I can see more than 50% throughput improvement sometimes depending on workloads. The only exception is small requests can be merged to a big request which typically can drive higher throughput for SSD too. Such small requests are sequential reads. Unlike hard disk, sequential read which can't be merged (for example direct IO, or read without readahead) can be ignored for SSD. Again there is no spindle move penality. readahead dispatches small requests and such requests can be merged. Last patch can help detect sequential read well, at least if concurrent read number isn't greater than raid disk number. In that case, distance based algorithm doesn't work well too. V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for random IO too. This makes the algorithm generic for raid with SSD. Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Shaohua Li 提交于
Currently the sequential read detection is global wide. It's natural to make it per disk based, which can improve the detection for concurrent multiple sequential reads. And next patch will make SSD read balance not use distance based algorithm, where this change help detect truly sequential read for SSD. Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-