1. 12 10月, 2012 2 次提交
  2. 27 9月, 2012 9 次提交
    • N
      md/raid10: fix "enough" function for detecting if array is failed. · 80b48124
      NeilBrown 提交于
      The 'enough' function is written to work with 'near' arrays only
      in that is implicitly assumes that the offset from one 'group' of
      devices to the next is the same as the number of copies.
      In reality it is the number of 'near' copies.
      
      So change it to make this number explicit.
      
      This bug makes it possible to run arrays without enough drives
      present, which is dangerous.
      It is appropriate for an -stable kernel, but will almost certainly
      need to be modified for some of them.
      
      Cc: stable@vger.kernel.org
      Reported-by: NJakub Husák <jakub@gooseman.cz>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      80b48124
    • M
      dm verity: fix overflow check · 1d55f6bc
      Mikulas Patocka 提交于
      This patch fixes sector_t overflow checking in dm-verity.
      
      Without this patch, the code checks for overflow only if sector_t is
      smaller than long long, not if sector_t and long long have the same size.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      1d55f6bc
    • M
      dm thin: fix discard support for data devices · 0424caa1
      Mike Snitzer 提交于
      The discard limits that get established for a thin-pool or thin device
      may be incompatible with the pool's data device.  Avoid this by checking
      the discard limits of the pool's data device.  If an incompatibility is
      found then the pool's 'discard passdown' feature is disabled.
      
      Change thin_io_hints to ensure that a thin device always uses the same
      queue limits as its pool device.
      
      Introduce requested_pf to track whether or not the table line originally
      contained the no_discard_passdown flag and use this directly for table
      output.  We prepare the correct setting for discard_passdown directly in
      bind_control_target (called from pool_io_hints) and store it in
      adjusted_pf rather than waiting until we have access to pool->pf in
      pool_preresume.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      0424caa1
    • M
      dm thin: tidy discard support · 9bc142dd
      Mike Snitzer 提交于
      A little thin discard code refactoring to make the next patch (dm thin:
      fix discard support for data devices) more readable.
      Pull out a couple of functions (and uses bools instead of unsigned for
      features).
      
      No functional changes.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      9bc142dd
    • M
      dm: retain table limits when swapping to new table with no devices · 3ae70656
      Mike Snitzer 提交于
      Add a safety net that will re-use the DM device's existing limits in the
      event that DM device has a temporary table that doesn't have any
      component devices.  This is to reduce the chance that requests not
      respecting the hardware limits will reach the device.
      
      DM recalculates queue limits based only on devices which currently exist
      in the table.  This creates a problem in the event all devices are
      temporarily removed such as all paths being lost in multipath.  DM will
      reset the limits to the maximum permissible, which can then assemble
      requests which exceed the limits of the paths when the paths are
      restored.  The request will fail the blk_rq_check_limits() test when
      sent to a path with lower limits, and will be retried without end by
      multipath.  This became a much bigger issue after v3.6 commit fe86cdce
      ("block: do not artificially constrain max_sectors for stacking
      drivers").
      Reported-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      3ae70656
    • M
      dm table: clear add_random unless all devices have it set · c3c4555e
      Milan Broz 提交于
      Always clear QUEUE_FLAG_ADD_RANDOM if any underlying device does not
      have it set. Otherwise devices with predictable characteristics may
      contribute entropy.
      
      QUEUE_FLAG_ADD_RANDOM specifies whether or not queue IO timings
      contribute to the random pool.
      
      For bio-based targets this flag is always 0 because such devices have no
      real queue.
      
      For request-based devices this flag was always set to 1 by default.
      
      Now set it according to the flags on underlying devices. If there is at
      least one device which should not contribute, set the flag to zero: If a
      device, such as fast SSD storage, is not suitable for supplying entropy,
      a request-based queue stacked over it will not be either.
      
      Because the checking logic is exactly same as for the rotational flag,
      share the iteration function with device_is_nonrot().
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      c3c4555e
    • M
      dm: handle requests beyond end of device instead of using BUG_ON · ba1cbad9
      Mike Snitzer 提交于
      The access beyond the end of device BUG_ON that was introduced to
      dm_request_fn via commit 29e4013d ("dm: implement
      REQ_FLUSH/FUA support for request-based dm") was an overly
      drastic (but simple) response to this situation.
      
      I have received a report that this BUG_ON was hit and now think
      it would be better to use dm_kill_unmapped_request() to fail the clone
      and original request with -EIO.
      
      map_request() will assign the valid target returned by
      dm_table_find_target to tio->ti.  But when the target
      isn't valid tio->ti is never assigned (because map_request isn't
      called); so add a check for tio->ti != NULL to dm_done().
      Reported-by: NMike Christie <michaelc@cs.wisc.edu>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: stable@vger.kernel.org # v2.6.37+
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      ba1cbad9
    • M
      dm mpath: only retry ioctl when no paths if queue_if_no_path set · 7ba10aa6
      Mike Snitzer 提交于
      When there are no paths and multipath receives an ioctl, it waits until
      a path becomes available.  This behaviour is incorrect if the
      "queue_if_no_path" setting was not specified, as then the ioctl should
      be rejected immediately, which this patch now does.
      
      commit 35991652 ("dm mpath: allow ioctls to trigger pg init") should
      have checked if queue_if_no_path was configured before queueing IO.
      
      Checking for the queue_if_no_path feature, like is done in map_io(),
      allows the following table load to work without blocking in the
      multipath_ioctl retry loop:
      
        echo "0 1024 multipath 0 0 0 0" | dmsetup create mpath_nodevs
      
      Without this fix the multipath_ioctl will block with the following stack
      trace:
      
        blkid           D 0000000000000002     0 23936      1 0x00000000
         ffff8802b89e5cd8 0000000000000082 ffff8802b89e5fd8 0000000000012440
         ffff8802b89e4010 0000000000012440 0000000000012440 0000000000012440
         ffff8802b89e5fd8 0000000000012440 ffff88030c2aab30 ffff880325794040
        Call Trace:
         [<ffffffff814ce099>] schedule+0x29/0x70
         [<ffffffff814cc312>] schedule_timeout+0x182/0x2e0
         [<ffffffff8104dee0>] ? lock_timer_base+0x70/0x70
         [<ffffffff814cc48e>] schedule_timeout_uninterruptible+0x1e/0x20
         [<ffffffff8104f840>] msleep+0x20/0x30
         [<ffffffffa0000839>] multipath_ioctl+0x109/0x170 [dm_multipath]
         [<ffffffffa06bfb9c>] dm_blk_ioctl+0xbc/0xd0 [dm_mod]
         [<ffffffff8122a408>] __blkdev_driver_ioctl+0x28/0x30
         [<ffffffff8122a79e>] blkdev_ioctl+0xce/0x730
         [<ffffffff811970ac>] block_ioctl+0x3c/0x40
         [<ffffffff8117321c>] do_vfs_ioctl+0x8c/0x340
         [<ffffffff81166293>] ? sys_newfstat+0x33/0x40
         [<ffffffff81173571>] sys_ioctl+0xa1/0xb0
         [<ffffffff814d70a9>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.5+
      Acked-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      7ba10aa6
    • M
      dm thin: do not set discard_zeroes_data · 307615a2
      Mike Snitzer 提交于
      The dm thin pool target claims to support the zeroing of discarded
      data areas.  This turns out to be incorrect when processing discards
      that do not exactly cover a complete number of blocks, so the target
      must always set discard_zeroes_data_unsupported.
      
      The thin pool target will zero blocks when they are allocated if the
      skip_block_zeroing feature is not specified.  The block layer
      may send a discard that only partly covers a block.  If a thin pool
      block is partially discarded then there is no guarantee that the
      discarded data will get zeroed before it is accessed again.
      Due to this, thin devices cannot claim discards will always zero data.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Cc: stable@vger.kernel.org # 3.4+
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      307615a2
  3. 24 9月, 2012 1 次提交
    • N
      md/raid5: add missing spin_lock_init. · cb13ff69
      NeilBrown 提交于
      commit b17459c0
         raid5: add a per-stripe lock
      
      added a spin_lock to the 'stripe_head' struct.
      Unfortunately there are two places where this struct is allocated
      but the spin lock was only initialised in one of them.
      
      So add the missing spin_lock_init.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cb13ff69
  4. 20 9月, 2012 1 次提交
  5. 19 9月, 2012 3 次提交
    • N
      md: make sure metadata is updated when spares are activated or removed. · 6dafab6b
      NeilBrown 提交于
      It isn't always necessary to update the metadata when spares are
      removed as the presence-or-not of a spare isn't really important to
      the integrity of an array.
      Also activating a spare doesn't always require updating the metadata
      as the update on 'recovery-completed' is usually sufficient.
      
      However the introduction of 'replacement' devices have made these
      transitions sometimes more important.  For example the 'Replacement'
      flag isn't cleared until the original device is removed, so we need
      to ensure a metadata update after that 'spare' is removed.
      
      So set MD_CHANGE_DEVS whenever a spare is activated or removed, to
      complement the current situation where it is set when a spare is added
      or a device is failed (or a number of other less common situations).
      
      This is suitable for -stable as out-of-data metadata could lead
      to data corruption.
      This is only relevant for 3.3 and later 9when 'replacement' as
      introduced.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6dafab6b
    • N
      md/raid5: fix calculate of 'degraded' when a replacement becomes active. · e5c86471
      NeilBrown 提交于
      When a replacement device becomes active, we mark the device that it
      replaces as 'faulty' so that it can subsequently get removed.
      However 'calc_degraded' only pays attention to the primary device, not
      the replacement, so the array appears to become degraded, which is
      wrong.
      
      So teach 'calc_degraded' to consider any replacement if a primary
      device is faulty.
      
      This is suitable for -stable as an incorrect 'degraded' value can
      confuse md and could lead to data corruption.
      This is only relevant for 3.3 and later.
      
      Cc: stable@vger.kernel.org
      Reported-by: NRobin Hill <robin@robinhill.me.uk>
      Reported-by: NJohn Drescher <drescherjm@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e5c86471
    • N
      Revert "md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE." · a852d7b8
      NeilBrown 提交于
      This reverts commit 895e3c5c.
      
      While this patch seemed like a good idea and did help some workloads,
      it hurts other workloads.
      Large sequential O_DIRECT writes were faster,
      Small random O_DIRECT writes were slower.
      
      Other changes (batching RAID5 writes) have improved the sequential
      writes using a different mechanism, so the net result of this patch
      is definitely negative.  So revert it.
      Reported-by: NShaohua Li <shli@kernel.org>
      Tested-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a852d7b8
  6. 09 9月, 2012 4 次提交
  7. 21 8月, 2012 1 次提交
    • T
      workqueue: deprecate flush[_delayed]_work_sync() · 43829731
      Tejun Heo 提交于
      flush[_delayed]_work_sync() are now spurious.  Mark them deprecated
      and convert all users to flush[_delayed]_work().
      
      If you're cc'd and wondering what's going on: Now all workqueues are
      non-reentrant and the regular flushes guarantee that the work item is
      not pending or running on any CPU on return, so there's no reason to
      use the sync flushes at all and they're going away.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Ian Campbell <ian.campbell@citrix.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mattia Dongili <malattia@linux.it>
      Cc: Kent Yoder <key@linux.vnet.ibm.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Karsten Keil <isdn@linux-pingi.de>
      Cc: Bryan Wu <bryan.wu@canonical.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: linux-wireless@vger.kernel.org
      Cc: Anton Vorontsov <cbou@mail.ru>
      Cc: Sangbeom Kim <sbkim73@samsung.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Petr Vandrovec <petr@vandrovec.name>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Avi Kivity <avi@redhat.com> 
      43829731
  8. 18 8月, 2012 1 次提交
    • N
      md/raid10: fix problem with on-stack allocation of r10bio structure. · e0ee7785
      NeilBrown 提交于
      A 'struct r10bio' has an array of per-copy information at the end.
      This array is declared with size [0] and r10bio_pool_alloc allocates
      enough extra space to store the per-copy information depending on the
      number of copies needed.
      
      So declaring a 'struct r10bio on the stack isn't going to work.  It
      won't allocate enough space, and memory corruption will ensue.
      
      So in the two places where this is done, declare a sufficiently large
      structure and use that instead.
      
      The two call-sites of this bug were introduced in 3.4 and 3.5
      so this is suitable for both those kernels.  The patch will have to
      be modified for 3.4 as it only has one bug.
      
      Cc: stable@vger.kernel.org
      Reported-by: NIvan Vasilyev <ivan.vasilyev@gmail.com>
      Tested-by: NIvan Vasilyev <ivan.vasilyev@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e0ee7785
  9. 16 8月, 2012 1 次提交
    • N
      md: Don't truncate size at 4TB for RAID0 and Linear · 667a5313
      NeilBrown 提交于
      commit 27a7b260
         md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.
      
      changed 0.90 metadata handling to truncated size to 4TB as that is
      all that 0.90 can record.
      However for RAID0 and Linear, 0.90 doesn't need to record the size, so
      this truncation is not needed and causes working arrays to become too small.
      
      So avoid the truncation for RAID0 and Linear
      
      This bug was introduced in 3.1 and is suitable for any stable kernels
      from then onwards.
      As the offending commit was tagged for 'stable', any stable kernel
      that it was applied to should also get this patch.  That includes
      at least 2.6.32, 2.6.33 and 3.0. (Thanks to Ben Hutchings for
      providing that list).
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeil Brown <neilb@suse.de>
      667a5313
  10. 02 8月, 2012 4 次提交
  11. 01 8月, 2012 1 次提交
  12. 31 7月, 2012 12 次提交
    • N
      blk: pass from_schedule to non-request unplug functions. · 74018dc3
      NeilBrown 提交于
      This will allow md/raid to know why the unplug was called,
      and will be able to act according - if !from_schedule it
      is safe to perform tasks which could themselves schedule.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74018dc3
    • N
      blk: centralize non-request unplug handling. · 9cbb1750
      NeilBrown 提交于
      Both md and umem has similar code for getting notified on an
      blk_finish_plug event.
      Centralize this code in block/ and allow each driver to
      provide its distinctive difference.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9cbb1750
    • N
      md: remove plug_cnt feature of plugging. · 0021b7bc
      NeilBrown 提交于
      This seemed like a good idea at the time, but after further thought I
      cannot see it making a difference other than very occasionally and
      testing to try to exercise the case it is most likely to help did not
      show any performance difference by removing it.
      
      So remove the counting of active plugs and allow 'pending writes' to
      be activated at any time, not just when no plugs are active.
      
      This is only relevant when there is a write-intent bitmap, and the
      updating of the bitmap will likely introduce enough delay that
      the single-threading of bitmap updates will be enough to collect large
      numbers of updates together.
      
      Removing this will make it easier to centralise the unplug code, and
      will clear the other for other unplug enhancements which have a
      measurable effect.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0021b7bc
    • A
      md/RAID1: Add missing case for attempting to repair known bad blocks. · d57368af
      Alexander Lyakas 提交于
      When doing resync or repair, attempt to correct bad blocks, according
      to WriteErrorSeen policy
      Signed-off-by: NAlex Lyakas <alex.bolshoy@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d57368af
    • A
      dm: use memweight() · 8fb980e3
      Akinobu Mita 提交于
      Use memweight() to count the total number of bits set in memory area.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fb980e3
    • M
      md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE. · 895e3c5c
      majianpeng 提交于
      'sync' writes set both REQ_SYNC and REQ_NOIDLE.
      O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE.
      
      We currently assume that a REQ_SYNC request will not be followed by
      more requests and so set STRIPE_PREREAD_ACTIVE to expedite the
      request.
      This is appropriate for sync requests, but not for O_DIRECT requests.
      
      So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE
      rather than REQ_SYNC.  This is consistent with the documented meaning
      of REQ_NOIDLE:
      
              __REQ_NOIDLE,           /* don't anticipate more IO after this one */
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      895e3c5c
    • N
      md/raid1: don't abort a resync on the first badblock. · b7219ccb
      NeilBrown 提交于
      If a resync of a RAID1 array with 2 devices finds a known bad block
      one device it will neither read from, or write to, that device for
      this block offset.
      So there will be one read_target (The other device) and zero write
      targets.
      This condition causes md/raid1 to abort the resync assuming that it
      has finished - without known bad blocks this would be true.
      
      When there are no write targets because of the presence of bad blocks
      we should only skip over the area covered by the bad block.
      RAID10 already gets this right, raid1 doesn't.  Or didn't.
      
      As this can cause a 'sync' to abort early and appear to have succeeded
      it could lead to some data corruption, so it suitable for -stable.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b7219ccb
    • N
      md: remove duplicated test on ->openers when calling do_md_stop() · 90cf195d
      NeilBrown 提交于
      do_md_stop tests mddev->openers while holding ->open_mutex,
      and fails if this count is too high.
      So callers do not need to check mddev->openers and doing so isn't
      very meaningful as they don't hold ->open_mutex so the number could
      change.
      
      So remove the unnecessary tests on mddev->openers.
      These are not called often enough for there to be any gain in
      an early test on ->open_mutex to avoid the need for a slightly more
      costly mutex_lock call.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      90cf195d
    • M
      raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer · 3f9e7c14
      majianpeng 提交于
      Because bios will merge at block-layer,so bios-error may caused by other
      bio which be merged into to the same request.
      Using this flag,it will find exactly error-sector and not do redundant
      operation like re-write and re-read.
      
      V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block
      layer.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3f9e7c14
    • S
      md/raid1: prevent merging too large request · 12cee5a8
      Shaohua Li 提交于
      For SSD, if request size exceeds specific value (optimal io size), request size
      isn't important for bandwidth. In such condition, if making request size bigger
      will cause some disks idle, the total throughput will actually drop. A good
      example is doing a readahead in a two-disk raid1 setup.
      
      So when should we split big requests? We absolutly don't want to split big
      request to very small requests. Even in SSD, big request transfer is more
      efficient. This patch only considers request with size above optimal io size.
      
      If all disks are busy, is it worth doing a split? Say optimal io size is 16k,
      two requests 32k and two disks. We can let each disk run one 32k request, or
      split the requests to 4 16k requests and each disk runs two. It's hard to say
      which case is better, depending on hardware.
      
      So only consider case where there are idle disks. For readahead, split is
      always better in this case. And in my test, below patch can improve > 30%
      thoughput. Hmm, not 100%, because disk isn't 100% busy.
      
      Such case can happen not just in readahead, for example, in directio. But I
      suppose directio usually will have bigger IO depth and make all disks busy, so
      I ignored it.
      
      Note: if the raid uses any hard disk, we don't prevent merging. That will make
      performace worse.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      12cee5a8
    • S
      md/raid1: read balance chooses idlest disk for SSD · 9dedf603
      Shaohua Li 提交于
      SSD hasn't spindle, distance between requests means nothing. And the original
      distance based algorithm sometimes can cause severe performance issue for SSD
      raid.
      
      Considering two thread groups, one accesses file A, the other access file B.
      The first group will access one disk and the second will access the other disk,
      because requests are near from one group and far between groups. In this case,
      read balance might keep one disk very busy but the other relative idle.  For
      SSD, we should try best to distribute requests to as many disks as possible.
      There isn't spindle move penality anyway.
      
      With below patch, I can see more than 50% throughput improvement sometimes
      depending on workloads.
      
      The only exception is small requests can be merged to a big request which
      typically can drive higher throughput for SSD too. Such small requests are
      sequential reads. Unlike hard disk, sequential read which can't be merged (for
      example direct IO, or read without readahead) can be ignored for SSD. Again
      there is no spindle move penality. readahead dispatches small requests and such
      requests can be merged.
      
      Last patch can help detect sequential read well, at least if concurrent read
      number isn't greater than raid disk number. In that case, distance based
      algorithm doesn't work well too.
      
      V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
      random IO too. This makes the algorithm generic for raid with SSD.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9dedf603
    • S
      md/raid1: make sequential read detection per disk based · be4d3280
      Shaohua Li 提交于
      Currently the sequential read detection is global wide. It's natural to make it
      per disk based, which can improve the detection for concurrent multiple
      sequential reads. And next patch will make SSD read balance not use distance
      based algorithm, where this change help detect truly sequential read for SSD.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      be4d3280