1. 02 8月, 2012 1 次提交
    • N
      md/raid1: submit IO from originating thread instead of md thread. · f54a9d0e
      NeilBrown 提交于
      queuing writes to the md thread means that all requests go through the
      one processor which may not be able to keep up with very high request
      rates.
      
      So use the plugging infrastructure to submit all requests on unplug.
      If a 'schedule' is needed, we fall back on the old approach of handing
      the requests to the thread for it to handle.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f54a9d0e
  2. 31 7月, 2012 8 次提交
    • N
      md: remove plug_cnt feature of plugging. · 0021b7bc
      NeilBrown 提交于
      This seemed like a good idea at the time, but after further thought I
      cannot see it making a difference other than very occasionally and
      testing to try to exercise the case it is most likely to help did not
      show any performance difference by removing it.
      
      So remove the counting of active plugs and allow 'pending writes' to
      be activated at any time, not just when no plugs are active.
      
      This is only relevant when there is a write-intent bitmap, and the
      updating of the bitmap will likely introduce enough delay that
      the single-threading of bitmap updates will be enough to collect large
      numbers of updates together.
      
      Removing this will make it easier to centralise the unplug code, and
      will clear the other for other unplug enhancements which have a
      measurable effect.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0021b7bc
    • A
      md/RAID1: Add missing case for attempting to repair known bad blocks. · d57368af
      Alexander Lyakas 提交于
      When doing resync or repair, attempt to correct bad blocks, according
      to WriteErrorSeen policy
      Signed-off-by: NAlex Lyakas <alex.bolshoy@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d57368af
    • N
      md/raid1: don't abort a resync on the first badblock. · b7219ccb
      NeilBrown 提交于
      If a resync of a RAID1 array with 2 devices finds a known bad block
      one device it will neither read from, or write to, that device for
      this block offset.
      So there will be one read_target (The other device) and zero write
      targets.
      This condition causes md/raid1 to abort the resync assuming that it
      has finished - without known bad blocks this would be true.
      
      When there are no write targets because of the presence of bad blocks
      we should only skip over the area covered by the bad block.
      RAID10 already gets this right, raid1 doesn't.  Or didn't.
      
      As this can cause a 'sync' to abort early and appear to have succeeded
      it could lead to some data corruption, so it suitable for -stable.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b7219ccb
    • S
      md/raid1: prevent merging too large request · 12cee5a8
      Shaohua Li 提交于
      For SSD, if request size exceeds specific value (optimal io size), request size
      isn't important for bandwidth. In such condition, if making request size bigger
      will cause some disks idle, the total throughput will actually drop. A good
      example is doing a readahead in a two-disk raid1 setup.
      
      So when should we split big requests? We absolutly don't want to split big
      request to very small requests. Even in SSD, big request transfer is more
      efficient. This patch only considers request with size above optimal io size.
      
      If all disks are busy, is it worth doing a split? Say optimal io size is 16k,
      two requests 32k and two disks. We can let each disk run one 32k request, or
      split the requests to 4 16k requests and each disk runs two. It's hard to say
      which case is better, depending on hardware.
      
      So only consider case where there are idle disks. For readahead, split is
      always better in this case. And in my test, below patch can improve > 30%
      thoughput. Hmm, not 100%, because disk isn't 100% busy.
      
      Such case can happen not just in readahead, for example, in directio. But I
      suppose directio usually will have bigger IO depth and make all disks busy, so
      I ignored it.
      
      Note: if the raid uses any hard disk, we don't prevent merging. That will make
      performace worse.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      12cee5a8
    • S
      md/raid1: read balance chooses idlest disk for SSD · 9dedf603
      Shaohua Li 提交于
      SSD hasn't spindle, distance between requests means nothing. And the original
      distance based algorithm sometimes can cause severe performance issue for SSD
      raid.
      
      Considering two thread groups, one accesses file A, the other access file B.
      The first group will access one disk and the second will access the other disk,
      because requests are near from one group and far between groups. In this case,
      read balance might keep one disk very busy but the other relative idle.  For
      SSD, we should try best to distribute requests to as many disks as possible.
      There isn't spindle move penality anyway.
      
      With below patch, I can see more than 50% throughput improvement sometimes
      depending on workloads.
      
      The only exception is small requests can be merged to a big request which
      typically can drive higher throughput for SSD too. Such small requests are
      sequential reads. Unlike hard disk, sequential read which can't be merged (for
      example direct IO, or read without readahead) can be ignored for SSD. Again
      there is no spindle move penality. readahead dispatches small requests and such
      requests can be merged.
      
      Last patch can help detect sequential read well, at least if concurrent read
      number isn't greater than raid disk number. In that case, distance based
      algorithm doesn't work well too.
      
      V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
      random IO too. This makes the algorithm generic for raid with SSD.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9dedf603
    • S
      md/raid1: make sequential read detection per disk based · be4d3280
      Shaohua Li 提交于
      Currently the sequential read detection is global wide. It's natural to make it
      per disk based, which can improve the detection for concurrent multiple
      sequential reads. And next patch will make SSD read balance not use distance
      based algorithm, where this change help detect truly sequential read for SSD.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      be4d3280
    • J
      MD: Move macros from raid1*.h to raid1*.c · 473e87ce
      Jonathan Brassow 提交于
      MD RAID1/RAID10: Move some macros from .h file to .c file
      
      There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined
      in both raid1.h and raid10.h.  They are only used in there respective .c files.
      However, if we wish to make RAID10 accessible to the device-mapper RAID
      target (dm-raid.c), then we need to move these macros into the .c files where
      they are used so that they do not conflict with each other.
      
      The macros from the two files are identical and could be moved into md.h, but
      I chose to leave the duplication and have them remain in the personality
      files.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      473e87ce
    • J
      MD RAID1: rename mirror_info structure · 0eaf822c
      Jonathan Brassow 提交于
      MD RAID1: Rename the structure 'mirror_info' to 'raid1_info'
      
      The same structure name ('mirror_info') is used by raid10.  Each of these
      structures are defined in there respective header files.  If dm-raid is
      to support both RAID1 and RAID10, the header files will be included and
      the structure names must not collide.  While only one of these structure
      names needs to change, this patch adds consistency to the naming of the
      structure.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0eaf822c
  3. 19 7月, 2012 1 次提交
  4. 09 7月, 2012 1 次提交
    • N
      md/raid1: fix use-after-free bug in RAID1 data-check code. · 2d4f4f33
      NeilBrown 提交于
      This bug has been present ever since data-check was introduce
      in 2.6.16.  However it would only fire if a data-check were
      done on a degraded array, which was only possible if the array
      has 3 or more devices.  This is certainly possible, but is quite
      uncommon.
      
      Since hot-replace was added in 3.3 it can happen more often as
      the same condition can arise if not all possible replacements are
      present.
      
      The problem is that as soon as we submit the last read request, the
      'r1_bio' structure could be freed at any time, so we really should
      stop looking at it.  If the last device is being read from we will
      stop looking at it.  However if the last device is not due to be read
      from, we will still check the bio pointer in the r1_bio, but the
      r1_bio might already be free.
      
      So use the read_targets counter to make sure we stop looking for bios
      to submit as soon as we have submitted them all.
      
      This fix is suitable for any -stable kernel since 2.6.16.
      
      Cc: stable@vger.kernel.org
      Reported-by: NArnold Schulz <arnysch@gmx.net>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2d4f4f33
  5. 03 7月, 2012 3 次提交
    • N
      md: fix up plugging (again). · b357f04a
      NeilBrown 提交于
      The value returned by "mddev_check_plug" is only valid until the
      next 'schedule' as that will unplug things.  This could happen at any
      call to mempool_alloc.
      So just calling mddev_check_plug at the start doesn't really make
      sense.
      
      So call it just before, or just after, queuing things for the thread.
      As the action that happens at unplug is to wake the thread, this makes
      lots of sense.
      If we cannot add a plug (which requires a small GFP_ATOMIC alloc) we
      wake thread immediately.
      
      RAID5 is a bit different.  Requests are queued for the thread and the
      thread is woken by release_stripe.  So we don't need to wake the
      thread on failure.
      However the thread doesn't perform certain actions when there is any
      active plug, so it is important to install a plug before waking the
      thread.  So for RAID5 we install the plug *before* queuing the request
      and waking the thread.
      
      Without this patch it is possible for raid1 or raid10 to queue a
      request without then waking the thread, resulting in the array locking
      up.
      
      Also change raid10 to only flush_pending_write when there are not
      active plugs, just like raid1.
      
      This patch is suitable for 3.0 or later.  I plan to submit it to
      -stable, but I'll like to let it spend a few weeks in mainline
      first to be sure it is completely safe.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b357f04a
    • N
      md/raid1: fix bug in read_balance introduced by hot-replace · 32644afd
      NeilBrown 提交于
      When we added hot_replace we doubled the number of devices
      that could be in a RAID1 array.  So we doubled how far read_balance
      would search.  Unfortunately we didn't double the point at which
      it looped back to the beginning - so it effectively loops over
      all non-replacement disks twice.
      This doesn't cause bad behaviour, but it pointless and means we
      never read from replacement devices.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      32644afd
    • N
      md: make 'name' arg to md_register_thread non-optional. · 0232605d
      NeilBrown 提交于
      Having the 'name' arg optional and defaulting to the current
      personality name is no necessary and leads to errors, as when
      changing the level of an array we can end up using the
      name of the old level instead of the new one.
      
      So make it non-optional and always explicitly pass the name
      of the level that the array will be.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0232605d
  6. 31 5月, 2012 1 次提交
    • N
      md: raid1/raid10: fix problem with merge_bvec_fn · aba336bd
      NeilBrown 提交于
      The new merge_bvec_fn which calls the corresponding function
      in subsidiary devices requires that mddev->merge_check_needed
      be set if any child has a merge_bvec_fn.
      
      However were were only setting that when a device was hot-added,
      not when a device was present from the start.
      
      This bug was introduced in 3.4 so patch is suitable for 3.4.y
      kernels.  However that are conflicts in raid10.c so a separate
      patch will be needed for 3.4.y.
      
      Cc: stable@vger.kernel.org
      Reported-by: NSebastian Riemer <sebastian.riemer@profitbricks.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      aba336bd
  7. 22 5月, 2012 3 次提交
  8. 21 5月, 2012 1 次提交
    • N
      md: add possibility to change data-offset for devices. · c6563a8c
      NeilBrown 提交于
      When reshaping we can avoid costly intermediate backup by
      changing the 'start' address of the array on the device
      (if there is enough room).
      
      So as a first step, allow such a change to be requested
      through sysfs, and recorded in v1.x metadata.
      
      (As we didn't previous check that all 'pad' fields were zero,
       we need a new FEATURE flag for this.
       A (belatedly) check that all remaining 'pad' fields are
       zero to avoid a repeat of this)
      
      The new data offset must be requested separately for each device.
      This allows each to have a different change in the data offset.
      This is not likely to be used often but as data_offset can be
      set per-device, new_data_offset should be too.
      
      This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
      it is never used and never will be.  At the same time we add a new
      arg ('in_new') which is currently always zero but will be used more
      soon.
      
      When a reshape finishes we will need to update the data_offset
      and rdev->sectors.  So provide an exported function to do that.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c6563a8c
  9. 12 4月, 2012 1 次提交
  10. 03 4月, 2012 2 次提交
  11. 02 4月, 2012 1 次提交
  12. 19 3月, 2012 3 次提交
    • N
      md/raid1: handle merge_bvec_fn in member devices. · 6b740b8d
      NeilBrown 提交于
      Currently we don't honour merge_bvec_fn in member devices so if there
      is one, we force all requests to be single-page at most.
      This is not ideal.
      
      So create a raid1 merge_bvec_fn to check that function in children
      as well.
      
      This introduces a small problem.  There is no locking around calls
      the ->merge_bvec_fn and subsequent calls to ->make_request.  So a
      device added between these could end up getting a request which
      violates its merge_bvec_fn.
      
      Currently the best we can do is synchronize_sched().  This will work
      providing no preemption happens.  If there is is preemption, we just
      have to hope that new devices are largely consistent with old devices.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6b740b8d
    • N
      md: tidy up rdev_for_each usage. · dafb20fa
      NeilBrown 提交于
      md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
      mddev.  However it uses the 'safe' version of list_for_each_entry,
      and so requires the extra variable, but doesn't include 'safe' in the
      name, which is useful documentation.
      
      Consequently some places use this safe version without needing it, and
      many use an explicity list_for_each entry.
      
      So:
       - rename rdev_for_each to rdev_for_each_safe
       - create a new rdev_for_each which uses the plain
         list_for_each_entry,
       - use the 'safe' version only where needed, and convert all other
         list_for_each_entry calls to use rdev_for_each.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      dafb20fa
    • N
      md/raid1,raid10: avoid deadlock during resync/recovery. · d6b42dcb
      NeilBrown 提交于
      If RAID1 or RAID10 is used under LVM or some other stacking
      block device, it is possible to enter a deadlock during
      resync or recovery.
      This can happen if the upper level block device creates
      two requests to the RAID1 or RAID10.  The first request gets
      processed, blocks recovery and queue requests for underlying
      requests in current->bio_list.  A resync request then starts
      which will wait for those requests and block new IO.
      
      But then the second request to the RAID1/10 will be attempted
      and it cannot progress until the resync request completes,
      which cannot progress until the underlying device requests complete,
      which are on a queue behind that second request.
      
      So allow that second request to proceed even though there is
      a resync request about to start.
      
      This is suitable for any -stable kernel.
      
      Cc: stable@vger.kernel.org
      Reported-by: NRay Morris <support@bettercgi.com>
      Tested-by: NRay Morris <support@bettercgi.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d6b42dcb
  13. 13 2月, 2012 1 次提交
  14. 11 1月, 2012 1 次提交
  15. 23 12月, 2011 8 次提交
  16. 01 11月, 2011 1 次提交
  17. 26 10月, 2011 1 次提交
    • N
      md: Fix some bugs in recovery_disabled handling. · d890fa2b
      NeilBrown 提交于
      In 3.0 we changed the way recovery_disabled was handle so that instead
      of testing against zero, we test an mddev-> value against a conf->
      value.
      Two problems:
        1/ one place in raid1 was missed and still sets to '1'.
        2/ We didn't explicitly set the conf-> value at array creation
           time.
           It defaulted to '0' just like the mddev value does so they
           could appear equal and thus disable recovery.
           This did not affect normal 'md' as it calls bind_rdev_to_array
           which changes the mddev value.  However the dmraid interface
           doesn't call this and so doesn't change ->recovery_disabled; so at
           array start all recovery is incorrectly disabled.
      
      So initialise the 'conf' value to one less that the mddev value, so
      the will only be the same when explicitly set that way.
      Reported-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown  <neilb@suse.de>
      d890fa2b
  18. 24 10月, 2011 1 次提交
    • T
      block: Remove the control of complete cpu from bio. · 9562ad9a
      Tao Ma 提交于
      bio originally has the functionality to set the complete cpu, but
      it is broken.
      
      Chirstoph said that "This code is unused, and from the all the
      discussions lately pretty obviously broken.  The only thing keeping
      it serves is creating more confusion and possibly more bugs."
      
      And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
      with leaving cpu control to the request based drivers, they are the
      only ones that can toggle the setting anyway".
      
      So this patch tries to remove all the work of controling complete cpu
      from a bio.
      
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9562ad9a
  19. 11 10月, 2011 1 次提交
    • N
      md: add proper write-congestion reporting to RAID1 and RAID10. · 34db0cd6
      NeilBrown 提交于
      RAID1 and RAID10 handle write requests by queuing them for handling by
      a separate thread.  This is because when a write-intent-bitmap is
      active we might need to update the bitmap first, so it is good to
      queue a lot of writes, then do one big bitmap update for them all.
      
      However writeback request devices to appear to be congested after a
      while so it can make some guesstimate of throughput.  The infinite
      queue defeats that (note that RAID5 has already has a finite queue so
      it doesn't suffer from this problem).
      
      So impose a limit on the number of pending write requests.  By default
      it is 1024 which seems to be generally suitable.  Make it configurable
      via module option just in case someone finds a regression.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      34db0cd6