1. 26 2月, 2013 9 次提交
    • N
      md/raid1,raid10: fix deadlock with freeze_array() · ee0b0244
      NeilBrown 提交于
      When raid1/raid10 needs to fix a read error, it first drains
      all pending requests by calling freeze_array().
      This calls flush_pending_writes() if it needs to sleep,
      but some writes may be pending in a per-process plug rather
      than in the per-array request queue.
      
      When raid1{,0}_unplug() moves the request from the per-process
      plug to the per-array request queue (from which
      flush_pending_writes() can flush them), it needs to wake up
      freeze_array(), or freeze_array() will never flush them and so
      it will block forever.
      
      So add the requires wake_up() calls.
      
      This bug was introduced by commit
         f54a9d0e
      for raid1 and a similar commit for RAID10, and so has been present
      since linux-3.6.  As the bug causes a deadlock I believe this fix is
      suitable for -stable.
      
      Cc: stable@vger.kernel.org (3.6.y 3.7.y 3.8.y)
      Reported-by: NTregaron Bayly <tbayly@bluehost.com>
      Tested-by: NTregaron Bayly <tbayly@bluehost.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ee0b0244
    • N
      md/raid0: improve error message when converting RAID4-with-spares to RAID0 · f96c9f30
      NeilBrown 提交于
      Mentioning "bad disk number -1" exposes irrelevant internal detail.
      Just say they are inactive and must be removed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f96c9f30
    • N
      md: raid0: fix error return from create_stripe_zones. · 58ebb34c
      NeilBrown 提交于
      Create_stripe_zones returns an error slightly differently to
      raid0_run and to raid0_takeover_*.
      
      The error returned used by the second was wrong and an error would
      result in mddev->private being set to NULL and sooner or later a
      crash.
      
      So never return NULL, return ERR_PTR(err), not NULL from
      create_stripe_zones.
      
      This bug has been present since 2.6.35 so the fix is suitable
      for any kernel since then.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      58ebb34c
    • N
      md: fix two bugs when attempting to resize RAID0 array. · a6468539
      NeilBrown 提交于
      You cannot resize a RAID0 array (in terms of making the devices
      bigger), but the code doesn't entirely stop you.
      So:
      
       disable setting of the available size on each device for
       RAID0 and Linear devices.  This must not change as doing so
       can change the effective layout of data.
      
       Make sure that the size that raid0_size() reports is accurate,
       but rounding devices sizes to chunk sizes.  As the device sizes
       cannot change now, this isn't so important, but it is best to be
       safe.
      
      Without this change:
        mdadm --grow /dev/md0 -z max
        mdadm --grow /dev/md0 -Z max
        then read to the end of the array
      
      can cause a BUG in a RAID0 array.
      
      These bugs have been present ever since it became possible
      to resize any device, which is a long time.  So the fix is
      suitable for any -stable kerenl.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a6468539
    • J
      DM RAID: Add support for MD's RAID10 "far" and "offset" algorithms · fe5d2f4a
      Jonathan Brassow 提交于
      DM RAID:  Add support for MD's RAID10 "far" and "offset" algorithms
      
      Until now, dm-raid.c only supported the "near" algorthm of MD's RAID10
      implementation.  This patch adds support for the "far" and "offset"
      algorithms, but only with the improved redundancy that is brought with
      the introduction of the 'use_far_sets' bit, which shifts copied stripes
      according to smaller sets vs the entire array.  That is, the 17th bit
      of the 'layout' variable that defines the RAID10 implementation will
      always be set.   (More information on how the 'layout' variable selects
      the RAID10 algorithm can be found in the opening comments of
      drivers/md/raid10.c.)
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fe5d2f4a
    • J
      MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 2) · 9a3152ab
      Jonathan Brassow 提交于
      MD RAID10:  Improve redundancy for 'far' and 'offset' algorithms (part 2)
      
      This patch addresses raid arrays that have a number of devices that cannot
      be evenly divided by 'far_copies'.  (E.g. 5 devices, far_copies = 2)  This
      case must be handled differently because it causes that last set to be of
      a different size than the rest of the sets.  We must compute a new modulo
      for this last set so that copied chunks are properly wrapped around.
      
      Example use_far_sets=1, far_copies=2, near_copies=1, devices=5:
                      "far" algorithm
              dev1 dev2 dev3 dev4 dev5
      	==== ==== ==== ==== ====
      	[ A   B ] [ C    D   E ]
              [ G   H ] [ I    J   K ]
                          ...
              [ B   A ] [ E    C   D ] --> nominal set of 2 and last set of 3
              [ H   G ] [ K    I   J ]     []'s show far/offset sets
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9a3152ab
    • J
      MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1) · 475901af
      Jonathan Brassow 提交于
      The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
      widths - copying them to a different location on the same devices after
      shifting the stripe.  An example layout of each follows below:
      
      	        "far" algorithm
      	dev1 dev2 dev3 dev4 dev5 dev6
      	==== ==== ==== ==== ==== ====
      	 A    B    C    D    E    F
      	 G    H    I    J    K    L
      	            ...
      	 F    A    B    C    D    E  --> Copy of stripe0, but shifted by 1
      	 L    G    H    I    J    K
      	            ...
      
      		"offset" algorithm
      	dev1 dev2 dev3 dev4 dev5 dev6
      	==== ==== ==== ==== ==== ====
      	 A    B    C    D    E    F
      	 F    A    B    C    D    E  --> Copy of stripe0, but shifted by 1
      	 G    H    I    J    K    L
      	 L    G    H    I    J    K
      	            ...
      
      Redundancy for these algorithms is gained by shifting the copied stripes
      one device to the right.  This patch proposes that array be divided into
      sets of adjacent devices and when the stripe copies are shifted, they wrap
      on set boundaries rather than the array size boundary.  That is, for the
      purposes of shifting, the copies are confined to their sets within the
      array.  The sets are 'near_copies * far_copies' in size.
      
      The above "far" algorithm example would change to:
      	        "far" algorithm
      	dev1 dev2 dev3 dev4 dev5 dev6
      	==== ==== ==== ==== ==== ====
      	 A    B    C    D    E    F
      	 G    H    I    J    K    L
      	            ...
      	 B    A    D    C    F    E  --> Copy of stripe0, shifted 1, 2-dev sets
      	 H    G    J    I    L    K      Dev sets are 1-2, 3-4, 5-6
      	            ...
      
      This has the affect of improving the redundancy of the array.  We can
      always sustain at least one failure, but sometimes more than one can
      be handled.  In the first examples, the pairs of devices that CANNOT fail
      together are:
      	(1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
      In the example where the copies are confined to sets, the pairs of
      devices that cannot fail together are:
      	(1,2) (3,4) (5,6)                    [20% of possible pairs]
      
      We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
      variable is used to indicate whether we use the old or new method of computing
      the shift.  (This is similar to the way the 16th bit indicates whether the
      "far" algorithm or the "offset" algorithm is being used.)
      
      This patch only handles the cases where the number of total raid disks is
      a multiple of 'far_copies'.  A follow-on patch addresses the condition where
      this is not true.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      475901af
    • J
      MD RAID10: Minor non-functional code changes · 4c0ca26b
      Jonathan Brassow 提交于
      Changes include assigning 'addr' from 's' instead of 'sector' to be
      consistent with the way the code does it just a few lines later and
      using '%=' vs a conditional and subtraction.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4c0ca26b
    • J
      md: raid1,10: Handle REQ_WRITE_SAME flag in write bios · c8dc9c65
      Joe Lawrence 提交于
      Set mddev queue's max_write_same_sectors to its chunk_sector value (before
      disk_stack_limits merges the underlying disk limits.)  With that in place,
      be sure to handle writes coming down from the block layer that have the
      REQ_WRITE_SAME flag set.  That flag needs to be copied into any newly cloned
      write bio.
      Signed-off-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Acked-by: N"Martin K. Petersen" <martin.petersen@oracle.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c8dc9c65
  2. 21 2月, 2013 1 次提交
  3. 31 1月, 2013 2 次提交
    • A
      dm: fix write same requests counting · fe7af2d3
      Alasdair G Kergon 提交于
      When processing write same requests, fix dm to send the configured
      number of WRITE SAME requests to the target rather than the number of
      discards, which is not always the same.
      
      Device-mapper WRITE SAME support was introduced by commit
      23508a96 ("dm: add WRITE SAME support").
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      fe7af2d3
    • M
      dm thin: fix queue limits stacking · 0f640dca
      Mike Snitzer 提交于
      thin_io_hints() is blindly copying the queue limits from the thin-pool
      which can lead to incorrect limits being set.  The fix here simply
      deletes the thin_io_hints() hook which leaves the existing stacking
      infrastructure to set the limits correctly.
      
      When a thin-pool uses an MD device for the data device a thin device
      from the thin-pool must respect MD's constraints about disallowing a bio
      from spanning multiple chunks.  Otherwise we can see problems.  If the raid0
      chunksize is 1152K and thin-pool chunksize is 256K I see the following
      md/raid0 error (with extra debug tracing added to thin_endio) when
      mkfs.xfs is executed against the thin device:
      
      md/raid0:md99: make_request bug: can't convert block across chunks or bigger than 1152k 6688 127
      device-mapper: thin: bio sector=2080 err=-5 bi_size=130560 bi_rw=17 bi_vcnt=32 bi_idx=0
      
      This extra DM debugging shows that the failing bio is spanning across
      the first and second logical 1152K chunk (sector 2080 + 255 takes the
      bio beyond the first chunk's boundary of sector 2304).  So the bio
      splitting that DM is doing clearly isn't respecting the MD limits.
      
      max_hw_sectors_kb is 127 for both the thin-pool and thin device
      (queue_max_hw_sectors returns 255 so we'll excuse sysfs's lack of
      precision).  So this explains why bi_size is 130560.
      
      But the thin device's max_hw_sectors_kb should be 4 (PAGE_SIZE) given
      that it doesn't have a .merge function (for bio_add_page to consult
      indirectly via dm_merge_bvec) yet the thin-pool does sit above an MD
      device that has a compulsory merge_bvec_fn.  This scenario is exactly
      why DM must resort to sending single PAGE_SIZE bios to the underlying
      layer. Some additional context for this is available in the header for
      commit 8cbeb67a ("dm: avoid unsupported spanning of md stripe boundaries").
      
      Long story short, the reason a thin device doesn't properly get
      configured to have a max_hw_sectors_kb of 4 (PAGE_SIZE) is that
      thin_io_hints() is blindly copying the queue limits from the thin-pool
      device directly to the thin device's queue limits.
      
      Fix this by eliminating thin_io_hints.  Doing so is safe because the
      block layer's queue limits stacking already enables the upper level thin
      device to inherit the thin-pool device's discard and minimum_io_size and
      optimal_io_size limits that get set in pool_io_hints.  But avoiding the
      queue limits copy allows the thin and thin-pool limits to be different
      where it is important, namely max_hw_sectors_kb.
      Reported-by: NDaniel Browning <db@kavod.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      0f640dca
  4. 24 1月, 2013 1 次提交
    • J
      DM-RAID: Fix RAID10's check for sufficient redundancy · 55ebbb59
      Jonathan Brassow 提交于
      Before attempting to activate a RAID array, it is checked for sufficient
      redundancy.  That is, we make sure that there are not too many failed
      devices - or devices specified for rebuild - to undermine our ability to
      activate the array.  The current code performs this check twice - once to
      ensure there were not too many devices specified for rebuild by the user
      ('validate_rebuild_devices') and again after possibly experiencing a failure
      to read the superblock ('analyse_superblocks').  Neither of these checks are
      sufficient.  The first check is done properly but with insufficient
      information about the possible failure state of the devices to make a good
      determination if the array can be activated.  The second check is simply
      done wrong in the case of RAID10 because it doesn't account for the
      independence of the stripes (i.e. mirror sets).  The solution is to use the
      properly written check ('validate_rebuild_devices'), but perform the check
      after the superblocks have been read and we know which devices have failed.
      This gives us one check instead of two and performs it in a location where
      it can be done right.
      
      Only RAID10 was affected and it was affected in the following ways:
      - the code did not properly catch the condition where a user specified
        a device for rebuild that already had a failed device in the same mirror
        set.  (This condition would, however, be caught at a deeper level in MD.)
      - the code triggers a false positive and denies activation when devices in
        independent mirror sets have failed - counting the failures as though they
        were all in the same set.
      
      The most likely place this error was introduced (or this patch should have
      been included) is in commit 4ec1e369 - first introduced in v3.7-rc1.
      Consequently this fix should also go in v3.7.y, however there is a
      small conflict on the .version in raid_target, so I'll submit a
      separate patch to -stable.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      55ebbb59
  5. 22 12月, 2012 27 次提交