1. 13 6月, 2013 1 次提交
  2. 30 4月, 2013 3 次提交
  3. 24 4月, 2013 11 次提交
    • J
      DM RAID: Add message/status support for changing sync action · be83651f
      Jonathan Brassow 提交于
      DM RAID:  Add message/status support for changing sync action
      
      This patch adds a message interface to dm-raid to allow the user to more
      finely control the sync actions being performed by the MD driver.  This
      gives the user the ability to initiate "check" and "repair" (i.e. scrubbing).
      Two additional fields have been appended to the status output to provide more
      information about the type of sync action occurring and the results of those
      actions, specifically: <sync_action> and <mismatch_cnt>.  These new fields
      will always be populated.  This is essentially the device-mapper way of doing
      what MD controls through the 'sync_action' sysfs file and shows through the
      'mismatch_cnt' sysfs file.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      be83651f
    • J
      MD: Export 'md_reap_sync_thread' function · a91d5ac0
      Jonathan Brassow 提交于
      MD: Export 'md_reap_sync_thread' function
      
      Make 'md_reap_sync_thread' available to other files, specifically dm-raid.c.
      - rename reap_sync_thread to md_reap_sync_thread
      - move the fn after md_check_recovery to match md.h declaration placement
      - export md_reap_sync_thread
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a91d5ac0
    • N
      md: don't update metadata when stopping a read-only array. · b6d428c6
      NeilBrown 提交于
      read-only arrays should stay that way as much as possible.
      Updating the metadata - which could be triggered by a re-add
      while assembling the array metadata - should be avoided.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b6d428c6
    • N
      md: Allow devices to be re-added to a read-only array. · 7ceb17e8
      NeilBrown 提交于
      When assembling an array incrementally we might want to make
      it device available when "enough" devices are present, but maybe
      not "all" devices are present.
      If the remaining devices appear before the array is actually used,
      they should be added transparently.
      
      We do this by using the "read-auto" mode where the array acts like
      it is read-only until a write request arrives.
      
      Current an add-device request switches a read-auto array to active.
      This means that only one device can be added after the array is first
      made read-auto.  This isn't a problem for RAID5, but is not ideal for
      RAID6 or RAID10.
      Also we don't really want to switch the array to read-auto at all
      when re-adding a device as this doesn't really imply any change.
      
      So:
       - remove the "md_update_sb()" call from add_new_disk().  This isn't
         really needed as just adding a disk doesn't require a metadata
         update.  Instead, just set MD_CHANGE_DEVS.  This will effect a
         metadata update soon enough, once the array is not read-only.
      
       - Allow the ADD_NEW_DISK ioctl to succeed without activating a
         read-auto array, providing the MD_DISK_SYNC flag is set.
         In this case, the device will be rejected if it cannot be added
         with the correct device number, or has an incorrect event count.
      
       - Teach remove_and_add_spares() to be careful about adding spares
         when the array is read-only (or read-mostly) - only add devices
         that are thought to be in-sync, and only do it if the array is
         in-sync itself.
      
       - In md_check_recovery, use remove_and_add_spares in the read-only
         case, rather than open coding just the 'remove' part of it.
      Reported-by: NMartin Wilck <mwilck@arcor.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7ceb17e8
    • M
      md/raid10: Allow skipping recovery when clean arrays are assembled · 7e83ccbe
      Martin Wilck 提交于
      When an array is assembled incrementally with mdadm -I -R
      and the array switches to "active" mode, md starts a recovery.
      
      If the array was clean, the "fullsync" flag will be 0. Skip
      the full recovery in this case, as RAID1 does (the code was
      actually copied from the sync_request() method of RAID1).
      Signed-off-by: NMartin Wilck <mwilck@arcor.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7e83ccbe
    • N
      md/raid5: avoid an extra write when writing to a known-bad-block. · c0b32972
      NeilBrown 提交于
      If we write to a known-bad-block it will be flags as having
      a ReadError by analyse_stripe, but the write will proceed anyway
      (as it should).  Then the read-error handling will kick in an
      write again, then re-read.
      
      We don't need that 'write-again', so set R5_ReWrite so it looks like
      it has already been done.  Then we will just get the re-read, which we
      want.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c0b32972
    • M
      md/raid5: Change or of some order to improve efficiency. · 6f608040
      majianpeng 提交于
      As the function call is the most expensive of these tests it should be
      done later in the chain so that it can be avoided in some cases.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6f608040
    • A
      md: use set_bit_le and clear_bit_le · 3f810b6c
      Akinobu Mita 提交于
      The value returned by test_and_set_bit_le() drivers/md/bitmap.c is not used.
      So just use set_bit_le(). The same goes for test_and_clear_bit_le().
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3f810b6c
    • N
      md: HOT_DISK_REMOVE shouldn't make a read-auto device active. · 3ea8929d
      NeilBrown 提交于
      If a fail device or a spare is removed from an array, there is
      not need to make the array 'active'.  If/when the array does become
      active for some other reason the metadata will be update to reflect
      the removal.
      If that never happens and the array is stopped while still read-auto,
      then there is no loss in forgetting the that the device had 'failed'.
      
      A read-only array will leave failed devices attached to
      the array personality, so we need to explicitly call
      remove_and_add_spares() to free it (clearing Blocked just
      like we do in store_slot()).
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3ea8929d
    • N
      md: use common code for all calls to ->hot_remove_disk() · 746d3207
      NeilBrown 提交于
      slot_store and remove_and_add_spares both call ->hot_remove_disk(),
      but with slightly different tests and consequences, which is
      at least untidy and might be buggy.
      
      So modify remove_and_add_spaces() so that it can be asked
      to remove a specific device, and call it from slot_store().
      
      We also clear the Blocked flag to ensure that doesn't prevent
      removal.  The purpose of Blocked is to prevent automatic removal
      by the kernel before an error is acknowledged.
      If the array is read/write then user-space would have not reason
      to remove a device unless it was known to be 'spare' or 'faulty' in
      which it would have already cleared the Blocked flag.
      If the array is read-only, the flag might still be blocked, but
      there is no harm in clearing the flag for read-only arrays.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      746d3207
    • N
      md: never update metadata when array is read-only. · d87f064f
      NeilBrown 提交于
      Normally we don't even try to update the metadata if
      the array is read-only.  However future patches
      will increase the number of things that can happen on a read-only
      array, so it is safest to explicitly disable this.
      
      Every time that mddev->ro is set to 0, either
       - md_update_sb will be called again (at least if MD_CHANGE_DEVS
         is set) or
       - the mddev->thread is scheduled, which will also run
         md_update_sb if needed.
      
      So this is safe: if the array ever become read-write the
      metadata will be updated.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d87f064f
  4. 19 4月, 2013 1 次提交
  5. 05 4月, 2013 2 次提交
    • M
      dm cache: reduce bio front_pad size in writeback mode · 19b0092e
      Mike Snitzer 提交于
      A recent patch to fix the dm cache target's writethrough mode extended
      the bio's front_pad to include a 1056-byte struct dm_bio_details.
      Writeback mode doesn't need this, so this patch reduces the
      per_bio_data_size to 16 bytes in this case instead of 1096.
      
      The dm_bio_details structure was added in "dm cache: fix writes to
      cache device in writethrough mode" which fixed commit e2e74d61 ("dm
      cache: fix race in writethrough implementation").  In writeback mode
      we avoid allocating the writethrough-specific members of the
      per_bio_data structure (the dm_bio_details structure included).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      19b0092e
    • D
      dm cache: fix writes to cache device in writethrough mode · b844fe69
      Darrick J. Wong 提交于
      The dm-cache writethrough strategy introduced by commit e2e74d61
      ("dm cache: fix race in writethrough implementation") issues a bio to
      the origin device, remaps and then issues the bio to the cache device.
      This more conservative in-series approach was selected to favor
      correctness over performance (of the previous parallel writethrough).
      However, this in-series implementation that reuses the same bio to write
      both the origin and cache device didn't take into account that the block
      layer's req_bio_endio() modifies a completing bio's bi_sector and
      bi_size.  So the new writethrough strategy needs to preserve these bio
      fields, and restore them before submission to the cache device,
      otherwise nothing gets written to the cache (because bi_size is 0).
      
      This patch adds a struct dm_bio_details field to struct per_bio_data,
      and uses dm_bio_record() and dm_bio_restore() to ensure the bio is
      restored before reissuing to the cache device.  Adding such a large
      structure to the per_bio_data is not ideal but we can improve this
      later, for now correctness is the important thing.
      
      This problem initially went unnoticed because the dm-cache test-suite
      uses a linear DM device for the dm-cache device's origin device.
      Writethrough worked as expected because DM submits a *clone* of the
      original bio, so the original bio which was reused for the cache was
      never touched.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b844fe69
  6. 21 3月, 2013 10 次提交
    • M
      dm cache: policy ignore hints if generated by different version · ea2dd8c1
      Mike Snitzer 提交于
      When reading the dm cache metadata from disk, ignore the policy hints
      unless they were generated by the same major version number of the same
      policy module.
      
      The hints are considered to be private data belonging to the specific
      module that generated them and there is no requirement for them to make
      sense to different versions of the policy that generated them.
      Policy modules are all required to work fine if no previous hints are
      supplied (or if existing hints are lost).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      ea2dd8c1
    • M
      dm cache: policy change version from string to integer set · 4e7f506f
      Mike Snitzer 提交于
      Separate dm cache policy version string into 3 unsigned numbers
      corresponding to major, minor and patchlevel and store them at the end
      of the on-disk metadata so we know which version of the policy generated
      the hints in case a future version wants to use them differently.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      4e7f506f
    • J
      dm cache: fix race in writethrough implementation · e2e74d61
      Joe Thornber 提交于
      We have found a race in the optimisation used in the dm cache
      writethrough implementation.  Currently, dm core sends the cache target
      two bios, one for the origin device and one for the cache device and
      these are processed in parallel.  This patch avoids the race by
      changing the code back to a simpler (slower) implementation which
      processes the two writes in series, one after the other, until we can
      develop a complete fix for the problem.
      
      When the cache is in writethrough mode it needs to send WRITE bios to
      both the origin and cache devices.
      
      Previously we've been implementing this by having dm core query the
      cache target on every write to find out how many copies of the bio it
      wants.  The cache will ask for two bios if the block is in the cache,
      and one otherwise.
      
      Then main problem with this is it's racey.  At the time this check is
      made the bio hasn't yet been submitted and so isn't being taken into
      account when quiescing a block for migration (promotion or demotion).
      This means a single bio may be submitted when two were needed because
      the block has since been promoted to the cache (catastrophic), or two
      bios where only one is needed (harmless).
      
      I really don't want to start entering bios into the quiescing system
      (deferred_set) in the get_num_write_bios callback.  Instead this patch
      simplifies things; only one bio is submitted by the core, this is
      first written to the origin and then the cache device in series.
      Obviously this will have a latency impact.
      
      deferred_writethrough_bios is introduced to record bios that must be
      later issued to the cache device from the worker thread.  This deferred
      submission, after the origin bio completes, is required given that we're
      in interrupt context (writethrough_endio).
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      e2e74d61
    • J
      dm cache: metadata clear dirty bits on clean shutdown · 79ed9caf
      Joe Thornber 提交于
      When writing the dirty bitset to the metadata device on a clean
      shutdown, clear the dirty bits.  Previously they were left indicating
      the cache was dirty. This led to confusion about whether there really
      was dirty data in the cache or not.  (This was a harmless bug.)
      Reported-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      79ed9caf
    • H
      dm cache: avoid calling policy destructor twice on error · b978440b
      Heinz Mauelshagen 提交于
      If the cache policy's config values are not able to be set we must
      set the policy to NULL after destroying it in create_cache_policy()
      so we don't attempt to destroy it a second time later.
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b978440b
    • H
      dm cache: detect cache_create failure · 617a0b89
      Heinz Mauelshagen 提交于
      Return error if cache_create() fails.
      
      A missing return check made cache_ctr continue even after an error in
      cache_create() resulting in the cache object being destroyed.  So a
      simple failure like an odd number of cache policy config value arguments
      would result in an oops.
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      617a0b89
    • J
      dm cache: avoid 64 bit division on 32 bit · 414dd67d
      Joe Thornber 提交于
      Squash various 32bit link errors.
      
        >> on i386:
        >> drivers/built-in.o: In function `is_discarded_oblock':
        >> dm-cache-target.c:(.text+0x1ea28e): undefined reference to `__udivdi3'
        ...
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      414dd67d
    • M
      dm verity: avoid deadlock · 3b6b7813
      Mikulas Patocka 提交于
      A deadlock was found in the prefetch code in the dm verity map
      function.  This patch fixes this by transferring the prefetch
      to a worker thread and skipping it completely if kmalloc fails.
      
      If generic_make_request is called recursively, it queues the I/O
      request on the current->bio_list without making the I/O request
      and returns. The routine making the recursive call cannot wait
      for the I/O to complete.
      
      The deadlock occurs when one thread grabs the bufio_client
      mutex and waits for an I/O to complete but the I/O is queued
      on another thread's current->bio_list and is waiting to get
      the mutex held by the first thread.
      
      The fix recognises that prefetching is not essential.  If memory
      can be allocated, it queues the prefetch request to the worker thread,
      but if not, it does nothing.
      Signed-off-by: NPaul Taysom <taysom@chromium.org>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      3b6b7813
    • J
      dm thin: fix non power of two discard granularity calc · 58051b94
      Joe Thornber 提交于
      Fix a discard granularity calculation to work for non power of 2 block sizes.
      
      In order for thinp to passdown discard bios to the underlying data
      device, the data device must have a discard granularity that is a
      factor of the thinp block size.  Originally this check was done by
      using bitops since the block_size was known to be a power of two.
      
      Introduced by commit f13945d7
      ("dm thin: support a non power of 2 discard_granularity").
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      58051b94
    • J
      dm thin: fix discard corruption · f046f89a
      Joe Thornber 提交于
      Fix a bug in dm_btree_remove that could leave leaf values with incorrect
      reference counts.  The effect of this was that removal of a shared block
      could result in the space maps thinking the block was no longer used.
      More concretely, if you have a thin device and a snapshot of it, sending
      a discard to a shared region of the thin could corrupt the snapshot.
      
      Thinp uses a 2-level nested btree to store it's mappings.  This first
      level is indexed by thin device, and the second level by logical
      block.
      
      Often when we're removing an entry in this mapping tree we need to
      rebalance nodes, which can involve shadowing them, possibly creating a
      copy if the block is shared.  If we do create a copy then children of
      that node need to have their reference counts incremented.  In this
      way reference counts percolate down the tree as shared trees diverge.
      
      The rebalance functions were incrementing the children at the
      appropriate time, but they were always assuming the children were
      internal nodes.  This meant the leaf values (in our case packed
      block/flags entries) were not being incremented.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      f046f89a
  7. 20 3月, 2013 5 次提交
    • P
      md: remove CONFIG_MULTICORE_RAID456 entirely · 238f5908
      Paul Bolle 提交于
      Once instance of this Kconfig macro remained after commit
      51acbcec ("md: remove
      CONFIG_MULTICORE_RAID456"). Remove that one too. And, while we're at it,
      also remove it from the defconfig files that carry it.
      Signed-off-by: NPaul Bolle <pebolle@tiscali.nl>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      238f5908
    • N
      md/raid5: ensure sync and DISCARD don't happen at the same time. · f8dfcffd
      NeilBrown 提交于
      A number of problems can occur due to races between
      resync/recovery and discard.
      
      - if sync_request calls handle_stripe() while a discard is
        happening on the stripe, it might call handle_stripe_clean_event
        before all of the individual discard requests have completed
        (so some devices are still locked, but not all).
        Since commit ca64cae9
           md/raid5: Make sure we clear R5_Discard when discard is finished.
        this will cause R5_Discard to be cleared for the parity device,
        so handle_stripe_clean_event() will not be called when the other
        devices do become unlocked, so their ->written will not be cleared.
        This ultimately leads to a WARN_ON in init_stripe and a lock-up.
      
      - If handle_stripe_clean_event() does clear R5_UPTODATE at an awkward
        time for resync, it can lead to s->uptodate being less than disks
        in handle_parity_checks5(), which triggers a BUG (because it is
        one).
      
      So:
       - keep R5_Discard on the parity device until all other devices have
         completed their discard request
       - make sure we don't try to have a 'discard' and a 'sync' action at
         the same time.
         This involves a new stripe flag to we know when a 'discard' is
         happening, and the use of R5_Overlap on the parity disk so when a
         discard is wanted while a sync is active, so we know to wake up
         the discard at the appropriate time.
      
      Discard support for RAID5 was added in 3.7, so this is suitable for
      any -stable kernel since 3.7.
      
      Cc: stable@vger.kernel.org (v3.7+)
      Reported-by: NJes Sorensen <Jes.Sorensen@redhat.com>
      Tested-by: NJes Sorensen <Jes.Sorensen@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f8dfcffd
    • J
      MD: Prevent sysfs operations on uninitialized kobjects · 90584fc9
      Jonathan Brassow 提交于
      MD: Prevent sysfs operations on uninitialized kobjects
      
      Device-mapper does not use sysfs; but when device-mapper is leveraging
      MD's RAID personalities, MD sometimes attempts to update sysfs.  This
      patch adds checks for 'mddev-kobj.sd' in sysfs_[un]link_rdev to ensure
      it is about to operate on something valid.  This patch also checks for
      'mddev->kobj.sd' before calling 'sysfs_notify' in 'remove_and_add_spares'.
      Although 'sysfs_notify' already makes this check, doing so in
      'remove_and_add_spares' prevents an additional mutex operation.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      90584fc9
    • J
      MD RAID5: Avoid accessing gendisk or queue structs when not available · e3620a3a
      Jonathan Brassow 提交于
      MD RAID5:  Fix kernel oops when RAID4/5/6 is used via device-mapper
      
      Commit a9add5d9 (v3.8-rc1) added blktrace calls to the RAID4/5/6 driver.
      However, when device-mapper is used to create RAID4/5/6 arrays, the
      mddev->gendisk and mddev->queue fields are not setup.  Therefore, calling
      things like trace_block_bio_remap will cause a kernel oops.  This patch
      conditionalizes those calls on whether the proper fields exist to make
      the calls.  (Device-mapper will call trace_block_bio_remap on its own.)
      
      This patch is suitable for the 3.8.y stable kernel.
      
      Cc: stable@vger.kernel.org (v3.8+)
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e3620a3a
    • N
      md/raid5: schedule_construction should abort if nothing to do. · ce7d363a
      NeilBrown 提交于
      Since commit 1ed850f3
          md/raid5: make sure to_read and to_write never go negative.
      
      It has been possible for handle_stripe_dirtying to be called
      when there isn't actually any work to do.
      It then calls schedule_reconstruction() which will set R5_LOCKED
      on the parity block(s) even when nothing else is happening.
      This then causes problems in do_release_stripe().
      
      So add checks to schedule_reconstruction() so that if it doesn't
      find anything to do, it just aborts.
      
      This bug was introduced in v3.7, so the patch is suitable
      for -stable kernels since then.
      
      Cc: stable@vger.kernel.org (v3.7+)
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ce7d363a
  8. 02 3月, 2013 7 次提交