1. 22 6月, 2017 1 次提交
    • N
      md: use a separate bio_set for synchronous IO. · 5a85071c
      NeilBrown 提交于
      md devices allocate a bio_set and use it for two
      distinct purposes.
      mddev->bio_set is used to clone bios as part of sending
      upper level requests down to lower level devices,
      and it is also use for synchronous IO such as superblock
      and bitmap updates, and for correcting read errors.
      
      This multiple usage can lead to deadlocks.  It is likely
      that cloned bios might be queued for write and to be
      waiting for a metadata update before the write can be permitted.
      If the cloning exhausted mddev->bio_set, the metadata update
      may not be able to proceed.
      
      This scenario has been seen during heavy testing, with lots of IO and
      lots of memory pressure.
      
      Address this by adding a new bio_set specifically for synchronous IO.
      All synchronous IO goes directly to the underlying device and is not
      queued at the md level, so request using entries from the new
      mddev->sync_set will complete in a timely fashion.
      Requests that use mddev->bio_set will sometimes need to wait
      for synchronous IO, but will no longer risk deadlocking that iO.
      
      Also: small simplification in mddev_put(): there is no need to
      wait until the spinlock is released before calling bioset_free().
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5a85071c
  2. 17 6月, 2017 1 次提交
  3. 14 6月, 2017 1 次提交
    • N
      md: fix deadlock between mddev_suspend() and md_write_start() · cc27b0c7
      NeilBrown 提交于
      If mddev_suspend() races with md_write_start() we can deadlock
      with mddev_suspend() waiting for the request that is currently
      in md_write_start() to complete the ->make_request() call,
      and md_write_start() waiting for the metadata to be updated
      to mark the array as 'dirty'.
      As metadata updates done by md_check_recovery() only happen then
      the mddev_lock() can be claimed, and as mddev_suspend() is often
      called with the lock held, these threads wait indefinitely for each
      other.
      
      We fix this by having md_write_start() abort if mddev_suspend()
      is happening, and ->make_request() aborts if md_write_start()
      aborted.
      md_make_request() can detect this abort, decrease the ->active_io
      count, and wait for mddev_suspend().
      Reported-by: NNix <nix@esperi.org.uk>
      Fix: 68866e42(MD: no sync IO while suspended)
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cc27b0c7
  4. 06 6月, 2017 1 次提交
  5. 01 6月, 2017 1 次提交
    • J
      md: Make flush bios explicitely sync · 5a8948f8
      Jan Kara 提交于
      Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
      definitions.  generic_make_request_checks() however strips REQ_FUA and
      REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
      write cache and thus write effectively becomes asynchronous which can
      lead to performance regressions
      
      Fix the problem by making sure all bios which are synchronous are
      properly marked with REQ_SYNC.
      
      CC: linux-raid@vger.kernel.org
      CC: Shaohua Li <shli@kernel.org>
      Fixes: b685d3d6
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5a8948f8
  6. 09 5月, 2017 1 次提交
  7. 21 4月, 2017 1 次提交
  8. 13 4月, 2017 2 次提交
    • N
      md: support disabling of create-on-open semantics. · 78b6350d
      NeilBrown 提交于
      md allows a new array device to be created by simply
      opening a device file.  This make it difficult to
      remove the device and udev is likely to open the device file
      as part of processing the REMOVE event.
      
      There is an alternate mechanism for creating arrays
      by writing to the new_array module parameter.
      When using tools that work with this parameter, it is
      best to disable the old semantics.
      This new module parameter allows that.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Acted-by: NColy Li <colyli@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      78b6350d
    • N
      md: allow creation of mdNNN arrays via md_mod/parameters/new_array · 039b7225
      NeilBrown 提交于
      The intention when creating the "new_array" parameter and the
      possibility of having array names line "md_HOME" was to transition
      away from the old way of creating arrays and to eventually only use
      this new way.
      
      The "old" way of creating array is to create a device node in /dev
      and then open it.  The act of opening creates the array.
      This is problematic because sometimes the device node can be opened
      when we don't want to create an array.  This can easily happen
      when some rule triggered by udev looks at a device as it is being
      destroyed.  The node in /dev continues to exist for a short period
      after an array is stopped, and opening it during this time recreates
      the array (as an inactive array).
      
      Unfortunately no clear plan for the transition was created.  It is now
      time to fix that.
      
      This patch allows devices with numeric names, like "md999" to be
      created by writing to "new_array".  This will only work if the minor
      number given is not already in use.  This will allow mdadm to
      support the creation of arrays with numbers > 511 (currently not
      possible) by writing to new_array.
      mdadm can, at some point, use this approach to create *all* arrays,
      which will allow the transition to only using the new-way.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Acted-by: NColy Li <colyli@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      039b7225
  9. 11 4月, 2017 2 次提交
  10. 23 3月, 2017 4 次提交
    • N
      MD: use per-cpu counter for writes_pending · 4ad23a97
      NeilBrown 提交于
      The 'writes_pending' counter is used to determine when the
      array is stable so that it can be marked in the superblock
      as "Clean".  Consequently it needs to be updated frequently
      but only checked for zero occasionally.  Recent changes to
      raid5 cause the count to be updated even more often - once
      per 4K rather than once per bio.  This provided
      justification for making the updates more efficient.
      
      So we replace the atomic counter a percpu-refcount.
      This can be incremented and decremented cheaply most of the
      time, and can be switched to "atomic" mode when more
      precise counting is needed.  As it is possible for multiple
      threads to want a precise count, we introduce a
      "sync_checker" counter to count the number of threads
      in "set_in_sync()", and only switch the refcount back
      to percpu mode when that is zero.
      
      We need to be careful about races between set_in_sync()
      setting ->in_sync to 1, and md_write_start() setting it
      to zero.  md_write_start() holds the rcu_read_lock()
      while checking if the refcount is in percpu mode.  If
      it is, then we know a switch to 'atomic' will not happen until
      after we call rcu_read_unlock(), in which case set_in_sync()
      will see the elevated count, and not set in_sync to 1.
      If it is not in percpu mode, we take the mddev->lock to
      ensure proper synchronization.
      
      It is no longer possible to quickly check if the count is zero, which
      we previously did to update a timer or to schedule the md_thread.
      So now we do these every time we decrement that counter, but make
      sure they are fast.
      
      mod_timer() already optimizes the case where the timeout value doesn't
      actually change.  We leverage that further by always rounding off the
      jiffies to the timeout value.  This may delay the marking of 'clean'
      slightly, but ensure we only perform atomic operation here when absolutely
      needed.
      
      md_wakeup_thread() current always calls wake_up(), even if
      THREAD_WAKEUP is already set.  That too can be optimised to avoid
      calls to wake_up().
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      4ad23a97
    • N
      md: close a race with setting mddev->in_sync · 55cc39f3
      NeilBrown 提交于
      If ->in_sync is being set just as md_write_start() is being called,
      it is possible that set_in_sync() won't see the elevated
      ->writes_pending, and md_write_start() won't see the set ->in_sync.
      
      To close this race, re-test ->writes_pending after setting ->in_sync,
      and add memory barriers to ensure the increment of ->writes_pending
      will be seen by the time of this second test, or the new ->in_sync
      will be seen by md_write_start().
      
      Add a spinlock to array_state_show() to ensure this temporary
      instability is never visible from userspace.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      55cc39f3
    • N
      md: factor out set_in_sync() · 6497709b
      NeilBrown 提交于
      Three separate places in md.c check if the number of active
      writes is zero and, if so, sets mddev->in_sync.
      
      There are a few differences, but there shouldn't be:
      - it is always appropriate to notify the change in
        sysfs_state, and there is no need to do this outside a
        spin-locked region.
      - we never need to check ->recovery_cp.  The state of resync
        is not relevant for whether there are any pending writes
        or not (which is what ->in_sync reports).
      
      So create set_in_sync() which does the correct tests and
      makes the correct changes, and call this in all three
      places.
      
      Any behaviour changes here a minor and cosmetic.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      6497709b
    • N
      md/raid5: use md_write_start to count stripes, not bios · 49728050
      NeilBrown 提交于
      We use md_write_start() to increase the count of pending writes, and
      md_write_end() to decrement the count.  We currently count bios
      submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
      has been attached to.
      
      So now, raid5_make_request() calls md_write_start() and then
      md_write_end() to keep the count elevated during the setup of the
      request.
      
      add_stripe_bio() calls md_write_start() for each stripe_head, and the
      completion routines always call md_write_end(), instead of only
      calling it when raid5_dec_bi_active_stripes() returns 0.
      make_discard_request also calls md_write_start/end().
      
      The parallel between md_write_{start,end} and use of bi_phys_segments
      can be seen in that:
       Whenever we set bi_phys_segments to 1, we now call md_write_start.
       Whenever we increment it on non-read requests with
         raid5_inc_bi_active_stripes(), we now call md_write_start().
       Whenever we decrement bi_phys_segments on non-read requsts with
          raid5_dec_bi_active_stripes(), we now call md_write_end().
      
      This reduces our dependence on keeping a per-bio count of active
      stripes in bi_phys_segments.
      
      md_write_inc() is added which parallels md_write_start(), but requires
      that a write has already been started, and is certain never to sleep.
      This can be used inside a spinlocked region when adding to a write
      request.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      49728050
  11. 17 3月, 2017 6 次提交
    • G
      md: move bitmap_destroy to the beginning of __md_stop · 48df498d
      Guoqing Jiang 提交于
      Since we have switched to sync way to handle METADATA_UPDATED
      msg for md-cluster, then process_metadata_update is depended
      on mddev->thread->wqueue.
      
      With the new change, clustered raid could possible hang if
      array received a METADATA_UPDATED msg after array unregistered
      mddev->thread, so we need to stop clustered raid (bitmap_destroy
      -> bitmap_free -> md_cluster_stop) earlier than unregister
      thread (mddev_detach -> md_unregister_thread).
      
      And this change should be safe for non-clustered raid since
      all writes are stopped before the destroy. Also in md_run,
      we activate the personality (pers->run()) before activating
      the bitmap (bitmap_create()). So it is pleasingly symmetric
      to stop the bitmap (bitmap_destroy()) before stopping the
      personality (__md_stop() calls pers->free()), we achieve this
      by move bitmap_destroy to the beginning of __md_stop.
      
      But we don't want to break the codes for waiting behind IO as
      Shaohua mentioned, so introduce bitmap_wait_behind_writes to
      call the codes, and call the new fun in both mddev_detach and
      bitmap_destroy, then we will not break original behind IO code
      and also fit the new condition well.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      48df498d
    • A
      raid5-ppl: runtime PPL enabling or disabling · ba903a3e
      Artur Paszkiewicz 提交于
      Allow writing to 'consistency_policy' attribute when the array is
      active. Add a new function 'change_consistency_policy' to the
      md_personality operations structure to handle the change in the
      personality code. Values "ppl" and "resync" are accepted and
      turn PPL on and off respectively.
      
      When enabling PPL its location and size should first be set using
      'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
      written at this location on each member device.
      
      Enabling or disabling PPL is performed under a suspended array.  The
      raid5_reset_stripe_cache function frees the stripe cache and allocates
      it again in order to allocate or free the ppl_pages for the stripes in
      the stripe cache.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ba903a3e
    • A
      md: add sysfs entries for PPL · 664aed04
      Artur Paszkiewicz 提交于
      Add 'consistency_policy' attribute for array. It indicates how the array
      maintains consistency in case of unexpected shutdown.
      
      Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
      and size of the PPL space on the device. They can't be changed for
      active members if the array is started and PPL is enabled, so in the
      setter functions only basic checks are performed. More checks are done
      in ppl_validate_rdev() when starting the log.
      
      These attributes are writable to allow enabling PPL for external
      metadata arrays and (later) to enable/disable PPL for a running array.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      664aed04
    • A
      md: superblock changes for PPL · ea0213e0
      Artur Paszkiewicz 提交于
      Include information about PPL location and size into mdp_superblock_1
      and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
      put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
      'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
      to mddev->flags to indicate that PPL is enabled on an array.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ea0213e0
    • G
      md-cluster: add the support for resize · 818da59f
      Guoqing Jiang 提交于
      To update size for cluster raid, we need to make
      sure all nodes can perform the change successfully.
      However, it is possible that some of them can't do
      it due to failure (bitmap_resize could fail). So
      we need to consider the issue before we set the
      capacity unconditionally, and we use below steps
      to perform sanity check.
      
      1. A change the size, then broadcast METADATA_UPDATED
         msg.
      2. B and C receive METADATA_UPDATED change the size
         excepts call set_capacity, sync_size is not update
         if the change failed. Also call bitmap_update_sb
         to sync sb to disk.
      3. A checks other node's sync_size, if sync_size has
         been updated in all nodes, then send CHANGE_CAPACITY
         msg otherwise send msg to revert previous change.
      4. B and C call set_capacity if receive CHANGE_CAPACITY
         msg, otherwise pers->resize will be called to restore
         the old value.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      818da59f
    • G
      md-cluster: use sync way to handle METADATA_UPDATED msg · 0ba95977
      Guoqing Jiang 提交于
      Previously, when node received METADATA_UPDATED msg, it just
      need to wakeup mddev->thread, then md_reload_sb will be called
      eventually.
      
      We taken the asynchronous way to avoid a deadlock issue, the
      deadlock issue could happen when one node is receiving the
      METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
      the path:
      
      md_check_recovery -> mddev_trylock(hold reconfig_mutex)
                        -> md_update_sb-metadata_update_start
      		     (want EX on token however token is
      		      got by the sending node)
      
      Since we will support resizing for clustered raid, and we
      need the metadata update handling to be synchronous so that
      the initiating node can detect failure, so we need to change
      the way for handling METADATA_UPDATED msg.
      
      But, we obviously need to avoid above deadlock with the
      sync way. To make this happen, we considered to not hold
      reconfig_mutex to call md_reload_sb, if some other thread
      has already taken reconfig_mutex and waiting for the 'token',
      then process_recvd_msg() can safely call md_reload_sb()
      without taking the mutex. This is because we can be certain
      that no other thread will take the mutex, and we also certain
      that the actions performed by md_reload_sb() won't interfere
      with anything that the other thread is in the middle of.
      
      To make this more concrete, we added a new cinfo->state bit
              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      
      Which is set in lock_token() just before dlm_lock_sync() is
      called, and cleared just after. As lock_token() is always
      called with reconfig_mutex() held (the specific case is the
      resync_info_update which is distinguished well in previous
      patch), if process_recvd_msg() finds that the new bit is set,
      then the mutex must be held by some other thread, and it will
      keep waiting.
      
      So process_metadata_update() can call md_reload_sb() if either
      mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      is set. The tricky bit is what to do if neither of these apply.
      We need to wait. Fortunately mddev_unlock() always calls wake_up()
      on mddev->thread->wqueue. So we can get lock_token() to call
      wake_up() on that when it sets the bit.
      
      There are also some related changes inside this commit:
      1. remove RELOAD_SB related codes since there are not valid anymore.
      2. mddev is added into md_cluster_info then we can get mddev inside
         lock_token.
      3. add new parameter for lock_token to distinguish reconfig_mutex
         is held or not.
      
      And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
      1. set it before unregister thread, otherwise a deadlock could
         appear if stop a resyncing array.
         This is because md_unregister_thread(&cinfo->recv_thread) is
         blocked by recv_daemon -> process_recvd_msg
      			  -> process_metadata_update.
         To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
         also need to be set before unregister thread.
      2. set it in metadata_update_start to fix another deadlock.
      	a. Node A sends METADATA_UPDATED msg (held Token lock).
      	b. Node B wants to do resync, and is blocked since it can't
      	   get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
      	   not set since the callchain
      	   (md_do_sync -> sync_request
              	       -> resync_info_update
      		       -> sendmsg
      		       -> lock_comm -> lock_token)
      	   doesn't hold reconfig_mutex.
      	c. Node B trys to update sb (held reconfig_mutex), but stopped
      	   at wait_event() in metadata_update_start since we have set
      	   MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
      	d. Then Node B receives METADATA_UPDATED msg from A, of course
      	   recv_daemon is blocked forever.
         Since metadata_update_start always calls lock_token with reconfig_mutex,
         we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
         lock_token don't need to set it twice unless lock_token is invoked from
         lock_comm.
      
      Finally, thanks to Neil for his great idea and help!
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0ba95977
  12. 11 3月, 2017 2 次提交
  13. 10 3月, 2017 3 次提交
  14. 02 3月, 2017 1 次提交
  15. 16 2月, 2017 3 次提交
  16. 14 2月, 2017 1 次提交
    • N
      md: ensure md devices are freed before module is unloaded. · 9356863c
      NeilBrown 提交于
      Commit: cbd19983 ("md: Fix unfortunate interaction with evms")
      change mddev_put() so that it would not destroy an md device while
      ->ctime was non-zero.
      
      Unfortunately, we didn't make sure to clear ->ctime when unloading
      the module, so it is possible for an md device to remain after
      module unload.  An attempt to open such a device will trigger
      an invalid memory reference in:
        get_gendisk -> kobj_lookup -> exact_lock -> get_disk
      
      when tring to access disk->fops, which was in the module that has
      been removed.
      
      So ensure we clear ->ctime in md_exit(), and explain how that is useful,
      as it isn't immediately obvious when looking at the code.
      
      Fixes: cbd19983 ("md: Fix unfortunate interaction with evms")
      Tested-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      9356863c
  17. 02 2月, 2017 1 次提交
  18. 25 1月, 2017 1 次提交
    • S
      md/r5cache: flush data only stripes in r5l_recovery_log() · a85dd7b8
      Song Liu 提交于
      For safer operation, all arrays start in write-through mode, which has been
      better tested and is more mature. And actually the write-through/write-mode
      isn't persistent after array restarted, so we always start array in
      write-through mode. However, if recovery found data-only stripes before the
      shutdown (from previous write-back mode), it is not safe to start the array in
      write-through mode, as write-through mode can not handle stripes with data in
      write-back cache. To solve this problem, we flush all data-only stripes in
      r5l_recovery_log(). When r5l_recovery_log() returns, the array starts with
      empty cache in write-through mode.
      
      This logic is implemented in r5c_recovery_flush_data_only_stripes():
      
      1. enable write back cache
      2. flush all stripes
      3. wake up conf->mddev->thread
      4. wait for all stripes get flushed (reuse wait_for_quiescent)
      5. disable write back cache
      
      The wait in 4 will be waked up in release_inactive_stripe_list()
      when conf->active_stripes reaches 0.
      
      It is safe to wake up mddev->thread here because all the resource
      required for the thread has been initialized.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      a85dd7b8
  19. 09 12月, 2016 2 次提交
  20. 06 12月, 2016 1 次提交
  21. 24 11月, 2016 2 次提交
  22. 23 11月, 2016 2 次提交
    • N
      md: Use REQ_FAILFAST_* on metadata writes where appropriate · 46533ff7
      NeilBrown 提交于
      This can only be supported on personalities which ensure
      that md_error() never causes an array to enter the 'failed'
      state.  i.e. if marking a device Faulty would cause some
      data to be inaccessible, the device is status is left as
      non-Faulty.  This is true for RAID1 and RAID10.
      
      If we get a failure writing metadata but the device doesn't
      fail, it must be the last device so we re-write without
      FAILFAST to improve chance of success.  We also flag the
      device as LastDev so that future metadata updates don't
      waste time on failfast writes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      46533ff7
    • N
      md/failfast: add failfast flag for md to be used by some personalities. · 688834e6
      NeilBrown 提交于
      This patch just adds a 'failfast' per-device flag which can be stored
      in v0.90 or v1.x metadata.
      The flag is not used yet but the intent is that it can be used for
      mirrored (raid1/raid10) arrays where low latency is more important
      than keeping all devices on-line.
      
      Setting the flag for a device effectively gives permission for that
      device to be marked as Faulty and excluded from the array on the first
      error.  The underlying driver will be directed not to retry requests
      that result in failures.  There is a proviso that the device must not
      be marked faulty if that would cause the array as a whole to fail, it
      may only be marked Faulty if the array remains functional, but is
      degraded.
      
      Failures on read requests will cause the device to be marked
      as Faulty immediately so that further reads will avoid that
      device.  No attempt will be made to correct read errors by
      over-writing with the correct data.
      
      It is expected that if transient errors, such as cable unplug, are
      possible, then something in user-space will revalidate failed
      devices and re-add them when they appear to be working again.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      688834e6