1. 02 11月, 2017 3 次提交
    • S
      md: use lockdep_assert_held · efa4b77b
      Shaohua Li 提交于
      lockdep_assert_held is a better way to assert lock held, and it works
      for UP.
      Signed-off-by: NShaohua Li <shli@fb.com>
      efa4b77b
    • N
      md: remove special meaning of ->quiesce(.., 2) · b03e0ccb
      NeilBrown 提交于
      The '2' argument means "wake up anything that is waiting".
      This is an inelegant part of the design and was added
      to help support management of suspend_lo/suspend_hi setting.
      Now that suspend_lo/hi is managed in mddev_suspend/resume,
      that need is gone.
      These is still a couple of places where we call 'quiesce'
      with an argument of '2', but they can safely be changed to
      call ->quiesce(.., 1); ->quiesce(.., 0) which
      achieve the same result at the small cost of pausing IO
      briefly.
      
      This removes a small "optimization" from suspend_{hi,lo}_store,
      but it isn't clear that optimization served a useful purpose.
      The code now is a lot clearer.
      Suggested-by: NShaohua Li <shli@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b03e0ccb
    • N
      md: allow metadata update while suspending. · 35bfc521
      NeilBrown 提交于
      There are various deadlocks that can occur
      when a thread holds reconfig_mutex and calls
      ->quiesce(mddev, 1).
      As some write request block waiting for
      metadata to be updated (e.g. to record device
      failure), and as the md thread updates the metadata
      while the reconfig mutex is held, holding the mutex
      can stop write requests completing, and this prevents
      ->quiesce(mddev, 1) from completing.
      
      ->quiesce() is now usually called from mddev_suspend(),
      and it is always called with reconfig_mutex held.  So
      at this time it is safe for the thread to update metadata
      without explicitly taking the lock.
      
      So add 2 new flags, one which says the unlocked updates is
      allowed, and one which ways it is happening.  Then allow it
      while the quiesce completes, and then wait for it to finish.
      Reported-and-tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      35bfc521
  2. 28 9月, 2017 1 次提交
  3. 28 8月, 2017 1 次提交
  4. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  5. 22 7月, 2017 2 次提交
  6. 11 7月, 2017 1 次提交
  7. 22 6月, 2017 1 次提交
    • N
      md: use a separate bio_set for synchronous IO. · 5a85071c
      NeilBrown 提交于
      md devices allocate a bio_set and use it for two
      distinct purposes.
      mddev->bio_set is used to clone bios as part of sending
      upper level requests down to lower level devices,
      and it is also use for synchronous IO such as superblock
      and bitmap updates, and for correcting read errors.
      
      This multiple usage can lead to deadlocks.  It is likely
      that cloned bios might be queued for write and to be
      waiting for a metadata update before the write can be permitted.
      If the cloning exhausted mddev->bio_set, the metadata update
      may not be able to proceed.
      
      This scenario has been seen during heavy testing, with lots of IO and
      lots of memory pressure.
      
      Address this by adding a new bio_set specifically for synchronous IO.
      All synchronous IO goes directly to the underlying device and is not
      queued at the md level, so request using entries from the new
      mddev->sync_set will complete in a timely fashion.
      Requests that use mddev->bio_set will sometimes need to wait
      for synchronous IO, but will no longer risk deadlocking that iO.
      
      Also: small simplification in mddev_put(): there is no need to
      wait until the spinlock is released before calling bioset_free().
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5a85071c
  8. 14 6月, 2017 1 次提交
    • N
      md: fix deadlock between mddev_suspend() and md_write_start() · cc27b0c7
      NeilBrown 提交于
      If mddev_suspend() races with md_write_start() we can deadlock
      with mddev_suspend() waiting for the request that is currently
      in md_write_start() to complete the ->make_request() call,
      and md_write_start() waiting for the metadata to be updated
      to mark the array as 'dirty'.
      As metadata updates done by md_check_recovery() only happen then
      the mddev_lock() can be claimed, and as mddev_suspend() is often
      called with the lock held, these threads wait indefinitely for each
      other.
      
      We fix this by having md_write_start() abort if mddev_suspend()
      is happening, and ->make_request() aborts if md_write_start()
      aborted.
      md_make_request() can detect this abort, decrease the ->active_io
      count, and wait for mddev_suspend().
      Reported-by: NNix <nix@esperi.org.uk>
      Fix: 68866e42(MD: no sync IO while suspended)
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cc27b0c7
  9. 06 6月, 2017 1 次提交
  10. 09 5月, 2017 1 次提交
  11. 09 4月, 2017 1 次提交
  12. 25 3月, 2017 2 次提交
  13. 23 3月, 2017 2 次提交
    • N
      MD: use per-cpu counter for writes_pending · 4ad23a97
      NeilBrown 提交于
      The 'writes_pending' counter is used to determine when the
      array is stable so that it can be marked in the superblock
      as "Clean".  Consequently it needs to be updated frequently
      but only checked for zero occasionally.  Recent changes to
      raid5 cause the count to be updated even more often - once
      per 4K rather than once per bio.  This provided
      justification for making the updates more efficient.
      
      So we replace the atomic counter a percpu-refcount.
      This can be incremented and decremented cheaply most of the
      time, and can be switched to "atomic" mode when more
      precise counting is needed.  As it is possible for multiple
      threads to want a precise count, we introduce a
      "sync_checker" counter to count the number of threads
      in "set_in_sync()", and only switch the refcount back
      to percpu mode when that is zero.
      
      We need to be careful about races between set_in_sync()
      setting ->in_sync to 1, and md_write_start() setting it
      to zero.  md_write_start() holds the rcu_read_lock()
      while checking if the refcount is in percpu mode.  If
      it is, then we know a switch to 'atomic' will not happen until
      after we call rcu_read_unlock(), in which case set_in_sync()
      will see the elevated count, and not set in_sync to 1.
      If it is not in percpu mode, we take the mddev->lock to
      ensure proper synchronization.
      
      It is no longer possible to quickly check if the count is zero, which
      we previously did to update a timer or to schedule the md_thread.
      So now we do these every time we decrement that counter, but make
      sure they are fast.
      
      mod_timer() already optimizes the case where the timeout value doesn't
      actually change.  We leverage that further by always rounding off the
      jiffies to the timeout value.  This may delay the marking of 'clean'
      slightly, but ensure we only perform atomic operation here when absolutely
      needed.
      
      md_wakeup_thread() current always calls wake_up(), even if
      THREAD_WAKEUP is already set.  That too can be optimised to avoid
      calls to wake_up().
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      4ad23a97
    • N
      md/raid5: use md_write_start to count stripes, not bios · 49728050
      NeilBrown 提交于
      We use md_write_start() to increase the count of pending writes, and
      md_write_end() to decrement the count.  We currently count bios
      submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
      has been attached to.
      
      So now, raid5_make_request() calls md_write_start() and then
      md_write_end() to keep the count elevated during the setup of the
      request.
      
      add_stripe_bio() calls md_write_start() for each stripe_head, and the
      completion routines always call md_write_end(), instead of only
      calling it when raid5_dec_bi_active_stripes() returns 0.
      make_discard_request also calls md_write_start/end().
      
      The parallel between md_write_{start,end} and use of bi_phys_segments
      can be seen in that:
       Whenever we set bi_phys_segments to 1, we now call md_write_start.
       Whenever we increment it on non-read requests with
         raid5_inc_bi_active_stripes(), we now call md_write_start().
       Whenever we decrement bi_phys_segments on non-read requsts with
          raid5_dec_bi_active_stripes(), we now call md_write_end().
      
      This reduces our dependence on keeping a per-bio count of active
      stripes in bi_phys_segments.
      
      md_write_inc() is added which parallels md_write_start(), but requires
      that a write has already been started, and is certain never to sleep.
      This can be used inside a spinlocked region when adding to a write
      request.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      49728050
  14. 17 3月, 2017 3 次提交
    • A
      raid5-ppl: runtime PPL enabling or disabling · ba903a3e
      Artur Paszkiewicz 提交于
      Allow writing to 'consistency_policy' attribute when the array is
      active. Add a new function 'change_consistency_policy' to the
      md_personality operations structure to handle the change in the
      personality code. Values "ppl" and "resync" are accepted and
      turn PPL on and off respectively.
      
      When enabling PPL its location and size should first be set using
      'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
      written at this location on each member device.
      
      Enabling or disabling PPL is performed under a suspended array.  The
      raid5_reset_stripe_cache function frees the stripe cache and allocates
      it again in order to allocate or free the ppl_pages for the stripes in
      the stripe cache.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ba903a3e
    • A
      md: superblock changes for PPL · ea0213e0
      Artur Paszkiewicz 提交于
      Include information about PPL location and size into mdp_superblock_1
      and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
      put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
      'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
      to mddev->flags to indicate that PPL is enabled on an array.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ea0213e0
    • G
      md-cluster: use sync way to handle METADATA_UPDATED msg · 0ba95977
      Guoqing Jiang 提交于
      Previously, when node received METADATA_UPDATED msg, it just
      need to wakeup mddev->thread, then md_reload_sb will be called
      eventually.
      
      We taken the asynchronous way to avoid a deadlock issue, the
      deadlock issue could happen when one node is receiving the
      METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
      the path:
      
      md_check_recovery -> mddev_trylock(hold reconfig_mutex)
                        -> md_update_sb-metadata_update_start
      		     (want EX on token however token is
      		      got by the sending node)
      
      Since we will support resizing for clustered raid, and we
      need the metadata update handling to be synchronous so that
      the initiating node can detect failure, so we need to change
      the way for handling METADATA_UPDATED msg.
      
      But, we obviously need to avoid above deadlock with the
      sync way. To make this happen, we considered to not hold
      reconfig_mutex to call md_reload_sb, if some other thread
      has already taken reconfig_mutex and waiting for the 'token',
      then process_recvd_msg() can safely call md_reload_sb()
      without taking the mutex. This is because we can be certain
      that no other thread will take the mutex, and we also certain
      that the actions performed by md_reload_sb() won't interfere
      with anything that the other thread is in the middle of.
      
      To make this more concrete, we added a new cinfo->state bit
              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      
      Which is set in lock_token() just before dlm_lock_sync() is
      called, and cleared just after. As lock_token() is always
      called with reconfig_mutex() held (the specific case is the
      resync_info_update which is distinguished well in previous
      patch), if process_recvd_msg() finds that the new bit is set,
      then the mutex must be held by some other thread, and it will
      keep waiting.
      
      So process_metadata_update() can call md_reload_sb() if either
      mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      is set. The tricky bit is what to do if neither of these apply.
      We need to wait. Fortunately mddev_unlock() always calls wake_up()
      on mddev->thread->wqueue. So we can get lock_token() to call
      wake_up() on that when it sets the bit.
      
      There are also some related changes inside this commit:
      1. remove RELOAD_SB related codes since there are not valid anymore.
      2. mddev is added into md_cluster_info then we can get mddev inside
         lock_token.
      3. add new parameter for lock_token to distinguish reconfig_mutex
         is held or not.
      
      And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
      1. set it before unregister thread, otherwise a deadlock could
         appear if stop a resyncing array.
         This is because md_unregister_thread(&cinfo->recv_thread) is
         blocked by recv_daemon -> process_recvd_msg
      			  -> process_metadata_update.
         To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
         also need to be set before unregister thread.
      2. set it in metadata_update_start to fix another deadlock.
      	a. Node A sends METADATA_UPDATED msg (held Token lock).
      	b. Node B wants to do resync, and is blocked since it can't
      	   get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
      	   not set since the callchain
      	   (md_do_sync -> sync_request
              	       -> resync_info_update
      		       -> sendmsg
      		       -> lock_comm -> lock_token)
      	   doesn't hold reconfig_mutex.
      	c. Node B trys to update sb (held reconfig_mutex), but stopped
      	   at wait_event() in metadata_update_start since we have set
      	   MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
      	d. Then Node B receives METADATA_UPDATED msg from A, of course
      	   recv_daemon is blocked forever.
         Since metadata_update_start always calls lock_token with reconfig_mutex,
         we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
         lock_token don't need to set it twice unless lock_token is invoked from
         lock_comm.
      
      Finally, thanks to Neil for his great idea and help!
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0ba95977
  15. 10 3月, 2017 1 次提交
  16. 16 2月, 2017 1 次提交
    • M
      md: fast clone bio in bio_clone_mddev() · d7a10308
      Ming Lei 提交于
      Firstly bio_clone_mddev() is used in raid normal I/O and isn't
      in resync I/O path.
      
      Secondly all the direct access to bvec table in raid happens on
      resync I/O except for write behind of raid1, in which we still
      use bio_clone() for allocating new bvec table.
      
      So this patch replaces bio_clone() with bio_clone_fast()
      in bio_clone_mddev().
      
      Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
      suggested by Christoph Hellwig.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d7a10308
  17. 14 2月, 2017 1 次提交
  18. 06 1月, 2017 1 次提交
  19. 09 12月, 2016 1 次提交
    • S
      md: separate flags for superblock changes · 2953079c
      Shaohua Li 提交于
      The mddev->flags are used for different purposes. There are a lot of
      places we check/change the flags without masking unrelated flags, we
      could check/change unrelated flags. These usage are most for superblock
      write, so spearate superblock related flags. This should make the code
      clearer and also fix real bugs.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2953079c
  20. 23 11月, 2016 2 次提交
    • N
      md: Use REQ_FAILFAST_* on metadata writes where appropriate · 46533ff7
      NeilBrown 提交于
      This can only be supported on personalities which ensure
      that md_error() never causes an array to enter the 'failed'
      state.  i.e. if marking a device Faulty would cause some
      data to be inaccessible, the device is status is left as
      non-Faulty.  This is true for RAID1 and RAID10.
      
      If we get a failure writing metadata but the device doesn't
      fail, it must be the last device so we re-write without
      FAILFAST to improve chance of success.  We also flag the
      device as LastDev so that future metadata updates don't
      waste time on failfast writes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      46533ff7
    • N
      md/failfast: add failfast flag for md to be used by some personalities. · 688834e6
      NeilBrown 提交于
      This patch just adds a 'failfast' per-device flag which can be stored
      in v0.90 or v1.x metadata.
      The flag is not used yet but the intent is that it can be used for
      mirrored (raid1/raid10) arrays where low latency is more important
      than keeping all devices on-line.
      
      Setting the flag for a device effectively gives permission for that
      device to be marked as Faulty and excluded from the array on the first
      error.  The underlying driver will be directed not to retry requests
      that result in failures.  There is a proviso that the device must not
      be marked faulty if that would cause the array as a whole to fail, it
      may only be marked Faulty if the array remains functional, but is
      degraded.
      
      Failures on read requests will cause the device to be marked
      as Faulty immediately so that further reads will avoid that
      device.  No attempt will be made to correct read errors by
      over-writing with the correct data.
      
      It is expected that if transient errors, such as cable unplug, are
      possible, then something in user-space will revalidate failed
      devices and re-add them when they appear to be working again.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      688834e6
  21. 10 11月, 2016 1 次提交
  22. 08 11月, 2016 1 次提交
    • T
      md: add bad block support for external metadata · 35b785f7
      Tomasz Majchrzak 提交于
      Add new rdev flag which external metadata handler can use to switch
      on/off bad block support. If new bad block is encountered, notify it via
      rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been
      cleared, notify update to rdev 'bad_blocks' sysfs file.
      
      When bad blocks support is being removed, just clear rdev flag. It is
      not necessary to reset badblocks->shift field. If there are bad blocks
      cleared or added at the same time, it is ok for those changes to be
      applied to the structure. The array is in blocked state and the drive
      which cannot handle bad blocks any more will be removed from the array
      before it is unlocked.
      
      Simplify state_show function by adding a separator at the end of each
      string and overwrite last separator with new line.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Reviewed-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      35b785f7
  23. 22 9月, 2016 1 次提交
    • G
      md: changes for MD_STILL_CLOSED flag · af8d8e6f
      Guoqing Jiang 提交于
      When stop clustered raid while it is pending on resync,
      MD_STILL_CLOSED flag could be cleared since udev rule
      is triggered to open the mddev. So obviously array can't
      be stopped soon and returns EBUSY.
      
      	mdadm -Ss          md-raid-arrays.rules
        set MD_STILL_CLOSED          md_open()
      	... ... ...          clear MD_STILL_CLOSED
      	do_md_stop
      
      We make below changes to resolve this issue:
      
      1. rename MD_STILL_CLOSED to MD_CLOSING since it is set
         when stop array and it means we are stopping array.
      2. let md_open returns early if CLOSING is set, so no
         other threads will open array if one thread is trying
         to close it.
      3. no need to clear CLOSING bit in md_open because 1 has
         ensure the bit is cleared, then we also don't need to
         test CLOSING bit in do_md_stop.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      af8d8e6f
  24. 20 7月, 2016 1 次提交
    • A
      md: use seconds granularity for error logging · 0e3ef49e
      Arnd Bergmann 提交于
      The md code stores the exact time of the last error in the
      last_read_error variable using a timespec structure. It only
      ever uses the seconds portion of that though, so we can
      use a scalar for it.
      
      There won't be an overflow in 2038 here, because it already
      used monotonic time and 32-bit is enough for that, but I've
      decided to use time64_t for consistency in the conversion.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0e3ef49e
  25. 14 6月, 2016 1 次提交
    • N
      md: reduce the number of synchronize_rcu() calls when multiple devices fail. · d787be40
      NeilBrown 提交于
      Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made
      which can delay several milliseconds in some case.
      If lots of devices fail at once - as could happen with a large RAID10 where one set
      of devices are removed all at once - these delays can add up to be very inconcenient.
      
      As failure is not reversible we can check for that first, setting a
      separate flag if it is found, and then all synchronize_rcu() once for
      all the flagged devices.  Then ->hot_remove_disk() function can skip the
      synchronize_rcu() step if the flag is set.
      
      fix build error(Shaohua)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d787be40
  26. 08 6月, 2016 2 次提交
  27. 04 6月, 2016 1 次提交
    • G
      md-cluster: fix deadlock issue when add disk to an recoverying array · bb8bf15b
      Guoqing Jiang 提交于
      Add a disk to an array which is performing recovery
      is a little complicated, we need to do both reap the
      sync thread and perform add disk for the case, then
      it caused deadlock as follows.
      
      linux44:~ # ps aux|grep md|grep D
      root      1822  0.0  0.0      0     0 ?        D    16:50   0:00 [md127_resync]
      root      1848  0.0  0.0  19860   952 pts/0    D+   16:50   0:00 mdadm --manage /dev/md127 --re-add /dev/vdb
      linux44:~ # cat /proc/1848/stack
      [<ffffffff8107afde>] kthread_stop+0x6e/0x120
      [<ffffffffa051ddb0>] md_unregister_thread+0x40/0x80 [md_mod]
      [<ffffffffa0526e45>] md_reap_sync_thread+0x15/0x150 [md_mod]
      [<ffffffffa05271e0>] action_store+0x260/0x270 [md_mod]
      [<ffffffffa05206b4>] md_attr_store+0xb4/0x100 [md_mod]
      [<ffffffff81214a7e>] sysfs_write_file+0xbe/0x140
      [<ffffffff811a6b98>] vfs_write+0xb8/0x1e0
      [<ffffffff811a75b8>] SyS_write+0x48/0xa0
      [<ffffffff8152a5c9>] system_call_fastpath+0x16/0x1b
      [<00007f068ea1ed30>] 0x7f068ea1ed30
      linux44:~ # cat /proc/1822/stack
      [<ffffffffa05251a6>] md_do_sync+0x846/0xf40 [md_mod]
      [<ffffffffa052402d>] md_thread+0x16d/0x180 [md_mod]
      [<ffffffff8107ad94>] kthread+0xb4/0xc0
      [<ffffffff8152a518>] ret_from_fork+0x58/0x90
      
                              Task1848                                Task1822
      md_attr_store (held reconfig_mutex by call mddev_lock())
                              action_store
      			md_reap_sync_thread
      			md_unregister_thread
      			kthread_stop                    md_wakeup_thread(mddev->thread);
      						wait_event(mddev->sb_wait, !test_bit(MD_CHANGE_PENDING))
      
      md_check_recovery is triggered by wakeup mddev->thread,
      but it can't clear MD_CHANGE_PENDING flag since it can't
      get lock which was held by md_attr_store already.
      
      To solve the deadlock problem, we move "->resync_finish()"
      from md_do_sync to md_reap_sync_thread (after md_update_sb),
      also MD_HELD_RESYNC_LOCK is introduced since it is possible
      that node can't get resync lock in md_do_sync.
      
      Then we do not need to wait for MD_CHANGE_PENDING is cleared
      or not since metadata should be updated after md_update_sb,
      so just call resync_finish if MD_HELD_RESYNC_LOCK is set.
      
      We also unified the code after skip label, since set PENDING
      for non-clustered case should be harmless.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      bb8bf15b
  28. 14 1月, 2016 1 次提交
    • D
      md/raid: only permit hot-add of compatible integrity profiles · 1501efad
      Dan Williams 提交于
      It is not safe for an integrity profile to be changed while i/o is
      in-flight in the queue.  Prevent adding new disks or otherwise online
      spares to an array if the device has an incompatible integrity profile.
      
      The original change to the blk_integrity_unregister implementation in
      md, commmit c7bfced9 "md: suspend i/o during runtime
      blk_integrity_unregister" introduced an immediate hang regression.
      
      This policy of disallowing changes the integrity profile once one has
      been established is shared with DM.
      
      Here is an abbreviated log from a test run that:
      1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [   59.076127]
      2/ Tries to add an integrity-disabled device (pmem1m) [   90.489209]
      3/ Retries with an integrity-enabled device (pmem1s) [  205.671277]
      
      [   59.076127] md/raid1:md0: active with 1 out of 2 mirrors
      [   59.078302] md: data integrity enabled on md0
      [..]
      [   90.489209] md0: incompatible integrity profile for pmem1m
      [..]
      [  205.671277] md: super_written gets error=-5
      [  205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
      [  205.677386] md/raid1:md0: Operation continuing on 1 devices.
      [  205.683037] RAID1 conf printout:
      [  205.684699]  --- wd:1 rd:2
      [  205.685972]  disk 0, wo:0, o:1, dev:pmem0s
      [  205.687562]  disk 1, wo:1, o:1, dev:pmem1s
      [  205.691717] md: recovery of RAID array md0
      
      Fixes: c7bfced9 ("md: suspend i/o during runtime blk_integrity_unregister")
      Cc: <stable@vger.kernel.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reported-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      1501efad
  29. 10 1月, 2016 1 次提交
  30. 07 1月, 2016 1 次提交
    • N
      md: Remove 'ready' field from mddev. · 274d8cbd
      NeilBrown 提交于
      This field is always set in tandem with ->pers, and when it is tested
      ->pers is also tested.  So ->ready is not needed.
      
      It was needed once, but code rearrangement and locking changes have
      removed that needed.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      274d8cbd
  31. 06 1月, 2016 1 次提交
    • D
      drivers: md: use ktime_get_real_seconds() · 9ebc6ef1
      Deepa Dinamani 提交于
      get_seconds() API is not y2038 safe on 32 bit systems and the API
      is deprecated. Replace it with calls to ktime_get_real_seconds()
      API instead. Change mddev structure types to time64_t accordingly.
      
      32 bit signed timestamps will overflow in the year 2038.
      
      Change the user interface mdu_array_info_s structure timestamps:
      ctime and utime values used in ioctls GET_ARRAY_INFO and
      SET_ARRAY_INFO to unsigned int. This will extend the field to last
      until the year 2106.
      The long term plan is to get rid of ctime and utime values in
      this structure as this information can be read from the on-disk
      meta data directly.
      
      Clamp the tim64_t timestamps to positive values with a max of U32_MAX
      when returning from GET_ARRAY_INFO ioctl to accommodate above changes
      in the data type of timestamps to unsigned int.
      
      v0.90 on disk meta data uses u32 for maintaining time stamps.
      So this will also last until year 2106.
      Assumption is that the usage of v0.90 will be deprecated by
      year 2106.
      
      Timestamp fields in the on disk meta data for v1.0 version already
      use 64 bit data types. Remove the truncation of the bits while
      writing to or reading from these from the disk.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      9ebc6ef1