1. 10 1月, 2016 2 次提交
  2. 08 11月, 2015 1 次提交
  3. 01 11月, 2015 8 次提交
  4. 31 10月, 2015 1 次提交
    • N
      Revert "md: allow a partially recovered device to be hot-added to an array." · d01552a7
      NeilBrown 提交于
      This reverts commit 7eb41885.
      
      This commit is poorly justified, I can find not discusison in email,
      and it clearly causes a problem.
      
      If a device which is being recovered fails and is subsequently
      re-added to an array, there could easily have been changes to the
      array *before* the point where the recovery was up to.  So the
      recovery must start again from the beginning.
      
      If a spare is being recovered and fails, then when it is re-added we
      really should do a bitmap-based recovery up to the recovery-offset,
      and then a full recovery from there.  Before this reversion, we only
      did the "full recovery from there" which is not corect.  After this
      reversion with will do a full recovery from the start, which is safer
      but not ideal.
      
      It will be left to a future patch to arrange the two different styles
      of recovery.
      Reported-and-tested-by: NNate Dailey <nate.dailey@stratus.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Cc: stable@vger.kernel.org (3.14+)
      Fixes: 7eb41885 ("md: allow a partially recovered device to be hot-added to an array.")
      d01552a7
  5. 24 10月, 2015 4 次提交
  6. 22 10月, 2015 3 次提交
  7. 13 10月, 2015 2 次提交
  8. 12 10月, 2015 7 次提交
    • G
      md-cluster: Fix adding of new disk with new reload code · dbb64f86
      Goldwyn Rodrigues 提交于
      Adding the disk worked incorrectly with the new reload code. Fix it:
      
       - No operation should be performed on rdev marked as Candidate
       - After a metadata update operation, kick disk if role is 0xfffe
         else clear Candidate bit and continue with the regular change check.
       - Saving the mode of the lock resource to check if token lock is already
         locked, because it can be called twice while adding a disk. However,
         unlock_comm() must be called only once.
       - add_new_disk() is called by the node initiating the --add operation.
         If it needs to be canceled, call add_new_disk_cancel(). The operation
         is completed by md_update_sb() which will write and unlock the
         communication.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      dbb64f86
    • G
      md-cluster: Perform resync/recovery under a DLM lock · c186b128
      Goldwyn Rodrigues 提交于
      Resync or recovery must be performed by only one node at a time.
      A DLM lock resource, resync_lockres provides the mutual exclusion
      so that only one node performs the recovery/resync at a time.
      
      If a node is unable to get the resync_lockres, because recovery is
      being performed by another node, it set MD_RECOVER_NEEDED so as
      to schedule recovery in the future.
      
      Remove the debug message in resync_info_update()
      used during development.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      c186b128
    • G
      md-cluster: Perform a lazy update · 2aa82191
      Goldwyn Rodrigues 提交于
      In a clustered environment, a change such as marking a device faulty,
      can be recorded by any of the nodes. This is communicated to all the
      nodes and re-recording such a change is unnecessary, and quite often
      pretty disruptive.
      
      With this patch, just before the update, we detect for the changes
      and if the changes are already in superblock, we abort the update
      after clearing all the flags
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      2aa82191
    • G
      md-cluster: Improve md_reload_sb to be less error prone · 70bcecdb
      Goldwyn Rodrigues 提交于
      md_reload_sb is too simplistic and it explicitly needs to determine
      the changes made by the writing node. However, there are multiple areas
      where a simple reload could fail.
      
      Instead, read the superblock of one of the "good" rdevs and update
      the necessary information:
      
      - read the superblock into a newly allocated page, by temporarily
        swapping out rdev->sb_page and calling ->load_super.
      - if that fails return
      - if it succeeds, call check_sb_changes
        1. iterates over list of active devices and checks the matching
         dev_roles[] value.
         	If that is 'faulty', the device must be  marked as faulty
      	 - call md_error to mark the device as faulty. Make sure
      	   not to set CHANGE_DEVS and wakeup mddev->thread or else
      	   it would initiate a resync process, which is the responsibility
      	   of the "primary" node.
      	 - clear the Blocked bit
      	 - Call remove_and_add_spares() to hot remove the device.
      	If the device is 'spare':
      	 - call remove_and_add_spares() to get the number of spares
      	   added in this operation.
      	 - Reduce mddev->degraded to mark the array as not degraded.
        2. reset recovery_cp
      - read the rest of the rdevs to update recovery_offset. If recovery_offset
        is equal to MaxSector, call spare_active() to set it In_sync
      
      This required that recovery_offset be initialized to MaxSector, as
      opposed to zero so as to communicate the end of sync for a rdev.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      70bcecdb
    • G
      md: remove_and_add_spares() to activate specific rdev · 2910ff17
      Goldwyn Rodrigues 提交于
      remove_and_add_spares() checks for all devices to activate spare.
      Change it to activate a specific device if a non-null rdev
      argument is passed.
      
      remove_and_add_spares() can be used to activate spares in
      slot_store() as well.
      
      For hot_remove_disk(), check if rdev->raid_disk == -1 before
      calling remove_and_add_spares()
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      2910ff17
    • G
      md-cluster: Use a small window for resync · c40f341f
      Goldwyn Rodrigues 提交于
      Suspending the entire device for resync could take too long. Resync
      in small chunks.
      
      cluster's resync window (32M) is maintained in r1conf as
      cluster_sync_low and cluster_sync_high and processed in
      raid1's sync_request(). If the current resync is outside the cluster
      resync window:
      
      1. Set the cluster_sync_low to curr_resync_completed.
      2. Check if the sync will fit in the new window, if not issue a
         wait_barrier() and set cluster_sync_low to sector_nr.
      3. Set cluster_sync_high to cluster_sync_low + resync_window.
      4. Send a message to all nodes so they may add it in their suspension
         list.
      
      bitmap_cond_end_sync is modified to allow to force a sync inorder
      to get the curr_resync_completed uptodate with the sector passed.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c40f341f
    • G
      md: Increment version for clustered bitmaps · 3c462c88
      Goldwyn Rodrigues 提交于
      Add BITMAP_MAJOR_CLUSTERED as 5, in order to prevent older kernels
      to assemble a clustered device.
      
      In order to maximize compatibility, the major version is set to
      BITMAP_MAJOR_CLUSTERED *only* if the bitmap is clustered.
      
      Added MD_FEATURE_CLUSTERED in order to return error for older
      kernels which would assemble MD even if the bitmap is corrupted.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      3c462c88
  9. 02 10月, 2015 2 次提交
    • S
      md: clear CHANGE_PENDING in readonly array · d4929add
      Shaohua Li 提交于
      If faulty disks of an array are more than allowed degraded number, the
      array enters error handling. It will be marked as read-only with
      MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't
      clear CHANGE_PENDING bit for read-only array.  If MD_CHANGE_PENDING is
      set for a raid5 array, all returned IO will be hold on a list till the
      bit is clear. But recovery nevery clears this bit, the IO is always in
      pending state and nevery finish. This has bad effects like upper layer
      can't get an IO error and the array can't be stopped.
      
      Fixes: c3cce6cd ("md/raid5: ensure device failure recorded before write request returns.")
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      d4929add
    • N
      md: wait for pending superblock updates before switching to read-only · 88724bfa
      NeilBrown 提交于
      If a superblock update is pending, wait for it to complete before
      letting md_set_readonly() switch to readonly.
      Otherwise we might lose important information about a device having
      failed.
      
      For external arrays, waiting for superblock updates can wait on
      user-space, so in that case, just return an error.
      Reported-and-tested-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      88724bfa
  10. 01 9月, 2015 9 次提交
    • N
      md/raid1: ensure device failure recorded before write request returns. · 55ce74d4
      NeilBrown 提交于
      When a write to one of the legs of a RAID1 fails, the failure is
      recorded in the metadata of the other leg(s) so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again  (maybe a cable was unplugged).
      
      Similarly when we record a bad-block in response to a write failure,
      we must not let the write complete until the bad-block update is safe.
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that experienced an error on a new queue which
         is only processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      55ce74d4
    • N
      md: extend spinlock protection in register_md_cluster_operations · 6022e75b
      NeilBrown 提交于
      This code looks racy.
      
      The only possible race is if two modules try to register at the same
      time and that won't happen.  But make the code look safe anyway.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      6022e75b
    • G
      md-cluster: transfer the resync ownership to another node · dc737d7c
      Guoqing Jiang 提交于
      When node A stops an array while the array is doing a resync, we need
      to let another node B take over the resync task.
      
      To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC
      message to the cluster. And the node B which received that message will
      invoke __recover_slot to do resync.
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      dc737d7c
    • S
      md: setup safemode_timer before it's being used · 25b2edfa
      Sasha Levin 提交于
      We used to set up the safemode_timer timer in md_run. If md_run
      would fail before the timer was set up we'd end up trying to modify
      a timer that doesn't have a callback function when we access safe_delay_store,
      which would trigger a BUG.
      
      neilb: delete init_timer() call as setup_timer() does that.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      25b2edfa
    • N
      md: sync sync_completed has correct value as recovery finishes. · 5ed1df2e
      NeilBrown 提交于
      There can be a small window between the moment that recovery
      actually writes the last block and the time when various sysfs
      and /proc/mdstat attributes report that it has finished.
      During this time, 'sync_completed' can have the wrong value.
      This can confuse monitoring software.
      
      So:
       - don't set curr_resync_completed beyond the end of the devices,
       - set it correctly when resync/recovery has completed.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      5ed1df2e
    • N
      md: be careful when testing resync_max against curr_resync_completed. · c5e19d90
      NeilBrown 提交于
      While it generally shouldn't happen, it is not impossible for
      curr_resync_completed to exceed resync_max.
      This can particularly happen when reshaping RAID5 - the current
      status isn't copied to curr_resync_completed promptly, so when it
      is, it can exceed resync_max.
      This happens when the reshape is 'frozen', resync_max is set low,
      and reshape is re-enabled.
      
      Taking a difference between two unsigned numbers is always dangerous
      anyway, so add a test to behave correctly if
         curr_resync_completed > resync_max
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c5e19d90
    • N
      md: set MD_RECOVERY_RECOVER when starting a degraded array. · a4a3d26d
      NeilBrown 提交于
      This ensures that 'sync_action' will show 'recover' immediately the
      array is started.  If there is no spare the status will change to
      'idle' once that is detected.
      
      Clear MD_RECOVERY_RECOVER for a read-only array to ensure this change
      happens.
      
      This allows scripts which monitor status not to get confused -
      particularly my test scripts.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      a4a3d26d
    • N
      md: close some races between setting and checking sync_action. · 985ca973
      NeilBrown 提交于
      When checking sync_action in a script, we want to be sure it is
      as accurate as possible.
      As resync/reshape etc doesn't always start immediately (a separate
      thread is scheduled to do it), it is best if 'action_show'
      checks if MD_RECOVER_NEEDED is set (which it does) and in that
      case reports what is likely to start soon (which it only sometimes
      does).
      
      So:
       - report 'reshape' if reshape_position suggests one might start.
       - set MD_RECOVERY_RECOVER in raid1_reshape(), because that is very
         likely to happen next.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      985ca973
    • N
      md: Keep /proc/mdstat reporting recovery until fully DONE. · f7851be7
      NeilBrown 提交于
      Currently when a recovery completes, mdstat shows that it has finished
      before the new device is marked as a full member.  Because of this it
      can appear to a script that the recovery finished but the array isn't
      in sync.
      
      So while MD_RECOVERY_DONE is still set, keep mdstat reporting "recovery".
      Once md_reap_sync_thread() completes, the spare will be active and then
      MD_RECOVERY_DONE will be cleared.
      
      To ensure this is race-free, set MD_RECOVERY_DONE before clearning
      curr_resync.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      f7851be7
  11. 14 8月, 2015 1 次提交
    • K
      block: kill merge_bvec_fn() completely · 8ae12666
      Kent Overstreet 提交于
      As generic_make_request() is now able to handle arbitrarily sized bios,
      it's no longer necessary for each individual block driver to define its
      own ->merge_bvec_fn() callback. Remove every invocation completely.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: ceph-devel@vger.kernel.org
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
       dm-era-target, and resolve merge conflicts]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ae12666