1. 06 1月, 2016 2 次提交
  2. 21 12月, 2015 1 次提交
    • N
      md: remove check for MD_RECOVERY_NEEDED in action_store. · 312045ee
      NeilBrown 提交于
      md currently doesn't allow a 'sync_action' such as 'reshape' to be set
      while MD_RECOVERY_NEEDED is set.
      
      This s a problem, particularly since commit 738a2738 as that can
      cause ->check_shape to call mddev_resume() which sets
      MD_RECOVERY_NEEDED.  So by the time we come to start 'reshape' it is
      very likely that MD_RECOVERY_NEEDED is still set.
      
      Testing for this flag is not really needed and is in any case very
      racy as it can be set at any moment - asynchronously.  Any race
      between setting a sync_action and setting MD_RECOVERY_NEEDED must
      already be handled properly in some locked code, probably
      md_check_recovery(), so remove the test here.
      
      The test on MD_RECOVERY_RUNNING is also racy in the 'reshape' case
      so we should test it again after getting mddev_lock().
      
      As this fixes a race and a regression which can cause 'reshape' to
      fail, it is suitable for -stable kernels since 4.1
      Reported-by: NXiao Ni <xni@redhat.com>
      Fixes: 738a2738 ("md/raid5: fix allocation of 'scribble' array.")
      Cc: stable@vger.kernel.org (v4.1+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      312045ee
  3. 18 12月, 2015 3 次提交
    • G
      Fix remove_and_add_spares removes drive added as spare in slot_store · cb01c549
      Goldwyn Rodrigues 提交于
      Commit 2910ff17
      introduced a regression which would remove a recently added spare via
      slot_store. Revert part of the patch which touches slot_store() and add
      the disk directly using pers->hot_add_disk()
      
      Fixes: 2910ff17 ("md: remove_and_add_spares() to activate specific
      rdev")
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NPawel Baldysiak <pawel.baldysiak@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      cb01c549
    • M
      md: fix bug due to nested suspend · 0dc10e50
      Mikulas Patocka 提交于
      The patch c7bfced9 committed to 4.4-rc
      causes crash in LVM test shell/lvchange-raid.sh. The kernel crashes with
      this BUG, the reason is that we attempt to suspend a device that is
      already suspended. See also
      https://bugzilla.redhat.com/show_bug.cgi?id=1283491
      
      This patch fixes the bug by changing functions mddev_suspend and
      mddev_resume to always nest.
      The number of nested calls to mddev_nested_suspend is kept in the
      variable mddev->suspended.
      [neilb: made mddev_suspend() always nest instead of introduce mddev_nested_suspend]
      
      kernel BUG at drivers/md/md.c:317!
      CPU: 3 PID: 32754 Comm: lvm Not tainted 4.4.0-rc2 #1
      task: 0000000047076040 ti: 0000000047014000 task.ti: 0000000047014000
      
           YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
      PSW: 00001000000001000000000000001111 Not tainted
      r00-03  000000000804000f 00000000102c5280 0000000010c7522c 000000007e3d1810
      r04-07  0000000010c6f000 000000004ef37f20 000000007e3d1dd0 000000007e3d1810
      r08-11  000000007c9f1600 0000000000000000 0000000000000001 ffffffffffffffff
      r12-15  0000000010c1d000 0000000000000041 00000000f98d63c8 00000000f98e49e4
      r16-19  00000000f98e49e4 00000000c138fd06 00000000f98d63c8 0000000000000001
      r20-23  0000000000000002 000000004ef37f00 00000000000000b0 00000000000001d1
      r24-27  00000000424783a0 000000007e3d1dd0 000000007e3d1810 00000000102b2000
      r28-31  0000000000000001 0000000047014840 0000000047014930 0000000000000001
      sr00-03  0000000007040800 0000000000000000 0000000000000000 0000000007040800
      sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      
      IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000102c538c 00000000102c5390
       IIR: 03ffe01f    ISR: 0000000000000000  IOR: 00000000102b2748
       CPU:        3   CR30: 0000000047014000 CR31: 0000000000000000
       ORIG_R28: 00000000000000b0
       IAOQ[0]: mddev_suspend+0x10c/0x160 [md_mod]
       IAOQ[1]: mddev_suspend+0x110/0x160 [md_mod]
       RP(r2): raid1_add_disk+0xd4/0x2c0 [raid1]
      Backtrace:
       [<0000000010c7522c>] raid1_add_disk+0xd4/0x2c0 [raid1]
       [<0000000010c20078>] raid_resume+0x390/0x418 [dm_raid]
       [<00000000105833e8>] dm_table_resume_targets+0xc0/0x188 [dm_mod]
       [<000000001057f784>] dm_resume+0x144/0x1e0 [dm_mod]
       [<0000000010587dd4>] dev_suspend+0x1e4/0x568 [dm_mod]
       [<0000000010589278>] ctl_ioctl+0x1e8/0x428 [dm_mod]
       [<0000000010589518>] dm_compat_ctl_ioctl+0x18/0x68 [dm_mod]
       [<0000000040377b88>] compat_SyS_ioctl+0xd0/0x1558
      
      Fixes: c7bfced9 ("md: suspend i/o during runtime blk_integrity_unregister")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      0dc10e50
    • S
      MD: change journal disk role to disk 0 · 9b15603d
      Shaohua Li 提交于
      Neil pointed out setting journal disk role to raid_disks will confuse
      reshape if we support reshape eventually. Switching the role to 0 (we
      should be fine as long as the value >=0) and skip sysfs file creation to
      avoid error.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      9b15603d
  4. 08 11月, 2015 1 次提交
  5. 01 11月, 2015 8 次提交
  6. 31 10月, 2015 1 次提交
    • N
      Revert "md: allow a partially recovered device to be hot-added to an array." · d01552a7
      NeilBrown 提交于
      This reverts commit 7eb41885.
      
      This commit is poorly justified, I can find not discusison in email,
      and it clearly causes a problem.
      
      If a device which is being recovered fails and is subsequently
      re-added to an array, there could easily have been changes to the
      array *before* the point where the recovery was up to.  So the
      recovery must start again from the beginning.
      
      If a spare is being recovered and fails, then when it is re-added we
      really should do a bitmap-based recovery up to the recovery-offset,
      and then a full recovery from there.  Before this reversion, we only
      did the "full recovery from there" which is not corect.  After this
      reversion with will do a full recovery from the start, which is safer
      but not ideal.
      
      It will be left to a future patch to arrange the two different styles
      of recovery.
      Reported-and-tested-by: NNate Dailey <nate.dailey@stratus.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Cc: stable@vger.kernel.org (3.14+)
      Fixes: 7eb41885 ("md: allow a partially recovered device to be hot-added to an array.")
      d01552a7
  7. 24 10月, 2015 4 次提交
  8. 22 10月, 2015 3 次提交
  9. 13 10月, 2015 2 次提交
  10. 12 10月, 2015 7 次提交
    • G
      md-cluster: Fix adding of new disk with new reload code · dbb64f86
      Goldwyn Rodrigues 提交于
      Adding the disk worked incorrectly with the new reload code. Fix it:
      
       - No operation should be performed on rdev marked as Candidate
       - After a metadata update operation, kick disk if role is 0xfffe
         else clear Candidate bit and continue with the regular change check.
       - Saving the mode of the lock resource to check if token lock is already
         locked, because it can be called twice while adding a disk. However,
         unlock_comm() must be called only once.
       - add_new_disk() is called by the node initiating the --add operation.
         If it needs to be canceled, call add_new_disk_cancel(). The operation
         is completed by md_update_sb() which will write and unlock the
         communication.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      dbb64f86
    • G
      md-cluster: Perform resync/recovery under a DLM lock · c186b128
      Goldwyn Rodrigues 提交于
      Resync or recovery must be performed by only one node at a time.
      A DLM lock resource, resync_lockres provides the mutual exclusion
      so that only one node performs the recovery/resync at a time.
      
      If a node is unable to get the resync_lockres, because recovery is
      being performed by another node, it set MD_RECOVER_NEEDED so as
      to schedule recovery in the future.
      
      Remove the debug message in resync_info_update()
      used during development.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      c186b128
    • G
      md-cluster: Perform a lazy update · 2aa82191
      Goldwyn Rodrigues 提交于
      In a clustered environment, a change such as marking a device faulty,
      can be recorded by any of the nodes. This is communicated to all the
      nodes and re-recording such a change is unnecessary, and quite often
      pretty disruptive.
      
      With this patch, just before the update, we detect for the changes
      and if the changes are already in superblock, we abort the update
      after clearing all the flags
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      2aa82191
    • G
      md-cluster: Improve md_reload_sb to be less error prone · 70bcecdb
      Goldwyn Rodrigues 提交于
      md_reload_sb is too simplistic and it explicitly needs to determine
      the changes made by the writing node. However, there are multiple areas
      where a simple reload could fail.
      
      Instead, read the superblock of one of the "good" rdevs and update
      the necessary information:
      
      - read the superblock into a newly allocated page, by temporarily
        swapping out rdev->sb_page and calling ->load_super.
      - if that fails return
      - if it succeeds, call check_sb_changes
        1. iterates over list of active devices and checks the matching
         dev_roles[] value.
         	If that is 'faulty', the device must be  marked as faulty
      	 - call md_error to mark the device as faulty. Make sure
      	   not to set CHANGE_DEVS and wakeup mddev->thread or else
      	   it would initiate a resync process, which is the responsibility
      	   of the "primary" node.
      	 - clear the Blocked bit
      	 - Call remove_and_add_spares() to hot remove the device.
      	If the device is 'spare':
      	 - call remove_and_add_spares() to get the number of spares
      	   added in this operation.
      	 - Reduce mddev->degraded to mark the array as not degraded.
        2. reset recovery_cp
      - read the rest of the rdevs to update recovery_offset. If recovery_offset
        is equal to MaxSector, call spare_active() to set it In_sync
      
      This required that recovery_offset be initialized to MaxSector, as
      opposed to zero so as to communicate the end of sync for a rdev.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      70bcecdb
    • G
      md: remove_and_add_spares() to activate specific rdev · 2910ff17
      Goldwyn Rodrigues 提交于
      remove_and_add_spares() checks for all devices to activate spare.
      Change it to activate a specific device if a non-null rdev
      argument is passed.
      
      remove_and_add_spares() can be used to activate spares in
      slot_store() as well.
      
      For hot_remove_disk(), check if rdev->raid_disk == -1 before
      calling remove_and_add_spares()
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      2910ff17
    • G
      md-cluster: Use a small window for resync · c40f341f
      Goldwyn Rodrigues 提交于
      Suspending the entire device for resync could take too long. Resync
      in small chunks.
      
      cluster's resync window (32M) is maintained in r1conf as
      cluster_sync_low and cluster_sync_high and processed in
      raid1's sync_request(). If the current resync is outside the cluster
      resync window:
      
      1. Set the cluster_sync_low to curr_resync_completed.
      2. Check if the sync will fit in the new window, if not issue a
         wait_barrier() and set cluster_sync_low to sector_nr.
      3. Set cluster_sync_high to cluster_sync_low + resync_window.
      4. Send a message to all nodes so they may add it in their suspension
         list.
      
      bitmap_cond_end_sync is modified to allow to force a sync inorder
      to get the curr_resync_completed uptodate with the sector passed.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c40f341f
    • G
      md: Increment version for clustered bitmaps · 3c462c88
      Goldwyn Rodrigues 提交于
      Add BITMAP_MAJOR_CLUSTERED as 5, in order to prevent older kernels
      to assemble a clustered device.
      
      In order to maximize compatibility, the major version is set to
      BITMAP_MAJOR_CLUSTERED *only* if the bitmap is clustered.
      
      Added MD_FEATURE_CLUSTERED in order to return error for older
      kernels which would assemble MD even if the bitmap is corrupted.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      3c462c88
  11. 02 10月, 2015 2 次提交
    • S
      md: clear CHANGE_PENDING in readonly array · d4929add
      Shaohua Li 提交于
      If faulty disks of an array are more than allowed degraded number, the
      array enters error handling. It will be marked as read-only with
      MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't
      clear CHANGE_PENDING bit for read-only array.  If MD_CHANGE_PENDING is
      set for a raid5 array, all returned IO will be hold on a list till the
      bit is clear. But recovery nevery clears this bit, the IO is always in
      pending state and nevery finish. This has bad effects like upper layer
      can't get an IO error and the array can't be stopped.
      
      Fixes: c3cce6cd ("md/raid5: ensure device failure recorded before write request returns.")
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      d4929add
    • N
      md: wait for pending superblock updates before switching to read-only · 88724bfa
      NeilBrown 提交于
      If a superblock update is pending, wait for it to complete before
      letting md_set_readonly() switch to readonly.
      Otherwise we might lose important information about a device having
      failed.
      
      For external arrays, waiting for superblock updates can wait on
      user-space, so in that case, just return an error.
      Reported-and-tested-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      88724bfa
  12. 01 9月, 2015 6 次提交
    • N
      md/raid1: ensure device failure recorded before write request returns. · 55ce74d4
      NeilBrown 提交于
      When a write to one of the legs of a RAID1 fails, the failure is
      recorded in the metadata of the other leg(s) so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again  (maybe a cable was unplugged).
      
      Similarly when we record a bad-block in response to a write failure,
      we must not let the write complete until the bad-block update is safe.
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that experienced an error on a new queue which
         is only processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      55ce74d4
    • N
      md: extend spinlock protection in register_md_cluster_operations · 6022e75b
      NeilBrown 提交于
      This code looks racy.
      
      The only possible race is if two modules try to register at the same
      time and that won't happen.  But make the code look safe anyway.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      6022e75b
    • G
      md-cluster: transfer the resync ownership to another node · dc737d7c
      Guoqing Jiang 提交于
      When node A stops an array while the array is doing a resync, we need
      to let another node B take over the resync task.
      
      To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC
      message to the cluster. And the node B which received that message will
      invoke __recover_slot to do resync.
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      dc737d7c
    • S
      md: setup safemode_timer before it's being used · 25b2edfa
      Sasha Levin 提交于
      We used to set up the safemode_timer timer in md_run. If md_run
      would fail before the timer was set up we'd end up trying to modify
      a timer that doesn't have a callback function when we access safe_delay_store,
      which would trigger a BUG.
      
      neilb: delete init_timer() call as setup_timer() does that.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      25b2edfa
    • N
      md: sync sync_completed has correct value as recovery finishes. · 5ed1df2e
      NeilBrown 提交于
      There can be a small window between the moment that recovery
      actually writes the last block and the time when various sysfs
      and /proc/mdstat attributes report that it has finished.
      During this time, 'sync_completed' can have the wrong value.
      This can confuse monitoring software.
      
      So:
       - don't set curr_resync_completed beyond the end of the devices,
       - set it correctly when resync/recovery has completed.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      5ed1df2e
    • N
      md: be careful when testing resync_max against curr_resync_completed. · c5e19d90
      NeilBrown 提交于
      While it generally shouldn't happen, it is not impossible for
      curr_resync_completed to exceed resync_max.
      This can particularly happen when reshaping RAID5 - the current
      status isn't copied to curr_resync_completed promptly, so when it
      is, it can exceed resync_max.
      This happens when the reshape is 'frozen', resync_max is set low,
      and reshape is re-enabled.
      
      Taking a difference between two unsigned numbers is always dangerous
      anyway, so add a test to behave correctly if
         curr_resync_completed > resync_max
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c5e19d90