1. 01 9月, 2015 16 次提交
    • G
      md-cluster: transfer the resync ownership to another node · dc737d7c
      Guoqing Jiang 提交于
      When node A stops an array while the array is doing a resync, we need
      to let another node B take over the resync task.
      
      To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC
      message to the cluster. And the node B which received that message will
      invoke __recover_slot to do resync.
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      dc737d7c
    • G
      md-cluster: split recover_slot for future code reuse · 05cd0e51
      Guoqing Jiang 提交于
      Make recover_slot as a wraper to __recover_slot, since the
      logic of __recover_slot can be reused for the condition
      when other nodes need to take over the resync job.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      05cd0e51
    • G
      md-cluster: use %pU to print UUIDs · b89f704a
      Guoqing Jiang 提交于
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      b89f704a
    • S
      md: setup safemode_timer before it's being used · 25b2edfa
      Sasha Levin 提交于
      We used to set up the safemode_timer timer in md_run. If md_run
      would fail before the timer was set up we'd end up trying to modify
      a timer that doesn't have a callback function when we access safe_delay_store,
      which would trigger a BUG.
      
      neilb: delete init_timer() call as setup_timer() does that.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      25b2edfa
    • N
      md/raid5: handle possible race as reshape completes. · 6cbd8148
      NeilBrown 提交于
      It is possible (though unlikely) for a reshape to be
      interrupted between the time that end_reshape is called
      and the time when raid5_finish_reshape is called.
      
      This can leave conf->reshape_progress set to MaxSector,
      but mddev->reshape_position not.
      
      This combination confused reshape_request() when ->reshape_backwards.
      As conf->reshape_progress is so high, it seems the reshape hasn't
      really begun.  But assuming MaxSector is a valid address only
      leads to sorrow.
      
      So ensure reshape_position and reshape_progress both agree,
      and add an extra check in reshape_request() just in case they don't.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      6cbd8148
    • N
      md: sync sync_completed has correct value as recovery finishes. · 5ed1df2e
      NeilBrown 提交于
      There can be a small window between the moment that recovery
      actually writes the last block and the time when various sysfs
      and /proc/mdstat attributes report that it has finished.
      During this time, 'sync_completed' can have the wrong value.
      This can confuse monitoring software.
      
      So:
       - don't set curr_resync_completed beyond the end of the devices,
       - set it correctly when resync/recovery has completed.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      5ed1df2e
    • N
      md: be careful when testing resync_max against curr_resync_completed. · c5e19d90
      NeilBrown 提交于
      While it generally shouldn't happen, it is not impossible for
      curr_resync_completed to exceed resync_max.
      This can particularly happen when reshaping RAID5 - the current
      status isn't copied to curr_resync_completed promptly, so when it
      is, it can exceed resync_max.
      This happens when the reshape is 'frozen', resync_max is set low,
      and reshape is re-enabled.
      
      Taking a difference between two unsigned numbers is always dangerous
      anyway, so add a test to behave correctly if
         curr_resync_completed > resync_max
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c5e19d90
    • N
      md: set MD_RECOVERY_RECOVER when starting a degraded array. · a4a3d26d
      NeilBrown 提交于
      This ensures that 'sync_action' will show 'recover' immediately the
      array is started.  If there is no spare the status will change to
      'idle' once that is detected.
      
      Clear MD_RECOVERY_RECOVER for a read-only array to ensure this change
      happens.
      
      This allows scripts which monitor status not to get confused -
      particularly my test scripts.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      a4a3d26d
    • N
      md/raid5: remove incorrect "min_t()" when calculating writepos. · c74c0d76
      NeilBrown 提交于
      This code is calculating:
        writepos, which is the furthest along address (device-space) that we
           *will* be writing to
        readpos, which is the earliest address that we *could* possible read
           from, and
        safepos, which is the earliest address in the 'old' section that we
           might read from after a crash when the reshape position is
           recovered from metadata.
      
        The first is a precise calculation, so clipping at zero doesn't
        make sense.  As the reshape position is now guaranteed to always be
        a multiple of reshape_sectors and as we already BUG_ON when
        reshape_progress is zero, there is no point in this min_t() call.
      
        The readpos and safepos are worst case - actual value depends on
        precise geometry.  That worst case could be negative, which is only
        a problem because we are storing the value in an unsigned.
        So leave the min_t() for those.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      
      c74c0d76
    • N
      md/raid5: strengthen check on reshape_position at run. · 05256d98
      NeilBrown 提交于
      When reshaping, we work in units of the largest chunk size.
      If changing from a larger to a smaller chunk size, that means we
      reshape more than one stripe at a time.  So the required alignment
      of reshape_position needs to take into account both the old
      and new chunk size.
      
      This means that both 'here_new' and 'here_old' are calculated with
      respect to the same (maximum) chunk size, so testing if they are the
      same when delta_disks is zero becomes pointless.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      05256d98
    • N
      md/raid5: switch to use conf->chunk_sectors in place of mddev->chunk_sectors where possible · 3cb5edf4
      NeilBrown 提交于
      The chunk_sectors and new_chunk_sectors fields of mddev can be changed
      any time (via sysfs) that the reconfig mutex can be taken.  So raid5
      keeps internal copies in 'conf' which are stable except for a short
      locked moment when reshape stops/starts.
      
      So any access that does not hold reconfig_mutex should use the 'conf'
      values, not the 'mddev' values.
      Several don't.
      
      This could result in corruption if new values were written at awkward
      times.
      
      Also use min() or max() rather than open-coding.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      3cb5edf4
    • N
      md/raid5: always set conf->prev_chunk_sectors and ->prev_algo · 5cac6bcb
      NeilBrown 提交于
      These aren't really needed when no reshape is happening,
      but it is safer to have them always set to a meaningful value.
      The next patch will use ->prev_chunk_sectors without checking
      if a reshape is happening (because that makes the code simpler),
      and this patch makes that safe.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      5cac6bcb
    • N
      md/raid10: fix a few typos in comments · 02ec5026
      NeilBrown 提交于
      Signed-off-by: NNeilBrown <neilb@suse.com>
      02ec5026
    • N
      md/raid5: consider updating reshape_position at start of reshape. · 92140480
      NeilBrown 提交于
      md/raid5 only updates ->reshape_position (which is stored in
      metadata and is authoritative) occasionally, but particularly
      when getting closed to ->resync_max as it must be correct
      when ->resync_max is reached.
      
      When mdadm tries to stop an array which is reshaping it will:
       - freeze the reshape,
       - set resync_max to where the reshape has reached.
       - unfreeze the reshape.
      When this happens, the reshape is aborted and then restarted.
      
      The restart doesn't check that resync_max is close, and so doesn't
      update ->reshape_position like it should.
      This results in the reshape stopping, but ->reshape_position being
      incorrect.
      
      So on that first call to reshape_request, make sure ->reshape_position
      is updated if needed.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      92140480
    • N
      md: close some races between setting and checking sync_action. · 985ca973
      NeilBrown 提交于
      When checking sync_action in a script, we want to be sure it is
      as accurate as possible.
      As resync/reshape etc doesn't always start immediately (a separate
      thread is scheduled to do it), it is best if 'action_show'
      checks if MD_RECOVER_NEEDED is set (which it does) and in that
      case reports what is likely to start soon (which it only sometimes
      does).
      
      So:
       - report 'reshape' if reshape_position suggests one might start.
       - set MD_RECOVERY_RECOVER in raid1_reshape(), because that is very
         likely to happen next.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      985ca973
    • N
      md: Keep /proc/mdstat reporting recovery until fully DONE. · f7851be7
      NeilBrown 提交于
      Currently when a recovery completes, mdstat shows that it has finished
      before the new device is marked as a full member.  Because of this it
      can appear to a script that the recovery finished but the array isn't
      in sync.
      
      So while MD_RECOVERY_DONE is still set, keep mdstat reporting "recovery".
      Once md_reap_sync_thread() completes, the spare will be active and then
      MD_RECOVERY_DONE will be cleared.
      
      To ensure this is race-free, set MD_RECOVERY_DONE before clearning
      curr_resync.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      f7851be7
  2. 03 8月, 2015 5 次提交
    • N
      md/raid0: update queue parameter in a safer location. · 199dc6ed
      NeilBrown 提交于
      When a (e.g.) RAID5 array is reshaped to RAID0, the updating
      of queue parameters (e.g. max number of sectors per bio) is
      done in the wrong place.
      It should be part of ->run, but it is actually part of ->takeover.
      This means it happens before level_store() calls:
      
      	blk_set_stacking_limits(&mddev->queue->limits);
      
      and so it ineffective.  This can lead to errors from underlying
      devices.
      
      So move all the relevant settings out of create_stripe_zones()
      and into raid0_run().
      
      As this can lead to a bug-on it is suitable for any -stable
      kernel which supports reshape to RAID0.  So 2.6.35 or later.
      As the bug has been present for five years there is no urgency,
      so no need to rush into -stable.
      
      Fixes: 9af204cf ("md: Add support for Raid5->Raid0 and Raid10->Raid0 takeover")
      Cc: stable@vger.kernel.org (v2.6.35+ - please delay until after -final release).
      Reported-by: NYi Zhang <yizhan@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      199dc6ed
    • B
      md: simplify get_bitmap_file now that "file" is zeroed. · 25eafe1a
      Benjamin Randazzo 提交于
      There is no point assigning '\0' to file->pathname[0] as
      file is now zeroed out, so remove that branch and
      simplify the code.
      
      [Original patch combined this with the change to use
       kzalloc.  I split the two so that the change to kzalloc
       is easier to backport. - neilb]
      Signed-off-by: NBenjamin Randazzo <benjamin@randazzo.fr>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      25eafe1a
    • N
      md/raid5: don't let shrink_slab shrink too far. · 49895bcc
      NeilBrown 提交于
      I have a report of drop_one_stripe() called from
      raid5_cache_scan() apparently finding ->max_nr_stripes == 0.
      
      This should not be allowed.
      
      So add a test to keep max_nr_stripes above min_nr_stripes.
      
      Also use a 'mask' rather than a 'mod' in drop_one_stripe
      to ensure 'hash' is valid even if max_nr_stripes does reach zero.
      
      
      Fixes: edbe83ab ("md/raid5: allow the stripe_cache to grow and shrink.")
      Cc: stable@vger.kernel.org (4.1 - please release with 2d5b569b)
      Reported-by: NTomas Papan <tomas.papan@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      49895bcc
    • B
      md: use kzalloc() when bitmap is disabled · b6878d9e
      Benjamin Randazzo 提交于
      In drivers/md/md.c get_bitmap_file() uses kmalloc() for creating a
      mdu_bitmap_file_t called "file".
      
      5769         file = kmalloc(sizeof(*file), GFP_NOIO);
      5770         if (!file)
      5771                 return -ENOMEM;
      
      This structure is copied to user space at the end of the function.
      
      5786         if (err == 0 &&
      5787             copy_to_user(arg, file, sizeof(*file)))
      5788                 err = -EFAULT
      
      But if bitmap is disabled only the first byte of "file" is initialized
      with zero, so it's possible to read some bytes (up to 4095) of kernel
      space memory from user space. This is an information leak.
      
      5775         /* bitmap disabled, zero the first byte and copy out */
      5776         if (!mddev->bitmap_info.file)
      5777                 file->pathname[0] = '\0';
      Signed-off-by: NBenjamin Randazzo <benjamin@randazzo.fr>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      b6878d9e
    • N
      md/raid1: extend spinlock to protect raid1_end_read_request against inconsistencies · 423f04d6
      NeilBrown 提交于
      raid1_end_read_request() assumes that the In_sync bits are consistent
      with the ->degaded count.
      raid1_spare_active updates the In_sync bit before the ->degraded count
      and so exposes an inconsistency, as does error()
      So extend the spinlock in raid1_spare_active() and error() to hide those
      inconsistencies.
      
      This should probably be part of
        Commit: 34cab6f4 ("md/raid1: fix test for 'was read error from
        last working device'.")
      as it addresses the same issue.  It fixes the same bug and should go
      to -stable for same reasons.
      
      Fixes: 76073054 ("md/raid1: clean up read_balance.")
      Cc: stable@vger.kernel.org (v3.0+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      423f04d6
  3. 30 7月, 2015 2 次提交
    • M
      dm cache: fix device destroy hang due to improper prealloc_used accounting · 795e633a
      Mike Snitzer 提交于
      Commit 665022d7 ("dm cache: avoid calls to prealloc_free_structs() if
      possible") introduced a regression that caused the removal of a DM cache
      device to hang in cache_postsuspend()'s call to wait_for_migrations()
      with the following stack trace:
      
        [<ffffffff81651457>] schedule+0x37/0x80
        [<ffffffffa041e21b>] cache_postsuspend+0xbb/0x470 [dm_cache]
        [<ffffffff810ba970>] ? prepare_to_wait_event+0xf0/0xf0
        [<ffffffffa0006f77>] dm_table_postsuspend_targets+0x47/0x60 [dm_mod]
        [<ffffffffa0001eb5>] __dm_destroy+0x215/0x250 [dm_mod]
        [<ffffffffa0004113>] dm_destroy+0x13/0x20 [dm_mod]
        [<ffffffffa00098cd>] dev_remove+0x10d/0x170 [dm_mod]
        [<ffffffffa00097c0>] ? dev_suspend+0x240/0x240 [dm_mod]
        [<ffffffffa0009f85>] ctl_ioctl+0x255/0x4d0 [dm_mod]
        [<ffffffff8127ac00>] ? SYSC_semtimedop+0x280/0xe10
        [<ffffffffa000a213>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
        [<ffffffff811fd432>] do_vfs_ioctl+0x2d2/0x4b0
        [<ffffffff81117d5f>] ? __audit_syscall_entry+0xaf/0x100
        [<ffffffff81022636>] ? do_audit_syscall_entry+0x66/0x70
        [<ffffffff811fd689>] SyS_ioctl+0x79/0x90
        [<ffffffff81023e58>] ? syscall_trace_leave+0xb8/0x110
        [<ffffffff81654f6e>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      Fix this by accounting for the call to prealloc_data_structs()
      immediately _before_ the call as opposed to after.  This is needed
      because it is possible to break out of the control loop after the call
      to prealloc_data_structs() but before prealloc_used was set to true.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      795e633a
    • M
      Revert "dm cache: do not wake_worker() in free_migration()" · 3508e659
      Mike Snitzer 提交于
      This reverts commit 386cb7cd.
      
      Taking the wake_worker() out of free_migration() will slow writeback
      dramatically, and hence adaptability.
      
      Say we have 10k blocks that need writing back, but are only able to
      issue 5 concurrently due to the migration bandwidth: it's imperative
      that we wake_worker() immediately after migration completion; waiting
      for the next 1 second wake up (via do_waker) means it'll take a long
      time to write that all back.
      Reported-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      3508e659
  4. 27 7月, 2015 3 次提交
  5. 24 7月, 2015 6 次提交
  6. 23 7月, 2015 1 次提交
    • G
      md: Skip cluster setup for dm-raid · d3b178ad
      Goldwyn Rodrigues 提交于
      There is a bug that the bitmap superblock isn't initialised properly for
      dm-raid, so a new field can have garbage in new fields.
      (dm-raid does initialisation in the kernel - md initialised the
       superblock in mdadm).
      
      This means that for dm-raid we cannot currently trust the new ->nodes
      field. So:
       - use __GFP_ZERO to initialise the superblock properly for all new
          arrays
       - initialise all fields in bitmap_info in bitmap_new_disk_sb
       - ignore ->nodes for dm arrays (yes, this is a hack)
      
      This bug exposes dm-raid to bug in the (still experimental) md-cluster
      code, so it is suitable for -stable.  It does cause crashes.
      
      References: https://bugzilla.kernel.org/show_bug.cgi?id=100491
      Cc: stable@vger.kernel.org (v4.1)
      Signed-off-By: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      d3b178ad
  7. 22 7月, 2015 3 次提交
    • N
      md: flush ->event_work before stopping array. · ee5d004f
      NeilBrown 提交于
      The 'event_work' worker used by dm-raid may still be running
      when the array is stopped.  This can result in an oops.
      
      So flush the workqueue on which it is run after detaching
      and before destroying the device.
      Reported-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Cc: stable@vger.kernel.org (2.6.38+ please delay 2 weeks after -final release)
      Fixes: 9d09e663 ("dm: raid456 basic support")
      ee5d004f
    • N
      md/raid10: always set reshape_safe when initializing reshape_position. · 299b0685
      NeilBrown 提交于
      'reshape_position' tracks where in the reshape we have reached.
      'reshape_safe' tracks where in the reshape we have safely recorded
      in the metadata.
      
      These are compared to determine when to update the metadata.
      So it is important that reshape_safe is initialised properly.
      Currently it isn't.  When starting a reshape from the beginning
      it usually has the correct value by luck.  But when reducing the
      number of devices in a RAID10, it has the wrong value and this leads
      to the metadata not being updated correctly.
      This can lead to corruption if the reshape is not allowed to complete.
      
      This patch is suitable for any -stable kernel which supports RAID10
      reshape, which is 3.5 and later.
      
      Fixes: 3ea7daa5 ("md/raid10: add reshape support")
      Cc: stable@vger.kernel.org (v3.5+ please wait for -final to be out for 2 weeks)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      299b0685
    • N
      md/raid5: avoid races when changing cache size. · 2d5b569b
      NeilBrown 提交于
      Cache size can grow or shrink due to various pressures at
      any time.  So when we resize the cache as part of a 'grow'
      operation (i.e. change the size to allow more devices) we need
      to blocks that automatic growing/shrinking.
      
      So introduce a mutex.  auto grow/shrink uses mutex_trylock()
      and just doesn't bother if there is a blockage.
      Resizing the whole cache holds the mutex to ensure that
      the correct number of new stripes is allocated.
      
      This bug can result in some stripes not being freed when an
      array is stopped.  This leads to the kmem_cache not being
      freed and a subsequent array can try to use the same kmem_cache
      and get confused.
      
      Fixes: edbe83ab ("md/raid5: allow the stripe_cache to grow and shrink.")
      Cc: stable@vger.kernel.org (4.1 - please delay until 2 weeks after release of 4.2)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      2d5b569b
  8. 17 7月, 2015 3 次提交
  9. 16 7月, 2015 1 次提交