1. 02 10月, 2015 5 次提交
  2. 12 9月, 2015 1 次提交
  3. 01 9月, 2015 33 次提交
    • J
      dm cache: fix use after freeing migrations · cc7da0ba
      Joe Thornber 提交于
      Both free_io_migration() and issue_discard() dereference a migration
      that was just freed.  Fix those by saving off the migrations's cache
      object before freeing the migration.  Also cleanup needless mg->cache
      dereferences now that the cache object is available directly.
      
      Fixes: e44b6a5a ("dm cache: move wake_waker() from free_migrations() to where it is needed")
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      cc7da0ba
    • M
      dm cache: small cleanups related to deferred prison cell cleanup · dc9cee5d
      Mike Snitzer 提交于
      Eliminate __cell_release() since it only had one caller that always
      released the cell holder.
      
      Switch cell_error_with_code() to using free_prison_cell() for the sake
      of consistency.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      dc9cee5d
    • J
      dm cache: fix leaking of deferred bio prison cells · 9153df74
      Joe Thornber 提交于
      There were two cases where dm_cell_visit_release() was being called,
      which removes the cell from the prison's rbtree, but the callers didn't
      also return the cell to the mempool.  Fix this by having them call
      free_prison_cell().
      
      This leak manifested as the 'kmalloc-96' slab growing until OOM.
      
      Fixes: 651f5fa2 ("dm cache: defer whole cells")
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 4.1+
      9153df74
    • N
      md/raid5: ensure device failure recorded before write request returns. · c3cce6cd
      NeilBrown 提交于
      When a write to one of the devices of a RAID5/6 fails, the failure is
      recorded in the metadata of the other devices so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again (maybe a cable was unplugged).
      
      Similarly when we record a bad-block in response to a write failure,
      we must not let the write complete until the bad-block update is safe.
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that completed when MD_CHANGE_PENDING is set to
         only be processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c3cce6cd
    • N
      md/raid5: use bio_list for the list of bios to return. · 34a6f80e
      NeilBrown 提交于
      This will make it easier to splice two lists together which will
      be needed in future patch.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      34a6f80e
    • N
      md/raid10: ensure device failure recorded before write request returns. · 95af587e
      NeilBrown 提交于
      When a write to one of the legs of a RAID10 fails, the failure is
      recorded in the metadata of the other legs so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again (maybe a cable was unplugged).
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that experienced an error on a new queue which
         is only processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      95af587e
    • N
      md/raid1: ensure device failure recorded before write request returns. · 55ce74d4
      NeilBrown 提交于
      When a write to one of the legs of a RAID1 fails, the failure is
      recorded in the metadata of the other leg(s) so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again  (maybe a cable was unplugged).
      
      Similarly when we record a bad-block in response to a write failure,
      we must not let the write complete until the bad-block update is safe.
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that experienced an error on a new queue which
         is only processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      55ce74d4
    • N
      md-cluster: remove inappropriate try_module_get from join() · 18b9f679
      NeilBrown 提交于
      md_setup_cluster already calls try_module_get(), so this
      try_module_get isn't needed.
      Also, there is no matching module_put (except in error patch),
      so this leaves an unbalanced module count.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      18b9f679
    • N
      md: extend spinlock protection in register_md_cluster_operations · 6022e75b
      NeilBrown 提交于
      This code looks racy.
      
      The only possible race is if two modules try to register at the same
      time and that won't happen.  But make the code look safe anyway.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      6022e75b
    • G
      md-cluster: Read the disk bitmap sb and check if it needs recovery · abb9b22a
      Guoqing Jiang 提交于
      In gather_all_resync_info, we need to read the disk bitmap sb and
      check if it needs recovery.
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      abb9b22a
    • G
      md-cluster: only call complete(&cinfo->completion) when node join cluster · eece075c
      Guoqing Jiang 提交于
      Introduce MD_CLUSTER_BEGIN_JOIN_CLUSTER flag to make sure
      complete(&cinfo->completion) is only be invoked when node
      join cluster. Otherwise node failure could also call the
      complete, and it doesn't make sense to do it.
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      eece075c
    • G
      md-cluster: add missed lockres_free · 6e6d9f2c
      Guoqing Jiang 提交于
      We also need to free the lock resource before goto out.
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      6e6d9f2c
    • G
      md-cluster: remove the unused sb_lock · b2b9bfff
      Guoqing Jiang 提交于
      The sb_lock is not used anywhere, so let's remove it.
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      b2b9bfff
    • G
      md-cluster: init suspend_list and suspend_lock early in join · 9e3072e3
      Guoqing Jiang 提交于
      If the node just join the cluster, and receive the msg from other nodes
      before init suspend_list, it will cause kernel crash due to NULL pointer
      dereference, so move the initializations early to fix the bug.
      
      md-cluster: Joined cluster 3578507b-e0cb-6d4f-6322-696cd7b1b10c slot 3
      BUG: unable to handle kernel NULL pointer dereference at           (null)
      ... ... ...
      Call Trace:
      [<ffffffffa0444924>] process_recvd_msg+0x2e4/0x330 [md_cluster]
      [<ffffffffa0444a06>] recv_daemon+0x96/0x170 [md_cluster]
      [<ffffffffa045189d>] md_thread+0x11d/0x170 [md_mod]
      [<ffffffff810768c4>] kthread+0xb4/0xc0
      [<ffffffff8151927c>] ret_from_fork+0x7c/0xb0
      ... ... ...
      RIP  [<ffffffffa0443581>] __remove_suspend_info+0x11/0xa0 [md_cluster]
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      9e3072e3
    • G
      md-cluster: add the error check if failed to get dlm lock · b5ef5678
      Guoqing Jiang 提交于
      In complicated cluster environment, it is possible that the
      dlm lock couldn't be get/convert on purpose, the related err
      info is added for better debug potential issue.
      
      For lockres_free, if the lock is blocking by a lock request or
      conversion request, then dlm_unlock just put it back to grant
      queue, so need to ensure the lock is free finally.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      b5ef5678
    • G
      md-cluster: init completion within lockres_init · b83d51c0
      Guoqing Jiang 提交于
      We should init completion within lockres_init, otherwise
      completion could be initialized more than one time during
      it's life cycle.
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      b83d51c0
    • G
      md-cluster: fix deadlock issue on message lock · 66099bb0
      Guoqing Jiang 提交于
      There is problem with previous communication mechanism, and we got below
      deadlock scenario with cluster which has 3 nodes.
      
      	Sender                	    Receiver        		Receiver
      
      	token(EX)
             message(EX)
            writes message
         downconverts message(CR)
            requests ack(EX)
      		                  get message(CR)            gets message(CR)
                      		  reads message                reads message
      		               requests EX on message    requests EX on message
      
      To fix this problem, we do the following changes:
      
      1. the sender downconverts MESSAGE to CW rather than CR.
      2. and the receiver request PR lock not EX lock on message.
      
      And in case we failed to down-convert EX to CW on message, it is better to
      unlock message otherthan still hold the lock.
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NLidong Zhong <ldzhong@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      66099bb0
    • G
      md-cluster: transfer the resync ownership to another node · dc737d7c
      Guoqing Jiang 提交于
      When node A stops an array while the array is doing a resync, we need
      to let another node B take over the resync task.
      
      To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC
      message to the cluster. And the node B which received that message will
      invoke __recover_slot to do resync.
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      dc737d7c
    • G
      md-cluster: split recover_slot for future code reuse · 05cd0e51
      Guoqing Jiang 提交于
      Make recover_slot as a wraper to __recover_slot, since the
      logic of __recover_slot can be reused for the condition
      when other nodes need to take over the resync job.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      05cd0e51
    • G
      md-cluster: use %pU to print UUIDs · b89f704a
      Guoqing Jiang 提交于
      Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      b89f704a
    • S
      md: setup safemode_timer before it's being used · 25b2edfa
      Sasha Levin 提交于
      We used to set up the safemode_timer timer in md_run. If md_run
      would fail before the timer was set up we'd end up trying to modify
      a timer that doesn't have a callback function when we access safe_delay_store,
      which would trigger a BUG.
      
      neilb: delete init_timer() call as setup_timer() does that.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      25b2edfa
    • N
      md/raid5: handle possible race as reshape completes. · 6cbd8148
      NeilBrown 提交于
      It is possible (though unlikely) for a reshape to be
      interrupted between the time that end_reshape is called
      and the time when raid5_finish_reshape is called.
      
      This can leave conf->reshape_progress set to MaxSector,
      but mddev->reshape_position not.
      
      This combination confused reshape_request() when ->reshape_backwards.
      As conf->reshape_progress is so high, it seems the reshape hasn't
      really begun.  But assuming MaxSector is a valid address only
      leads to sorrow.
      
      So ensure reshape_position and reshape_progress both agree,
      and add an extra check in reshape_request() just in case they don't.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      6cbd8148
    • N
      md: sync sync_completed has correct value as recovery finishes. · 5ed1df2e
      NeilBrown 提交于
      There can be a small window between the moment that recovery
      actually writes the last block and the time when various sysfs
      and /proc/mdstat attributes report that it has finished.
      During this time, 'sync_completed' can have the wrong value.
      This can confuse monitoring software.
      
      So:
       - don't set curr_resync_completed beyond the end of the devices,
       - set it correctly when resync/recovery has completed.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      5ed1df2e
    • N
      md: be careful when testing resync_max against curr_resync_completed. · c5e19d90
      NeilBrown 提交于
      While it generally shouldn't happen, it is not impossible for
      curr_resync_completed to exceed resync_max.
      This can particularly happen when reshaping RAID5 - the current
      status isn't copied to curr_resync_completed promptly, so when it
      is, it can exceed resync_max.
      This happens when the reshape is 'frozen', resync_max is set low,
      and reshape is re-enabled.
      
      Taking a difference between two unsigned numbers is always dangerous
      anyway, so add a test to behave correctly if
         curr_resync_completed > resync_max
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c5e19d90
    • N
      md: set MD_RECOVERY_RECOVER when starting a degraded array. · a4a3d26d
      NeilBrown 提交于
      This ensures that 'sync_action' will show 'recover' immediately the
      array is started.  If there is no spare the status will change to
      'idle' once that is detected.
      
      Clear MD_RECOVERY_RECOVER for a read-only array to ensure this change
      happens.
      
      This allows scripts which monitor status not to get confused -
      particularly my test scripts.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      a4a3d26d
    • N
      md/raid5: remove incorrect "min_t()" when calculating writepos. · c74c0d76
      NeilBrown 提交于
      This code is calculating:
        writepos, which is the furthest along address (device-space) that we
           *will* be writing to
        readpos, which is the earliest address that we *could* possible read
           from, and
        safepos, which is the earliest address in the 'old' section that we
           might read from after a crash when the reshape position is
           recovered from metadata.
      
        The first is a precise calculation, so clipping at zero doesn't
        make sense.  As the reshape position is now guaranteed to always be
        a multiple of reshape_sectors and as we already BUG_ON when
        reshape_progress is zero, there is no point in this min_t() call.
      
        The readpos and safepos are worst case - actual value depends on
        precise geometry.  That worst case could be negative, which is only
        a problem because we are storing the value in an unsigned.
        So leave the min_t() for those.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      
      c74c0d76
    • N
      md/raid5: strengthen check on reshape_position at run. · 05256d98
      NeilBrown 提交于
      When reshaping, we work in units of the largest chunk size.
      If changing from a larger to a smaller chunk size, that means we
      reshape more than one stripe at a time.  So the required alignment
      of reshape_position needs to take into account both the old
      and new chunk size.
      
      This means that both 'here_new' and 'here_old' are calculated with
      respect to the same (maximum) chunk size, so testing if they are the
      same when delta_disks is zero becomes pointless.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      05256d98
    • N
      md/raid5: switch to use conf->chunk_sectors in place of mddev->chunk_sectors where possible · 3cb5edf4
      NeilBrown 提交于
      The chunk_sectors and new_chunk_sectors fields of mddev can be changed
      any time (via sysfs) that the reconfig mutex can be taken.  So raid5
      keeps internal copies in 'conf' which are stable except for a short
      locked moment when reshape stops/starts.
      
      So any access that does not hold reconfig_mutex should use the 'conf'
      values, not the 'mddev' values.
      Several don't.
      
      This could result in corruption if new values were written at awkward
      times.
      
      Also use min() or max() rather than open-coding.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      3cb5edf4
    • N
      md/raid5: always set conf->prev_chunk_sectors and ->prev_algo · 5cac6bcb
      NeilBrown 提交于
      These aren't really needed when no reshape is happening,
      but it is safer to have them always set to a meaningful value.
      The next patch will use ->prev_chunk_sectors without checking
      if a reshape is happening (because that makes the code simpler),
      and this patch makes that safe.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      5cac6bcb
    • N
      md/raid10: fix a few typos in comments · 02ec5026
      NeilBrown 提交于
      Signed-off-by: NNeilBrown <neilb@suse.com>
      02ec5026
    • N
      md/raid5: consider updating reshape_position at start of reshape. · 92140480
      NeilBrown 提交于
      md/raid5 only updates ->reshape_position (which is stored in
      metadata and is authoritative) occasionally, but particularly
      when getting closed to ->resync_max as it must be correct
      when ->resync_max is reached.
      
      When mdadm tries to stop an array which is reshaping it will:
       - freeze the reshape,
       - set resync_max to where the reshape has reached.
       - unfreeze the reshape.
      When this happens, the reshape is aborted and then restarted.
      
      The restart doesn't check that resync_max is close, and so doesn't
      update ->reshape_position like it should.
      This results in the reshape stopping, but ->reshape_position being
      incorrect.
      
      So on that first call to reshape_request, make sure ->reshape_position
      is updated if needed.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      92140480
    • N
      md: close some races between setting and checking sync_action. · 985ca973
      NeilBrown 提交于
      When checking sync_action in a script, we want to be sure it is
      as accurate as possible.
      As resync/reshape etc doesn't always start immediately (a separate
      thread is scheduled to do it), it is best if 'action_show'
      checks if MD_RECOVER_NEEDED is set (which it does) and in that
      case reports what is likely to start soon (which it only sometimes
      does).
      
      So:
       - report 'reshape' if reshape_position suggests one might start.
       - set MD_RECOVERY_RECOVER in raid1_reshape(), because that is very
         likely to happen next.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      985ca973
    • N
      md: Keep /proc/mdstat reporting recovery until fully DONE. · f7851be7
      NeilBrown 提交于
      Currently when a recovery completes, mdstat shows that it has finished
      before the new device is marked as a full member.  Because of this it
      can appear to a script that the recovery finished but the array isn't
      in sync.
      
      So while MD_RECOVERY_DONE is still set, keep mdstat reporting "recovery".
      Once md_reap_sync_thread() completes, the spare will be active and then
      MD_RECOVERY_DONE will be cleared.
      
      To ensure this is race-free, set MD_RECOVERY_DONE before clearning
      curr_resync.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      f7851be7
  4. 29 8月, 2015 1 次提交