1. 17 3月, 2017 3 次提交
    • G
      md-cluster: introduce cluster_check_sync_size · b98938d1
      Guoqing Jiang 提交于
      Support resize is a little complex for clustered
      raid, since we need to ensure all the nodes share
      the same knowledge about the size of raid.
      
      We achieve the goal by check the sync_size which
      is in each node's bitmap, we can only change the
      capacity after cluster_check_sync_size returns 0.
      
      Also, get_bitmap_from_slot is added to get a slot's
      bitmap. And we exported some funcs since they are
      used in cluster_check_sync_size().
      
      We can also reuse get_bitmap_from_slot to remove
      redundant code existed in bitmap_copy_from_slot.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b98938d1
    • G
      md-cluster: add CHANGE_CAPACITY message type · 7da3d203
      Guoqing Jiang 提交于
      The msg type CHANGE_CAPACITY is introduced to support
      resize clustered raid in later patch, and it is sent
      after all the nodes have the same sync_size, receiver
      node just need to set new capacity once received this
      msg.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      7da3d203
    • G
      md-cluster: use sync way to handle METADATA_UPDATED msg · 0ba95977
      Guoqing Jiang 提交于
      Previously, when node received METADATA_UPDATED msg, it just
      need to wakeup mddev->thread, then md_reload_sb will be called
      eventually.
      
      We taken the asynchronous way to avoid a deadlock issue, the
      deadlock issue could happen when one node is receiving the
      METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
      the path:
      
      md_check_recovery -> mddev_trylock(hold reconfig_mutex)
                        -> md_update_sb-metadata_update_start
      		     (want EX on token however token is
      		      got by the sending node)
      
      Since we will support resizing for clustered raid, and we
      need the metadata update handling to be synchronous so that
      the initiating node can detect failure, so we need to change
      the way for handling METADATA_UPDATED msg.
      
      But, we obviously need to avoid above deadlock with the
      sync way. To make this happen, we considered to not hold
      reconfig_mutex to call md_reload_sb, if some other thread
      has already taken reconfig_mutex and waiting for the 'token',
      then process_recvd_msg() can safely call md_reload_sb()
      without taking the mutex. This is because we can be certain
      that no other thread will take the mutex, and we also certain
      that the actions performed by md_reload_sb() won't interfere
      with anything that the other thread is in the middle of.
      
      To make this more concrete, we added a new cinfo->state bit
              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      
      Which is set in lock_token() just before dlm_lock_sync() is
      called, and cleared just after. As lock_token() is always
      called with reconfig_mutex() held (the specific case is the
      resync_info_update which is distinguished well in previous
      patch), if process_recvd_msg() finds that the new bit is set,
      then the mutex must be held by some other thread, and it will
      keep waiting.
      
      So process_metadata_update() can call md_reload_sb() if either
      mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
      is set. The tricky bit is what to do if neither of these apply.
      We need to wait. Fortunately mddev_unlock() always calls wake_up()
      on mddev->thread->wqueue. So we can get lock_token() to call
      wake_up() on that when it sets the bit.
      
      There are also some related changes inside this commit:
      1. remove RELOAD_SB related codes since there are not valid anymore.
      2. mddev is added into md_cluster_info then we can get mddev inside
         lock_token.
      3. add new parameter for lock_token to distinguish reconfig_mutex
         is held or not.
      
      And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
      1. set it before unregister thread, otherwise a deadlock could
         appear if stop a resyncing array.
         This is because md_unregister_thread(&cinfo->recv_thread) is
         blocked by recv_daemon -> process_recvd_msg
      			  -> process_metadata_update.
         To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
         also need to be set before unregister thread.
      2. set it in metadata_update_start to fix another deadlock.
      	a. Node A sends METADATA_UPDATED msg (held Token lock).
      	b. Node B wants to do resync, and is blocked since it can't
      	   get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
      	   not set since the callchain
      	   (md_do_sync -> sync_request
              	       -> resync_info_update
      		       -> sendmsg
      		       -> lock_comm -> lock_token)
      	   doesn't hold reconfig_mutex.
      	c. Node B trys to update sb (held reconfig_mutex), but stopped
      	   at wait_event() in metadata_update_start since we have set
      	   MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
      	d. Then Node B receives METADATA_UPDATED msg from A, of course
      	   recv_daemon is blocked forever.
         Since metadata_update_start always calls lock_token with reconfig_mutex,
         we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
         lock_token don't need to set it twice unless lock_token is invoked from
         lock_comm.
      
      Finally, thanks to Neil for his great idea and help!
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0ba95977
  2. 10 3月, 2017 2 次提交
  3. 22 9月, 2016 6 次提交
  4. 25 8月, 2016 1 次提交
  5. 10 5月, 2016 2 次提交
  6. 05 5月, 2016 5 次提交
  7. 25 1月, 2016 1 次提交
    • S
      md-cluster: fix missing memory free · 4ac7a65f
      Shaohua Li 提交于
      There are several places we allocate dlm_lock_resource, but not free it.
      
      leave() need free a lock resource too (from Guoqing)
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
      Cc: Guoqing Jiang <gqjiang@suse.com>
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      4ac7a65f
  8. 06 1月, 2016 7 次提交
  9. 24 10月, 2015 4 次提交
  10. 16 10月, 2015 1 次提交
  11. 13 10月, 2015 6 次提交
  12. 12 10月, 2015 2 次提交
    • G
      md-cluster: Fix adding of new disk with new reload code · dbb64f86
      Goldwyn Rodrigues 提交于
      Adding the disk worked incorrectly with the new reload code. Fix it:
      
       - No operation should be performed on rdev marked as Candidate
       - After a metadata update operation, kick disk if role is 0xfffe
         else clear Candidate bit and continue with the regular change check.
       - Saving the mode of the lock resource to check if token lock is already
         locked, because it can be called twice while adding a disk. However,
         unlock_comm() must be called only once.
       - add_new_disk() is called by the node initiating the --add operation.
         If it needs to be canceled, call add_new_disk_cancel(). The operation
         is completed by md_update_sb() which will write and unlock the
         communication.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      dbb64f86
    • G
      md-cluster: Perform resync/recovery under a DLM lock · c186b128
      Goldwyn Rodrigues 提交于
      Resync or recovery must be performed by only one node at a time.
      A DLM lock resource, resync_lockres provides the mutual exclusion
      so that only one node performs the recovery/resync at a time.
      
      If a node is unable to get the resync_lockres, because recovery is
      being performed by another node, it set MD_RECOVER_NEEDED so as
      to schedule recovery in the future.
      
      Remove the debug message in resync_info_update()
      used during development.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      c186b128