1. 26 3月, 2016 3 次提交
    • X
      ocfs2/dlm: move lock to the tail of grant queue while doing in-place convert · e5054c9a
      xuejiufei 提交于
      We have found a bug when two nodes doing umount one after another.
      
      1) Node 1 migrate a lockres that has 3 locks in grant queue such as
         N2(PR)<->N3(NL)<->N4(PR) to N2.  After migration, lvb of the lock
         N3(NL) and N4(PR) are empty on node 2 because migration target do not
         copy lvb to these two lock.
      
      2) Node 3 want to convert to PR, it can be granted in
         __dlmconvert_master(), and the order of these locks is unchanged.  The
         lvb of the lock N3(PR) on node 2 is copyed from lockres in function
         dlm_update_lvb() while the lvb of lock N4(PR) is still empty.
      
      3) Node 2 want to leave domain, it will migrate this lockres to node 3.
         Then node 2 will trigger the BUG in dlm_prepare_lvb_for_migration()
         when adding the lock N4(PR) to mres with the following message because
         the lvb of mres is already copied from lock N3(PR), but the lvb of lock
         N4(PR) is empty.
      
      "Mismatched lvb in lock cookie=%u:%llu, name=%.*s, node=%u"
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: Nxuejiufei <xuejiufei@huawei.com>
      Acked-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5054c9a
    • J
      ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list · be12b299
      Joseph Qi 提交于
      When master handles convert request, it queues ast first and then
      returns status.  This may happen that the ast is sent before the request
      status because the above two messages are sent by two threads.  And
      right after the ast is sent, if master down, it may trigger BUG in
      dlm_move_lockres_to_recovery_list in the requested node because ast
      handler moves it to grant list without clear lock->convert_pending.  So
      remove BUG_ON statement and check if the ast is processed in
      dlmconvert_remote.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Reported-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Tariq Saeed <tariq.x.saeed@oracle.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be12b299
    • J
      ocfs2/dlm: fix race between convert and recovery · ac7cf246
      Joseph Qi 提交于
      There is a race window between dlmconvert_remote and
      dlm_move_lockres_to_recovery_list, which will cause a lock with
      OCFS2_LOCK_BUSY in grant list, thus system hangs.
      
      dlmconvert_remote
      {
              spin_lock(&res->spinlock);
              list_move_tail(&lock->list, &res->converting);
              lock->convert_pending = 1;
              spin_unlock(&res->spinlock);
      
              status = dlm_send_remote_convert_request();
              >>>>>> race window, master has queued ast and return DLM_NORMAL,
                     and then down before sending ast.
                     this node detects master down and calls
                     dlm_move_lockres_to_recovery_list, which will revert the
                     lock to grant list.
                     Then OCFS2_LOCK_BUSY won't be cleared as new master won't
                     send ast any more because it thinks already be authorized.
      
              spin_lock(&res->spinlock);
              lock->convert_pending = 0;
              if (status != DLM_NORMAL)
                      dlm_revert_pending_convert(res, lock);
              spin_unlock(&res->spinlock);
      }
      
      In this case, check if res->state has DLM_LOCK_RES_RECOVERING bit set
      (res is still in recovering) or res master changed (new master has
      finished recovery), reset the status to DLM_RECOVERING, then it will
      retry convert.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Reported-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Tariq Saeed <tariq.x.saeed@oracle.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac7cf246
  2. 16 3月, 2016 7 次提交
  3. 06 2月, 2016 1 次提交
  4. 15 1月, 2016 7 次提交
  5. 30 12月, 2015 1 次提交
    • X
      ocfs2/dlm: clear migration_pending when migration target goes down · cc28d6d8
      xuejiufei 提交于
      We have found a BUG on res->migration_pending when migrating lock
      resources.  The situation is as follows.
      
      dlm_mark_lockres_migration
        res->migration_pending = 1;
        __dlm_lockres_reserve_ast
        dlm_lockres_release_ast returns with res->migration_pending remains
            because other threads reserve asts
        wait dlm_migration_can_proceed returns 1
        >>>>>>> o2hb found that target goes down and remove target
                from domain_map
        dlm_migration_can_proceed returns 1
        dlm_mark_lockres_migrating returns -ESHOTDOWN with
            res->migration_pending still remains.
      
      When reentering dlm_mark_lockres_migrating(), it will trigger the BUG_ON
      with res->migration_pending.  So clear migration_pending when target is
      down.
      Signed-off-by: NJiufei Xue <xuejiufei@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc28d6d8
  6. 06 11月, 2015 1 次提交
  7. 23 10月, 2015 1 次提交
  8. 23 9月, 2015 1 次提交
  9. 12 9月, 2015 1 次提交
  10. 05 9月, 2015 5 次提交
  11. 25 6月, 2015 1 次提交
  12. 06 5月, 2015 1 次提交
    • J
      ocfs2: dlm: fix race between purge and get lock resource · b1432a2a
      Junxiao Bi 提交于
      There is a race window in dlm_get_lock_resource(), which may return a
      lock resource which has been purged.  This will cause the process to
      hang forever in dlmlock() as the ast msg can't be handled due to its
      lock resource not existing.
      
          dlm_get_lock_resource {
              ...
              spin_lock(&dlm->spinlock);
              tmpres = __dlm_lookup_lockres_full(dlm, lockid, namelen, hash);
              if (tmpres) {
                   spin_unlock(&dlm->spinlock);
                   >>>>>>>> race window, dlm_run_purge_list() may run and purge
                                    the lock resource
                   spin_lock(&tmpres->spinlock);
                   ...
                   spin_unlock(&tmpres->spinlock);
              }
          }
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1432a2a
  13. 11 2月, 2015 4 次提交
  14. 09 1月, 2015 1 次提交
  15. 19 12月, 2014 1 次提交
  16. 11 12月, 2014 3 次提交
  17. 10 10月, 2014 1 次提交
    • J
      ocfs2: fix deadlock between o2hb thread and o2net_wq · 70e82a12
      Joseph Qi 提交于
      The following case may lead to o2net_wq and o2hb thread deadlock on
      o2hb_callback_sem.
      Currently there are 2 nodes say N1, N2 in the cluster. And N2 down, at
      the same time, N3 tries to join the cluster. So N1 will handle node
      down (N2) and join (N3) simultaneously.
          o2hb                               o2net_wq
          ->o2hb_do_disk_heartbeat
          ->o2hb_check_slot
          ->o2hb_run_event_list
          ->o2hb_fire_callbacks
          ->down_write(&o2hb_callback_sem)
          ->o2net_hb_node_down_cb
          ->flush_workqueue(o2net_wq)
                                             ->o2net_process_message
                                             ->dlm_query_join_handler
                                             ->o2hb_check_node_heartbeating
                                             ->o2hb_fill_node_map
                                             ->down_read(&o2hb_callback_sem)
      
      No need to take o2hb_callback_sem in dlm_query_join_handler,
      o2hb_live_lock is enough to protect live node map.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: xMark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: jiangyiwen <jiangyiwen@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70e82a12