1. 24 6月, 2014 1 次提交
    • X
      ocfs2/dlm: do not purge lockres that is queued for assert master · ac4fef4d
      Xue jiufei 提交于
      When workqueue is delayed, it may occur that a lockres is purged while it
      is still queued for master assert.  it may trigger BUG() as follows.
      
      N1                                         N2
      dlm_get_lockres()
      ->dlm_do_master_requery
                                        is the master of lockres,
                                        so queue assert_master work
      
                                        dlm_thread() start running
                                        and purge the lockres
      
                                        dlm_assert_master_worker()
                                        send assert master message
                                        to other nodes
      receiving the assert_master
      message, set master to N2
      
      dlmlock_remote() send create_lock message to N2, but receive DLM_IVLOCKID,
      if it is RECOVERY lockres, it triggers the BUG().
      
      Another BUG() is triggered when N3 become the new master and send
      assert_master to N1, N1 will trigger the BUG() because owner doesn't
      match.  So we should not purge lockres when it is queued for assert
      master.
      Signed-off-by: Njoyce.xue <xuejiufei@huawei.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac4fef4d
  2. 05 6月, 2014 1 次提交
    • X
      ocfs2/dlm: fix possible convert=sion deadlock · 6718cb5e
      Xue jiufei 提交于
      We found there is a conversion deadlock when the owner of lockres
      happened to crash before send DLM_PROXY_AST_MSG for a downconverting
      lock.  The situation is as follows:
      
      Node1                            Node2                  Node3
                                 the owner of lockresA
      lock_1 granted at EX mode
      and call ocfs2_cluster_unlock
      to decrease ex_holders.
                                                       converting lock_3 from
                                                       NL to EX
                                 send DLM_PROXY_AST_MSG
                                 to Node1, asking Node 1
                                 to downconvert.
      receiving DLM_PROXY_AST_MSG,
      thread ocfs2dc send
      DLM_CONVERT_LOCK_MSG
      to Node2 to downconvert
      lock_1(EX->NL).
                                 lock_1 can be granted and
                                 put it into pending_asts
                                 list, return DLM_NORMAL.
                                 then something happened
                                 and Node2 crashed.
      received DLM_NORMAL, waiting
      for DLM_PROXY_AST_MSG.
                                                     selected as the recovery
                                                     master, receving migrate
                                                     lock from Node1, queue
                                                     lock_1 to the tail of
                                                     converting list.
      
      After dlm recovery, converting list in the master of lockresA(Node3)
      will be: converting list head <-> lock_3(NL->EX) <->lock_1(EX<->NL).
      Requested mode of lock_3 is not compatible with the granted mode of
      lock_1, so it can not be granted.  and lock_1 can not downconvert
      because covnerting queue is strictly FIFO.  So a deadlock is created.
      We think function dlm_process_recovery_data() should queue_ast for
      lock_1 or alter the order of lock_1 and lock_3, so dlm_thread can
      process lock_1 first.  And if there are multiple downconverting locks,
      they must convert form PR to NL, so no need to sort them.
      Signed-off-by: Njoyce.xue <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6718cb5e
  3. 04 4月, 2014 2 次提交
    • J
      ocfs2: dlm: fix recovery hung · ded2cf71
      Junxiao Bi 提交于
      There is a race window in dlm_do_recovery() between dlm_remaster_locks()
      and dlm_reset_recovery() when the recovery master nearly finish the
      recovery process for a dead node.  After the master sends FINALIZE_RECO
      message in dlm_remaster_locks(), another node may become the recovery
      master for another dead node, and then send the BEGIN_RECO message to
      all the nodes included the old master, in the handler of this message
      dlm_begin_reco_handler() of old master, dlm->reco.dead_node and
      dlm->reco.new_master will be set to the second dead node and the new
      master, then in dlm_reset_recovery(), these two variables will be reset
      to default value.  This will cause new recovery master can not finish
      the recovery process and hung, at last the whole cluster will hung for
      recovery.
      
      old recovery master:                                 new recovery master:
      dlm_remaster_locks()
                                                        become recovery master for
                                                        another dead node.
                                                        dlm_send_begin_reco_message()
      dlm_begin_reco_handler()
      {
       if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) {
        return -EAGAIN;
       }
       dlm_set_reco_master(dlm, br->node_idx);
       dlm_set_reco_dead_node(dlm, br->dead_node);
      }
      dlm_reset_recovery()
      {
       dlm_set_reco_dead_node(dlm, O2NM_INVALID_NODE_NUM);
       dlm_set_reco_master(dlm, O2NM_INVALID_NODE_NUM);
      }
                                                        will hang in dlm_remaster_locks() for
                                                        request dlm locks info
      
      Before send FINALIZE_RECO message, recovery master should set
      DLM_RECO_STATE_FINALIZE for itself and clear it after the recovery done,
      this can break the race windows as the BEGIN_RECO messages will not be
      handled before DLM_RECO_STATE_FINALIZE flag is cleared.
      
      A similar race may happen between new recovery master and normal node
      which is in dlm_finalize_reco_handler(), also fix it.
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
      Reviewed-by: NWengang Wang <wen.gang.wang@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ded2cf71
    • J
      ocfs2: dlm: fix lock migration crash · 34aa8dac
      Junxiao Bi 提交于
      This issue was introduced by commit 800deef3 ("ocfs2: use
      list_for_each_entry where benefical") in 2007 where it replaced
      list_for_each with list_for_each_entry.  The variable "lock" will point
      to invalid data if "tmpq" list is empty and a panic will be triggered
      due to this.  Sunil advised reverting it back, but the old version was
      also not right.  At the end of the outer for loop, that
      list_for_each_entry will also set "lock" to an invalid data, then in the
      next loop, if the "tmpq" list is empty, "lock" will be an stale invalid
      data and cause the panic.  So reverting the list_for_each back and reset
      "lock" to NULL to fix this issue.
      
      Another concern is that this seemes can not happen because the "tmpq"
      list should not be empty.  Let me describe how.
      
      old lock resource owner(node 1):                                  migratation target(node 2):
      image there's lockres with a EX lock from node 2 in
      granted list, a NR lock from node x with convert_type
      EX in converting list.
      dlm_empty_lockres() {
       dlm_pick_migration_target() {
         pick node 2 as target as its lock is the first one
         in granted list.
       }
       dlm_migrate_lockres() {
         dlm_mark_lockres_migrating() {
           res->state |= DLM_LOCK_RES_BLOCK_DIRTY;
           wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
      	 //after the above code, we can not dirty lockres any more,
           // so dlm_thread shuffle list will not run
                                                                         downconvert lock from EX to NR
                                                                         upconvert lock from NR to EX
      <<< migration may schedule out here, then
      <<< node 2 send down convert request to convert type from EX to
      <<< NR, then send up convert request to convert type from NR to
      <<< EX, at this time, lockres granted list is empty, and two locks
      <<< in the converting list, node x up convert lock followed by
      <<< node 2 up convert lock.
      
      	 // will set lockres RES_MIGRATING flag, the following
      	 // lock/unlock can not run
           dlm_lockres_release_ast(dlm, res);
         }
      
         dlm_send_one_lockres()
                                                                       dlm_process_recovery_data()
                                                                         for (i=0; i<mres->num_locks; i++)
                                                                           if (ml->node == dlm->node_num)
                                                                             for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST; j++) {
                                                                              list_for_each_entry(lock, tmpq, list)
                                                                              if (lock) break; <<< lock is invalid as grant list is empty.
                                                                             }
                                                                             if (lock->ml.node != ml->node)
                                                                               BUG() >>> crash here
       }
      
      I see the above locks status from a vmcore of our internal bug.
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NWengang Wang <wen.gang.wang@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Reviewed-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34aa8dac
  4. 13 11月, 2013 1 次提交
  5. 12 9月, 2013 2 次提交
    • X
      ocfs2/dlm: force clean refmap when doing local cleanup · 69b2bd16
      Xue jiufei 提交于
      dlm_do_local_recovery_cleanup() should force clean refmap if the owner of
      lockres is UNKNOWN.  Otherwise node may hang when umounting filesystems.
      Here's the situation:
      
      	Node1                                    Node2
      dlmlock()
        -> dlm_get_lock_resource()
      send DLM_MASTER_REQUEST_MSG to
      other nodes.
      
                                             trying to master this lockres,
                                             return MAYBE.
      
      selected as the master of lockresA,
      set mle->master to Node1,
      and do assert_master,
      send DLM_ASSERT_MASTER_MSG to Node2.
                                             Node 2 has interest on lockresA
                                             and return
                                             DLM_ASSERT_RESPONSE_MASTERY_REF
                                             then something happened and
                                             Node2 crashed.
      
      Receiving DLM_ASSERT_RESPONSE_MASTERY_REF, set Node2 into refmap, and keep
      sending DLM_ASSERT_MASTER_MSG to other nodes
      
      o2hb found node2 down, calling dlm_hb_node_down() -->
      dlm_do_local_recovery_cleanup() the master of lockresA is still UNKNOWN,
      no need to call dlm_free_dead_locks().
      
      Set the master of lockresA to Node1, but Node2 stills remains in refmap.
      
      When Node1 umount, it found that the refmap of lockresA is not empty and
      attempted to migrate it to Node2, But Node2 is already down, so umount
      hang, trying to migrate lockresA again and again.
      Signed-off-by: Njoyce <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69b2bd16
    • X
      ocfs2: dlm_request_all_locks() should deal with the status sent from target node · 98ac9125
      Xue jiufei 提交于
      dlm_request_all_locks() should deal with the status sent from target node
      if DLM_LOCK_REQUEST_MSG is sent successfully, or recovery master will fall
      into endless loop, waiting for other nodes to send locks and
      DLM_RECO_DATA_DONE_MSG to me.
      
              NodeA                                  NodeB
                                           selected as recovery master
                                           dlm_remaster_locks()
                                           ->dlm_request_all_locks()
                                           send DLM_LOCK_REQUEST_MSG to nodeA
      
      It happened that NodeA cannot alloc memory when it processes this
      message.  dlm_request_all_locks_handler() do not queue
      dlm_request_all_locks_worker and returns -ENOMEM.  It will never send
      locks and DLM_RECO_DATA_DONE_MSG to NodeB.
      
                                          NodeB do not deal with the status
                                          sent from nodeA, and will fall in
                                          endless loop waiting for the
                                          recovery state of NodeA to be
                                          changed.
      Signed-off-by: Njoyce <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Jeff Liu <jeff.liu@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98ac9125
  6. 04 7月, 2013 3 次提交
  7. 13 6月, 2013 1 次提交
  8. 30 4月, 2013 1 次提交
  9. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  10. 25 7月, 2011 4 次提交
  11. 26 5月, 2011 1 次提交
    • S
      ocfs2/dlm: Add new dlm message DLM_BEGIN_EXIT_DOMAIN_MSG · bddefdee
      Sunil Mushran 提交于
      This patch adds a new dlm message DLM_BEGIN_EXIT_DOMAIN_MSG and ups the dlm
      protocol to 1.2.
      
      o2dlm sends this new message in dlm_unregister_domain() to mark the beginning
      of the exit domain. This message is sent to all nodes in the domain.
      
      Currently o2dlm has no way of informing other nodes of its impending exit.
      This information is useful as the other nodes could disregard the exiting
      node in certain operations. For example, in resource migration. If two or
      more nodes were umounting in parallel, it would be more efficient if o2dlm
      were to choose a non-exiting node to be the new master node rather than an
      exiting one.
      Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <jlbec@evilplan.org>
      bddefdee
  12. 07 3月, 2011 1 次提交
    • T
      ocfs2: Remove EXIT from masklog. · c1e8d35e
      Tao Ma 提交于
      mlog_exit is used to record the exit status of a function.
      But because it is added in so many functions, if we enable it,
      the system logs get filled up quickly and cause too much I/O.
      So actually no one can open it for a production system or even
      for a test.
      
      This patch just try to remove it or change it. So:
      1. if all the error paths already use mlog_errno, it is just removed.
         Otherwise, it will be replaced by mlog_errno.
      2. if it is used to print some return value, it is replaced with
         mlog(0,...).
      mlog_exit_ptr is changed to mlog(0.
      All those mlog(0,...) will be replaced with trace events later.
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      c1e8d35e
  13. 21 2月, 2011 1 次提交
    • T
      ocfs2: Remove ENTRY from masklog. · ef6b689b
      Tao Ma 提交于
      ENTRY is used to record the entry of a function.
      But because it is added in so many functions, if we enable it,
      the system logs get filled up quickly and cause too much I/O.
      So actually no one can open it for a production system or even
      for a test.
      
      So for mlog_entry_void, we just remove it.
      for mlog_entry(...), we replace it with mlog(0,...), and they
      will be replace by trace event later.
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      ef6b689b
  14. 08 8月, 2010 1 次提交
    • W
      ocfs2/dlm: avoid incorrect bit set in refmap on recovery master · a524812b
      Wengang Wang 提交于
      In the following situation, there remains an incorrect bit in refmap on the
      recovery master. Finally the recovery master will fail at purging the lockres
      due to the incorrect bit in refmap.
      
      1) node A has no interest on lockres A any longer, so it is purging it.
      2) the owner of lockres A is node B, so node A is sending de-ref message
      to node B.
      3) at this time, node B crashed. node C becomes the recovery master. it recovers
      lockres A(because the master is the dead node B).
      4) node A migrated lockres A to node C with a refbit there.
      5) node A failed to send de-ref message to node B because it crashed. The failure
      is ignored. no other action is done for lockres A any more.
      
      For mormal, re-send the deref message to it to recovery master can fix it. Well,
      ignoring the failure of deref to the original master and not recovering the lockres
      to recovery master has the same effect. And the later is simpler.
      Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com>
      Acked-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: stable@kernel.org
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      a524812b
  15. 13 7月, 2010 1 次提交
  16. 06 5月, 2010 1 次提交
  17. 27 2月, 2010 1 次提交
  18. 04 2月, 2010 1 次提交
  19. 03 2月, 2010 1 次提交
  20. 26 1月, 2010 3 次提交
  21. 03 12月, 2009 1 次提交
  22. 24 9月, 2009 1 次提交
  23. 09 7月, 2009 1 次提交
  24. 11 3月, 2008 4 次提交
  25. 26 1月, 2008 2 次提交
    • T
      ocfs2/dlm: Clear joining_node on hearbeat node down · 2d4b1cbb
      Tao Ma 提交于
      Currently the process of dlm join contains 2 steps: query join and assert join.
      After query join, the joined node will set its joining_node. So if the joining
      node happens to panic before the 2nd step, the joined node will fail to clear
      its joining_node flag because that node isn't in the domain map. It at least
      cause 2 problems.
      1. All the new join request will fail. So no new node can mount the volume.
      2. The joined node can't umount the volume since during the umount process it
         has to wait for the joining_node to be unknown. So the umount will be hanged.
      
      The solution is to clear the joining_node before we check the domain map.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NMark Fasheh <mark.fasheh@oracle.com>
      2d4b1cbb
    • M
      ocfs2_dlm: Call node eviction callbacks from heartbeat handler · 6561168c
      Mark Fasheh 提交于
      With this, a dlm client can take advantage of the group protocol in the dlm
      to get full notification whenever a node within the dlm domain leaves
      unexpectedly.
      Signed-off-by: NMark Fasheh <mark.fasheh@oracle.com>
      6561168c
  26. 20 10月, 2007 1 次提交
  27. 11 7月, 2007 1 次提交