1. 18 2月, 2021 1 次提交
    • B
      gfs2: Allow node-wide exclusive glock sharing · 06e908cd
      Bob Peterson 提交于
      Introduce a new LM_FLAG_NODE_SCOPE glock holder flag: when taking a
      glock in LM_ST_EXCLUSIVE (EX) mode and with the LM_FLAG_NODE_SCOPE flag
      set, the exclusive lock is shared among all local processes who are
      holding the glock in EX mode and have the LM_FLAG_NODE_SCOPE flag set.
      From the point of view of other nodes, the lock is still held
      exclusively.
      
      A future patch will start using this flag to improve performance with
      rgrp sharing.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      06e908cd
  2. 01 12月, 2020 1 次提交
  3. 25 11月, 2020 1 次提交
    • A
      gfs2: set lockdep subclass for iopen glocks · 515b269d
      Alexander Aring 提交于
      This patch introduce a new globs attribute to define the subclass of the
      glock lockref spinlock. This avoid the following lockdep warning, which
      occurs when we lock an inode lock while an iopen lock is held:
      
      ============================================
      WARNING: possible recursive locking detected
      5.10.0-rc3+ #4990 Not tainted
      --------------------------------------------
      kworker/0:1/12 is trying to acquire lock:
      ffff9067d45672d8 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: lockref_get+0x9/0x20
      
      but task is already holding lock:
      ffff9067da308588 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: delete_work_func+0x164/0x260
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&gl->gl_lockref.lock);
        lock(&gl->gl_lockref.lock);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      3 locks held by kworker/0:1/12:
       #0: ffff9067c1bfdd38 ((wq_completion)delete_workqueue){+.+.}-{0:0}, at: process_one_work+0x1b7/0x540
       #1: ffffac594006be70 ((work_completion)(&(&gl->gl_delete)->work)){+.+.}-{0:0}, at: process_one_work+0x1b7/0x540
       #2: ffff9067da308588 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: delete_work_func+0x164/0x260
      
      stack backtrace:
      CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.10.0-rc3+ #4990
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Workqueue: delete_workqueue delete_work_func
      Call Trace:
       dump_stack+0x8b/0xb0
       __lock_acquire.cold+0x19e/0x2e3
       lock_acquire+0x150/0x410
       ? lockref_get+0x9/0x20
       _raw_spin_lock+0x27/0x40
       ? lockref_get+0x9/0x20
       lockref_get+0x9/0x20
       delete_work_func+0x188/0x260
       process_one_work+0x237/0x540
       worker_thread+0x4d/0x3b0
       ? process_one_work+0x540/0x540
       kthread+0x127/0x140
       ? __kthread_bind_mask+0x60/0x60
       ret_from_fork+0x22/0x30
      Suggested-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NAlexander Aring <aahringo@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      515b269d
  4. 03 11月, 2020 1 次提交
  5. 21 10月, 2020 2 次提交
  6. 15 10月, 2020 3 次提交
  7. 03 8月, 2020 2 次提交
  8. 30 6月, 2020 1 次提交
    • A
      gfs2: Don't sleep during glock hash walk · 34244d71
      Andreas Gruenbacher 提交于
      In flush_delete_work, instead of flushing each individual pending
      delayed work item, cancel and re-queue them for immediate execution.
      The waiting isn't needed here because we're already waiting for all
      queued work items to complete in gfs2_flush_delete_work.  This makes the
      code more efficient, but more importantly, it avoids sleeping during a
      rhashtable walk, inside rcu_read_lock().
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      34244d71
  9. 06 6月, 2020 8 次提交
  10. 05 6月, 2020 2 次提交
  11. 09 5月, 2020 2 次提交
    • B
      Revert "gfs2: Don't demote a glock until its revokes are written" · b14c9490
      Bob Peterson 提交于
      This reverts commit df5db5f9.
      
      This patch fixes a regression: patch df5db5f9 allowed function
      run_queue() to bypass its call to do_xmote() if revokes were queued for
      the glock. That's wrong because its call to do_xmote() is what is
      responsible for calling the go_sync() glops functions to sync both
      the ail list and any revokes queued for it. By bypassing the call,
      gfs2 could get into a stand-off where the glock could not be demoted
      until its revokes are written back, but the revokes would not be
      written back because do_xmote() was never called.
      
      It "sort of" works, however, because there are other mechanisms like
      the log flush daemon (logd) that can sync the ail items and revokes,
      if it deems it necessary. The problem is: without file system pressure,
      it might never deem it necessary.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      b14c9490
    • B
      gfs2: If go_sync returns error, withdraw but skip invalidate · b11e1a84
      Bob Peterson 提交于
      Before this patch, if the go_sync operation returned an error during
      the do_xmote process (such as unable to sync metadata to the journal)
      the code did goto out. That kept the glock locked, so it could not be
      given away, which correctly avoids file system corruption. However,
      it never set the withdraw bit or requeueing the glock work. So it would
      hang forever, unable to ever demote the glock.
      
      This patch changes to goto to a new label, skip_inval, so that errors
      from go_sync are treated the same way as errors from go_inval:
      The delayed withdraw bit is set and the work is requeued. That way,
      the logd should eventually figure out there's a problem and withdraw
      properly there.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      b11e1a84
  12. 08 5月, 2020 1 次提交
    • B
      gfs2: Fix error exit in do_xmote · a8b7528b
      Bob Peterson 提交于
      Before this patch, if an error was detected from glock function go_sync
      by function do_xmote, it would return.  But the function had temporarily
      unlocked the gl_lockref spin_lock, and it never re-locked it.  When the
      caller of do_xmote tried to unlock it again, it was already unlocked,
      which resulted in a corrupted spin_lock value.
      
      This patch makes sure the gl_lockref spin_lock is re-locked after it is
      unlocked.
      
      Thanks to Wu Bo <wubo40@huawei.com> for reporting this problem.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      a8b7528b
  13. 28 3月, 2020 1 次提交
  14. 27 2月, 2020 5 次提交
    • B
      gfs2: Do proper error checking for go_sync family of glops functions · 1c634f94
      Bob Peterson 提交于
      Before this patch, function do_xmote would try to sync out the glock
      dirty data by calling the appropriate glops function XXX_go_sync()
      but it did not check for a good return code. If the sync was not
      possible due to an io error or whatever, do_xmote would continue on
      and call go_inval and release the glock to other cluster nodes.
      When those nodes go to replay the journal, they may already be holding
      glocks for the journal records that should have been synced, but were
      not due to the ignored error.
      
      This patch introduces proper error code checking to the go_sync
      family of glops functions.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
      1c634f94
    • B
      gfs2: Don't demote a glock until its revokes are written · df5db5f9
      Bob Peterson 提交于
      Before this patch, run_queue would demote glocks based on whether
      there are any more holders. But if the glock has pending revokes that
      haven't been written to the media, giving up the glock might end in
      file system corruption if the revokes never get written due to
      io errors, node crashes and fences, etc. In that case, another node
      will replay the metadata blocks associated with the glock, but
      because the revoke was never written, it could replay that block
      even though the glock had since been granted to another node who
      might have made changes.
      
      This patch changes the logic in run_queue so that it never demotes
      a glock until its count of pending revokes reaches zero.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
      df5db5f9
    • B
      gfs2: Check for log write errors before telling dlm to unlock · d93ae386
      Bob Peterson 提交于
      Before this patch, function do_xmote just assumed all the writes
      submitted to the journal were finished and successful, and it
      called the go_unlock function to release the dlm lock. But if
      they're not, and a revoke failed to make its way to the journal,
      a journal replay on another node will cause corruption if we
      let the go_inval function continue and tell dlm to release the
      glock to another node. This patch adds a couple checks for errors
      in do_xmote after the calls to go_sync and go_inval. If an error
      is found, we cannot withdraw yet, because the withdraw itself
      uses glocks to make the file system read-only. Instead, we flag
      the error. Later, asserts should cause another node to replay
      the journal before continuing, thus protecting rgrp and dinode
      glocks and maintaining the integrity of the metadata. Note that
      we only need to do this for journaled glocks. System glocks
      should be able to progress even under withdrawn conditions.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
      d93ae386
    • B
      gfs2: fix infinite loop when checking ail item count before go_inval · 33dbd1e4
      Bob Peterson 提交于
      Before this patch, the rgrp_go_inval and inode_go_inval functions each
      checked if there were any items left on the ail count (by way of a
      count), and if so, did a withdraw. But the withdraw code now uses
      glocks when changing the file system to read-only status. So we can
      not have glock functions withdrawing or a hang will likely result:
      The glocks can't be serviced by the work_func if the work_func is
      busy doing its own withdraw.
      
      This patch removes the checks from the go_inval functions and adds
      a centralized check in do_xmote to warn about the problem and not
      withdraw, but flag the error so it's eventually caught when the logd
      daemon eventually runs.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
      33dbd1e4
    • B
      gfs2: Force withdraw to replay journals and wait for it to finish · 601ef0d5
      Bob Peterson 提交于
      When a node withdraws from a file system, it often leaves its journal
      in an incomplete state. This is especially true when the withdraw is
      caused by io errors writing to the journal. Before this patch, a
      withdraw would try to write a "shutdown" record to the journal, tell
      dlm it's done with the file system, and none of the other nodes
      know about the problem. Later, when the problem is fixed and the
      withdrawn node is rebooted, it would then discover that its own
      journal was incomplete, and replay it. However, replaying it at this
      point is almost guaranteed to introduce corruption because the other
      nodes are likely to have used affected resource groups that appeared
      in the journal since the time of the withdraw. Replaying the journal
      later will overwrite any changes made, and not through any fault of
      dlm, which was instructed during the withdraw to release those
      resources.
      
      This patch makes file system withdraws seen by the entire cluster.
      Withdrawing nodes dequeue their journal glock to allow recovery.
      
      The remaining nodes check all the journals to see if they are
      clean or in need of replay. They try to replay dirty journals, but
      only the journals of withdrawn nodes will be "not busy" and
      therefore available for replay.
      
      Until the journal replay is complete, no i/o related glocks may be
      given out, to ensure that the replay does not cause the
      aforementioned corruption: We cannot allow any journal replay to
      overwrite blocks associated with a glock once it is held.
      
      The "live" glock which is now used to signal when a withdraw
      occurs. When a withdraw occurs, the node signals its withdraw by
      dequeueing the "live" glock and trying to enqueue it in EX mode,
      thus forcing the other nodes to all see a demote request, by way
      of a "1CB" (one callback) try lock. The "live" glock is not
      granted in EX; the callback is only just used to indicate a
      withdraw has occurred.
      
      Note that all nodes in the cluster must wait for the recovering
      node to finish replaying the withdrawing node's journal before
      continuing. To this end, it checks that the journals are clean
      multiple times in a retry loop.
      
      Also note that the withdraw function may be called from a wide
      variety of situations, and therefore, we need to take extra
      precautions to make sure pointers are valid before using them in
      many circumstances.
      
      We also need to take care when glocks decide to withdraw, since
      the withdraw code now uses glocks.
      
      Also, before this patch, if a process encountered an error and
      decided to withdraw, if another process was already withdrawing,
      the second withdraw would be silently ignored, which set it free
      to unlock its glocks. That's correct behavior if the original
      withdrawer encounters further errors down the road. But if
      secondary waiters don't wait for the journal replay, unlocking
      glocks will allow other nodes to use them, despite the fact that
      the journal containing those blocks is being replayed. The
      replay needs to finish before our glocks are released to other
      nodes. IOW, secondary withdraws need to wait for the first
      withdraw to finish.
      
      For example, if an rgrp glock is unlocked by a process that didn't
      wait for the first withdraw, a journal replay could introduce file
      system corruption by replaying a rgrp block that has already been
      granted to a different cluster node.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      601ef0d5
  15. 21 2月, 2020 1 次提交
    • B
      gfs2: Allow some glocks to be used during withdraw · a72d2401
      Bob Peterson 提交于
      We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
      when we're withdrawn. For example, to maintain metadata integrity, we should
      disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
      iopen or the transaction glocks may be safely used because none of their
      metadata goes through the journal. So in general, we should disallow all
      glocks with an address space, and allow all the others. One exception is:
      we need to allow our active journal to be demoted so others may recover it.
      
      Allowing glocks after withdraw gives us the ability to take appropriate
      action (in a following patch) to have our journal properly replayed by
      another node rather than just abandoning the current transactions and
      pretending nothing bad happened, leaving the other nodes free to modify
      the blocks we had in our journal, which may result in file system
      corruption.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      a72d2401
  16. 10 2月, 2020 1 次提交
    • B
      gfs2: Rework how rgrp buffer_heads are managed · b3422cac
      Bob Peterson 提交于
      Before this patch, the rgrp code had a serious problem related to
      how it managed buffer_heads for resource groups. The problem caused
      file system corruption, especially in cases of journal replay.
      
      When an rgrp glock was demoted to transfer ownership to a
      different cluster node, do_xmote() first calls rgrp_go_sync and then
      rgrp_go_inval, as expected. When it calls rgrp_go_sync, that called
      gfs2_rgrp_brelse() that dropped the buffer_head reference count.
      In most cases, the reference count went to zero, which is right.
      However, there were other places where the buffers are handled
      differently.
      
      After rgrp_go_sync, do_xmote called rgrp_go_inval which called
      gfs2_rgrp_brelse a second time, then rgrp_go_inval's call to
      truncate_inode_pages_range would get rid of the pages in memory,
      but only if the reference count drops to 0.
      
      Unfortunately, gfs2_rgrp_brelse was setting bi->bi_bh = NULL.
      So when rgrp_go_sync called gfs2_rgrp_brelse, it lost the pointer
      to the buffer_heads in cases where the reference count was still 1.
      Therefore, when rgrp_go_inval called gfs2_rgrp_brelse a second time,
      it failed the check for "if (bi->bi_bh)" and thus failed to call
      brelse a second time. Because of that, the reference count on those
      buffers sometimes failed to drop from 1 to 0. And that caused
      function truncate_inode_pages_range to keep the pages in page cache
      rather than freeing them.
      
      The next time the rgrp glock was acquired, the metadata read of
      the rgrp buffers re-used the pages in memory, which were now
      wrong because they were likely modified by the other node who
      acquired the glock in EX (which is why we demoted the glock).
      This re-use of the page cache caused corruption because changes
      made by the other nodes were never seen, so the bitmaps were
      inaccurate.
      
      For some reason, the problem became most apparent when journal
      replay forced the replay of rgrps in memory, which caused newer
      rgrp data to be overwritten by the older in-core pages.
      
      A big part of the problem was that the rgrp buffer were released
      in multiple places: The go_unlock function would release them when
      the glock was released rather than when the glock is demoted,
      which is clearly wrong because our intent was to cache them until
      the glock is demoted from SH or EX.
      
      This patch attempts to clean up the mess and make one consistent
      and centralized mechanism for managing the rgrp buffer_heads by
      implementing several changes:
      
      1. It eliminates the call to gfs2_rgrp_brelse() from rgrp_go_sync.
         We don't want to release the buffers or zero the pointers when
         syncing for the reasons stated above. It only makes sense to
         release them when the glock is actually invalidated (go_inval).
         And when we do, then we set the bh pointers to NULL.
      2. The go_unlock function (which was only used for rgrps) is
         eliminated, as we've talked about doing many times before.
         The go_unlock function was called too early in the glock dq
         process, and should not happen until the glock is invalidated.
      3. It also eliminates the call to rgrp_brelse in gfs2_clear_rgrpd.
         That will now happen automatically when the rgrp glocks are
         demoted, and shouldn't happen any sooner or later than that.
         Instead, function gfs2_clear_rgrpd has been modified to demote
         the rgrp glocks, and therefore, free those pages, before the
         remaining glocks are culled by gfs2_gl_hash_clear. This
         prevents the gl_object from hanging around when the glocks are
         culled.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
      b3422cac
  17. 20 1月, 2020 1 次提交
  18. 16 11月, 2019 1 次提交
    • B
      gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS · d99724c3
      Bob Peterson 提交于
      This patch closes a timing window in which two processes compete
      and overlap in the execution of do_xmote for the same glock:
      
                   Process A                              Process B
         ------------------------------------   -----------------------------
      1. Grabs gl_lockref and calls do_xmote
      2.                                        Grabs gl_lockref but is blocked
      3. Sets GLF_INVALIDATE_IN_PROGRESS
      4. Unlocks gl_lockref
      5.                                        Calls do_xmote
      6. Call glops->go_sync
      7. test_and_clear_bit GLF_DIRTY
      8. Call gfs2_log_flush                    Call glops->go_sync
      9. (slow IO, so it blocks a long time)    test_and_clear_bit GLF_DIRTY
                                                It's not dirty (step 7) returns
      10.                                       Tests GLF_INVALIDATE_IN_PROGRESS
      11.                                       Calls go_inval (rgrp_go_inval)
      12.                                       gfs2_rgrp_relse does brelse
      13.                                       truncate_inode_pages_range
      14.                                       Calls lm_lock UN
      
      In step 14 we've just told dlm to give the glock to another node
      when, in fact, process A has not finished the IO and synced all
      buffer_heads to disk and make sure their revokes are done.
      
      This patch fixes the problem by changing the GLF_INVALIDATE_IN_PROGRESS
      to use test_and_set_bit, and if the bit is already set, process B just
      ignores it and trusts that process A will do the do_xmote in the proper
      order.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      d99724c3
  19. 15 11月, 2019 1 次提交
  20. 05 9月, 2019 2 次提交
    • B
      gfs2: Use async glocks for rename · ad26967b
      Bob Peterson 提交于
      Because s_vfs_rename_mutex is not cluster-wide, multiple nodes can
      reverse the roles of which directories are "old" and which are "new" for
      the purposes of rename. This can cause deadlocks where two nodes end up
      waiting for each other.
      
      There can be several layers of directory dependencies across many nodes.
      
      This patch fixes the problem by acquiring all gfs2_rename's inode glocks
      asychronously and waiting for all glocks to be acquired.  That way all
      inodes are locked regardless of the order.
      
      The timeout value for multiple asynchronous glocks is calculated to be
      the total of the individual wait times for each glock times two.
      
      Since gfs2_exchange is very similar to gfs2_rename, both functions are
      patched in the same way.
      
      A new async glock wait queue, sd_async_glock_wait, keeps a list of
      waiters for these events. If gfs2's holder_wake function detects an
      async holder, it wakes up any waiters for the event. The waiter only
      tests whether any of its requests are still pending.
      
      Since the glocks are sent to dlm asychronously, the wait function needs
      to check to see which glocks, if any, were granted.
      
      If a glock is granted by dlm (and therefore held), its minimum hold time
      is checked and adjusted as necessary, as other glock grants do.
      
      If the event times out, all glocks held thus far must be dequeued to
      resolve any existing deadlocks.  Then, if there are any outstanding
      locking requests, we need to loop around and wait for dlm to respond to
      those requests too.  After we release all requests, we return -ESTALE to
      the caller (vfs rename) which loops around and retries the request.
      
          Node1           Node2
          ---------       ---------
      1.  Enqueue A       Enqueue B
      2.  Enqueue B       Enqueue A
      3.  A granted
      6.                  B granted
      7.  Wait for B
      8.                  Wait for A
      9.                  A times out (since Node 1 holds A)
      10.                 Dequeue B (since it was granted)
      11.                 Wait for all requests from DLM
      12. B Granted (since Node2 released it in step 10)
      13. Rename
      14. Dequeue A
      15.                 DLM Grants A
      16.                 Dequeue A (due to the timeout and since we
                          no longer have B held for our task).
      17. Dequeue B
      18.                 Return -ESTALE to vfs
      19.                 VFS retries the operation, goto step 1.
      
      This release-all-locks / acquire-all-locks may slow rename / exchange
      down as both nodes struggle in the same way and do the same thing.
      However, this will only happen when there is contention for the same
      inodes, which ought to be rare.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      ad26967b
    • A
      gfs2: create function gfs2_glock_update_hold_time · 01123cf1
      Andreas Gruenbacher 提交于
      This patch moves the code that updates glock minimum hold
      time to a separate function. This will be called by a future
      patch.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      01123cf1
  21. 03 9月, 2019 1 次提交
  22. 28 6月, 2019 1 次提交
    • B
      gfs2: dump fsid when dumping glock problems · 3792ce97
      Bob Peterson 提交于
      Before this patch, if a glock error was encountered, the glock with
      the problem was dumped. But sometimes you may have lots of file systems
      mounted, and that doesn't tell you which file system it was for.
      
      This patch adds a new boolean parameter fsid to the dump_glock family
      of functions. For non-error cases, such as dumping the glocks debugfs
      file, the fsid is not dumped in order to keep lock dumps and glocktop
      as clean as possible. For all error cases, such as GLOCK_BUG_ON, the
      file system id is now printed. This will make it easier to debug.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      3792ce97