1. 04 5月, 2022 2 次提交
  2. 23 10月, 2021 2 次提交
  3. 15 10月, 2021 2 次提交
  4. 07 10月, 2020 8 次提交
    • D
      xfs: only relog deferred intent items if free space in the log gets low · 74f4d6a1
      Darrick J. Wong 提交于
      Now that we have the ability to ask the log how far the tail needs to be
      pushed to maintain its free space targets, augment the decision to relog
      an intent item so that we only do it if the log has hit the 75% full
      threshold.  There's no point in relogging an intent into the same
      checkpoint, and there's no need to relog if there's plenty of free space
      in the log.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      74f4d6a1
    • D
      xfs: periodically relog deferred intent items · 4e919af7
      Darrick J. Wong 提交于
      There's a subtle design flaw in the deferred log item code that can lead
      to pinning the log tail.  Taking up the defer ops chain examples from
      the previous commit, we can get trapped in sequences like this:
      
      Caller hands us a transaction t0 with D0-D3 attached.  The defer ops
      chain will look like the following if the transaction rolls succeed:
      
      t1: D0(t0), D1(t0), D2(t0), D3(t0)
      t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
      t3: d5(t1), D1(t0), D2(t0), D3(t0)
      ...
      t9: d9(t7), D3(t0)
      t10: D3(t0)
      t11: d10(t10), d11(t10)
      t12: d11(t10)
      
      In transaction 9, we finish d9 and try to roll to t10 while holding onto
      an intent item for D3 that we logged in t0.
      
      The previous commit changed the order in which we place new defer ops in
      the defer ops processing chain to reduce the maximum chain length.  Now
      make xfs_defer_finish_noroll capable of relogging the entire chain
      periodically so that we can always move the log tail forward.  Most
      chains will never get relogged, except for operations that generate very
      long chains (large extents containing many blocks with different sharing
      levels) or are on filesystems with small logs and a lot of ongoing
      metadata updates.
      
      Callers are now required to ensure that the transaction reservation is
      large enough to handle logging done items and new intent items for the
      maximum possible chain length.  Most callers are careful to keep the
      chain lengths low, so the overhead should be minimal.
      
      The decision to relog an intent item is made based on whether the intent
      was logged in a previous checkpoint, since there's no point in relogging
      an intent into the same checkpoint.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      4e919af7
    • D
      xfs: change the order in which child and parent defer ops are finished · 27dada07
      Darrick J. Wong 提交于
      The defer ops code has been finishing items in the wrong order -- if a
      top level defer op creates items A and B, and finishing item A creates
      more defer ops A1 and A2, we'll put the new items on the end of the
      chain and process them in the order A B A1 A2.  This is kind of weird,
      since it's convenient for programmers to be able to think of A and B as
      an ordered sequence where all the sub-tasks for A must finish before we
      move on to B, e.g. A A1 A2 D.
      
      Right now, our log intent items are not so complex that this matters,
      but this will become important for the atomic extent swapping patchset.
      In order to maintain correct reference counting of extents, we have to
      unmap and remap extents in that order, and we want to complete that work
      before moving on to the next range that the user wants to swap.  This
      patch fixes defer ops to satsify that requirement.
      
      The primary symptom of the incorrect order was noticed in an early
      performance analysis of the atomic extent swap code.  An astonishingly
      large number of deferred work items accumulated when userspace requested
      an atomic update of two very fragmented files.  The cause of this was
      traced to the same ordering bug in the inner loop of
      xfs_defer_finish_noroll.
      
      If the ->finish_item method of a deferred operation queues new deferred
      operations, those new deferred ops are appended to the tail of the
      pending work list.  To illustrate, say that a caller creates a
      transaction t0 with four deferred operations D0-D3.  The first thing
      defer ops does is roll the transaction to t1, leaving us with:
      
      t1: D0(t0), D1(t0), D2(t0), D3(t0)
      
      Let's say that finishing each of D0-D3 will create two new deferred ops.
      After finish D0 and roll, we'll have the following chain:
      
      t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1)
      
      d4 and d5 were logged to t1.  Notice that while we're about to start
      work on D1, we haven't actually completed all the work implied by D0
      being finished.  So far we've been careful (or lucky) to structure the
      dfops callers such that D1 doesn't depend on d4 or d5 being finished,
      but this is a potential logic bomb.
      
      There's a second problem lurking.  Let's see what happens as we finish
      D1-D3:
      
      t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2)
      t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3)
      t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
      
      Let's say that d4-d11 are simple work items that don't queue any other
      operations, which means that we can complete each d4 and roll to t6:
      
      t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
      t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
      ...
      t11: d10(t4), d11(t4)
      t12: d11(t4)
      <done>
      
      When we try to roll to transaction #12, we're holding defer op d11,
      which we logged way back in t4.  This means that the tail of the log is
      pinned at t4.  If the log is very small or there are a lot of other
      threads updating metadata, this means that we might have wrapped the log
      and cannot get roll to t11 because there isn't enough space left before
      we'd run into t4.
      
      Let's shift back to the original failure.  I mentioned before that I
      discovered this flaw while developing the atomic file update code.  In
      that scenario, we have a defer op (D0) that finds a range of file blocks
      to remap, creates a handful of new defer ops to do that, and then asks
      to be continued with however much work remains.
      
      So, D0 is the original swapext deferred op.  The first thing defer ops
      does is rolls to t1:
      
      t1: D0(t0)
      
      We try to finish D0, logging d1 and d2 in the process, but can't get all
      the work done.  We log a done item and a new intent item for the work
      that D0 still has to do, and roll to t2:
      
      t2: D0'(t1), d1(t1), d2(t1)
      
      We roll and try to finish D0', but still can't get all the work done, so
      we log a done item and a new intent item for it, requeue D0 a second
      time, and roll to t3:
      
      t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2)
      
      If it takes 48 more rolls to complete D0, then we'll finally dispense
      with D0 in t50:
      
      t50: D<fifty primes>(t49), d1(t1), ..., d102(t50)
      
      We then try to roll again to get a chain like this:
      
      t51: d1(t1), d2(t1), ..., d101(t50), d102(t50)
      ...
      t152: d102(t50)
      <done>
      
      Notice that in rolling to transaction #51, we're holding on to a log
      intent item for d1 that was logged in transaction #1.  This means that
      the tail of the log is pinned at t1.  If the log is very small or there
      are a lot of other threads updating metadata, this means that we might
      have wrapped the log and cannot roll to t51 because there isn't enough
      space left before we'd run into t1.  This is of course problem #2 again.
      
      But notice the third problem with this scenario: we have 102 defer ops
      tied to this transaction!  Each of these items are backed by pinned
      kernel memory, which means that we risk OOM if the chains get too long.
      
      Yikes.  Problem #1 is a subtle logic bomb that could hit someone in the
      future; problem #2 applies (rarely) to the current upstream, and problem
      #3 applies to work under development.
      
      This is not how incremental deferred operations were supposed to work.
      The dfops design of logging in the same transaction an intent-done item
      and a new intent item for the work remaining was to make it so that we
      only have to juggle enough deferred work items to finish that one small
      piece of work.  Deferred log item recovery will find that first
      unfinished work item and restart it, no matter how many other intent
      items might follow it in the log.  Therefore, it's ok to put the new
      intents at the start of the dfops chain.
      
      For the first example, the chains look like this:
      
      t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
      t3: d5(t1), D1(t0), D2(t0), D3(t0)
      ...
      t9: d9(t7), D3(t0)
      t10: D3(t0)
      t11: d10(t10), d11(t10)
      t12: d11(t10)
      
      For the second example, the chains look like this:
      
      t1: D0(t0)
      t2: d1(t1), d2(t1), D0'(t1)
      t3: d2(t1), D0'(t1)
      t4: D0'(t1)
      t5: d1(t4), d2(t4), D0''(t4)
      ...
      t148: D0<50 primes>(t147)
      t149: d101(t148), d102(t148)
      t150: d102(t148)
      <done>
      
      This actually sucks more for pinning the log tail (we try to roll to t10
      while holding an intent item that was logged in t1) but we've solved
      problem #1.  We've also reduced the maximum chain length from:
      
          sum(all the new items) + nr_original_items
      
      to:
      
          max(new items that each original item creates) + nr_original_items
      
      This solves problem #3 by sharply reducing the number of defer ops that
      can be attached to a transaction at any given time.  The change makes
      the problem of log tail pinning worse, but is improvement we need to
      solve problem #2.  Actually solving #2, however, is left to the next
      patch.
      
      Note that a subsequent analysis of some hard-to-trigger reflink and COW
      livelocks on extremely fragmented filesystems (or systems running a lot
      of IO threads) showed the same symptoms -- uncomfortably large numbers
      of incore deferred work items and occasional stalls in the transaction
      grant code while waiting for log reservations.  I think this patch and
      the next one will also solve these problems.
      
      As originally written, the code used list_splice_tail_init instead of
      list_splice_init, so change that, and leave a short comment explaining
      our actions.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      27dada07
    • D
      xfs: fix an incore inode UAF in xfs_bui_recover · ff4ab5e0
      Darrick J. Wong 提交于
      In xfs_bui_item_recover, there exists a use-after-free bug with regards
      to the inode that is involved in the bmap replay operation.  If the
      mapping operation does not complete, we call xfs_bmap_unmap_extent to
      create a deferred op to finish the unmapping work, and we retain a
      pointer to the incore inode.
      
      Unfortunately, the very next thing we do is commit the transaction and
      drop the inode.  If reclaim tears down the inode before we try to finish
      the defer ops, we dereference garbage and blow up.  Therefore, create a
      way to join inodes to the defer ops freezer so that we can maintain the
      xfs_inode reference until we're done with the inode.
      
      Note: This imposes the requirement that there be enough memory to keep
      every incore inode in memory throughout recovery.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      ff4ab5e0
    • D
      xfs: xfs_defer_capture should absorb remaining transaction reservation · 929b92f6
      Darrick J. Wong 提交于
      When xfs_defer_capture extracts the deferred ops and transaction state
      from a transaction, it should record the transaction reservation type
      from the old transaction so that when we continue the dfops chain, we
      still use the same reservation parameters.
      
      Doing this means that the log item recovery functions get to determine
      the transaction reservation instead of abusing tr_itruncate in yet
      another part of xfs.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      929b92f6
    • D
      xfs: xfs_defer_capture should absorb remaining block reservations · 4f9a60c4
      Darrick J. Wong 提交于
      When xfs_defer_capture extracts the deferred ops and transaction state
      from a transaction, it should record the remaining block reservations so
      that when we continue the dfops chain, we can reserve the same number of
      blocks to use.  We capture the reservations for both data and realtime
      volumes.
      
      This adds the requirement that every log intent item recovery function
      must be careful to reserve enough blocks to handle both itself and all
      defer ops that it can queue.  On the other hand, this enables us to do
      away with the handwaving block estimation nonsense that was going on in
      xlog_finish_defer_ops.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      4f9a60c4
    • D
      xfs: proper replay of deferred ops queued during log recovery · e6fff81e
      Darrick J. Wong 提交于
      When we replay unfinished intent items that have been recovered from the
      log, it's possible that the replay will cause the creation of more
      deferred work items.  As outlined in commit 50995582 ("xfs: log
      recovery should replay deferred ops in order"), later work items have an
      implicit ordering dependency on earlier work items.  Therefore, recovery
      must replay the items (both recovered and created) in the same order
      that they would have been during normal operation.
      
      For log recovery, we enforce this ordering by using an empty transaction
      to collect deferred ops that get created in the process of recovering a
      log intent item to prevent them from being committed before the rest of
      the recovered intent items.  After we finish committing all the
      recovered log items, we allocate a transaction with an enormous block
      reservation, splice our huge list of created deferred ops into that
      transaction, and commit it, thereby finishing all those ops.
      
      This is /really/ hokey -- it's the one place in XFS where we allow
      nested transactions; the splicing of the defer ops list is is inelegant
      and has to be done twice per recovery function; and the broken way we
      handle inode pointers and block reservations cause subtle use-after-free
      and allocator problems that will be fixed by this patch and the two
      patches after it.
      
      Therefore, replace the hokey empty transaction with a structure designed
      to capture each chain of deferred ops that are created as part of
      recovering a single unfinished log intent.  Finally, refactor the loop
      that replays those chains to do so using one transaction per chain.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      e6fff81e
    • D
      xfs: remove xfs_defer_reset · b80b29d6
      Darrick J. Wong 提交于
      Remove this one-line helper since the assert is trivially true in one
      call site and the rest obscures a bitmask operation.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      b80b29d6
  5. 23 9月, 2020 1 次提交
    • D
      xfs: log new intent items created as part of finishing recovered intent items · 93293bcb
      Darrick J. Wong 提交于
      During a code inspection, I found a serious bug in the log intent item
      recovery code when an intent item cannot complete all the work and
      decides to requeue itself to get that done.  When this happens, the
      item recovery creates a new incore deferred op representing the
      remaining work and attaches it to the transaction that it allocated.  At
      the end of _item_recover, it moves the entire chain of deferred ops to
      the dummy parent_tp that xlog_recover_process_intents passed to it, but
      fail to log a new intent item for the remaining work before committing
      the transaction for the single unit of work.
      
      xlog_finish_defer_ops logs those new intent items once recovery has
      finished dealing with the intent items that it recovered, but this isn't
      sufficient.  If the log is forced to disk after a recovered log item
      decides to requeue itself and the system goes down before we call
      xlog_finish_defer_ops, the second log recovery will never see the new
      intent item and therefore has no idea that there was more work to do.
      It will finish recovery leaving the filesystem in a corrupted state.
      
      The same logic applies to /any/ deferred ops added during intent item
      recovery, not just the one handling the remaining work.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      93293bcb
  6. 20 5月, 2020 1 次提交
    • D
      xfs: use ordered buffers to initialize dquot buffers during quotacheck · 78bba5c8
      Darrick J. Wong 提交于
      While QAing the new xfs_repair quotacheck code, I uncovered a quota
      corruption bug resulting from a bad interaction between dquot buffer
      initialization and quotacheck.  The bug can be reproduced with the
      following sequence:
      
      # mkfs.xfs -f /dev/sdf
      # mount /dev/sdf /opt -o usrquota
      # su nobody -s /bin/bash -c 'touch /opt/barf'
      # sync
      # xfs_quota -x -c 'report -ahi' /opt
      User quota on /opt (/dev/sdf)
                              Inodes
      User ID      Used   Soft   Hard Warn/Grace
      ---------- ---------------------------------
      root            3      0      0  00 [------]
      nobody          1      0      0  00 [------]
      
      # xfs_io -x -c 'shutdown' /opt
      # umount /opt
      # mount /dev/sdf /opt -o usrquota
      # touch /opt/man2
      # xfs_quota -x -c 'report -ahi' /opt
      User quota on /opt (/dev/sdf)
                              Inodes
      User ID      Used   Soft   Hard Warn/Grace
      ---------- ---------------------------------
      root            1      0      0  00 [------]
      nobody          1      0      0  00 [------]
      
      # umount /opt
      
      Notice how the initial quotacheck set the root dquot icount to 3
      (rootino, rbmino, rsumino), but after shutdown -> remount -> recovery,
      xfs_quota reports that the root dquot has only 1 icount.  We haven't
      deleted anything from the filesystem, which means that quota is now
      under-counting.  This behavior is not limited to icount or the root
      dquot, but this is the shortest reproducer.
      
      I traced the cause of this discrepancy to the way that we handle ondisk
      dquot updates during quotacheck vs. regular fs activity.  Normally, when
      we allocate a disk block for a dquot, we log the buffer as a regular
      (dquot) buffer.  Subsequent updates to the dquots backed by that block
      are done via separate dquot log item updates, which means that they
      depend on the logged buffer update being written to disk before the
      dquot items.  Because individual dquots have their own LSN fields, that
      initial dquot buffer must always be recovered.
      
      However, the story changes for quotacheck, which can cause dquot block
      allocations but persists the final dquot counter values via a delwri
      list.  Because recovery doesn't gate dquot buffer replay on an LSN, this
      means that the initial dquot buffer can be replayed over the (newer)
      contents that were delwritten at the end of quotacheck.  In effect, this
      re-initializes the dquot counters after they've been updated.  If the
      log does not contain any other dquot items to recover, the obsolete
      dquot contents will not be corrected by log recovery.
      
      Because quotacheck uses a transaction to log the setting of the CHKD
      flags in the superblock, we skip quotacheck during the second mount
      call, which allows the incorrect icount to remain.
      
      Fix this by changing the ondisk dquot initialization function to use
      ordered buffers to write out fresh dquot blocks if it detects that we're
      running quotacheck.  If the system goes down before quotacheck can
      complete, the CHKD flags will not be set in the superblock and the next
      mount will run quotacheck again, which can fix uninitialized dquot
      buffers.  This requires amending the defer code to maintaine ordered
      buffer state across defer rolls for the sake of the dquot allocation
      code.
      
      For regular operations we preserve the current behavior since the dquot
      items require properly initialized ondisk dquot records.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      78bba5c8
  7. 05 5月, 2020 6 次提交
  8. 27 8月, 2019 1 次提交
  9. 29 6月, 2019 1 次提交
  10. 30 4月, 2019 1 次提交
    • D
      xfs: always rejoin held resources during defer roll · 710d707d
      Darrick J. Wong 提交于
      During testing of xfs/141 on a V4 filesystem, I observed some
      inconsistent behavior with regards to resources that are held (i.e.
      remain locked) across a defer roll.  The transaction roll always gives
      the defer roll function a new transaction, even if committing the old
      transaction fails.  However, the defer roll function only rejoins the
      held resources if the transaction commit succeedied.  This means that
      callers of defer roll have to figure out whether the held resources are
      attached to the transaction being passed back.
      
      Worse yet, if the defer roll was part of a defer finish call, we have a
      third possibility: the defer finish could pass back a dirty transaction
      with dirty held resources and an error code.
      
      The only sane way to handle all of these scenarios is to require that
      the code that held the resource either cancel the transaction before
      unlocking and releasing the resources, or use functions that detach
      resources from a transaction properly (e.g.  xfs_trans_brelse) if they
      need to drop the reference before committing or cancelling the
      transaction.
      
      In order to make this so, change the defer roll code to join held
      resources to the new transaction unconditionally and fix all the bhold
      callers to release the held buffers correctly.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      710d707d
  11. 13 12月, 2018 2 次提交
  12. 03 8月, 2018 11 次提交
  13. 27 7月, 2018 2 次提交