1. 27 12月, 2021 2 次提交
    • D
      xfs: update superblock counters correctly for !lazysbcount · 20560d6e
      Dave Chinner 提交于
      mainline-inclusion
      from mainline-v5.12-rc4
      commit 6543990a
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6543990a168acf366f4b6174d7bd46ba15a8a2a6
      
      -------------------------------------------------
      
      Keep the mount superblock counters up to date for !lazysbcount
      filesystems so that when we log the superblock they do not need
      updating in any way because they are already correct.
      
      It's found by what Zorro reported:
      1. mkfs.xfs -f -l lazy-count=0 -m crc=0 $dev
      2. mount $dev $mnt
      3. fsstress -d $mnt -p 100 -n 1000 (maybe need more or less io load)
      4. umount $mnt
      5. xfs_repair -n $dev
      and I've seen no problem with this patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reported-by: NZorro Lang <zlang@redhat.com>
      Reviewed-by: NGao Xiang <hsiangkao@redhat.com>
      Signed-off-by: NGao Xiang <hsiangkao@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NLihong Kou <koulihong@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      20560d6e
    • D
      xfs: remove obsolete AGF counter debugging · dc0c7ae0
      Darrick J. Wong 提交于
      mainline-inclusion
      from mainline-v5.12-rc4
      commit 1aec7c3d
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1aec7c3d05670b92b7339b19999009a93808efb9
      
      -------------------------------------------------
      
      In commit f8f2835a we changed the behavior of XFS to use EFIs to
      remove blocks from an overfilled AGFL because there were complaints
      about transaction overruns that stemmed from trying to free multiple
      blocks in a single transaction.
      
      Unfortunately, that commit missed a subtlety in the debug-mode
      transaction accounting when a realtime volume is attached.  If a
      realtime file undergoes a data fork mapping change such that realtime
      extents are allocated (or freed) in the same transaction that a data
      device block is also allocated (or freed), we can trip a debugging
      assertion.  This can happen (for example) if a realtime extent is
      allocated and it is necessary to reshape the bmbt to hold the new
      mapping.
      
      When we go to allocate a bmbt block from an AG, the first thing the data
      device block allocator does is ensure that the freelist is the proper
      length.  If the freelist is too long, it will trim the freelist to the
      proper length.
      
      In debug mode, trimming the freelist calls xfs_trans_agflist_delta() to
      record the decrement in the AG free list count.  Prior to f8f28 we would
      put the free block back in the free space btrees in the same
      transaction, which calls xfs_trans_agblocks_delta() to record the
      increment in the AG free block count.  Since AGFL blocks are included in
      the global free block count (fdblocks), there is no corresponding
      fdblocks update, so the AGFL free satisfies the following condition in
      xfs_trans_apply_sb_deltas:
      
      	/*
      	 * Check that superblock mods match the mods made to AGF counters.
      	 */
      	ASSERT((tp->t_fdblocks_delta + tp->t_res_fdblocks_delta) ==
      	       (tp->t_ag_freeblks_delta + tp->t_ag_flist_delta +
      		tp->t_ag_btree_delta));
      
      The comparison here used to be: (X + 0) == ((X+1) + -1 + 0), where X is
      the number blocks that were allocated.
      
      After commit f8f28 we defer the block freeing to the next chained
      transaction, which means that the calls to xfs_trans_agflist_delta and
      xfs_trans_agblocks_delta occur in separate transactions.  The (first)
      transaction that shortens the free list trips on the comparison, which
      has now become:
      
      (X + 0) == ((X) + -1 + 0)
      
      because we haven't freed the AGFL block yet; we've only logged an
      intention to free it.  When the second transaction (the deferred free)
      commits, it will evaluate the expression as:
      
      (0 + 0) == (1 + 0 + 0)
      
      and trip over that in turn.
      
      At this point, the astute reader may note that the two commits tagged by
      this patch have been in the kernel for a long time but haven't generated
      any bug reports.  How is it that the author became aware of this bug?
      
      This originally surfaced as an intermittent failure when I was testing
      realtime rmap, but a different bug report by Zorro Lang reveals the same
      assertion occuring on !lazysbcount filesystems.
      
      The common factor to both reports (and why this problem wasn't
      previously reported) becomes apparent if we consider when
      xfs_trans_apply_sb_deltas is called by __xfs_trans_commit():
      
      	if (tp->t_flags & XFS_TRANS_SB_DIRTY)
      		xfs_trans_apply_sb_deltas(tp);
      
      With a modern lazysbcount filesystem, transactions update only the
      percpu counters, so they don't need to set XFS_TRANS_SB_DIRTY, hence
      xfs_trans_apply_sb_deltas is rarely called.
      
      However, updates to the count of free realtime extents are not part of
      lazysbcount, so XFS_TRANS_SB_DIRTY will be set on transactions adding or
      removing data fork mappings to realtime files; similarly,
      XFS_TRANS_SB_DIRTY is always set on !lazysbcount filesystems.
      
      Dave mentioned in response to an earlier version of this patch:
      
      "IIUC, what you are saying is that this debug code is simply not
      exercised in normal testing and hasn't been for the past decade?  And it
      still won't be exercised on anything other than realtime device testing?
      
      "...it was debugging code from 1994 that was largely turned into dead
      code when lazysbcounters were introduced in 2007. Hence I'm not sure it
      holds any value anymore."
      
      This debugging code isn't especially helpful - you can modify the
      flcount on one AG and the freeblks of another AG, and it won't trigger.
      Add the fact that nobody noticed for a decade, and let's just get rid of
      it (and start testing realtime :P).
      
      This bug was found by running generic/051 on either a V4 filesystem
      lacking lazysbcount; or a V5 filesystem with a realtime volume.
      
      Cc: bfoster@redhat.com, zlang@redhat.com
      Fixes: f8f2835a ("xfs: defer agfl block frees when dfops is available")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NLihong Kou <koulihong@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      dc0c7ae0
  2. 03 6月, 2021 1 次提交
  3. 20 11月, 2020 1 次提交
    • D
      xfs: revert "xfs: fix rmap key and record comparison functions" · eb840907
      Darrick J. Wong 提交于
      This reverts commit 6ff646b2.
      
      Your maintainer committed a major braino in the rmap code by adding the
      attr fork, bmbt, and unwritten extent usage bits into rmap record key
      comparisons.  While XFS uses the usage bits *in the rmap records* for
      cross-referencing metadata in xfs_scrub and xfs_repair, it only needs
      the owner and offset information to distinguish between reverse mappings
      of the same physical extent into the data fork of a file at multiple
      offsets.  The other bits are not important for key comparisons for index
      lookups, and never have been.
      
      Eric Sandeen reports that this causes regressions in generic/299, so
      undo this patch before it does more damage.
      Reported-by: NEric Sandeen <sandeen@sandeen.net>
      Fixes: 6ff646b2 ("xfs: fix rmap key and record comparison functions")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      eb840907
  4. 19 11月, 2020 1 次提交
    • G
      xfs: fix forkoff miscalculation related to XFS_LITINO(mp) · ada49d64
      Gao Xiang 提交于
      Currently, commit e9e2eae8 dropped a (int) decoration from
      XFS_LITINO(mp), and since sizeof() expression is also involved,
      the result of XFS_LITINO(mp) is simply as the size_t type
      (commonly unsigned long).
      
      Considering the expression in xfs_attr_shortform_bytesfit():
        offset = (XFS_LITINO(mp) - bytes) >> 3;
      let "bytes" be (int)340, and
          "XFS_LITINO(mp)" be (unsigned long)336.
      
      on 64-bit platform, the expression is
        offset = ((unsigned long)336 - (int)340) >> 3 =
                 (int)(0xfffffffffffffffcUL >> 3) = -1
      
      but on 32-bit platform, the expression is
        offset = ((unsigned long)336 - (int)340) >> 3 =
                 (int)(0xfffffffcUL >> 3) = 0x1fffffff
      instead.
      
      so offset becomes a large positive number on 32-bit platform, and
      cause xfs_attr_shortform_bytesfit() returns maxforkoff rather than 0.
      
      Therefore, one result is
        "ASSERT(new_size <= XFS_IFORK_SIZE(ip, whichfork));"
      
      assertion failure in xfs_idata_realloc(), which was also the root
      cause of the original bugreport from Dennis, see:
         https://bugzilla.redhat.com/show_bug.cgi?id=1894177
      
      And it can also be manually triggered with the following commands:
        $ touch a;
        $ setfattr -n user.0 -v "`seq 0 80`" a;
        $ setfattr -n user.1 -v "`seq 0 80`" a
      
      on 32-bit platform.
      
      Fix the case in xfs_attr_shortform_bytesfit() by bailing out
      "XFS_LITINO(mp) < bytes" in advance suggested by Eric and a misleading
      comment together with this bugfix suggested by Darrick. It seems the
      other users of XFS_LITINO(mp) are not impacted.
      
      Fixes: e9e2eae8 ("xfs: only check the superblock version for dinode size calculation")
      Cc: <stable@vger.kernel.org> # 5.7+
      Reported-and-tested-by: NDennis Gilmore <dgilmore@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGao Xiang <hsiangkao@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      ada49d64
  5. 11 11月, 2020 2 次提交
    • D
      xfs: fix rmap key and record comparison functions · 6ff646b2
      Darrick J. Wong 提交于
      Keys for extent interval records in the reverse mapping btree are
      supposed to be computed as follows:
      
      (physical block, owner, fork, is_btree, is_unwritten, offset)
      
      This provides users the ability to look up a reverse mapping from a bmbt
      record -- start with the physical block; then if there are multiple
      records for the same block, move on to the owner; then the inode fork
      type; and so on to the file offset.
      
      However, the key comparison functions incorrectly remove the
      fork/btree/unwritten information that's encoded in the on-disk offset.
      This means that lookup comparisons are only done with:
      
      (physical block, owner, offset)
      
      This means that queries can return incorrect results.  On consistent
      filesystems this hasn't been an issue because blocks are never shared
      between forks or with bmbt blocks; and are never unwritten.  However,
      this bug means that online repair cannot always detect corruption in the
      key information in internal rmapbt nodes.
      
      Found by fuzzing keys[1].attrfork = ones on xfs/371.
      
      Fixes: 4b8ed677 ("xfs: add rmap btree operations")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      6ff646b2
    • D
      xfs: fix flags argument to rmap lookup when converting shared file rmaps · ea843989
      Darrick J. Wong 提交于
      Pass the same oldext argument (which contains the existing rmapping's
      unwritten state) to xfs_rmap_lookup_le_range at the start of
      xfs_rmap_convert_shared.  At this point in the code, flags is zero,
      which means that we perform lookups using the wrong key.
      
      Fixes: 3f165b33 ("xfs: convert unwritten status of reverse mappings for shared files")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      ea843989
  6. 29 10月, 2020 1 次提交
  7. 17 10月, 2020 1 次提交
    • D
      xfs: fix high key handling in the rt allocator's query_range function · d88850bd
      Darrick J. Wong 提交于
      Fix some off-by-one errors in xfs_rtalloc_query_range.  The highest key
      in the realtime bitmap is always one less than the number of rt extents,
      which means that the key clamp at the start of the function is wrong.
      The 4th argument to xfs_rtfind_forw is the highest rt extent that we
      want to probe, which means that passing 1 less than the high key is
      wrong.  Finally, drop the rem variable that controls the loop because we
      can compare the iteration point (rtstart) against the high key directly.
      
      The sordid history of this function is that the original commit (fb3c3)
      incorrectly passed (high_rec->ar_startblock - 1) as the 'limit' parameter
      to xfs_rtfind_forw.  This was wrong because the "high key" is supposed
      to be the largest key for which the caller wants result rows, not the
      key for the first row that could possibly be outside the range that the
      caller wants to see.
      
      A subsequent attempt (8ad56) to strengthen the parameter checking added
      incorrect clamping of the parameters to the number of rt blocks in the
      system (despite the bitmap functions all taking units of rt extents) to
      avoid querying ranges past the end of rt bitmap file but failed to fix
      the incorrect _rtfind_forw parameter.  The original _rtfind_forw
      parameter error then survived the conversion of the startblock and
      blockcount fields to rt extents (a0e5c), and the most recent off-by-one
      fix (a3a37) thought it was patching a problem when the end of the rt
      volume is not in use, but none of these fixes actually solved the
      original problem that the author was confused about the "limit" argument
      to xfs_rtfind_forw.
      
      Sadly, all four of these patches were written by this author and even
      his own usage of this function and rt testing were inadequate to get
      this fixed quickly.
      
      Original-problem: fb3c3de2 ("xfs: add a couple of queries to iterate free extents in the rtbitmap")
      Not-fixed-by: 8ad560d2 ("xfs: strengthen rtalloc query range checks")
      Not-fixed-by: a0e5c435 ("xfs: fix xfs_rtalloc_rec units")
      Fixes: a3a374bf ("xfs: fix off-by-one error in xfs_rtalloc_query_range")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      d88850bd
  8. 07 10月, 2020 8 次提交
    • D
      xfs: only relog deferred intent items if free space in the log gets low · 74f4d6a1
      Darrick J. Wong 提交于
      Now that we have the ability to ask the log how far the tail needs to be
      pushed to maintain its free space targets, augment the decision to relog
      an intent item so that we only do it if the log has hit the 75% full
      threshold.  There's no point in relogging an intent into the same
      checkpoint, and there's no need to relog if there's plenty of free space
      in the log.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      74f4d6a1
    • D
      xfs: periodically relog deferred intent items · 4e919af7
      Darrick J. Wong 提交于
      There's a subtle design flaw in the deferred log item code that can lead
      to pinning the log tail.  Taking up the defer ops chain examples from
      the previous commit, we can get trapped in sequences like this:
      
      Caller hands us a transaction t0 with D0-D3 attached.  The defer ops
      chain will look like the following if the transaction rolls succeed:
      
      t1: D0(t0), D1(t0), D2(t0), D3(t0)
      t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
      t3: d5(t1), D1(t0), D2(t0), D3(t0)
      ...
      t9: d9(t7), D3(t0)
      t10: D3(t0)
      t11: d10(t10), d11(t10)
      t12: d11(t10)
      
      In transaction 9, we finish d9 and try to roll to t10 while holding onto
      an intent item for D3 that we logged in t0.
      
      The previous commit changed the order in which we place new defer ops in
      the defer ops processing chain to reduce the maximum chain length.  Now
      make xfs_defer_finish_noroll capable of relogging the entire chain
      periodically so that we can always move the log tail forward.  Most
      chains will never get relogged, except for operations that generate very
      long chains (large extents containing many blocks with different sharing
      levels) or are on filesystems with small logs and a lot of ongoing
      metadata updates.
      
      Callers are now required to ensure that the transaction reservation is
      large enough to handle logging done items and new intent items for the
      maximum possible chain length.  Most callers are careful to keep the
      chain lengths low, so the overhead should be minimal.
      
      The decision to relog an intent item is made based on whether the intent
      was logged in a previous checkpoint, since there's no point in relogging
      an intent into the same checkpoint.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      4e919af7
    • D
      xfs: change the order in which child and parent defer ops are finished · 27dada07
      Darrick J. Wong 提交于
      The defer ops code has been finishing items in the wrong order -- if a
      top level defer op creates items A and B, and finishing item A creates
      more defer ops A1 and A2, we'll put the new items on the end of the
      chain and process them in the order A B A1 A2.  This is kind of weird,
      since it's convenient for programmers to be able to think of A and B as
      an ordered sequence where all the sub-tasks for A must finish before we
      move on to B, e.g. A A1 A2 D.
      
      Right now, our log intent items are not so complex that this matters,
      but this will become important for the atomic extent swapping patchset.
      In order to maintain correct reference counting of extents, we have to
      unmap and remap extents in that order, and we want to complete that work
      before moving on to the next range that the user wants to swap.  This
      patch fixes defer ops to satsify that requirement.
      
      The primary symptom of the incorrect order was noticed in an early
      performance analysis of the atomic extent swap code.  An astonishingly
      large number of deferred work items accumulated when userspace requested
      an atomic update of two very fragmented files.  The cause of this was
      traced to the same ordering bug in the inner loop of
      xfs_defer_finish_noroll.
      
      If the ->finish_item method of a deferred operation queues new deferred
      operations, those new deferred ops are appended to the tail of the
      pending work list.  To illustrate, say that a caller creates a
      transaction t0 with four deferred operations D0-D3.  The first thing
      defer ops does is roll the transaction to t1, leaving us with:
      
      t1: D0(t0), D1(t0), D2(t0), D3(t0)
      
      Let's say that finishing each of D0-D3 will create two new deferred ops.
      After finish D0 and roll, we'll have the following chain:
      
      t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1)
      
      d4 and d5 were logged to t1.  Notice that while we're about to start
      work on D1, we haven't actually completed all the work implied by D0
      being finished.  So far we've been careful (or lucky) to structure the
      dfops callers such that D1 doesn't depend on d4 or d5 being finished,
      but this is a potential logic bomb.
      
      There's a second problem lurking.  Let's see what happens as we finish
      D1-D3:
      
      t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2)
      t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3)
      t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
      
      Let's say that d4-d11 are simple work items that don't queue any other
      operations, which means that we can complete each d4 and roll to t6:
      
      t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
      t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
      ...
      t11: d10(t4), d11(t4)
      t12: d11(t4)
      <done>
      
      When we try to roll to transaction #12, we're holding defer op d11,
      which we logged way back in t4.  This means that the tail of the log is
      pinned at t4.  If the log is very small or there are a lot of other
      threads updating metadata, this means that we might have wrapped the log
      and cannot get roll to t11 because there isn't enough space left before
      we'd run into t4.
      
      Let's shift back to the original failure.  I mentioned before that I
      discovered this flaw while developing the atomic file update code.  In
      that scenario, we have a defer op (D0) that finds a range of file blocks
      to remap, creates a handful of new defer ops to do that, and then asks
      to be continued with however much work remains.
      
      So, D0 is the original swapext deferred op.  The first thing defer ops
      does is rolls to t1:
      
      t1: D0(t0)
      
      We try to finish D0, logging d1 and d2 in the process, but can't get all
      the work done.  We log a done item and a new intent item for the work
      that D0 still has to do, and roll to t2:
      
      t2: D0'(t1), d1(t1), d2(t1)
      
      We roll and try to finish D0', but still can't get all the work done, so
      we log a done item and a new intent item for it, requeue D0 a second
      time, and roll to t3:
      
      t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2)
      
      If it takes 48 more rolls to complete D0, then we'll finally dispense
      with D0 in t50:
      
      t50: D<fifty primes>(t49), d1(t1), ..., d102(t50)
      
      We then try to roll again to get a chain like this:
      
      t51: d1(t1), d2(t1), ..., d101(t50), d102(t50)
      ...
      t152: d102(t50)
      <done>
      
      Notice that in rolling to transaction #51, we're holding on to a log
      intent item for d1 that was logged in transaction #1.  This means that
      the tail of the log is pinned at t1.  If the log is very small or there
      are a lot of other threads updating metadata, this means that we might
      have wrapped the log and cannot roll to t51 because there isn't enough
      space left before we'd run into t1.  This is of course problem #2 again.
      
      But notice the third problem with this scenario: we have 102 defer ops
      tied to this transaction!  Each of these items are backed by pinned
      kernel memory, which means that we risk OOM if the chains get too long.
      
      Yikes.  Problem #1 is a subtle logic bomb that could hit someone in the
      future; problem #2 applies (rarely) to the current upstream, and problem
      #3 applies to work under development.
      
      This is not how incremental deferred operations were supposed to work.
      The dfops design of logging in the same transaction an intent-done item
      and a new intent item for the work remaining was to make it so that we
      only have to juggle enough deferred work items to finish that one small
      piece of work.  Deferred log item recovery will find that first
      unfinished work item and restart it, no matter how many other intent
      items might follow it in the log.  Therefore, it's ok to put the new
      intents at the start of the dfops chain.
      
      For the first example, the chains look like this:
      
      t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
      t3: d5(t1), D1(t0), D2(t0), D3(t0)
      ...
      t9: d9(t7), D3(t0)
      t10: D3(t0)
      t11: d10(t10), d11(t10)
      t12: d11(t10)
      
      For the second example, the chains look like this:
      
      t1: D0(t0)
      t2: d1(t1), d2(t1), D0'(t1)
      t3: d2(t1), D0'(t1)
      t4: D0'(t1)
      t5: d1(t4), d2(t4), D0''(t4)
      ...
      t148: D0<50 primes>(t147)
      t149: d101(t148), d102(t148)
      t150: d102(t148)
      <done>
      
      This actually sucks more for pinning the log tail (we try to roll to t10
      while holding an intent item that was logged in t1) but we've solved
      problem #1.  We've also reduced the maximum chain length from:
      
          sum(all the new items) + nr_original_items
      
      to:
      
          max(new items that each original item creates) + nr_original_items
      
      This solves problem #3 by sharply reducing the number of defer ops that
      can be attached to a transaction at any given time.  The change makes
      the problem of log tail pinning worse, but is improvement we need to
      solve problem #2.  Actually solving #2, however, is left to the next
      patch.
      
      Note that a subsequent analysis of some hard-to-trigger reflink and COW
      livelocks on extremely fragmented filesystems (or systems running a lot
      of IO threads) showed the same symptoms -- uncomfortably large numbers
      of incore deferred work items and occasional stalls in the transaction
      grant code while waiting for log reservations.  I think this patch and
      the next one will also solve these problems.
      
      As originally written, the code used list_splice_tail_init instead of
      list_splice_init, so change that, and leave a short comment explaining
      our actions.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      27dada07
    • D
      xfs: fix an incore inode UAF in xfs_bui_recover · ff4ab5e0
      Darrick J. Wong 提交于
      In xfs_bui_item_recover, there exists a use-after-free bug with regards
      to the inode that is involved in the bmap replay operation.  If the
      mapping operation does not complete, we call xfs_bmap_unmap_extent to
      create a deferred op to finish the unmapping work, and we retain a
      pointer to the incore inode.
      
      Unfortunately, the very next thing we do is commit the transaction and
      drop the inode.  If reclaim tears down the inode before we try to finish
      the defer ops, we dereference garbage and blow up.  Therefore, create a
      way to join inodes to the defer ops freezer so that we can maintain the
      xfs_inode reference until we're done with the inode.
      
      Note: This imposes the requirement that there be enough memory to keep
      every incore inode in memory throughout recovery.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      ff4ab5e0
    • D
      xfs: xfs_defer_capture should absorb remaining transaction reservation · 929b92f6
      Darrick J. Wong 提交于
      When xfs_defer_capture extracts the deferred ops and transaction state
      from a transaction, it should record the transaction reservation type
      from the old transaction so that when we continue the dfops chain, we
      still use the same reservation parameters.
      
      Doing this means that the log item recovery functions get to determine
      the transaction reservation instead of abusing tr_itruncate in yet
      another part of xfs.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      929b92f6
    • D
      xfs: xfs_defer_capture should absorb remaining block reservations · 4f9a60c4
      Darrick J. Wong 提交于
      When xfs_defer_capture extracts the deferred ops and transaction state
      from a transaction, it should record the remaining block reservations so
      that when we continue the dfops chain, we can reserve the same number of
      blocks to use.  We capture the reservations for both data and realtime
      volumes.
      
      This adds the requirement that every log intent item recovery function
      must be careful to reserve enough blocks to handle both itself and all
      defer ops that it can queue.  On the other hand, this enables us to do
      away with the handwaving block estimation nonsense that was going on in
      xlog_finish_defer_ops.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      4f9a60c4
    • D
      xfs: proper replay of deferred ops queued during log recovery · e6fff81e
      Darrick J. Wong 提交于
      When we replay unfinished intent items that have been recovered from the
      log, it's possible that the replay will cause the creation of more
      deferred work items.  As outlined in commit 50995582 ("xfs: log
      recovery should replay deferred ops in order"), later work items have an
      implicit ordering dependency on earlier work items.  Therefore, recovery
      must replay the items (both recovered and created) in the same order
      that they would have been during normal operation.
      
      For log recovery, we enforce this ordering by using an empty transaction
      to collect deferred ops that get created in the process of recovering a
      log intent item to prevent them from being committed before the rest of
      the recovered intent items.  After we finish committing all the
      recovered log items, we allocate a transaction with an enormous block
      reservation, splice our huge list of created deferred ops into that
      transaction, and commit it, thereby finishing all those ops.
      
      This is /really/ hokey -- it's the one place in XFS where we allow
      nested transactions; the splicing of the defer ops list is is inelegant
      and has to be done twice per recovery function; and the broken way we
      handle inode pointers and block reservations cause subtle use-after-free
      and allocator problems that will be fixed by this patch and the two
      patches after it.
      
      Therefore, replace the hokey empty transaction with a structure designed
      to capture each chain of deferred ops that are created as part of
      recovering a single unfinished log intent.  Finally, refactor the loop
      that replays those chains to do so using one transaction per chain.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      e6fff81e
    • D
      xfs: remove xfs_defer_reset · b80b29d6
      Darrick J. Wong 提交于
      Remove this one-line helper since the assert is trivially true in one
      call site and the rest obscures a bitmask operation.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      b80b29d6
  9. 26 9月, 2020 5 次提交
  10. 23 9月, 2020 2 次提交
    • D
      xfs: log new intent items created as part of finishing recovered intent items · 93293bcb
      Darrick J. Wong 提交于
      During a code inspection, I found a serious bug in the log intent item
      recovery code when an intent item cannot complete all the work and
      decides to requeue itself to get that done.  When this happens, the
      item recovery creates a new incore deferred op representing the
      remaining work and attaches it to the transaction that it allocated.  At
      the end of _item_recover, it moves the entire chain of deferred ops to
      the dummy parent_tp that xlog_recover_process_intents passed to it, but
      fail to log a new intent item for the remaining work before committing
      the transaction for the single unit of work.
      
      xlog_finish_defer_ops logs those new intent items once recovery has
      finished dealing with the intent items that it recovered, but this isn't
      sufficient.  If the log is forced to disk after a recovered log item
      decides to requeue itself and the system goes down before we call
      xlog_finish_defer_ops, the second log recovery will never see the new
      intent item and therefore has no idea that there was more work to do.
      It will finish recovery leaving the filesystem in a corrupted state.
      
      The same logic applies to /any/ deferred ops added during intent item
      recovery, not just the one handling the remaining work.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      93293bcb
    • D
      xfs: don't free rt blocks when we're doing a REMAP bunmapi call · 8df0fa39
      Darrick J. Wong 提交于
      When callers pass XFS_BMAPI_REMAP into xfs_bunmapi, they want the extent
      to be unmapped from the given file fork without the extent being freed.
      We do this for non-rt files, but we forgot to do this for realtime
      files.  So far this isn't a big deal since nobody makes a bunmapi call
      to a rt file with the REMAP flag set, but don't leave a logic bomb.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      8df0fa39
  11. 16 9月, 2020 16 次提交