1. 07 7月, 2022 1 次提交
  2. 29 4月, 2022 1 次提交
  3. 21 4月, 2022 1 次提交
  4. 12 4月, 2022 1 次提交
    • D
      xfs: use a separate frextents counter for rt extent reservations · 2229276c
      Darrick J. Wong 提交于
      As mentioned in the previous commit, the kernel misuses sb_frextents in
      the incore mount to reflect both incore reservations made by running
      transactions as well as the actual count of free rt extents on disk.
      This results in the superblock being written to the log with an
      underestimate of the number of rt extents that are marked free in the
      rtbitmap.
      
      Teaching XFS to recompute frextents after log recovery avoids
      operational problems in the current mount, but it doesn't solve the
      problem of us writing undercounted frextents which are then recovered by
      an older kernel that doesn't have that fix.
      
      Create an incore percpu counter to mirror the ondisk frextents.  This
      new counter will track transaction reservations and the only time we
      will touch the incore super counter (i.e the one that gets logged) is
      when those transactions commit updates to the rt bitmap.  This is in
      contrast to the lazysbcount counters (e.g. fdblocks), where we know that
      log recovery will always fix any incorrect counter that we log.
      As a bonus, we only take m_sb_lock at transaction commit time.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2229276c
  5. 30 3月, 2022 1 次提交
    • D
      xfs: xfs_trans_commit() path must check for log shutdown · 3c4cb76b
      Dave Chinner 提交于
      If a shut races with xfs_trans_commit() and we have shut down the
      filesystem but not the log, we will still cancel the transaction.
      This can result in aborting dirty log items instead of committing and
      pinning them whilst the log is still running. Hence we can end up
      with dirty, unlogged metadata that isn't in the AIL in memory that
      can be flushed to disk via writeback clustering.
      
      This was discovered from a g/388 trace where an inode log item was
      having IO completed on it and it wasn't in the AIL, hence tripping
      asserts xfs_ail_check(). Inode cluster writeback started long after
      the filesystem shutdown started, and long after the transaction
      containing the dirty inode was aborted and the log item marked
      XFS_LI_ABORTED. The inode was seen as dirty and unpinned, so it
      was flushed. IO completion tried to remove the inode from the AIL,
      at which point stuff went bad:
      
       XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500).  Shutting down filesystem.
       XFS: Assertion failed: in_ail, file: fs/xfs/xfs_trans_ail.c, line: 67
       XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
       Workqueue: xfs-buf/pmem1 xfs_buf_ioend_work
       RIP: 0010:assfail+0x27/0x2d
       Call Trace:
        <TASK>
        xfs_ail_check+0xa8/0x180
        xfs_ail_delete_one+0x3b/0xf0
        xfs_buf_inode_iodone+0x329/0x3f0
        xfs_buf_ioend+0x1f8/0x530
        xfs_buf_ioend_work+0x15/0x20
        process_one_work+0x1ac/0x390
        worker_thread+0x56/0x3c0
        kthread+0xf6/0x120
        ret_from_fork+0x1f/0x30
        </TASK>
      
      xfs_trans_commit() needs to check log state for shutdown, not mount
      state. It cannot abort dirty log items while the log is still
      running as dirty items must remained pinned in memory until they are
      either committed to the journal or the log has shut down and they
      can be safely tossed away. Hence if the log has not shut down, the
      xfs_trans_commit() path must allow completed transactions to commit
      to the CIL and pin the dirty items even if a mount shutdown has
      started.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      3c4cb76b
  6. 20 3月, 2022 2 次提交
  7. 15 3月, 2022 1 次提交
    • D
      xfs: reserve quota for dir expansion when linking/unlinking files · 871b9316
      Darrick J. Wong 提交于
      XFS does not reserve quota for directory expansion when linking or
      unlinking children from a directory.  This means that we don't reject
      the expansion with EDQUOT when we're at or near a hard limit, which
      means that unprivileged userspace can use link()/unlink() to exceed
      quota.
      
      The fix for this is nuanced -- link operations don't always expand the
      directory, and we allow a link to proceed with no space reservation if
      we don't need to add a block to the directory to handle the addition.
      Unlink operations generally do not expand the directory (you'd have to
      free a block and then cause a btree split) and we can defer the
      directory block freeing if there is no space reservation.
      
      Moreover, there is a further bug in that we do not trigger the blockgc
      workers to try to clear space when we're out of quota.
      
      To fix both cases, create a new xfs_trans_alloc_dir function that
      allocates the transaction, locks and joins the inodes, and reserves
      quota for the directory.  If there isn't sufficient space or quota,
      we'll switch the caller to reservationless mode.  This should prevent
      quota usage overruns with the least restriction in functionality.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      871b9316
  8. 22 12月, 2021 1 次提交
    • D
      xfs: shut down filesystem if we xfs_trans_cancel with deferred work items · 47a6df7c
      Darrick J. Wong 提交于
      While debugging some very strange rmap corruption reports in connection
      with the online directory repair code.  I root-caused the error to the
      following incorrect sequence:
      
      <start repair transaction>
      <expand directory, causing a deferred rmap to be queued>
      <roll transaction>
      <cancel transaction>
      
      Obviously, we should have committed the transaction instead of
      cancelling it.  Thinking more broadly, however, xfs_trans_cancel should
      have warned us that we were throwing away work item that we already
      committed to performing.  This is not correct, and we need to shut down
      the filesystem.
      
      Change xfs_trans_cancel to complain in the loudest manner if we're
      cancelling any transaction with deferred work items attached.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      47a6df7c
  9. 23 10月, 2021 2 次提交
  10. 15 10月, 2021 1 次提交
  11. 20 8月, 2021 2 次提交
  12. 17 8月, 2021 2 次提交
    • D
      xfs: AIL needs asynchronous CIL forcing · 0020a190
      Dave Chinner 提交于
      The AIL pushing is stalling on log forces when it comes across
      pinned items. This is happening on removal workloads where the AIL
      is dominated by stale items that are removed from AIL when the
      checkpoint that marks the items stale is committed to the journal.
      This results is relatively few items in the AIL, but those that are
      are often pinned as directories items are being removed from are
      still being logged.
      
      As a result, many push cycles through the CIL will first issue a
      blocking log force to unpin the items. This can take some time to
      complete, with tracing regularly showing push delays of half a
      second and sometimes up into the range of several seconds. Sequences
      like this aren't uncommon:
      
      ....
       399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 270ms delay>
       400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
       400.099623:  xfsaild: first lsn 0x11002f3600
       400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
      <wanted 50ms, got 500ms delay>
       400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
       400.589349:  xfsaild: first lsn 0x1100305000
       400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
      <wanted 50ms, got 460ms delay>
       400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
       400.950343:  xfsaild: first lsn 0x1100317c00
       400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 200ms delay>
       401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
       401.142334:  xfsaild: first lsn 0x110032e600
       401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
      <wanted 10ms, got 10ms delay>
       401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
       401.154328:  xfsaild: first lsn 0x1100353000
       401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 300ms delay>
       401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
       401.451526:  xfsaild: first lsn 0x1100353000
       401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
      <wanted 50ms, got 500ms delay>
       401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
      ....
      
      In each of these cases, every AIL pass saw 101 log items stuck on
      the AIL (pinned) with very few other items being found. Each pass, a
      log force was issued, and delay between last/first is the sleep time
      + the sync log force time.
      
      Some of these 101 items pinned the tail of the log. The tail of the
      log does slowly creep forward (first lsn), but the problem is that
      the log is actually out of reservation space because it's been
      running so many transactions that stale items that never reach the
      AIL but consume log space. Hence we have a largely empty AIL, with
      long term pins on items that pin the tail of the log that don't get
      pushed frequently enough to keep log space available.
      
      The problem is the hundreds of milliseconds that we block in the log
      force pushing the CIL out to disk. The AIL should not be stalled
      like this - it needs to run and flush items that are at the tail of
      the log with minimal latency. What we really need to do is trigger a
      log flush, but then not wait for it at all - we've already done our
      waiting for stuff to complete when we backed off prior to the log
      force being issued.
      
      Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
      still do a blocking flush of the CIL and that is what is causing the
      issue. Hence we need a new interface for the CIL to trigger an
      immediate background push of the CIL to get it moving faster but not
      to wait on that to occur. While the CIL is pushing, the AIL can also
      be pushing.
      
      We already have an internal interface to do this -
      xlog_cil_push_now() - but we need a wrapper for it to be used
      externally. xlog_cil_force_seq() can easily be extended to do what
      we need as it already implements the synchronous CIL push via
      xlog_cil_push_now(). Add the necessary flags and "push current
      sequence" semantics to xlog_cil_force_seq() and convert the AIL
      pushing to use it.
      
      One of the complexities here is that the CIL push does not guarantee
      that the commit record for the CIL checkpoint is written to disk.
      The current log force ensures this by submitting the current ACTIVE
      iclog that the commit record was written to. We need the CIL to
      actually write this commit record to disk for an async push to
      ensure that the checkpoint actually makes it to disk and unpins the
      pinned items in the checkpoint on completion. Hence we need to pass
      down to the CIL push that we are doing an async flush so that it can
      switch out the commit_iclog if necessary to get written to disk when
      the commit iclog is finally released.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0020a190
    • D
      xfs: convert XLOG_FORCED_SHUTDOWN() to xlog_is_shutdown() · 2039a272
      Dave Chinner 提交于
      Make it less shouty and a static inline before adding more calls
      through the log code.
      
      Also convert internal log code that uses XFS_FORCED_SHUTDOWN(mount)
      to use xlog_is_shutdown(log) as well.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      2039a272
  13. 10 8月, 2021 1 次提交
  14. 22 6月, 2021 1 次提交
    • D
      xfs: xfs_log_force_lsn isn't passed a LSN · 5f9b4b0d
      Dave Chinner 提交于
      In doing an investigation into AIL push stalls, I was looking at the
      log force code to see if an async CIL push could be done instead.
      This lead me to xfs_log_force_lsn() and looking at how it works.
      
      xfs_log_force_lsn() is only called from inode synchronisation
      contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
      value as the LSN to sync the log to. This gets passed to
      xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
      journal, and then used by xfs_log_force_lsn() to flush the iclogs to
      the journal.
      
      The problem is that ip->i_itemp->ili_last_lsn does not store a
      log sequence number. What it stores is passed to it from the
      ->iop_committing method, which is called by xfs_log_commit_cil().
      The value this passes to the iop_committing method is the CIL
      context sequence number that the item was committed to.
      
      As it turns out, xlog_cil_force_lsn() converts the sequence to an
      actual commit LSN for the related context and returns that to
      xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
      variable that contained a sequence with an actual LSN and then uses
      that to sync the iclogs.
      
      This caused me some confusion for a while, even though I originally
      wrote all this code a decade ago. ->iop_committing is only used by
      a couple of log item types, and only inode items use the sequence
      number it is passed.
      
      Let's clean up the API, CIL structures and inode log item to call it
      a sequence number, and make it clear that the high level code is
      using CIL sequence numbers and not on-disk LSNs for integrity
      synchronisation purposes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      5f9b4b0d
  15. 29 4月, 2021 2 次提交
    • D
      xfs: update superblock counters correctly for !lazysbcount · 6543990a
      Dave Chinner 提交于
      Keep the mount superblock counters up to date for !lazysbcount
      filesystems so that when we log the superblock they do not need
      updating in any way because they are already correct.
      
      It's found by what Zorro reported:
      1. mkfs.xfs -f -l lazy-count=0 -m crc=0 $dev
      2. mount $dev $mnt
      3. fsstress -d $mnt -p 100 -n 1000 (maybe need more or less io load)
      4. umount $mnt
      5. xfs_repair -n $dev
      and I've seen no problem with this patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reported-by: NZorro Lang <zlang@redhat.com>
      Reviewed-by: NGao Xiang <hsiangkao@redhat.com>
      Signed-off-by: NGao Xiang <hsiangkao@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      6543990a
    • D
      xfs: remove obsolete AGF counter debugging · 1aec7c3d
      Darrick J. Wong 提交于
      In commit f8f2835a we changed the behavior of XFS to use EFIs to
      remove blocks from an overfilled AGFL because there were complaints
      about transaction overruns that stemmed from trying to free multiple
      blocks in a single transaction.
      
      Unfortunately, that commit missed a subtlety in the debug-mode
      transaction accounting when a realtime volume is attached.  If a
      realtime file undergoes a data fork mapping change such that realtime
      extents are allocated (or freed) in the same transaction that a data
      device block is also allocated (or freed), we can trip a debugging
      assertion.  This can happen (for example) if a realtime extent is
      allocated and it is necessary to reshape the bmbt to hold the new
      mapping.
      
      When we go to allocate a bmbt block from an AG, the first thing the data
      device block allocator does is ensure that the freelist is the proper
      length.  If the freelist is too long, it will trim the freelist to the
      proper length.
      
      In debug mode, trimming the freelist calls xfs_trans_agflist_delta() to
      record the decrement in the AG free list count.  Prior to f8f28 we would
      put the free block back in the free space btrees in the same
      transaction, which calls xfs_trans_agblocks_delta() to record the
      increment in the AG free block count.  Since AGFL blocks are included in
      the global free block count (fdblocks), there is no corresponding
      fdblocks update, so the AGFL free satisfies the following condition in
      xfs_trans_apply_sb_deltas:
      
      	/*
      	 * Check that superblock mods match the mods made to AGF counters.
      	 */
      	ASSERT((tp->t_fdblocks_delta + tp->t_res_fdblocks_delta) ==
      	       (tp->t_ag_freeblks_delta + tp->t_ag_flist_delta +
      		tp->t_ag_btree_delta));
      
      The comparison here used to be: (X + 0) == ((X+1) + -1 + 0), where X is
      the number blocks that were allocated.
      
      After commit f8f28 we defer the block freeing to the next chained
      transaction, which means that the calls to xfs_trans_agflist_delta and
      xfs_trans_agblocks_delta occur in separate transactions.  The (first)
      transaction that shortens the free list trips on the comparison, which
      has now become:
      
      (X + 0) == ((X) + -1 + 0)
      
      because we haven't freed the AGFL block yet; we've only logged an
      intention to free it.  When the second transaction (the deferred free)
      commits, it will evaluate the expression as:
      
      (0 + 0) == (1 + 0 + 0)
      
      and trip over that in turn.
      
      At this point, the astute reader may note that the two commits tagged by
      this patch have been in the kernel for a long time but haven't generated
      any bug reports.  How is it that the author became aware of this bug?
      
      This originally surfaced as an intermittent failure when I was testing
      realtime rmap, but a different bug report by Zorro Lang reveals the same
      assertion occuring on !lazysbcount filesystems.
      
      The common factor to both reports (and why this problem wasn't
      previously reported) becomes apparent if we consider when
      xfs_trans_apply_sb_deltas is called by __xfs_trans_commit():
      
      	if (tp->t_flags & XFS_TRANS_SB_DIRTY)
      		xfs_trans_apply_sb_deltas(tp);
      
      With a modern lazysbcount filesystem, transactions update only the
      percpu counters, so they don't need to set XFS_TRANS_SB_DIRTY, hence
      xfs_trans_apply_sb_deltas is rarely called.
      
      However, updates to the count of free realtime extents are not part of
      lazysbcount, so XFS_TRANS_SB_DIRTY will be set on transactions adding or
      removing data fork mappings to realtime files; similarly,
      XFS_TRANS_SB_DIRTY is always set on !lazysbcount filesystems.
      
      Dave mentioned in response to an earlier version of this patch:
      
      "IIUC, what you are saying is that this debug code is simply not
      exercised in normal testing and hasn't been for the past decade?  And it
      still won't be exercised on anything other than realtime device testing?
      
      "...it was debugging code from 1994 that was largely turned into dead
      code when lazysbcounters were introduced in 2007. Hence I'm not sure it
      holds any value anymore."
      
      This debugging code isn't especially helpful - you can modify the
      flcount on one AG and the freeblks of another AG, and it won't trigger.
      Add the fact that nobody noticed for a decade, and let's just get rid of
      it (and start testing realtime :P).
      
      This bug was found by running generic/051 on either a V4 filesystem
      lacking lazysbcount; or a V5 filesystem with a realtime volume.
      
      Cc: bfoster@redhat.com, zlang@redhat.com
      Fixes: f8f2835a ("xfs: defer agfl block frees when dfops is available")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      1aec7c3d
  16. 08 4月, 2021 1 次提交
  17. 26 3月, 2021 2 次提交
  18. 26 2月, 2021 2 次提交
    • D
      xfs: use current->journal_info for detecting transaction recursion · 756b1c34
      Dave Chinner 提交于
      Because the iomap code using PF_MEMALLOC_NOFS to detect transaction
      recursion in XFS is just wrong. Remove it from the iomap code and
      replace it with XFS specific internal checks using
      current->journal_info instead.
      
      [djwong: This change also realigns the lifetime of NOFS flag changes to
      match the incore transaction, instead of the inconsistent scheme we have
      now.]
      
      Fixes: 9070733b ("xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      756b1c34
    • D
      xfs: don't nest transactions when scanning for eofblocks · 9febcda6
      Darrick J. Wong 提交于
      Brian Foster reported a lockdep warning on xfs/167:
      
      ============================================
      WARNING: possible recursive locking detected
      5.11.0-rc4 #35 Tainted: G        W I
      --------------------------------------------
      fsstress/17733 is trying to acquire lock:
      ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_free_eofblocks+0x104/0x1d0 [xfs]
      
      but task is already holding lock:
      ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_trans_alloc_inode+0x5f/0x160 [xfs]
      
      stack backtrace:
      CPU: 38 PID: 17733 Comm: fsstress Tainted: G        W I       5.11.0-rc4 #35
      Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 1.6.11 11/20/2018
      Call Trace:
       dump_stack+0x8b/0xb0
       __lock_acquire.cold+0x159/0x2ab
       lock_acquire+0x116/0x370
       xfs_trans_alloc+0x1ad/0x310 [xfs]
       xfs_free_eofblocks+0x104/0x1d0 [xfs]
       xfs_blockgc_scan_inode+0x24/0x60 [xfs]
       xfs_inode_walk_ag+0x202/0x4b0 [xfs]
       xfs_inode_walk+0x66/0xc0 [xfs]
       xfs_trans_alloc+0x160/0x310 [xfs]
       xfs_trans_alloc_inode+0x5f/0x160 [xfs]
       xfs_alloc_file_space+0x105/0x300 [xfs]
       xfs_file_fallocate+0x270/0x460 [xfs]
       vfs_fallocate+0x14d/0x3d0
       __x64_sys_fallocate+0x3e/0x70
       do_syscall_64+0x33/0x40
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The cause of this is the new code that spurs a scan to garbage collect
      speculative preallocations if we fail to reserve enough blocks while
      allocating a transaction.  While the warning itself is a fairly benign
      lockdep complaint, it does expose a potential livelock if the rwsem
      behavior ever changes with regards to nesting read locks when someone's
      waiting for a write lock.
      
      Fix this by freeing the transaction and jumping back to xfs_trans_alloc
      like this patch in the V4 submission[1].
      
      [1] https://lore.kernel.org/linux-xfs/161142798066.2171939.9311024588681972086.stgit@magnolia/
      
      Fixes: a1a7d05a ("xfs: flush speculative space allocations when we run out of space")
      Reported-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      9febcda6
  19. 04 2月, 2021 9 次提交
  20. 17 12月, 2020 1 次提交
  21. 26 9月, 2020 1 次提交
  22. 16 9月, 2020 1 次提交
  23. 29 7月, 2020 1 次提交
  24. 07 7月, 2020 1 次提交
    • B
      xfs: preserve rmapbt swapext block reservation from freed blocks · f74681ba
      Brian Foster 提交于
      The rmapbt extent swap algorithm remaps individual extents between
      the source inode and the target to trigger reverse mapping metadata
      updates. If either inode straddles a format or other bmap allocation
      boundary, the individual unmap and map cycles can trigger repeated
      bmap block allocations and frees as the extent count bounces back
      and forth across the boundary. While net block usage is bound across
      the swap operation, this behavior can prematurely exhaust the
      transaction block reservation because it continuously drains as the
      transaction rolls. Each allocation accounts against the reservation
      and each free returns to global free space on transaction roll.
      
      The previous workaround to this problem attempted to detect this
      boundary condition and provide surplus block reservation to
      acommodate it. This is insufficient because more remaps can occur
      than implied by the extent counts; if start offset boundaries are
      not aligned between the two inodes, for example.
      
      To address this problem more generically and dynamically, add a
      transaction accounting mode that returns freed blocks to the
      transaction reservation instead of the superblock counters on
      transaction roll and use it when the rmapbt based algorithm is
      active. This allows the chain of remap transactions to preserve the
      block reservation based own its own frees and prevent premature
      exhaustion regardless of the remap pattern. Note that this is only
      safe for superblocks with lazy sb accounting, but the latter is
      required for v5 supers and the rmap feature depends on v5.
      
      Fixes: b3fed434 ("xfs: account format bouncing into rmapbt swapext tx reservation")
      Root-caused-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f74681ba
  25. 27 5月, 2020 1 次提交
    • D
      xfs: remove the m_active_trans counter · b41b46c2
      Dave Chinner 提交于
      It's a global atomic counter, and we are hitting it at a rate of
      half a million transactions a second, so it's bouncing the counter
      cacheline all over the place on large machines. We don't actually
      need it anymore - it used to be required because the VFS freeze code
      could not track/prevent filesystem transactions that were running,
      but that problem no longer exists.
      
      Hence to remove the counter, we simply have to ensure that nothing
      calls xfs_sync_sb() while we are trying to quiesce the filesytem.
      That only happens if the log worker is still running when we call
      xfs_quiesce_attr(). The log worker is cancelled at the end of
      xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it
      early here and then we can remove the counter altogether.
      
      Concurrent create, 50 million inodes, identical 16p/16GB virtual
      machines on different physical hosts. Machine A has twice the CPU
      cores per socket of machine B:
      
      		unpatched	patched
      machine A:	3m16s		2m00s
      machine B:	4m04s		4m05s
      
      Create rates:
      		unpatched	patched
      machine A:	282k+/-31k	468k+/-21k
      machine B:	231k+/-8k	233k+/-11k
      
      Concurrent rm of same 50 million inodes:
      
      		unpatched	patched
      machine A:	6m42s		2m33s
      machine B:	4m47s		4m47s
      
      The transaction rate on the fast machine went from just under
      300k/sec to 700k/sec, which indicates just how much of a bottleneck
      this atomic counter was.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      b41b46c2