1. 31 3月, 2020 1 次提交
    • D
      xfs: ratelimit inode flush on buffered write ENOSPC · c6425702
      Darrick J. Wong 提交于
      A customer reported rcu stalls and softlockup warnings on a computer
      with many CPU cores and many many more IO threads trying to write to a
      filesystem that is totally out of space.  Subsequent analysis pointed to
      the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
      which causes a lot of wb_writeback_work to be queued.  The writeback
      worker spends so much time trying to wake the many many threads waiting
      for writeback completion that it trips the softlockup detector, and (in
      this case) the system automatically reboots.
      
      In addition, they complain that the lengthy xfs_flush_inodes scan traps
      all of those threads in uninterruptible sleep, which hampers their
      ability to kill the program or do anything else to escape the situation.
      
      If there's thousands of threads trying to write to files on a full
      filesystem, each of those threads will start separate copies of the
      inode flush scan.  This is kind of pointless since we only need one
      scan, so rate limit the inode flush.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      c6425702
  2. 14 11月, 2019 3 次提交
  3. 11 11月, 2019 2 次提交
  4. 06 11月, 2019 2 次提交
  5. 30 10月, 2019 7 次提交
  6. 27 8月, 2019 1 次提交
    • D
      xfs: add kmem allocation trace points · 0ad95687
      Dave Chinner 提交于
      When trying to correlate XFS kernel allocations to memory reclaim
      behaviour, it is useful to know what allocations XFS is actually
      attempting. This information is not directly available from
      tracepoints in the generic memory allocation and reclaim
      tracepoints, so these new trace points provide a high level
      indication of what the XFS memory demand actually is.
      
      There is no per-filesystem context in this code, so we just trace
      the type of allocation, the size and the allocation constraints.
      The kmem code also doesn't include much of the common XFS headers,
      so there are a few definitions that need to be added to the trace
      headers and a couple of types that need to be made common to avoid
      needing to include the whole world in the kmem code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0ad95687
  7. 29 6月, 2019 1 次提交
  8. 12 6月, 2019 2 次提交
  9. 27 4月, 2019 1 次提交
  10. 17 4月, 2019 1 次提交
  11. 15 4月, 2019 2 次提交
  12. 21 2月, 2019 1 次提交
    • C
      xfs: introduce an always_cow mode · 66ae56a5
      Christoph Hellwig 提交于
      Add a mode where XFS never overwrites existing blocks in place.  This
      is to aid debugging our COW code, and also put infatructure in place
      for things like possible future support for zoned block devices, which
      can't support overwrites.
      
      This mode is enabled globally by doing a:
      
          echo 1 > /sys/fs/xfs/debug/always_cow
      
      Note that the parameter is global to allow running all tests in xfstests
      easily in this mode, which would not easily be possible with a per-fs
      sysfs file.
      
      In always_cow mode persistent preallocations are disabled, and fallocate
      will fail when called with a 0 mode (with our without
      FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space
      when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE.
      
      There are a few interesting xfstests failures when run in always_cow
      mode:
      
       - generic/392 fails because the bytes used in the file used to test
         hole punch recovery are less after the log replay.  This is
         because the blocks written and then punched out are only freed
         with a delay due to the logging mechanism.
       - xfs/170 will fail as the already fragile file streams mechanism
         doesn't seem to interact well with the COW allocator
       - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim
         the file system is badly fragmented, but there is not much we
         can do to avoid that when always writing out of place
       - xfs/205 fails because overwriting a file in always_cow mode
         will require new space allocation and the assumption in the
         test thus don't work anymore.
       - xfs/326 fails to modify the file at all in always_cow mode after
         injecting the refcount error, leading to an unexpected md5sum
         after the remount, but that again is expected
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      66ae56a5
  13. 15 2月, 2019 1 次提交
  14. 12 2月, 2019 1 次提交
    • D
      xfs: cache unlinked pointers in an rhashtable · 9b247179
      Darrick J. Wong 提交于
      Use a rhashtable to cache the unlinked list incore.  This should speed
      up unlinked processing considerably when there are a lot of inodes on
      the unlinked list because iunlink_remove no longer has to traverse an
      entire bucket list to find which inode points to the one being removed.
      
      The incore list structure records "X.next_unlinked = Y" relations, with
      the rhashtable using Y to index the records.  This makes finding the
      inode X that points to a inode Y very quick.  If our cache fails to find
      anything we can always fall back on the old method.
      
      FWIW this drastically reduces the amount of time it takes to remove
      inodes from the unlinked list.  I wrote a program to open a lot of
      O_TMPFILE files and then close them in the same order, which takes
      a very long time if we have to traverse the unlinked lists.  With the
      ptach, I see:
      
      + /d/t/tmpfile/tmpfile
      Opened 193531 files in 6.33s.
      Closed 193531 files in 5.86s
      
      real    0m12.192s
      user    0m0.064s
      sys     0m11.619s
      + cd /
      + umount /mnt
      
      real    0m0.050s
      user    0m0.004s
      sys     0m0.030s
      
      And without the patch:
      
      + /d/t/tmpfile/tmpfile
      Opened 193588 files in 6.35s.
      Closed 193588 files in 751.61s
      
      real    12m38.853s
      user    0m0.084s
      sys     12m34.470s
      + cd /
      + umount /mnt
      
      real    0m0.086s
      user    0m0.000s
      sys     0m0.060s
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      9b247179
  15. 13 12月, 2018 3 次提交
    • O
      xfs: cache minimum realtime summary level · 355e3532
      Omar Sandoval 提交于
      The realtime summary is a two-dimensional array on disk, effectively:
      
      u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]
      
      rsum[log][bbno] is the number of extents of size 2**log which start in
      bitmap block bbno.
      
      xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
      rsum[log][bbno] != 0 for any log level. However, the summary array is
      stored in row-major order (i.e., like an array in C), so all of these
      entries are not adjacent, but rather spread across the entire summary
      file. In the worst case (a full bitmap block), xfs_rtany_summary() has
      to check every level.
      
      This means that on a moderately-used realtime device, an allocation will
      waste a lot of time finding, reading, and releasing buffers for the
      realtime summary. In particular, one of our storage services (which runs
      on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
      spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
      xfs_trans_brelse() called from xfs_rtany_summary().
      
      One solution would be to also store the summary with the dimensions
      swapped. However, this would require a disk format change to a very old
      component of XFS.
      
      Instead, we can cache the minimum size which contains any extents. We do
      so lazily; rather than guaranteeing that the cache contains the precise
      minimum, it always contains a loose lower bound which we tighten when we
      read or update a summary block. This only uses a few kilobytes of memory
      and is already serialized via the realtime bitmap and summary inode
      locks, so the cost is minimal. With this change, the same workload only
      spends 0.2% of its CPU cycles in the realtime allocator.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      355e3532
    • D
      xfs: precalculate cluster alignment in inodes and blocks · c1b4a321
      Darrick J. Wong 提交于
      Store the inode cluster alignment information in units of inodes and
      blocks in the mount data so that we don't have to keep recalculating
      them.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      c1b4a321
    • D
      xfs: precalculate inodes and blocks per inode cluster · 83dcdb44
      Darrick J. Wong 提交于
      Store the number of inodes and blocks per inode cluster in the mount
      data so that we don't have to keep recalculating them.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      83dcdb44
  16. 27 7月, 2018 1 次提交
  17. 24 7月, 2018 2 次提交
  18. 09 6月, 2018 1 次提交
  19. 07 6月, 2018 1 次提交
    • D
      xfs: convert to SPDX license tags · 0b61f8a4
      Dave Chinner 提交于
      Remove the verbose license text from XFS files and replace them
      with SPDX tags. This does not change the license of any of the code,
      merely refers to the common, up-to-date license files in LICENSES/
      
      This change was mostly scripted. fs/xfs/Makefile and
      fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
      and modified by the following command:
      
      for f in `git grep -l "GNU General" fs/xfs/` ; do
      	echo $f
      	cat $f | awk -f hdr.awk > $f.new
      	mv -f $f.new $f
      done
      
      And the hdr.awk script that did the modification (including
      detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
      is as follows:
      
      $ cat hdr.awk
      BEGIN {
      	hdr = 1.0
      	tag = "GPL-2.0"
      	str = ""
      }
      
      /^ \* This program is free software/ {
      	hdr = 2.0;
      	next
      }
      
      /any later version./ {
      	tag = "GPL-2.0+"
      	next
      }
      
      /^ \*\// {
      	if (hdr > 0.0) {
      		print "// SPDX-License-Identifier: " tag
      		print str
      		print $0
      		str=""
      		hdr = 0.0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \* / {
      	if (hdr > 1.0)
      		next
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \*/ {
      	if (hdr > 0.0)
      		next
      	print $0
      	next
      }
      
      // {
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      }
      
      END { }
      $
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0b61f8a4
  20. 24 3月, 2018 1 次提交
    • B
      xfs: detect agfl count corruption and reset agfl · a27ba260
      Brian Foster 提交于
      The struct xfs_agfl v5 header was originally introduced with
      unexpected padding that caused the AGFL to operate with one less
      slot than intended. The header has since been packed, but the fix
      left an incompatibility for users who upgrade from an old kernel
      with the unpacked header to a newer kernel with the packed header
      while the AGFL happens to wrap around the end. The newer kernel
      recognizes one extra slot at the physical end of the AGFL that the
      previous kernel did not. The new kernel will eventually attempt to
      allocate a block from that slot, which contains invalid data, and
      cause a crash.
      
      This condition can be detected by comparing the active range of the
      AGFL to the count. While this detects a padding mismatch, it can
      also trigger false positives for unrelated flcount corruption. Since
      we cannot distinguish a size mismatch due to padding from unrelated
      corruption, we can't trust the AGFL enough to simply repopulate the
      empty slot.
      
      Instead, avoid unnecessarily complex detection logic and and use a
      solution that can handle any form of flcount corruption that slips
      through read verifiers: distrust the entire AGFL and reset it to an
      empty state. Any valid blocks within the AGFL are intentionally
      leaked. This requires xfs_repair to rectify (which was already
      necessary based on the state the AGFL was found in). The reset
      mitigates the side effect of the padding mismatch problem from a
      filesystem crash to a free space accounting inconsistency. The
      generic approach also means that this patch can be safely backported
      to kernels with or without a packed struct xfs_agfl.
      
      Check the AGF for an invalid freelist count on initial read from
      disk. If detected, set a flag on the xfs_perag to indicate that a
      reset is required before the AGFL can be used. In the first
      transaction that attempts to use a flagged AGFL, reset it to empty,
      warn the user about the inconsistency and allow the freelist fixup
      code to repopulate the AGFL with new blocks. The xfs_perag flag is
      cleared to eliminate the need for repeated checks on each block
      allocation operation.
      
      This allows kernels that include the packing fix commit 96f859d5
      ("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
      to handle older unpacked AGFL formats without a filesystem crash.
      Suggested-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a27ba260
  21. 12 3月, 2018 3 次提交
    • B
      xfs: account only rmapbt-used blocks against rmapbt perag res · 0ab32086
      Brian Foster 提交于
      The rmapbt perag metadata reservation reserves blocks for the
      reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
      the agfl and perag accounting is updated as blocks are allocated
      from the allocation btrees, the reservation actually accounts blocks
      as they are allocated to (or freed from) the agfl rather than the
      rmapbt itself.
      
      While this works for blocks that are eventually used for the rmapbt,
      not all agfl blocks are destined for the rmapbt. Blocks that are
      allocated to the agfl (and thus "reserved" for the rmapbt) but then
      used by another structure leads to a growing inconsistency over time
      between the runtime tracking of rmapbt usage vs. actual rmapbt
      usage. Since the runtime tracking thinks all agfl blocks are rmapbt
      blocks, it essentially believes that less future reservation is
      required to satisfy the rmapbt than what is actually necessary.
      
      The inconsistency is rectified across mount cycles because the perag
      reservation is initialized based on the actual rmapbt usage at mount
      time. The problem, however, is that the excessive drain of the
      reservation at runtime opens a window to allocate blocks for other
      purposes that might be required for the rmapbt on a subsequent
      mount. This problem can be demonstrated by a simple test that runs
      an allocation workload to consume agfl blocks over time and then
      observe the difference in the agfl reservation requirement across an
      unmount/mount cycle:
      
        mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
        ...
        ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
        umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
        mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194
      
      As the above tracepoints show, the reservation requirement reduces
      from 3194 blocks to 2956 blocks as the workload runs.  Without any
      other changes in the filesystem, the same reservation requirement
      jumps from 2956 to 3052 blocks over a umount/mount cycle.
      
      To address this divergence, update the RMAPBT reservation to account
      blocks used for the rmapbt only rather than all blocks filled into
      the agfl. This patch makes several high-level changes toward that
      end:
      
      1.) Reintroduce an AGFL reservation type to serve as an accounting
          no-op for blocks allocated to (or freed from) the AGFL.
      2.) Invoke RMAPBT usage accounting from the actual rmapbt block
          allocation path rather than the AGFL allocation path.
      
      The first change is required because agfl blocks are considered free
      blocks throughout their lifetime. The perag reservation subsystem is
      invoked unconditionally by the allocation subsystem, so we need a
      way to tell the perag subsystem (via the allocation subsystem) to
      not make any accounting changes for blocks filled into the AGFL.
      
      The second change causes the in-core RMAPBT reservation usage
      accounting to remain consistent with the on-disk state at all times
      and eliminates the risk of leaving the rmapbt reservation
      underfilled.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0ab32086
    • B
      xfs: rename agfl perag res type to rmapbt · 21592863
      Brian Foster 提交于
      The AGFL perag reservation type accounts all allocations that feed
      into (or are released from) the allocation group free list (agfl).
      The purpose of the reservation is to support worst case conditions
      for the reverse mapping btree (rmapbt). As such, the agfl
      reservation usage accounting only considers rmapbt usage when the
      in-core counters are initialized at mount time.
      
      This implementation inconsistency leads to divergence of the in-core
      and on-disk usage accounting over time. In preparation to resolve
      this inconsistency and adjust the AGFL reservation into an rmapbt
      specific reservation, rename the AGFL reservation type and
      associated accounting fields to something more rmapbt-specific. Also
      fix up a couple tracepoints that incorrectly use the AGFL
      reservation type to pass the agfl state of the associated extent
      where the raw reservation type is expected.
      
      Note that this patch does not change perag reservation behavior.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      21592863
    • E
      xfs: remove unused m_dmevmask from xfs_mount struct · 4603fa74
      Eric Sandeen 提交于
      The dmevmask structure member is a dmapi leftover; it's
      set here and there but never actually used.  Remove it.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4603fa74
  22. 28 6月, 2017 2 次提交