1. 27 8月, 2019 1 次提交
    • D
      xfs: add kmem allocation trace points · 0ad95687
      Dave Chinner 提交于
      When trying to correlate XFS kernel allocations to memory reclaim
      behaviour, it is useful to know what allocations XFS is actually
      attempting. This information is not directly available from
      tracepoints in the generic memory allocation and reclaim
      tracepoints, so these new trace points provide a high level
      indication of what the XFS memory demand actually is.
      
      There is no per-filesystem context in this code, so we just trace
      the type of allocation, the size and the allocation constraints.
      The kmem code also doesn't include much of the common XFS headers,
      so there are a few definitions that need to be added to the trace
      headers and a couple of types that need to be made common to avoid
      needing to include the whole world in the kmem code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0ad95687
  2. 29 6月, 2019 1 次提交
  3. 12 6月, 2019 2 次提交
  4. 27 4月, 2019 1 次提交
  5. 17 4月, 2019 1 次提交
  6. 15 4月, 2019 2 次提交
  7. 21 2月, 2019 1 次提交
    • C
      xfs: introduce an always_cow mode · 66ae56a5
      Christoph Hellwig 提交于
      Add a mode where XFS never overwrites existing blocks in place.  This
      is to aid debugging our COW code, and also put infatructure in place
      for things like possible future support for zoned block devices, which
      can't support overwrites.
      
      This mode is enabled globally by doing a:
      
          echo 1 > /sys/fs/xfs/debug/always_cow
      
      Note that the parameter is global to allow running all tests in xfstests
      easily in this mode, which would not easily be possible with a per-fs
      sysfs file.
      
      In always_cow mode persistent preallocations are disabled, and fallocate
      will fail when called with a 0 mode (with our without
      FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space
      when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE.
      
      There are a few interesting xfstests failures when run in always_cow
      mode:
      
       - generic/392 fails because the bytes used in the file used to test
         hole punch recovery are less after the log replay.  This is
         because the blocks written and then punched out are only freed
         with a delay due to the logging mechanism.
       - xfs/170 will fail as the already fragile file streams mechanism
         doesn't seem to interact well with the COW allocator
       - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim
         the file system is badly fragmented, but there is not much we
         can do to avoid that when always writing out of place
       - xfs/205 fails because overwriting a file in always_cow mode
         will require new space allocation and the assumption in the
         test thus don't work anymore.
       - xfs/326 fails to modify the file at all in always_cow mode after
         injecting the refcount error, leading to an unexpected md5sum
         after the remount, but that again is expected
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      66ae56a5
  8. 15 2月, 2019 1 次提交
  9. 12 2月, 2019 1 次提交
    • D
      xfs: cache unlinked pointers in an rhashtable · 9b247179
      Darrick J. Wong 提交于
      Use a rhashtable to cache the unlinked list incore.  This should speed
      up unlinked processing considerably when there are a lot of inodes on
      the unlinked list because iunlink_remove no longer has to traverse an
      entire bucket list to find which inode points to the one being removed.
      
      The incore list structure records "X.next_unlinked = Y" relations, with
      the rhashtable using Y to index the records.  This makes finding the
      inode X that points to a inode Y very quick.  If our cache fails to find
      anything we can always fall back on the old method.
      
      FWIW this drastically reduces the amount of time it takes to remove
      inodes from the unlinked list.  I wrote a program to open a lot of
      O_TMPFILE files and then close them in the same order, which takes
      a very long time if we have to traverse the unlinked lists.  With the
      ptach, I see:
      
      + /d/t/tmpfile/tmpfile
      Opened 193531 files in 6.33s.
      Closed 193531 files in 5.86s
      
      real    0m12.192s
      user    0m0.064s
      sys     0m11.619s
      + cd /
      + umount /mnt
      
      real    0m0.050s
      user    0m0.004s
      sys     0m0.030s
      
      And without the patch:
      
      + /d/t/tmpfile/tmpfile
      Opened 193588 files in 6.35s.
      Closed 193588 files in 751.61s
      
      real    12m38.853s
      user    0m0.084s
      sys     12m34.470s
      + cd /
      + umount /mnt
      
      real    0m0.086s
      user    0m0.000s
      sys     0m0.060s
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      9b247179
  10. 13 12月, 2018 3 次提交
    • O
      xfs: cache minimum realtime summary level · 355e3532
      Omar Sandoval 提交于
      The realtime summary is a two-dimensional array on disk, effectively:
      
      u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]
      
      rsum[log][bbno] is the number of extents of size 2**log which start in
      bitmap block bbno.
      
      xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
      rsum[log][bbno] != 0 for any log level. However, the summary array is
      stored in row-major order (i.e., like an array in C), so all of these
      entries are not adjacent, but rather spread across the entire summary
      file. In the worst case (a full bitmap block), xfs_rtany_summary() has
      to check every level.
      
      This means that on a moderately-used realtime device, an allocation will
      waste a lot of time finding, reading, and releasing buffers for the
      realtime summary. In particular, one of our storage services (which runs
      on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
      spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
      xfs_trans_brelse() called from xfs_rtany_summary().
      
      One solution would be to also store the summary with the dimensions
      swapped. However, this would require a disk format change to a very old
      component of XFS.
      
      Instead, we can cache the minimum size which contains any extents. We do
      so lazily; rather than guaranteeing that the cache contains the precise
      minimum, it always contains a loose lower bound which we tighten when we
      read or update a summary block. This only uses a few kilobytes of memory
      and is already serialized via the realtime bitmap and summary inode
      locks, so the cost is minimal. With this change, the same workload only
      spends 0.2% of its CPU cycles in the realtime allocator.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      355e3532
    • D
      xfs: precalculate cluster alignment in inodes and blocks · c1b4a321
      Darrick J. Wong 提交于
      Store the inode cluster alignment information in units of inodes and
      blocks in the mount data so that we don't have to keep recalculating
      them.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      c1b4a321
    • D
      xfs: precalculate inodes and blocks per inode cluster · 83dcdb44
      Darrick J. Wong 提交于
      Store the number of inodes and blocks per inode cluster in the mount
      data so that we don't have to keep recalculating them.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      83dcdb44
  11. 27 7月, 2018 1 次提交
  12. 24 7月, 2018 2 次提交
  13. 09 6月, 2018 1 次提交
  14. 07 6月, 2018 1 次提交
    • D
      xfs: convert to SPDX license tags · 0b61f8a4
      Dave Chinner 提交于
      Remove the verbose license text from XFS files and replace them
      with SPDX tags. This does not change the license of any of the code,
      merely refers to the common, up-to-date license files in LICENSES/
      
      This change was mostly scripted. fs/xfs/Makefile and
      fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
      and modified by the following command:
      
      for f in `git grep -l "GNU General" fs/xfs/` ; do
      	echo $f
      	cat $f | awk -f hdr.awk > $f.new
      	mv -f $f.new $f
      done
      
      And the hdr.awk script that did the modification (including
      detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
      is as follows:
      
      $ cat hdr.awk
      BEGIN {
      	hdr = 1.0
      	tag = "GPL-2.0"
      	str = ""
      }
      
      /^ \* This program is free software/ {
      	hdr = 2.0;
      	next
      }
      
      /any later version./ {
      	tag = "GPL-2.0+"
      	next
      }
      
      /^ \*\// {
      	if (hdr > 0.0) {
      		print "// SPDX-License-Identifier: " tag
      		print str
      		print $0
      		str=""
      		hdr = 0.0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \* / {
      	if (hdr > 1.0)
      		next
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \*/ {
      	if (hdr > 0.0)
      		next
      	print $0
      	next
      }
      
      // {
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      }
      
      END { }
      $
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0b61f8a4
  15. 24 3月, 2018 1 次提交
    • B
      xfs: detect agfl count corruption and reset agfl · a27ba260
      Brian Foster 提交于
      The struct xfs_agfl v5 header was originally introduced with
      unexpected padding that caused the AGFL to operate with one less
      slot than intended. The header has since been packed, but the fix
      left an incompatibility for users who upgrade from an old kernel
      with the unpacked header to a newer kernel with the packed header
      while the AGFL happens to wrap around the end. The newer kernel
      recognizes one extra slot at the physical end of the AGFL that the
      previous kernel did not. The new kernel will eventually attempt to
      allocate a block from that slot, which contains invalid data, and
      cause a crash.
      
      This condition can be detected by comparing the active range of the
      AGFL to the count. While this detects a padding mismatch, it can
      also trigger false positives for unrelated flcount corruption. Since
      we cannot distinguish a size mismatch due to padding from unrelated
      corruption, we can't trust the AGFL enough to simply repopulate the
      empty slot.
      
      Instead, avoid unnecessarily complex detection logic and and use a
      solution that can handle any form of flcount corruption that slips
      through read verifiers: distrust the entire AGFL and reset it to an
      empty state. Any valid blocks within the AGFL are intentionally
      leaked. This requires xfs_repair to rectify (which was already
      necessary based on the state the AGFL was found in). The reset
      mitigates the side effect of the padding mismatch problem from a
      filesystem crash to a free space accounting inconsistency. The
      generic approach also means that this patch can be safely backported
      to kernels with or without a packed struct xfs_agfl.
      
      Check the AGF for an invalid freelist count on initial read from
      disk. If detected, set a flag on the xfs_perag to indicate that a
      reset is required before the AGFL can be used. In the first
      transaction that attempts to use a flagged AGFL, reset it to empty,
      warn the user about the inconsistency and allow the freelist fixup
      code to repopulate the AGFL with new blocks. The xfs_perag flag is
      cleared to eliminate the need for repeated checks on each block
      allocation operation.
      
      This allows kernels that include the packing fix commit 96f859d5
      ("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
      to handle older unpacked AGFL formats without a filesystem crash.
      Suggested-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a27ba260
  16. 12 3月, 2018 3 次提交
    • B
      xfs: account only rmapbt-used blocks against rmapbt perag res · 0ab32086
      Brian Foster 提交于
      The rmapbt perag metadata reservation reserves blocks for the
      reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
      the agfl and perag accounting is updated as blocks are allocated
      from the allocation btrees, the reservation actually accounts blocks
      as they are allocated to (or freed from) the agfl rather than the
      rmapbt itself.
      
      While this works for blocks that are eventually used for the rmapbt,
      not all agfl blocks are destined for the rmapbt. Blocks that are
      allocated to the agfl (and thus "reserved" for the rmapbt) but then
      used by another structure leads to a growing inconsistency over time
      between the runtime tracking of rmapbt usage vs. actual rmapbt
      usage. Since the runtime tracking thinks all agfl blocks are rmapbt
      blocks, it essentially believes that less future reservation is
      required to satisfy the rmapbt than what is actually necessary.
      
      The inconsistency is rectified across mount cycles because the perag
      reservation is initialized based on the actual rmapbt usage at mount
      time. The problem, however, is that the excessive drain of the
      reservation at runtime opens a window to allocate blocks for other
      purposes that might be required for the rmapbt on a subsequent
      mount. This problem can be demonstrated by a simple test that runs
      an allocation workload to consume agfl blocks over time and then
      observe the difference in the agfl reservation requirement across an
      unmount/mount cycle:
      
        mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
        ...
        ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
        umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
        mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194
      
      As the above tracepoints show, the reservation requirement reduces
      from 3194 blocks to 2956 blocks as the workload runs.  Without any
      other changes in the filesystem, the same reservation requirement
      jumps from 2956 to 3052 blocks over a umount/mount cycle.
      
      To address this divergence, update the RMAPBT reservation to account
      blocks used for the rmapbt only rather than all blocks filled into
      the agfl. This patch makes several high-level changes toward that
      end:
      
      1.) Reintroduce an AGFL reservation type to serve as an accounting
          no-op for blocks allocated to (or freed from) the AGFL.
      2.) Invoke RMAPBT usage accounting from the actual rmapbt block
          allocation path rather than the AGFL allocation path.
      
      The first change is required because agfl blocks are considered free
      blocks throughout their lifetime. The perag reservation subsystem is
      invoked unconditionally by the allocation subsystem, so we need a
      way to tell the perag subsystem (via the allocation subsystem) to
      not make any accounting changes for blocks filled into the AGFL.
      
      The second change causes the in-core RMAPBT reservation usage
      accounting to remain consistent with the on-disk state at all times
      and eliminates the risk of leaving the rmapbt reservation
      underfilled.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0ab32086
    • B
      xfs: rename agfl perag res type to rmapbt · 21592863
      Brian Foster 提交于
      The AGFL perag reservation type accounts all allocations that feed
      into (or are released from) the allocation group free list (agfl).
      The purpose of the reservation is to support worst case conditions
      for the reverse mapping btree (rmapbt). As such, the agfl
      reservation usage accounting only considers rmapbt usage when the
      in-core counters are initialized at mount time.
      
      This implementation inconsistency leads to divergence of the in-core
      and on-disk usage accounting over time. In preparation to resolve
      this inconsistency and adjust the AGFL reservation into an rmapbt
      specific reservation, rename the AGFL reservation type and
      associated accounting fields to something more rmapbt-specific. Also
      fix up a couple tracepoints that incorrectly use the AGFL
      reservation type to pass the agfl state of the associated extent
      where the raw reservation type is expected.
      
      Note that this patch does not change perag reservation behavior.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      21592863
    • E
      xfs: remove unused m_dmevmask from xfs_mount struct · 4603fa74
      Eric Sandeen 提交于
      The dmevmask structure member is a dmapi leftover; it's
      set here and there but never actually used.  Remove it.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4603fa74
  17. 28 6月, 2017 3 次提交
  18. 20 6月, 2017 1 次提交
    • D
      xfs: remove double-underscore integer types · c8ce540d
      Darrick J. Wong 提交于
      This is a purely mechanical patch that removes the private
      __{u,}int{8,16,32,64}_t typedefs in favor of using the system
      {u,}int{8,16,32,64}_t typedefs.  This is the sed script used to perform
      the transformation and fix the resulting whitespace and indentation
      errors:
      
      s/typedef\t__uint8_t/typedef __uint8_t\t/g
      s/typedef\t__uint/typedef __uint/g
      s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
      s/__uint8_t\t/__uint8_t\t\t/g
      s/__uint/uint/g
      s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
      s/__int/int/g
      /^typedef.*int[0-9]*_t;$/d
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c8ce540d
  19. 26 4月, 2017 1 次提交
    • E
      xfs: more do_div cleanups · 4f1adf33
      Eric Sandeen 提交于
      On some architectures do_div does the pointer compare
      trick to make sure that we've sent it an unsigned 64-bit
      number.  (Why unsigned?  I don't know.)
      
      Fix up the few places that squawk about this; in
      xfs_bmap_wants_extents() we just used a bare int64_t so change
      that to unsigned.
      
      In xfs_adjust_extent_unmap_boundaries() all we wanted was the
      mod, and we have an xfs-specific function to handle that w/o
      side effects, which includes proper casting for do_div.
      
      In xfs_daddr_to_ag[b]no, we were using the wrong type anyway;
      XFS_BB_TO_FSBT returns a block in the filesystem, so use
      xfs_rfsblock_t not xfs_daddr_t, and gain the unsignedness
      from that type as a bonus.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4f1adf33
  20. 04 4月, 2017 1 次提交
    • B
      xfs: use dedicated log worker wq to avoid deadlock with cil wq · 696a5620
      Brian Foster 提交于
      The log covering background task used to be part of the xfssyncd
      workqueue. That workqueue was removed as of commit 5889608d ("xfs:
      syncd workqueue is no more") and the associated work item scheduled
      to the xfs-log wq. The latter is used for log buffer I/O completion.
      
      Since xfs_log_worker() can invoke a log flush, a deadlock is
      possible between the xfs-log and xfs-cil workqueues. Consider the
      following codepath from xfs_log_worker():
      
      xfs_log_worker()
        xfs_log_force()
          _xfs_log_force()
            xlog_cil_force()
              xlog_cil_force_lsn()
                xlog_cil_push_now()
                  flush_work()
      
      The above is in xfs-log wq context and blocked waiting on the
      completion of an xfs-cil work item. Concurrently, the cil push in
      progress can end up blocked here:
      
      xlog_cil_push_work()
        xlog_cil_push()
          xlog_write()
            xlog_state_get_iclog_space()
              xlog_wait(&log->l_flush_wait, ...)
      
      The above is in xfs-cil context waiting on log buffer I/O
      completion, which executes in xfs-log wq context. In this scenario
      both workqueues are deadlocked waiting on eachother.
      
      Add a new workqueue specifically for the high level log covering and
      ail pushing worker, as was the case prior to commit 5889608d.
      Diagnosed-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      696a5620
  21. 17 2月, 2017 1 次提交
    • B
      xfs: resurrect debug mode drop buffered writes mechanism · 9dbddd7b
      Brian Foster 提交于
      A debug mode write failure mechanism was introduced to XFS in commit
      801cc4e1 ("xfs: debug mode forced buffered write failure") to
      facilitate targeted testing of delalloc indirect reservation management
      from userspace. This code was subsequently rendered ineffective by the
      move to iomap based buffered writes in commit 68a9f5e7 ("xfs:
      implement iomap based buffered write path"). This likely went unnoticed
      because the associated userspace code had not made it into xfstests.
      
      Resurrect this mechanism to facilitate effective indlen reservation
      testing from xfstests. The move to iomap based buffered writes relocated
      the hook this mechanism needs to return write failure from XFS to
      generic code. The failure trigger must remain in XFS. Given that
      limitation, convert this from a write failure mechanism to one that
      simply drops writes without returning failure to userspace. Rename all
      "fail_writes" references to "drop_writes" to illustrate the point. This
      is more hacky than preferred, but still triggers the XFS error handling
      behavior required to drive the indlen tests. This is only available in
      DEBUG mode and for testing purposes only.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9dbddd7b
  22. 10 2月, 2017 1 次提交
  23. 25 1月, 2017 1 次提交
    • C
      xfs: use per-AG reservations for the finobt · 76d771b4
      Christoph Hellwig 提交于
      Currently we try to rely on the global reserved block pool for block
      allocations for the free inode btree, but I have customer reports
      (fairly complex workload, need to find an easier reproducer) where that
      is not enough as the AG where we free an inode that requires a new
      finobt block is entirely full.  This causes us to cancel a dirty
      transaction and thus a file system shutdown.
      
      I think the right way to guard against this is to treat the finot the same
      way as the refcount btree and have a per-AG reservations for the possible
      worst case size of it, and the patch below implements that.
      
      Note that this could increase mount times with large finobt trees.  In
      an ideal world we would have added a field for the number of finobt
      fields to the AGI, similar to what we did for the refcount blocks.
      We should do add it next time we rev the AGI or AGF format by adding
      new fields.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      76d771b4
  24. 07 12月, 2016 1 次提交
    • L
      xfs: use rhashtable to track buffer cache · 6031e73a
      Lucas Stach 提交于
      On filesystems with a lot of metadata and in metadata intensive workloads
      xfs_buf_find() is showing up at the top of the CPU cycles trace. Most of
      the CPU time is spent on CPU cache misses while traversing the rbtree.
      
      As the buffer cache does not need any kind of ordering, but fast lookups
      a hashtable is the natural data structure to use. The rhashtable
      infrastructure provides a self-scaling hashtable implementation and
      allows lookups to proceed while the table is going through a resize
      operation.
      
      This reduces the CPU-time spent for the lookups to 1/3 even for small
      filesystems with a relatively small number of cached buffers, with
      possibly much larger gains on higher loaded filesystems.
      
      [dchinner: reduce minimum hash size to an acceptable size for large
      	   filesystems with many AGs with no active use.]
      [dchinner: remove stale rbtree asserts.]
      [dchinner: use xfs_buf_map for compare function argument.]
      [dchinner: make functions static.]
      [dchinner: remove redundant comments.]
      Signed-off-by: NLucas Stach <dev@lynxeye.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      6031e73a
  25. 06 10月, 2016 1 次提交
    • D
      xfs: garbage collect old cowextsz reservations · 83104d44
      Darrick J. Wong 提交于
      Trim CoW reservations made on behalf of a cowextsz hint if they get too
      old or we run low on quota, so long as we don't have dirty data awaiting
      writeback or directio operations in progress.
      
      Garbage collection of the cowextsize extents are kept separate from
      prealloc extent reaping because setting the CoW prealloc lifetime to a
      (much) higher value than the regular prealloc extent lifetime has been
      useful for combatting CoW fragmentation on VM hosts where the VMs
      experience bursty write behaviors and we can keep the utilization ratios
      low enough that we don't start to run out of space.  IOWs, it benefits
      us to keep the CoW fork reservations around for as long as we can unless
      we run out of blocks or hit inode reclaim.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      83104d44
  26. 04 10月, 2016 2 次提交
  27. 19 9月, 2016 1 次提交
    • D
      xfs: set up per-AG free space reservations · 3fd129b6
      Darrick J. Wong 提交于
      One unfortunate quirk of the reference count and reverse mapping
      btrees -- they can expand in size when blocks are written to *other*
      allocation groups if, say, one large extent becomes a lot of tiny
      extents.  Since we don't want to start throwing errors in the middle
      of CoWing, we need to reserve some blocks to handle future expansion.
      The transaction block reservation counters aren't sufficient here
      because we have to have a reserve of blocks in every AG, not just
      somewhere in the filesystem.
      
      Therefore, create two per-AG block reservation pools.  One feeds the
      AGFL so that rmapbt expansion always succeeds, and the other feeds all
      other metadata so that refcountbt expansion never fails.
      
      Use the count of how many reserved blocks we need to have on hand to
      create a virtual reservation in the AG.  Through selective clamping of
      the maximum length of allocation requests and of the length of the
      longest free extent, we can make it look like there's less free space
      in the AG unless the reservation owner is asking for blocks.
      
      In other words, play some accounting tricks in-core to make sure that
      we always have blocks available.  On the plus side, there's nothing to
      clean up if we crash, which is contrast to the strategy that the rough
      draft used (actually removing extents from the freespace btrees).
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3fd129b6
  28. 14 9月, 2016 1 次提交
    • E
      xfs: normalize "infinite" retries in error configs · 77169812
      Eric Sandeen 提交于
      As it stands today, the "fail immediately" vs. "retry forever"
      values for max_retries and retry_timeout_seconds in the xfs metadata
      error configurations are not consistent.
      
      A retry_timeout_seconds of 0 means "retry forever," but a
      max_retries of 0 means "fail immediately."
      
      retry_timeout_seconds < 0 is disallowed, while max_retries == -1
      means "retry forever."
      
      Make this consistent across the error configs, such that a value of
      0 means "fail immediately" (i.e. wait 0 seconds, or retry 0 times),
      and a value of -1 always means "retry forever."
      
      This makes retry_timeout a signed long to accommodate the -1, even
      though it stores jiffies.  Given our limit of a 1 day maximum
      timeout, this should be sufficient even at much higher HZ values
      than we have available today.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      77169812
  29. 03 8月, 2016 2 次提交
    • D
      xfs: rmap btree requires more reserved free space · 52548852
      Darrick J. Wong 提交于
      Originally-From: Dave Chinner <dchinner@redhat.com>
      
      The rmap btree is allocated from the AGFL, which means we have to
      ensure ENOSPC is reported to userspace before we run out of free
      space in each AG. The last allocation in an AG can cause a full
      height rmap btree split, and that means we have to reserve at least
      this many blocks *in each AG* to be placed on the AGFL at ENOSPC.
      Update the various space calculation functions to handle this.
      
      Also, because the macros are now executing conditional code and are
      called quite frequently, convert them to functions that initialise
      variables in the struct xfs_mount, use the new variables everywhere
      and document the calculations better.
      
      [darrick.wong@oracle.com: don't reserve blocks if !rmap]
      [dchinner@redhat.com: update m_ag_max_usable after growfs]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      52548852
    • D
      xfs: define the on-disk rmap btree format · 035e00ac
      Darrick J. Wong 提交于
      Originally-From: Dave Chinner <dchinner@redhat.com>
      
      Now we have all the surrounding call infrastructure in place, we can
      start filling out the rmap btree implementation. Start with the
      on-disk btree format; add everything needed to read, write and
      manipulate rmap btree blocks. This prepares the way for adding the
      btree operations implementation.
      
      [darrick: record owner and offset info in rmap btree]
      [darrick: fork, bmbt and unwritten state in rmap btree]
      [darrick: flags are a separate field in xfs_rmap_irec]
      [darrick: calculate maxlevels separately]
      [darrick: move the 'unwritten' bit into unused parts of rm_offset]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      035e00ac