1. 13 4月, 2022 1 次提交
  2. 12 4月, 2022 2 次提交
    • D
      xfs: use a separate frextents counter for rt extent reservations · 2229276c
      Darrick J. Wong 提交于
      As mentioned in the previous commit, the kernel misuses sb_frextents in
      the incore mount to reflect both incore reservations made by running
      transactions as well as the actual count of free rt extents on disk.
      This results in the superblock being written to the log with an
      underestimate of the number of rt extents that are marked free in the
      rtbitmap.
      
      Teaching XFS to recompute frextents after log recovery avoids
      operational problems in the current mount, but it doesn't solve the
      problem of us writing undercounted frextents which are then recovered by
      an older kernel that doesn't have that fix.
      
      Create an incore percpu counter to mirror the ondisk frextents.  This
      new counter will track transaction reservations and the only time we
      will touch the incore super counter (i.e the one that gets logged) is
      when those transactions commit updates to the rt bitmap.  This is in
      contrast to the lazysbcount counters (e.g. fdblocks), where we know that
      log recovery will always fix any incorrect counter that we log.
      As a bonus, we only take m_sb_lock at transaction commit time.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2229276c
    • D
      xfs: recalculate free rt extents after log recovery · 5a605fd6
      Darrick J. Wong 提交于
      I've been observing periodic corruption reports from xfs_scrub involving
      the free rt extent counter (frextents) while running xfs/141.  That test
      uses an error injection knob to induce a torn write to the log, and an
      arbitrary number of recovery mounts, frextents will count fewer free rt
      extents than can be found the rtbitmap.
      
      The root cause of the problem is a combination of the misuse of
      sb_frextents in the incore mount to reflect both incore reservations
      made by running transactions as well as the actual count of free rt
      extents on disk.  The following sequence can reproduce the undercount:
      
      Thread 1			Thread 2
      xfs_trans_alloc(rtextents=3)
      xfs_mod_frextents(-3)
      <blocks>
      				xfs_attr_set()
      				xfs_bmap_attr_addfork()
      				xfs_add_attr2()
      				xfs_log_sb()
      				xfs_sb_to_disk()
      				xfs_trans_commit()
      <log flushed to disk>
      <log goes down>
      
      Note that thread 1 subtracts 3 from sb_frextents even though it never
      commits to using that space.  Thread 2 writes the undercounted value to
      the ondisk superblock and logs it to the xattr transaction, which is
      then flushed to disk.  At next mount, log recovery will find the logged
      superblock and write that back into the filesystem.  At the end of log
      recovery, we reread the superblock and install the recovered
      undercounted frextents value into the incore superblock.  From that
      point on, we've effectively leaked thread 1's transaction reservation.
      
      The correct fix for this is to separate the incore reservation from the
      ondisk usage, but that's a matter for the next patch.  Because the
      kernel has been logging superblocks with undercounted frextents for a
      very long time and we don't demand that sysadmins run xfs_repair after a
      crash, fix the undercount by recomputing frextents after log recovery.
      
      Gating this on log recovery is a reasonable balance (I think) between
      correcting the problem and slowing down every mount attempt.  Note that
      xfs_repair will fix undercounted frextents.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5a605fd6
  3. 20 8月, 2021 2 次提交
    • D
      xfs: replace xfs_sb_version checks with feature flag checks · 38c26bfd
      Dave Chinner 提交于
      Convert the xfs_sb_version_hasfoo() to checks against
      mp->m_features. Checks of the superblock itself during disk
      operations (e.g. in the read/write verifiers and the to/from disk
      formatters) are not converted - they operate purely on the
      superblock state. Everything else should use the mount features.
      
      Large parts of this conversion were done with sed with commands like
      this:
      
      for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
      	sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
      done
      
      With manual cleanups for things like "xfs_has_extflgbit" and other
      little inconsistencies in naming.
      
      The result is ia lot less typing to check features and an XFS binary
      size reduced by a bit over 3kB:
      
      $ size -t fs/xfs/built-in.a
      	text	   data	    bss	    dec	    hex	filenam
      before	1130866  311352     484 1442702  16038e (TOTALS)
      after	1127727  311352     484 1439563  15f74b (TOTALS)
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      38c26bfd
    • D
      xfs: reflect sb features in xfs_mount · a1d86e8d
      Dave Chinner 提交于
      Currently on-disk feature checks require decoding the superblock
      fileds and so can be non-trivial. We have almost 400 hundred
      individual feature checks in the XFS code, so this is a significant
      amount of code. To reduce runtime check overhead, pre-process all
      the version flags into a features field in the xfs_mount at mount
      time so we can convert all the feature checks to a simple flag
      check.
      
      There is also a need to convert the dynamic feature flags to update
      the m_features field. This is required for attr, attr2 and quota
      features. New xfs_mount based wrappers are added for this.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      a1d86e8d
  4. 16 7月, 2021 2 次提交
  5. 08 4月, 2021 2 次提交
  6. 23 1月, 2021 1 次提交
  7. 17 12月, 2020 1 次提交
  8. 13 10月, 2020 3 次提交
    • D
      xfs: annotate grabbing the realtime bitmap/summary locks in growfs · ace74e79
      Darrick J. Wong 提交于
      Use XFS_ILOCK_RT{BITMAP,SUM} to annotate grabbing the rt bitmap and
      summary locks when we grow the realtime volume, just like we do most
      everywhere else.  This shuts up lockdep warnings about grabbing the
      ILOCK class of locks recursively:
      
      ============================================
      WARNING: possible recursive locking detected
      5.9.0-rc4-djw #rc4 Tainted: G           O
      --------------------------------------------
      xfs_growfs/4841 is trying to acquire lock:
      ffff888035acc230 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xac/0x1a0 [xfs]
      
      but task is already holding lock:
      ffff888035acedb0 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xac/0x1a0 [xfs]
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&xfs_nondir_ilock_class);
        lock(&xfs_nondir_ilock_class);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      ace74e79
    • D
      xfs: make xfs_growfs_rt update secondary superblocks · 7249c95a
      Darrick J. Wong 提交于
      When we call growfs on the data device, we update the secondary
      superblocks to reflect the updated filesystem geometry.  We need to do
      this for growfs on the realtime volume too, because a future xfs_repair
      run could try to fix the filesystem using a backup superblock.
      
      This was observed by the online superblock scrubbers while running
      xfs/233.  One can also trigger this by growing an rt volume, cycling the
      mount, and creating new rt files.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      7249c95a
    • D
      xfs: fix realtime bitmap/summary file truncation when growing rt volume · f4c32e87
      Darrick J. Wong 提交于
      The realtime bitmap and summary files are regular files that are hidden
      away from the directory tree.  Since they're regular files, inode
      inactivation will try to purge what it thinks are speculative
      preallocations beyond the incore size of the file.  Unfortunately,
      xfs_growfs_rt forgets to update the incore size when it resizes the
      inodes, with the result that inactivating the rt inodes at unmount time
      will cause their contents to be truncated.
      
      Fix this by updating the incore size when we change the ondisk size as
      part of updating the superblock.  Note that we don't do this when we're
      allocating blocks to the rt inodes because we actually want those blocks
      to get purged if the growfs fails.
      
      This fixes corruption complaints from the online rtsummary checker when
      running xfs/233.  Since that test requires rmap, one can also trigger
      this by growing an rt volume, cycling the mount, and creating rt files.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      f4c32e87
  9. 23 9月, 2020 1 次提交
    • C
      xfs: Set xfs_buf's b_ops member when zeroing bitmap/summary files · c54e14d1
      Chandan Babu R 提交于
      In xfs_growfs_rt(), we enlarge bitmap and summary files by allocating
      new blocks for both files. For each of the new blocks allocated, we
      allocate an xfs_buf, zero the payload, log the contents and commit the
      transaction. Hence these buffers will eventually find themselves
      appended to list at xfs_ail->ail_buf_list.
      
      Later, xfs_growfs_rt() loops across all of the new blocks belonging to
      the bitmap inode to set the bitmap values to 1. In doing so, it
      allocates a new transaction and invokes the following sequence of
      functions,
        - xfs_rtfree_range()
          - xfs_rtmodify_range()
            - xfs_rtbuf_get()
              We pass '&xfs_rtbuf_ops' as the ops pointer to xfs_trans_read_buf().
              - xfs_trans_read_buf()
      	  We find the xfs_buf of interest in per-ag hash table, invoke
      	  xfs_buf_reverify() which ends up assigning '&xfs_rtbuf_ops' to
      	  xfs_buf->b_ops.
      
      On the other hand, if xfs_growfs_rt_alloc() had allocated a few blocks
      for the bitmap inode and returned with an error, all the xfs_bufs
      corresponding to the new bitmap blocks that have been allocated would
      continue to be on xfs_ail->ail_buf_list list without ever having a
      non-NULL value assigned to their b_ops members. An AIL flush operation
      would then trigger the following warning message to be printed on the
      console,
      
        XFS (loop0): _xfs_buf_ioapply: no buf ops on daddr 0x58 len 8
        00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        CPU: 3 PID: 449 Comm: xfsaild/loop0 Not tainted 5.8.0-rc4-chandan-00038-g4d8c2b9de9ab-dirty #37
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
        Call Trace:
         dump_stack+0x57/0x70
         _xfs_buf_ioapply+0x37c/0x3b0
         ? xfs_rw_bdev+0x1e0/0x1e0
         ? xfs_buf_delwri_submit_buffers+0xd4/0x210
         __xfs_buf_submit+0x6d/0x1f0
         xfs_buf_delwri_submit_buffers+0xd4/0x210
         xfsaild+0x2c8/0x9e0
         ? __switch_to_asm+0x42/0x70
         ? xfs_trans_ail_cursor_first+0x80/0x80
         kthread+0xfe/0x140
         ? kthread_park+0x90/0x90
         ret_from_fork+0x22/0x30
      
      This message indicates that the xfs_buf had its b_ops member set to
      NULL.
      
      This commit fixes the issue by assigning "&xfs_rtbuf_ops" to b_ops
      member of each of the xfs_bufs logged by xfs_growfs_rt_alloc().
      Signed-off-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c54e14d1
  10. 22 9月, 2020 1 次提交
    • C
      xfs: Set xfs_buf type flag when growing summary/bitmap files · 72cc9513
      Chandan Babu R 提交于
      The following sequence of commands,
      
        mkfs.xfs -f -m reflink=0 -r rtdev=/dev/loop1,size=10M /dev/loop0
        mount -o rtdev=/dev/loop1 /dev/loop0 /mnt
        xfs_growfs  /mnt
      
      ... causes the following call trace to be printed on the console,
      
      XFS: Assertion failed: (bip->bli_flags & XFS_BLI_STALE) || (xfs_blft_from_flags(&bip->__bli_format) > XFS_BLFT_UNKNOWN_BUF && xfs_blft_from_flags(&bip->__bli_format) < XFS_BLFT_MAX_BUF), file: fs/xfs/xfs_buf_item.c, line: 331
      Call Trace:
       xfs_buf_item_format+0x632/0x680
       ? kmem_alloc_large+0x29/0x90
       ? kmem_alloc+0x70/0x120
       ? xfs_log_commit_cil+0x132/0x940
       xfs_log_commit_cil+0x26f/0x940
       ? xfs_buf_item_init+0x1ad/0x240
       ? xfs_growfs_rt_alloc+0x1fc/0x280
       __xfs_trans_commit+0xac/0x370
       xfs_growfs_rt_alloc+0x1fc/0x280
       xfs_growfs_rt+0x1a0/0x5e0
       xfs_file_ioctl+0x3fd/0xc70
       ? selinux_file_ioctl+0x174/0x220
       ksys_ioctl+0x87/0xc0
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x3e/0x70
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This occurs because the buffer being formatted has the value of
      XFS_BLFT_UNKNOWN_BUF assigned to the 'type' subfield of
      bip->bli_formats->blf_flags.
      
      This commit fixes the issue by assigning one of XFS_BLFT_RTSUMMARY_BUF
      and XFS_BLFT_RTBITMAP_BUF to the 'type' subfield of
      bip->bli_formats->blf_flags before committing the corresponding
      transaction.
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChandan Babu R <chandanrlinux@gmail.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      72cc9513
  11. 16 9月, 2020 2 次提交
  12. 27 1月, 2020 1 次提交
  13. 24 10月, 2019 1 次提交
    • B
      xfs: don't set bmapi total block req where minleft is · da781e64
      Brian Foster 提交于
      xfs_bmapi_write() takes a total block requirement parameter that is
      passed down to the block allocation code and is used to specify the
      total block requirement of the associated transaction. This is used
      to try and select an AG that can not only satisfy the requested
      extent allocation, but can also accommodate subsequent allocations
      that might be required to complete the transaction. For example,
      additional bmbt block allocations may be required on insertion of
      the resulting extent to an inode data fork.
      
      While it's important for callers to calculate and reserve such extra
      blocks in the transaction, it is not necessary to pass the total
      value to xfs_bmapi_write() in all cases. The latter automatically
      sets minleft to ensure that sufficient free blocks remain after the
      allocation attempt to expand the format of the associated inode
      (i.e., such as extent to btree conversion, btree splits, etc).
      Therefore, any callers that pass a total block requirement of the
      bmap mapping length plus worst case bmbt expansion essentially
      specify the additional reservation requirement twice. These callers
      can pass a total of zero to rely on the bmapi minleft policy.
      
      Beyond being superfluous, the primary motivation for this change is
      that the total reservation logic in the bmbt code is dubious in
      scenarios where minlen < maxlen and a maxlen extent cannot be
      allocated (which is more common for data extent allocations where
      contiguity is not required). The total value is based on maxlen in
      the xfs_bmapi_write() caller. If the bmbt code falls back to an
      allocation between minlen and maxlen, that allocation will not
      succeed until total is reset to minlen, which essentially throws
      away any additional reservation included in total by the caller. In
      addition, the total value is not reset until after alignment is
      dropped, which means that such callers drop alignment far too
      aggressively than necessary.
      
      Update all callers of xfs_bmapi_write() that pass a total block
      value of the mapping length plus bmbt reservation to instead pass
      zero and rely on xfs_bmapi_minleft() to enforce the bmbt reservation
      requirement. This trades off slightly less conservative AG selection
      for the ability to preserve alignment in more scenarios.
      xfs_bmapi_write() callers that incorporate unrelated or additional
      reservations in total beyond what is already included in minleft
      must continue to use the former.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      da781e64
  14. 27 8月, 2019 1 次提交
  15. 29 6月, 2019 1 次提交
  16. 22 12月, 2018 1 次提交
  17. 14 12月, 2018 1 次提交
  18. 13 12月, 2018 1 次提交
    • O
      xfs: cache minimum realtime summary level · 355e3532
      Omar Sandoval 提交于
      The realtime summary is a two-dimensional array on disk, effectively:
      
      u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]
      
      rsum[log][bbno] is the number of extents of size 2**log which start in
      bitmap block bbno.
      
      xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
      rsum[log][bbno] != 0 for any log level. However, the summary array is
      stored in row-major order (i.e., like an array in C), so all of these
      entries are not adjacent, but rather spread across the entire summary
      file. In the worst case (a full bitmap block), xfs_rtany_summary() has
      to check every level.
      
      This means that on a moderately-used realtime device, an allocation will
      waste a lot of time finding, reading, and releasing buffers for the
      realtime summary. In particular, one of our storage services (which runs
      on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
      spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
      xfs_trans_brelse() called from xfs_rtany_summary().
      
      One solution would be to also store the summary with the dimensions
      swapped. However, this would require a disk format change to a very old
      component of XFS.
      
      Instead, we can cache the minimum size which contains any extents. We do
      so lazily; rather than guaranteeing that the cache contains the precise
      minimum, it always contains a loose lower bound which we tighten when we
      read or update a summary block. This only uses a few kilobytes of memory
      and is already serialized via the realtime bitmap and summary inode
      locks, so the cost is minimal. With this change, the same workload only
      spends 0.2% of its CPU cycles in the realtime allocator.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      355e3532
  19. 27 7月, 2018 2 次提交
  20. 12 7月, 2018 6 次提交
  21. 09 6月, 2018 1 次提交
  22. 07 6月, 2018 1 次提交
    • D
      xfs: convert to SPDX license tags · 0b61f8a4
      Dave Chinner 提交于
      Remove the verbose license text from XFS files and replace them
      with SPDX tags. This does not change the license of any of the code,
      merely refers to the common, up-to-date license files in LICENSES/
      
      This change was mostly scripted. fs/xfs/Makefile and
      fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
      and modified by the following command:
      
      for f in `git grep -l "GNU General" fs/xfs/` ; do
      	echo $f
      	cat $f | awk -f hdr.awk > $f.new
      	mv -f $f.new $f
      done
      
      And the hdr.awk script that did the modification (including
      detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
      is as follows:
      
      $ cat hdr.awk
      BEGIN {
      	hdr = 1.0
      	tag = "GPL-2.0"
      	str = ""
      }
      
      /^ \* This program is free software/ {
      	hdr = 2.0;
      	next
      }
      
      /any later version./ {
      	tag = "GPL-2.0+"
      	next
      }
      
      /^ \*\// {
      	if (hdr > 0.0) {
      		print "// SPDX-License-Identifier: " tag
      		print str
      		print $0
      		str=""
      		hdr = 0.0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \* / {
      	if (hdr > 1.0)
      		next
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \*/ {
      	if (hdr > 0.0)
      		next
      	print $0
      	next
      }
      
      // {
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      }
      
      END { }
      $
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0b61f8a4
  23. 02 9月, 2017 1 次提交
  24. 20 6月, 2017 1 次提交
    • D
      xfs: remove double-underscore integer types · c8ce540d
      Darrick J. Wong 提交于
      This is a purely mechanical patch that removes the private
      __{u,}int{8,16,32,64}_t typedefs in favor of using the system
      {u,}int{8,16,32,64}_t typedefs.  This is the sed script used to perform
      the transformation and fix the resulting whitespace and indentation
      errors:
      
      s/typedef\t__uint8_t/typedef __uint8_t\t/g
      s/typedef\t__uint/typedef __uint/g
      s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
      s/__uint8_t\t/__uint8_t\t\t/g
      s/__uint/uint/g
      s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
      s/__int/int/g
      /^typedef.*int[0-9]*_t;$/d
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c8ce540d
  25. 18 2月, 2017 1 次提交
  26. 03 8月, 2016 2 次提交