1. 04 4月, 2023 2 次提交
    • D
      xfs: don't include bnobt blocks when reserving free block pool · 36dce62e
      Darrick J. Wong 提交于
      mainline inclusion
      from mainline-v5.18-rc1
      commit c8c56825
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c8c568259772751a14e969b7230990508de73d9d
      
      --------------------------------
      
      xfs_reserve_blocks controls the size of the user-visible free space
      reserve pool.  Given the difference between the current and requested
      pool sizes, it will try to reserve free space from fdblocks.  However,
      the amount requested from fdblocks is also constrained by the amount of
      space that we think xfs_mod_fdblocks will give us.  If we forget to
      subtract m_allocbt_blks before calling xfs_mod_fdblocks, it will will
      return ENOSPC and we'll hang the kernel at mount due to the infinite
      loop.
      
      In commit fd43cf60, we decided that xfs_mod_fdblocks should not hand
      out the "free space" used by the free space btrees, because some portion
      of the free space btrees hold in reserve space for future btree
      expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
      of blocks that it could request from xfs_mod_fdblocks was not updated to
      include m_allocbt_blks, so if space is extremely low, the caller hangs.
      
      Fix this by creating a function to estimate the number of blocks that
      can be reserved from fdblocks, which needs to exclude the set-aside and
      m_allocbt_blks.
      
      Found by running xfs/306 (which formats a single-AG 20MB filesystem)
      with an fstests configuration that specifies a 1k blocksize and a
      specially crafted log size that will consume 7/8 of the space (17920
      blocks, specifically) in that AG.
      
      Cc: Brian Foster <bfoster@redhat.com>
      Fixes: fd43cf60 ("xfs: set aside allocation btree blocks from block reservation")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Conflicts:
      	fs/xfs/xfs_mount.h
      	[ 15f04fdc("xfs: remove infinite loop when reserving
      	  free block pool") applied earlier. ]
      Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      36dce62e
    • B
      xfs: set aside allocation btree blocks from block reservation · d530ebb5
      Brian Foster 提交于
      mainline inclusion
      from mainline-v5.13-rc1
      commit fd43cf60
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fd43cf600cf61c66ae0a1021aca2f636115c7fcb
      
      --------------------------------
      
      The blocks used for allocation btrees (bnobt and countbt) are
      technically considered free space. This is because as free space is
      used, allocbt blocks are removed and naturally become available for
      traditional allocation. However, this means that a significant
      portion of free space may consist of in-use btree blocks if free
      space is severely fragmented.
      
      On large filesystems with large perag reservations, this can lead to
      a rare but nasty condition where a significant amount of physical
      free space is available, but the majority of actual usable blocks
      consist of in-use allocbt blocks. We have a record of a (~12TB, 32
      AG) filesystem with multiple AGs in a state with ~2.5GB or so free
      blocks tracked across ~300 total allocbt blocks, but effectively at
      100% full because the the free space is entirely consumed by
      refcountbt perag reservation.
      
      Such a large perag reservation is by design on large filesystems.
      The problem is that because the free space is so fragmented, this AG
      contributes the 300 or so allocbt blocks to the global counters as
      free space. If this pattern repeats across enough AGs, the
      filesystem lands in a state where global block reservation can
      outrun physical block availability. For example, a streaming
      buffered write on the affected filesystem continues to allow delayed
      allocation beyond the point where writeback starts to fail due to
      physical block allocation failures. The expected behavior is for the
      delalloc block reservation to fail gracefully with -ENOSPC before
      physical block allocation failure is a possibility.
      
      To address this problem, set aside in-use allocbt blocks at
      reservation time and thus ensure they cannot be reserved until truly
      available for physical allocation. This allows alloc btree metadata
      to continue to reside in free space, but dynamically adjusts
      reservation availability based on internal state. Note that the
      logic requires that the allocbt counter is fully populated at
      reservation time before it is fully effective. We currently rely on
      the mount time AGF scan in the perag reservation initialization code
      for this dependency on filesystems where it's most important (i.e.
      with active perag reservations).
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      d530ebb5
  2. 21 11月, 2022 2 次提交
    • L
      xfs: fix sb write verify for lazysbcount · 902d1c12
      Long Li 提交于
      Offering: HULK
      hulk inclusion
      category: bugfix
      bugzilla: 186982,https://gitee.com/openeuler/kernel/issues/I4KIAO
      
      --------------------------------
      
      When lazysbcount is enabled, fsstress and loop mount/unmount test report
      the following problems:
      
      XFS (loop0): SB summary counter sanity check failed
      XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
      	xfs_sb block 0x0
      XFS (loop0): Unmount and run xfs_repair
      XFS (loop0): First 128 bytes of corrupted metadata buffer:
      00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
      00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
      00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
      00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
      00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
      00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
      00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
      XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
      	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
      XFS (loop0): Please unmount the filesystem and rectify the problem(s)
      XFS (loop0): log mount/recovery failed: error -117
      XFS (loop0): log mount failed
      
      This corruption will shutdown the file system and the file system will
      no longer be mountable. The following script can reproduce the problem,
      but it may take a long time.
      
       #!/bin/bash
      
       device=/dev/sda
       testdir=/mnt/test
       round=0
      
       function fail()
       {
      	 echo "$*"
      	 exit 1
       }
      
       mkdir -p $testdir
       while [ $round -lt 10000 ]
       do
      	 echo "******* round $round ********"
      	 mkfs.xfs -f $device
      	 mount $device $testdir || fail "mount failed!"
      	 fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
      	 sleep 4
      	 killall -w fsstress
      	 umount $testdir
      	 xfs_repair -e $device > /dev/null
      	 if [ $? -eq 2 ];then
      		 echo "ERR CODE 2: Dirty log exception during repair."
      		 exit 1
      	 fi
      	 round=$(($round+1))
       done
      
      With lazysbcount is enabled, There is no additional lock protection for
      reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
      m_ifree, this will make the m_ifree greater than m_icount. For example,
      consider the following sequence and ifreedelta is postive:
      
       CPU0				 CPU1
       xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
       ----------			 ------------------------------
       percpu_counter_sum(&mp->m_icount)
      				 percpu_counter_add_batch(&mp->m_icount,
      						idelta, XFS_ICOUNT_BATCH)
      				 percpu_counter_add(&mp->m_ifree, ifreedelta);
       percpu_counter_sum(&mp->m_ifree)
      
      After this, incorrect inode count (sb_ifree > sb_icount) will be writen to
      the log. In the subsequent writing of sb, incorrect inode count (sb_ifree >
      sb_icount) will fail to pass the boundary check in xfs_validate_sb_write()
      that cause the file system shutdown.
      
      When lazysbcount is enabled, we don't need to guarantee that Lazy sb
      counters are completely correct, but we do need to guarantee that sb_ifree
      <= sb_icount. On the other hand, the constraint that m_ifree <= m_icount
      must be satisfied any time that there /cannot/ be other threads allocating
      or freeing inode chunks. If the constraint is violated under these
      circumstances, sb_i{count,free} (the ondisk superblock inode counters)
      maybe incorrect and need to be marked sick at unmount, the count will
      be rebuilt on the next mount.
      
      Fixes: 8756a5af ("libxfs: add more bounds checking to sb sanity checks")
      Signed-off-by: NLong Li <leo.lilong@huawei.com>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      902d1c12
    • D
      xfs: only run COW extent recovery when there are no live extents · 8efeef76
      Darrick J. Wong 提交于
      mainline inclusion
      from mainline-v5.16-rc5
      commit 7993f1a4
      category: bugfix
      bugzilla: 186901,https://gitee.com/openeuler/kernel/issues/I4KIAO
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7993f1a431bc5271369d359941485a9340658ac3
      
      --------------------------------
      
      As part of multiple customer escalations due to file data corruption
      after copy on write operations, I wrote some fstests that use fsstress
      to hammer on COW to shake things loose.  Regrettably, I caught some
      filesystem shutdowns due to incorrect rmap operations with the following
      loop:
      
      mount <filesystem>				# (0)
      fsstress <run only readonly ops> &		# (1)
      while true; do
      	fsstress <run all ops>
      	mount -o remount,ro			# (2)
      	fsstress <run only readonly ops>
      	mount -o remount,rw			# (3)
      done
      
      When (2) happens, notice that (1) is still running.  xfs_remount_ro will
      call xfs_blockgc_stop to walk the inode cache to free all the COW
      extents, but the blockgc mechanism races with (1)'s reader threads to
      take IOLOCKs and loses, which means that it doesn't clean them all out.
      Call such a file (A).
      
      When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
      walks the ondisk refcount btree and frees any COW extent that it finds.
      This function does not check the inode cache, which means that incore
      COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
      one of those former COW extents are allocated and mapped into another
      file (B) and someone triggers a COW to the stale reservation in (A), A's
      dirty data will be written into (B) and once that's done, those blocks
      will be transferred to (A)'s data fork without bumping the refcount.
      
      The results are catastrophic -- file (B) and the refcount btree are now
      corrupt.  In the first patch, we fixed the race condition in (2) so that
      (A) will always flush the COW fork.  In this second patch, we move the
      _recover_cow call to the initial mount call in (0) for safety.
      
      As mentioned previously, xfs_reflink_recover_cow walks the refcount
      btree looking for COW staging extents, and frees them.  This was
      intended to be run at mount time (when we know there are no live inodes)
      to clean up any leftover staging events that may have been left behind
      during an unclean shutdown.  As a time "optimization" for readonly
      mounts, we deferred this to the ro->rw transition, not realizing that
      any failure to clean all COW forks during a rw->ro transition would
      result in catastrophic corruption.
      
      Therefore, remove this optimization and only run the recovery routine
      when we're guaranteed not to have any COW staging extents anywhere,
      which means we always run this at mount time.  While we're at it, move
      the callsite to xfs_log_mount_finish because any refcount btree
      expansion (however unlikely given that we're removing records from the
      right side of the index) must be fed by a per-AG reservation, which
      doesn't exist in its current location.
      
      Fixes: 174edb0e ("xfs: store in-progress CoW allocations in the refcount btree")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      8efeef76
  3. 29 9月, 2022 1 次提交
  4. 07 1月, 2022 9 次提交
  5. 27 12月, 2021 1 次提交
  6. 19 11月, 2020 1 次提交
  7. 16 9月, 2020 1 次提交
  8. 07 9月, 2020 2 次提交
  9. 07 7月, 2020 2 次提交
    • D
      xfs: remove SYNC_WAIT from xfs_reclaim_inodes() · 4d0bab3a
      Dave Chinner 提交于
      Clean up xfs_reclaim_inodes() callers. Most callers want blocking
      behaviour, so just make the existing SYNC_WAIT behaviour the
      default.
      
      For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag()
      directly because we just want optimistic clean inode reclaim to be
      done in the background.
      
      For xfs_quiesce_attr() we can just remove the inode reclaim calls as
      they are a historic relic that was required to flush dirty inodes
      that contained unlogged changes. We now log all changes to the
      inodes, so the sync AIL push from xfs_log_quiesce() called by
      xfs_quiesce_attr() will do all the required inode writeback for
      freeze.
      
      Seeing as we now want to loop until all reclaimable inodes have been
      reclaimed, make xfs_reclaim_inodes() loop on the XFS_ICI_RECLAIM_TAG
      tag rather than having xfs_reclaim_inodes_ag() tell it that inodes
      were skipped. This is much more reliable and will always loop until
      all reclaimable inodes are reclaimed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4d0bab3a
    • D
      xfs: allow multiple reclaimers per AG · 0e8e2c63
      Dave Chinner 提交于
      Inode reclaim will still throttle direct reclaim on the per-ag
      reclaim locks. This is no longer necessary as reclaim can run
      non-blocking now. Hence we can remove these locks so that we don't
      arbitrarily block reclaimers just because there are more direct
      reclaimers than there are AGs.
      
      This can result in multiple reclaimers working on the same range of
      an AG, but this doesn't cause any apparent issues. Optimising the
      spread of concurrent reclaimers for best efficiency can be done in a
      future patchset.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0e8e2c63
  10. 27 5月, 2020 1 次提交
    • D
      xfs: reduce free inode accounting overhead · f18c9a90
      Dave Chinner 提交于
      Shaokun Zhang reported that XFS was using substantial CPU time in
      percpu_count_sum() when running a single threaded benchmark on
      a high CPU count (128p) machine from xfs_mod_ifree(). The issue
      is that the filesystem is empty when the benchmark runs, so inode
      allocation is running with a very low inode free count.
      
      With the percpu counter batching, this means comparisons when the
      counter is less that 128 * 256 = 32768 use the slow path of adding
      up all the counters across the CPUs, and this is expensive on high
      CPU count machines.
      
      The summing in xfs_mod_ifree() is only used to fire an assert if an
      underrun occurs. The error is ignored by the higher level code.
      Hence this is really just debug code and we don't need to run it
      on production kernels, nor do we need such debug checks to return
      error values just to trigger an assert.
      
      Finally, xfs_mod_icount/xfs_mod_ifree are only called from
      xfs_trans_unreserve_and_mod_sb(), so get rid of them and just
      directly call the percpu_counter_add/percpu_counter_compare
      functions. The compare functions are now run only on debug builds as
      they are internal to ASSERT() checks and so only compiled in when
      ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y).
      Reported-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f18c9a90
  11. 05 5月, 2020 1 次提交
  12. 12 3月, 2020 1 次提交
  13. 19 12月, 2019 2 次提交
    • D
      xfs: don't commit sunit/swidth updates to disk if that would cause repair failures · 13eaec4b
      Darrick J. Wong 提交于
      Alex Lyakas reported[1] that mounting an xfs filesystem with new sunit
      and swidth values could cause xfs_repair to fail loudly.  The problem
      here is that repair calculates the where mkfs should have allocated the
      root inode, based on the superblock geometry.  The allocation decisions
      depend on sunit, which means that we really can't go updating sunit if
      it would lead to a subsequent repair failure on an otherwise correct
      filesystem.
      
      Port from xfs_repair some code that computes the location of the root
      inode and teach mount to skip the ondisk update if it would cause
      problems for repair.  Along the way we'll update the documentation,
      provide a function for computing the minimum AGFL size instead of
      open-coding it, and cut down some indenting in the mount code.
      
      Note that we allow the mount to proceed (and new allocations will
      reflect this new geometry) because we've never screened this kind of
      thing before.  We'll have to wait for a new future incompat feature to
      enforce correct behavior, alas.
      
      Note that the geometry reporting always uses the superblock values, not
      the incore ones, so that is what xfs_info and xfs_growfs will report.
      
      [1] https://lore.kernel.org/linux-xfs/20191125130744.GA44777@bfoster/T/#m00f9594b511e076e2fcdd489d78bc30216d72a7dReported-by: NAlex Lyakas <alex@zadara.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      13eaec4b
    • D
      xfs: split the sunit parameter update into two parts · 4f5b1b3a
      Darrick J. Wong 提交于
      If the administrator provided a sunit= mount option, we need to validate
      the raw parameter, convert the mount option units (512b blocks) into the
      internal unit (fs blocks), and then validate that the (now cooked)
      parameter doesn't screw anything up on disk.  The incore inode geometry
      computation can depend on the new sunit option, but a subsequent patch
      will make validating the cooked value depends on the computed inode
      geometry, so break the sunit update into two steps.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      4f5b1b3a
  14. 14 11月, 2019 1 次提交
  15. 06 11月, 2019 1 次提交
  16. 30 10月, 2019 4 次提交
  17. 06 9月, 2019 1 次提交
  18. 27 8月, 2019 1 次提交
  19. 29 6月, 2019 1 次提交
  20. 12 6月, 2019 4 次提交
  21. 27 4月, 2019 1 次提交