提交 · 36dce62e65ff46d05bf05174438d6536fd5b6af8 · openeuler / Kernel

04 4月, 2023 2 次提交

xfs: don't include bnobt blocks when reserving free block pool · 36dce62e

由 Darrick J. Wong 提交于 4月 04, 2023

mainline inclusion
from mainline-v5.18-rc1
commit c8c56825
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c8c568259772751a14e969b7230990508de73d9d

--------------------------------

xfs_reserve_blocks controls the size of the user-visible free space
reserve pool.  Given the difference between the current and requested
pool sizes, it will try to reserve free space from fdblocks.  However,
the amount requested from fdblocks is also constrained by the amount of
space that we think xfs_mod_fdblocks will give us.  If we forget to
subtract m_allocbt_blks before calling xfs_mod_fdblocks, it will will
return ENOSPC and we'll hang the kernel at mount due to the infinite
loop.

In commit fd43cf60, we decided that xfs_mod_fdblocks should not hand
out the "free space" used by the free space btrees, because some portion
of the free space btrees hold in reserve space for future btree
expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
of blocks that it could request from xfs_mod_fdblocks was not updated to
include m_allocbt_blks, so if space is extremely low, the caller hangs.

Fix this by creating a function to estimate the number of blocks that
can be reserved from fdblocks, which needs to exclude the set-aside and
m_allocbt_blks.

Found by running xfs/306 (which formats a single-AG 20MB filesystem)
with an fstests configuration that specifies a 1k blocksize and a
specially crafted log size that will consume 7/8 of the space (17920
blocks, specifically) in that AG.

Cc: Brian Foster <bfoster@redhat.com>
Fixes: fd43cf60 ("xfs: set aside allocation btree blocks from block reservation")
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Conflicts:
	fs/xfs/xfs_mount.h
	[ 15f04fdc("xfs: remove infinite loop when reserving
	  free block pool") applied earlier. ]
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

36dce62e

xfs: set aside allocation btree blocks from block reservation · d530ebb5

由 Brian Foster 提交于 4月 04, 2023

mainline inclusion
from mainline-v5.13-rc1
commit fd43cf60
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fd43cf600cf61c66ae0a1021aca2f636115c7fcb

--------------------------------

The blocks used for allocation btrees (bnobt and countbt) are
technically considered free space. This is because as free space is
used, allocbt blocks are removed and naturally become available for
traditional allocation. However, this means that a significant
portion of free space may consist of in-use btree blocks if free
space is severely fragmented.

On large filesystems with large perag reservations, this can lead to
a rare but nasty condition where a significant amount of physical
free space is available, but the majority of actual usable blocks
consist of in-use allocbt blocks. We have a record of a (~12TB, 32
AG) filesystem with multiple AGs in a state with ~2.5GB or so free
blocks tracked across ~300 total allocbt blocks, but effectively at
100% full because the the free space is entirely consumed by
refcountbt perag reservation.

Such a large perag reservation is by design on large filesystems.
The problem is that because the free space is so fragmented, this AG
contributes the 300 or so allocbt blocks to the global counters as
free space. If this pattern repeats across enough AGs, the
filesystem lands in a state where global block reservation can
outrun physical block availability. For example, a streaming
buffered write on the affected filesystem continues to allow delayed
allocation beyond the point where writeback starts to fail due to
physical block allocation failures. The expected behavior is for the
delalloc block reservation to fail gracefully with -ENOSPC before
physical block allocation failure is a possibility.

To address this problem, set aside in-use allocbt blocks at
reservation time and thus ensure they cannot be reserved until truly
available for physical allocation. This allows alloc btree metadata
to continue to reside in free space, but dynamically adjusts
reservation availability based on internal state. Note that the
logic requires that the allocbt counter is fully populated at
reservation time before it is fully effective. We currently rely on
the mount time AGF scan in the perag reservation initialization code
for this dependency on filesystems where it's most important (i.e.
with active perag reservations).
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

d530ebb5

21 11月, 2022 2 次提交

xfs: fix sb write verify for lazysbcount · 902d1c12

由 Long Li 提交于 11月 21, 2022

Offering: HULK
hulk inclusion
category: bugfix
bugzilla: 186982,https://gitee.com/openeuler/kernel/issues/I4KIAO

--------------------------------

When lazysbcount is enabled, fsstress and loop mount/unmount test report
the following problems:

XFS (loop0): SB summary counter sanity check failed
XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
	xfs_sb block 0x0
XFS (loop0): Unmount and run xfs_repair
XFS (loop0): First 128 bytes of corrupted metadata buffer:
00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
XFS (loop0): Please unmount the filesystem and rectify the problem(s)
XFS (loop0): log mount/recovery failed: error -117
XFS (loop0): log mount failed

This corruption will shutdown the file system and the file system will
no longer be mountable. The following script can reproduce the problem,
but it may take a long time.

 #!/bin/bash

 device=/dev/sda
 testdir=/mnt/test
 round=0

 function fail()
 {
	 echo "$*"
	 exit 1
 }

 mkdir -p $testdir
 while [ $round -lt 10000 ]
 do
	 echo "******* round $round ********"
	 mkfs.xfs -f $device
	 mount $device $testdir || fail "mount failed!"
	 fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
	 sleep 4
	 killall -w fsstress
	 umount $testdir
	 xfs_repair -e $device > /dev/null
	 if [ $? -eq 2 ];then
		 echo "ERR CODE 2: Dirty log exception during repair."
		 exit 1
	 fi
	 round=$(($round+1))
 done

With lazysbcount is enabled, There is no additional lock protection for
reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
m_ifree, this will make the m_ifree greater than m_icount. For example,
consider the following sequence and ifreedelta is postive:

 CPU0				 CPU1
 xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
 ----------			 ------------------------------
 percpu_counter_sum(&mp->m_icount)
				 percpu_counter_add_batch(&mp->m_icount,
						idelta, XFS_ICOUNT_BATCH)
				 percpu_counter_add(&mp->m_ifree, ifreedelta);
 percpu_counter_sum(&mp->m_ifree)

After this, incorrect inode count (sb_ifree > sb_icount) will be writen to
the log. In the subsequent writing of sb, incorrect inode count (sb_ifree >
sb_icount) will fail to pass the boundary check in xfs_validate_sb_write()
that cause the file system shutdown.

When lazysbcount is enabled, we don't need to guarantee that Lazy sb
counters are completely correct, but we do need to guarantee that sb_ifree
<= sb_icount. On the other hand, the constraint that m_ifree <= m_icount
must be satisfied any time that there /cannot/ be other threads allocating
or freeing inode chunks. If the constraint is violated under these
circumstances, sb_i{count,free} (the ondisk superblock inode counters)
maybe incorrect and need to be marked sick at unmount, the count will
be rebuilt on the next mount.

Fixes: 8756a5af ("libxfs: add more bounds checking to sb sanity checks")
Signed-off-by: NLong Li <leo.lilong@huawei.com>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

902d1c12

xfs: only run COW extent recovery when there are no live extents · 8efeef76

由 Darrick J. Wong 提交于 11月 21, 2022

mainline inclusion
from mainline-v5.16-rc5
commit 7993f1a4
category: bugfix
bugzilla: 186901,https://gitee.com/openeuler/kernel/issues/I4KIAO

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7993f1a431bc5271369d359941485a9340658ac3

--------------------------------

As part of multiple customer escalations due to file data corruption
after copy on write operations, I wrote some fstests that use fsstress
to hammer on COW to shake things loose.  Regrettably, I caught some
filesystem shutdowns due to incorrect rmap operations with the following
loop:

mount <filesystem>				# (0)
fsstress <run only readonly ops> &		# (1)
while true; do
	fsstress <run all ops>
	mount -o remount,ro			# (2)
	fsstress <run only readonly ops>
	mount -o remount,rw			# (3)
done

When (2) happens, notice that (1) is still running.  xfs_remount_ro will
call xfs_blockgc_stop to walk the inode cache to free all the COW
extents, but the blockgc mechanism races with (1)'s reader threads to
take IOLOCKs and loses, which means that it doesn't clean them all out.
Call such a file (A).

When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
walks the ondisk refcount btree and frees any COW extent that it finds.
This function does not check the inode cache, which means that incore
COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
one of those former COW extents are allocated and mapped into another
file (B) and someone triggers a COW to the stale reservation in (A), A's
dirty data will be written into (B) and once that's done, those blocks
will be transferred to (A)'s data fork without bumping the refcount.

The results are catastrophic -- file (B) and the refcount btree are now
corrupt.  In the first patch, we fixed the race condition in (2) so that
(A) will always flush the COW fork.  In this second patch, we move the
_recover_cow call to the initial mount call in (0) for safety.

As mentioned previously, xfs_reflink_recover_cow walks the refcount
btree looking for COW staging extents, and frees them.  This was
intended to be run at mount time (when we know there are no live inodes)
to clean up any leftover staging events that may have been left behind
during an unclean shutdown.  As a time "optimization" for readonly
mounts, we deferred this to the ro->rw transition, not realizing that
any failure to clean all COW forks during a rw->ro transition would
result in catastrophic corruption.

Therefore, remove this optimization and only run the recovery routine
when we're guaranteed not to have any COW staging extents anywhere,
which means we always run this at mount time.  While we're at it, move
the callsite to xfs_log_mount_finish because any refcount btree
expansion (however unlikely given that we're removing records from the
right side of the index) must be fed by a per-AG reservation, which
doesn't exist in its current location.

Fixes: 174edb0e ("xfs: store in-progress CoW allocations in the refcount btree")
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

8efeef76

29 9月, 2022 1 次提交

xfs: sync lazy sb accounting on quiesce of read-only mounts · cd009263

由 Brian Foster 提交于 9月 29, 2022

stable inclusion
from stable-v5.10.121
commit 643ceee253a45ac3e8be5518d5779cb3c9464d13
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5L6CQ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=643ceee253a45ac3e8be5518d5779cb3c9464d13

--------------------------------

commit 50d25484 upstream.

xfs_log_sbcount() syncs the superblock specifically to accumulate
the in-core percpu superblock counters and commit them to disk. This
is required to maintain filesystem consistency across quiesce
(freeze, read-only mount/remount) or unmount when lazy superblock
accounting is enabled because individual transactions do not update
the superblock directly.

This mechanism works as expected for writable mounts, but
xfs_log_sbcount() skips the update for read-only mounts. Read-only
mounts otherwise still allow log recovery and write out an unmount
record during log quiesce. If a read-only mount performs log
recovery, it can modify the in-core superblock counters and write an
unmount record when the filesystem unmounts without ever syncing the
in-core counters. This leaves the filesystem with a clean log but in
an inconsistent state with regard to lazy sb counters.

Update xfs_log_sbcount() to use the same logic
xfs_log_unmount_write() uses to determine when to write an unmount
record. This ensures that lazy accounting is always synced before
the log is cleaned. Refactor this logic into a new helper to
distinguish between a writable filesystem and a writable log.
Specifically, the log is writable unless the filesystem is mounted
with the norecovery mount option, the underlying log device is
read-only, or the filesystem is shutdown. Drop the freeze state
check because the update is already allowed during the freezing
process and no context calls this function on an already frozen fs.
Also, retain the shutdown check in xfs_log_unmount_write() to catch
the case where the preceding log force might have triggered a
shutdown.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NGao Xiang <hsiangkao@redhat.com>
Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBill O'Donnell <billodo@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

Conflicts:
fs/xfs/xfs_log.h
Reviewed-by: NXuenan Guo <guoxuenan@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

cd009263

07 1月, 2022 9 次提交

xfs: throttle inode inactivation queuing on memory reclaim · 0bab43f0

由 Darrick J. Wong 提交于 1月 07, 2022

mainline-inclusion
from mainline-v5.14-rc4
commit 40b1de00
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=40b1de007aca4f9ec4ee4322c29f026ebb60ac96

-------------------------------------------------

Now that we defer inode inactivation, we've decoupled the process of
unlinking or closing an inode from the process of inactivating it. In
theory this should lead to better throughput since we now inactivate the
queued inodes in batches instead of one at a time.

Unfortunately, one of the primary risks with this decoupling is the loss
of rate control feedback between the frontend and background threads.
In other words, a rm -rf /* thread can run the system out of memory if
it can queue inodes for inactivation and jump to a new CPU faster than
the background threads can actually clear the deferred work. The
workers can get scheduled off the CPU if they have to do IO, etc.

To solve this problem, we configure a shrinker so that it will activate
the /second/ time the shrinkers are called. The custom shrinker will
queue all percpu deferred inactivation workers immediately and set a
flag to force frontend callers who are releasing a vfs inode to wait for
the inactivation workers.

On my test VM with 560M of RAM and a 2TB filesystem, this seems to solve
most of the OOMing problem when deleting 10 million inodes.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NLihong Kou <koulihong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

0bab43f0

xfs: don't run speculative preallocation gc when fs is frozen · 6d67e0b4

由 Darrick J. Wong 提交于 1月 07, 2022

mainline-inclusion
from mainline-v5.14-rc4
commit 6f649091
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f6490914d9b712004ddad648e47b1bf22647978

-------------------------------------------------

Now that we have the infrastructure to switch background workers on and
off at will, fix the block gc worker code so that we don't actually run
the worker when the filesystem is frozen, same as we do for deferred
inactivation.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NLihong Kou <koulihong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6d67e0b4

xfs: queue inactivation immediately when free realtime extents are tight · 098886e7

由 Darrick J. Wong 提交于 1月 07, 2022

mainline-inclusion
from mainline-v5.14-rc4
commit 65f03d86
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=65f03d8652b240aa66b99a07e3c423a51e967568

-------------------------------------------------

Now that we have made the inactivation of unlinked inodes a background
task to increase the throughput of file deletions, we need to be a
little more careful about how long of a delay we can tolerate.

Similar to the patch doing this for free space on the data device, if
the file being inactivated is a realtime file and the realtime volume is
running low on free extents, we want to run the worker ASAP so that the
realtime allocator can make better decisions.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NLihong Kou <koulihong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

098886e7

xfs: queue inactivation immediately when free space is tight · 04502e84

由 Darrick J. Wong 提交于 1月 07, 2022

mainline-inclusion
from mainline-v5.14-rc4
commit 7d6f07d2
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7d6f07d2c5ad9fce298889eeed317d512a2df8cd

-------------------------------------------------

Now that we have made the inactivation of unlinked inodes a background
task to increase the throughput of file deletions, we need to be a
little more careful about how long of a delay we can tolerate.

On a mostly empty filesystem, the risk of the allocator making poor
decisions due to fragmentation of the free space on account a lengthy
delay in background updates is minimal because there's plenty of space.
However, if free space is tight, we want to deallocate unlinked inodes
as quickly as possible to avoid fallocate ENOSPC and to give the
allocator the best shot at optimal allocations for new writes.

Therefore, queue the percpu worker immediately if the filesystem is more
than 95% full.  This follows the same principle that XFS becomes less
aggressive about speculative allocations and lazy cleanup (and more
precise about accounting) when nearing full.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NLihong Kou <koulihong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

04502e84

xfs: per-cpu deferred inode inactivation queues · 705fccfb

由 Dave Chinner 提交于 1月 07, 2022

mainline-inclusion
from mainline-v5.14-rc4
commit ab23a776
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab23a7768739a23d21d8a16ca37dff96b1ca957a

-------------------------------------------------

Move inode inactivation to background work contexts so that it no
longer runs in the context that releases the final reference to an
inode. This will allow process work that ends up blocking on
inactivation to continue doing work while the filesytem processes
the inactivation in the background.

A typical demonstration of this is unlinking an inode with lots of
extents. The extents are removed during inactivation, so this blocks
the process that unlinked the inode from the directory structure. By
moving the inactivation to the background process, the userspace
applicaiton can keep working (e.g. unlinking the next inode in the
directory) while the inactivation work on the previous inode is
done by a different CPU.

The implementation of the queue is relatively simple. We use a
per-cpu lockless linked list (llist) to queue inodes for
inactivation without requiring serialisation mechanisms, and a work
item to allow the queue to be processed by a CPU bound worker
thread. We also keep a count of the queue depth so that we can
trigger work after a number of deferred inactivations have been
queued.

The use of a bound workqueue with a single work depth allows the
workqueue to run one work item per CPU. We queue the work item on
the CPU we are currently running on, and so this essentially gives
us affine per-cpu worker threads for the per-cpu queues. THis
maintains the effective CPU affinity that occurs within XFS at the
AG level due to all objects in a directory being local to an AG.
Hence inactivation work tends to run on the same CPU that last
accessed all the objects that inactivation accesses and this
maintains hot CPU caches for unlink workloads.

A depth of 32 inodes was chosen to match the number of inodes in an
inode cluster buffer. This hopefully allows sequential
allocation/unlink behaviours to defering inactivation of all the
inodes in a single cluster buffer at a time, further helping
maintain hot CPU and buffer cache accesses while running
inactivations.

A hard per-cpu queue throttle of 256 inode has been set to avoid
runaway queuing when inodes that take a long to time inactivate are
being processed. For example, when unlinking inodes with large
numbers of extents that can take a lot of processing to free.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
[djwong: tweak comments and tracepoints, convert opflags to state bits]
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NLihong Kou <koulihong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

705fccfb

xfs: remove the active vs running quota differentiation · d0c0801d

由 Christoph Hellwig 提交于 1月 07, 2022

mainline-inclusion
from mainline-v5.14-rc4
commit 149e53af
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=149e53afc851713a70b6a06ebfaf2ebf25975454

-------------------------------------------------

These only made a difference when quotaoff supported disabling quota
accounting on a mounted file system, so we can switch everyone to use
a single set of flags and helpers now. Note that the *QUOTA_ON naming
for the helpers is kept as it was the much more commonly used one.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

d0c0801d

xfs: force log and push AIL to clear pinned inodes when aborting mount · 9822c59c

由 Darrick J. Wong 提交于 1月 07, 2022

mainline-inclusion
from mainline-v5.12-rc2
commit d336f7eb
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d336f7ebc65007f5831e2297e6f3383ae8dbf8ed

-------------------------------------------------

If we allocate quota inodes in the process of mounting a filesystem but
then decide to abort the mount, it's possible that the quota inodes are
sitting around pinned by the log.  Now that inode reclaim relies on the
AIL to flush inodes, we have to force the log and push the AIL in
between releasing the quota inodes and kicking off reclaim to tear down
all the incore inodes.  Do this by extracting the bits we need from the
unmount path and reusing them.  As an added bonus, failed writes during
a failed mount will not retry forever now.

This was originally found during a fuzz test of metadata directories
(xfs/1546), but the actual symptom was that reclaim hung up on the quota
inodes.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NLihong Kou <koulihong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

9822c59c

xfs: parallelize block preallocation garbage collection · ddfed0bb

由 Darrick J. Wong 提交于 1月 07, 2022

mainline-inclusion
from mainline-v5.11-rc4
commit 894ecacf
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=894ecacf0f27fd1701c34f2946148b7f017bf984

-------------------------------------------------

Split the block preallocation garbage collection work into per-AG work
items so that we can take advantage of parallelization.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NLihong Kou <koulihong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ddfed0bb

xfs: rename block gc start and stop functions · bc28c3c9

由 Darrick J. Wong 提交于 1月 07, 2022

mainline-inclusion
from mainline-v5.11-rc4
commit c9a6526f
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c9a6526fe7ae64528d924c6f255af15312211432

-------------------------------------------------

Shorten the names of the two functions that start and stop block
preallocation garbage collection and move them up to the other blockgc
functions.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NLihong Kou <koulihong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

bc28c3c9

27 12月, 2021 1 次提交

xfs: fix log intent recovery ENOSPC shutdowns when inactivating inodes · 365e8cb7

由 Darrick J. Wong 提交于 12月 27, 2021

mainline-inclusion
from mainline-v5.13-rc4
commit 81ed9475
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=81ed94751b1513fcc5978dcc06eb1f5b4e55a785

-------------------------------------------------

During regular operation, the xfs_inactive operations create
transactions with zero block reservation because in general we're
freeing space, not asking for more.  The per-AG space reservations
created at mount time enable us to handle expansions of the refcount
btree without needing to reserve blocks to the transaction.

Unfortunately, log recovery doesn't create the per-AG space reservations
when intent items are being recovered.  This isn't an issue for intent
item recovery itself because they explicitly request blocks, but any
inode inactivation that can happen during log recovery uses the same
xfs_inactive paths as regular runtime.  If a refcount btree expansion
happens, the transaction will fail due to blk_res_used > blk_res, and we
shut down the filesystem unnecessarily.

Fix this problem by making per-AG reservations temporarily so that we
can handle the inactivations, and releasing them at the end.  This
brings the recovery environment closer to the runtime environment.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NLihong Kou <koulihong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

365e8cb7

19 11月, 2020 1 次提交

xfs: return corresponding errcode if xfs_initialize_perag() fail · 595189c2

由 Yu Kuai 提交于 11月 18, 2020

In xfs_initialize_perag(), if kmem_zalloc(), xfs_buf_hash_init(), or
radix_tree_preload() failed, the returned value 'error' is not set
accordingly.

Reported-as-fixing: 8b26c582 ("xfs: handle ENOMEM correctly during initialisation of perag structures")
Fixes: 9b247179 ("xfs: cache unlinked pointers in an rhashtable")
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

595189c2

16 9月, 2020 1 次提交

xfs: remove xfs_getsb · b3f8e08c

由 Christoph Hellwig 提交于 9月 01, 2020

Merge xfs_getsb into its only caller, and clean that one up a little bit
as well.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

b3f8e08c

07 9月, 2020 2 次提交

xfs: xfs_iflock is no longer a completion · 718ecc50

由 Dave Chinner 提交于 8月 17, 2020

With the recent rework of the inode cluster flushing, we no longer
ever wait on the the inode flush "lock". It was never a lock in the
first place, just a completion to allow callers to wait for inode IO
to complete. We now never wait for flush completion as all inode
flushing is non-blocking. Hence we can get rid of all the iflock
infrastructure and instead just set and check a state flag.

Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
xfs_iflock_nowait() test-and-set operations on that flag, and
replace all the xfs_ifunlock() calls to clear operations.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

718ecc50

xfs: remove kmem_realloc() · 771915c4

由 Carlos Maiolino 提交于 8月 26, 2020

Remove kmem_realloc() function and convert its users to use MM API
directly (krealloc())
Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

771915c4

07 7月, 2020 2 次提交

xfs: remove SYNC_WAIT from xfs_reclaim_inodes() · 4d0bab3a

由 Dave Chinner 提交于 7月 01, 2020

Clean up xfs_reclaim_inodes() callers. Most callers want blocking
behaviour, so just make the existing SYNC_WAIT behaviour the
default.

For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag()
directly because we just want optimistic clean inode reclaim to be
done in the background.

For xfs_quiesce_attr() we can just remove the inode reclaim calls as
they are a historic relic that was required to flush dirty inodes
that contained unlogged changes. We now log all changes to the
inodes, so the sync AIL push from xfs_log_quiesce() called by
xfs_quiesce_attr() will do all the required inode writeback for
freeze.

Seeing as we now want to loop until all reclaimable inodes have been
reclaimed, make xfs_reclaim_inodes() loop on the XFS_ICI_RECLAIM_TAG
tag rather than having xfs_reclaim_inodes_ag() tell it that inodes
were skipped. This is much more reliable and will always loop until
all reclaimable inodes are reclaimed.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

4d0bab3a

xfs: allow multiple reclaimers per AG · 0e8e2c63

由 Dave Chinner 提交于 6月 29, 2020

Inode reclaim will still throttle direct reclaim on the per-ag
reclaim locks. This is no longer necessary as reclaim can run
non-blocking now. Hence we can remove these locks so that we don't
arbitrarily block reclaimers just because there are more direct
reclaimers than there are AGs.

This can result in multiple reclaimers working on the same range of
an AG, but this doesn't cause any apparent issues. Optimising the
spread of concurrent reclaimers for best efficiency can be done in a
future patchset.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

0e8e2c63

27 5月, 2020 1 次提交

xfs: reduce free inode accounting overhead · f18c9a90

由 Dave Chinner 提交于 5月 20, 2020

Shaokun Zhang reported that XFS was using substantial CPU time in
percpu_count_sum() when running a single threaded benchmark on
a high CPU count (128p) machine from xfs_mod_ifree(). The issue
is that the filesystem is empty when the benchmark runs, so inode
allocation is running with a very low inode free count.

With the percpu counter batching, this means comparisons when the
counter is less that 128 * 256 = 32768 use the slow path of adding
up all the counters across the CPUs, and this is expensive on high
CPU count machines.

The summing in xfs_mod_ifree() is only used to fire an assert if an
underrun occurs. The error is ignored by the higher level code.
Hence this is really just debug code and we don't need to run it
on production kernels, nor do we need such debug checks to return
error values just to trigger an assert.

Finally, xfs_mod_icount/xfs_mod_ifree are only called from
xfs_trans_unreserve_and_mod_sb(), so get rid of them and just
directly call the percpu_counter_add/percpu_counter_compare
functions. The compare functions are now run only on debug builds as
they are internal to ASSERT() checks and so only compiled in when
ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y).
Reported-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

f18c9a90

05 5月, 2020 1 次提交

xfs: define printk_once variants for xfs messages · ec43f6da

由 Eric Sandeen 提交于 4月 27, 2020

There are a couple places where we directly call printk_once() and one
of them doesn't follow the standard xfs subsystem printk format as a
result.

#define printk_once variants to go with our existing printk_ratelimited
#defines so we can do one-shot printks in a consistent manner.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

ec43f6da

12 3月, 2020 1 次提交

xfs: remove XFS_BUF_TO_SBP · 3e6e8afd

由 Christoph Hellwig 提交于 3月 10, 2020

Just dereference bp->b_addr directly and make the code a little
simpler and more clear.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

3e6e8afd

19 12月, 2019 2 次提交

xfs: don't commit sunit/swidth updates to disk if that would cause repair failures · 13eaec4b

由 Darrick J. Wong 提交于 12月 11, 2019

Alex Lyakas reported[1] that mounting an xfs filesystem with new sunit
and swidth values could cause xfs_repair to fail loudly. The problem
here is that repair calculates the where mkfs should have allocated the
root inode, based on the superblock geometry. The allocation decisions
depend on sunit, which means that we really can't go updating sunit if
it would lead to a subsequent repair failure on an otherwise correct
filesystem.

Port from xfs_repair some code that computes the location of the root
inode and teach mount to skip the ondisk update if it would cause
problems for repair. Along the way we'll update the documentation,
provide a function for computing the minimum AGFL size instead of
open-coding it, and cut down some indenting in the mount code.

Note that we allow the mount to proceed (and new allocations will
reflect this new geometry) because we've never screened this kind of
thing before. We'll have to wait for a new future incompat feature to
enforce correct behavior, alas.

Note that the geometry reporting always uses the superblock values, not
the incore ones, so that is what xfs_info and xfs_growfs will report.

[1] https://lore.kernel.org/linux-xfs/20191125130744.GA44777@bfoster/T/#m00f9594b511e076e2fcdd489d78bc30216d72a7dReported-by: NAlex Lyakas <alex@zadara.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

13eaec4b

xfs: split the sunit parameter update into two parts · 4f5b1b3a

由 Darrick J. Wong 提交于 12月 18, 2019

If the administrator provided a sunit= mount option, we need to validate
the raw parameter, convert the mount option units (512b blocks) into the
internal unit (fs blocks), and then validate that the (now cooked)
parameter doesn't screw anything up on disk. The incore inode geometry
computation can depend on the new sunit option, but a subsequent patch
will make validating the cooked value depends on the computed inode
geometry, so break the sunit update into two steps.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

4f5b1b3a

14 11月, 2019 1 次提交

xfs: convert open coded corruption check to use XFS_IS_CORRUPT · a71895c5

由 Darrick J. Wong 提交于 11月 11, 2019

Convert the last of the open coded corruption check and report idioms to
use the XFS_IS_CORRUPT macro.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

a71895c5

06 11月, 2019 1 次提交

xfs: use super s_id instead of struct xfs_mount m_fsname · e1d3d218

由 Ian Kent 提交于 11月 04, 2019

Eliminate struct xfs_mount field m_fsname by using the super block s_id
field directly.
Signed-off-by: NIan Kent <raven@themaw.net>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

e1d3d218

30 10月, 2019 4 次提交

xfs: simplify parsing of allocsize mount option · 2fcddee8

由 Christoph Hellwig 提交于 10月 28, 2019

Rework xfs_parseargs to fill out the default value and then parse the
option directly into the mount structure, similar to what we do for
other updates, and open code the now trivial updates based on on the
on-disk superblock directly into xfs_mountfs.

Note that this change rejects the allocsize=0 mount option that has been
documented as invalid for a long time instead of just ignoring it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

2fcddee8

xfs: rename the m_writeio_* fields in struct xfs_mount · 5da8a07c

由 Christoph Hellwig 提交于 10月 28, 2019

Use the allocsize name to match the mount option and usage instead.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

5da8a07c

xfs: remove the m_readio_* fields in struct xfs_mount · 3cd1d18b

由 Christoph Hellwig 提交于 10月 28, 2019

m_readio_blocks is entirely unused, and m_readio_blocks is only used in
xfs_stat_blksize in a max statements that is a no-op as it always has
the same value as m_writeio_log.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

3cd1d18b

xfs: don't use a different allocsice for -o wsync · b5ad616c

由 Christoph Hellwig 提交于 10月 28, 2019

The -o wsync allocsize overwrite overwrite was part of a special hack
for NFSv2 servers in IRIX and has no real purpose in modern Linux, so
remove it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

b5ad616c

06 9月, 2019 1 次提交

xfs: Use WARN_ON_ONCE for bailout mount-operation · eb2e9994

由 Austin Kim 提交于 9月 05, 2019

If the CONFIG_BUG is enabled, BUG is executed and then system is crashed.
However, the bailout for mount is no longer proceeding.

Using WARN_ON_ONCE rather than BUG can prevent this situation.
Signed-off-by: NAustin Kim <austindh.kim@gmail.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

eb2e9994

27 8月, 2019 1 次提交

fs: xfs: Remove KM_NOSLEEP and KM_SLEEP. · 707e0dda

由 Tetsuo Handa 提交于 8月 26, 2019

Since no caller is using KM_NOSLEEP and no callee branches on KM_SLEEP,
we can remove KM_NOSLEEP and replace KM_SLEEP with 0.
Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

707e0dda

29 6月, 2019 1 次提交

xfs: remove unused header files · 250d4b4c

由 Eric Sandeen 提交于 6月 28, 2019

There are many, many xfs header files which are included but
unneeded (or included twice) in the xfs code, so remove them.

nb: xfs_linux.h includes about 9 headers for everyone, so those
explicit includes get removed by this.  I'm not sure what the
preference is, but if we wanted explicit includes everywhere,
a followup patch could remove those xfs_*.h includes from
xfs_linux.h and move them into the files that need them.
Or it could be left as-is.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

250d4b4c

12 6月, 2019 4 次提交

xfs: remove unused flags arg from getsb interfaces · 8c9ce2f7

由 Eric Sandeen 提交于 6月 12, 2019

The flags value is always passed as 0 so remove the argument.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

8c9ce2f7

xfs: fix inode_cluster_size rounding mayhem · 490d451f

由 Darrick J. Wong 提交于 6月 05, 2019

inode_cluster_size is supposed to represent the size (in bytes) of an
inode cluster buffer. We avoid having to handle multiple clusters per
filesystem block on filesystems with large blocks by openly rounding
this value up to 1 FSB when necessary. However, we never reset
inode_cluster_size to reflect this new rounded value, which adds to the
potential for mistakes in calculating geometries.

Fix this by setting inode_cluster_size to reflect the rounded-up size if
needed, and special-case the few places in the sparse inodes code where
we actually need the smaller value to validate on-disk metadata.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

490d451f

xfs: refactor inode geometry setup routines · 494dba7b

由 Darrick J. Wong 提交于 6月 05, 2019

Migrate all of the inode geometry setup code from xfs_mount.c into a
single libxfs function that we can share with xfsprogs.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

494dba7b

xfs: separate inode geometry · ef325959

由 Darrick J. Wong 提交于 6月 05, 2019

Separate the inode geometry information into a distinct structure.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

ef325959

27 4月, 2019 1 次提交

xfs: rename the speculative block allocation reclaim toggle functions · ed30dcbd

由 Darrick J. Wong 提交于 4月 25, 2019

"reclaim" is used throughout the icache code to mean reclamation of
incore inode structures. It's also used for two helper functions that
toggle background deletion of speculative preallocations. Separate
the second of the two uses to make things less confusing.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

ed30dcbd

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功