提交 · 9fa47bdcd33b117599e9ee3f2e315cb47939ac2d · openeuler / Kernel

20 10月, 2021 22 次提交

xfs: use separate btree cursor cache for each btree type · 9fa47bdc

由 Darrick J. Wong 提交于 9月 23, 2021

Now that we have the infrastructure to track the max possible height of
each btree type, we can create a separate slab cache for cursors of each
type of btree.  For smaller indices like the free space btrees, this
means that we can pack more cursors into a slab page, improving slab
utilization.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

9fa47bdc

xfs: compute absolute maximum nlevels for each btree type · 0ed5f735

由 Darrick J. Wong 提交于 9月 23, 2021

Add code for all five btree types so that we can compute the absolute
maximum possible btree height for each btree type.  This is a setup for
the next patch, which makes every btree type have its own cursor cache.

The functions are exported so that we can have xfs_db report the
absolute maximum btree heights for each btree type, rather than making
everyone run their own ad-hoc computations.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

0ed5f735

xfs: kill XFS_BTREE_MAXLEVELS · bc8883eb

由 Darrick J. Wong 提交于 9月 16, 2021

Nobody uses this symbol anymore, so kill it.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

bc8883eb

xfs: compute the maximum height of the rmap btree when reflink enabled · 9ec69120

由 Darrick J. Wong 提交于 9月 16, 2021

Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big
enough to handle the maximally tall rmap btree when all blocks are in
use and maximally shared, let's compute the maximum height assuming the
rmapbt consumes as many blocks as possible.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

9ec69120

xfs: clean up xfs_btree_{calc_size,compute_maxlevels} · 1b236ad7

由 Darrick J. Wong 提交于 10月 13, 2021

During review of the next patch, Dave remarked that he found these two
btree geometry calculation functions lacking in documentation and that
they performed more work than was really necessary.

These functions take the same parameters and have nearly the same logic;
the only real difference is in the return values. Reword the function
comment to make it clearer what each function does, and move them to be
adjacent to reinforce their relation.

Clean up both of them to stop opencoding the howmany functions, stop
using the uint typedefs, and make them both support computations for
more than 2^32 leaf records, since we're going to need all of the above
for files with large data forks and large rmap btrees.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

1b236ad7

xfs: compute maximum AG btree height for critical reservation calculation · b74e15d7

由 Darrick J. Wong 提交于 9月 16, 2021

Compute the actual maximum AG btree height for deciding if a per-AG
block reservation is critically low.  This only affects the sanity check
condition, since we /generally/ will trigger on the 10% threshold.  This
is a long-winded way of saying that we're removing one more usage of
XFS_BTREE_MAXLEVELS.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

b74e15d7

xfs: rename m_ag_maxlevels to m_allocbt_maxlevels · 7cb3efb4

由 Darrick J. Wong 提交于 10月 13, 2021

Years ago when XFS was thought to be much more simple, we introduced
m_ag_maxlevels to specify the maximum btree height of per-AG btrees for
a given filesystem mount.  Then we observed that inode btrees don't
actually have the same height and split that off; and now we have rmap
and refcount btrees with much different geometries and separate
maxlevels variables.

The 'ag' part of the name doesn't make much sense anymore, so rename
this to m_alloc_maxlevels to reinforce that this is the maximum height
of the *free space* btrees.  This sets us up for the next patch, which
will add a variable to track the maximum height of all AG btrees.

(Also take the opportunity to improve adjacent comments and fix minor
style problems.)
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

7cb3efb4

xfs: dynamically allocate cursors based on maxlevels · c940a0c5

由 Darrick J. Wong 提交于 9月 16, 2021

To support future btree code, we need to be able to size btree cursors
dynamically for very large btrees. Switch the maxlevels computation to
use the precomputed values in the superblock, and create cursors that
can handle a certain height. For now, we retain the btree cursor cache
that can handle up to 9-level btrees, though a subsequent patch
introduces separate caches for each btree type, where each cache's
objects will be exactly tall enough to handle the specific btree type.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

c940a0c5

xfs: encode the max btree height in the cursor · c0643f6f

由 Darrick J. Wong 提交于 9月 16, 2021

Encode the maximum btree height in the cursor, since we're soon going to
allow smaller cursors for AG btrees and larger cursors for file btrees.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

c0643f6f

xfs: refactor btree cursor allocation function · 56370ea6

由 Darrick J. Wong 提交于 9月 16, 2021

Refactor btree allocation to a common helper.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

56370ea6

xfs: rearrange xfs_btree_cur fields for better packing · 69724d92

由 Darrick J. Wong 提交于 10月 12, 2021

Reduce the size of the btree cursor structure some more by rearranging
fields to eliminate unused space.  While we're at it, fix the ragged
indentation and a spelling error.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

69724d92

xfs: prepare xfs_btree_cur for dynamic cursor heights · 6ca444cf

由 Darrick J. Wong 提交于 9月 16, 2021

Split out the btree level information into a separate struct and put it
at the end of the cursor structure as a VLA.  Files with huge data forks
(and in the future, the realtime rmap btree) will require the ability to
support many more levels than a per-AG btree cursor, which means that
we're going to create per-btree type cursor caches to conserve memory
for the more common case.

Note that a subsequent patch actually introduces dynamic cursor heights.
This one merely rearranges the structure to prepare for that.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

6ca444cf

xfs: dynamically allocate btree scrub context structure · eae5db47

由 Darrick J. Wong 提交于 9月 16, 2021

Reorganize struct xchk_btree so that we can dynamically size the context
structure to fit the type of btree cursor that we have. This will
enable us to use memory more efficiently once we start adding very tall
btree types. Right-size the lastkey array to match the number of *node*
levels in the tree so that we stop wasting space.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

eae5db47

xfs: don't track firstrec/firstkey separately in xchk_btree · d47fef93

由 Darrick J. Wong 提交于 9月 22, 2021

The btree scrubbing code checks that the records (or keys) that it finds
in a btree block are all in order by calling the btree cursor's
->recs_inorder function.  This of course makes no sense for the first
item in the block, so we switch that off with a separate variable in
struct xchk_btree.

Christoph helped me figure out that the variable is unnecessary, since
we just accessed bc_ptrs[level] and can compare that against zero.  Use
that, and save ourselves some memory space.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

d47fef93

xfs: reduce the size of nr_ops for refcount btree cursors · efb79ea3

由 Darrick J. Wong 提交于 9月 23, 2021

We're never going to run more than 4 billion btree operations on a
refcount cursor, so shrink the field to an unsigned int to reduce the
structure size.  Fix whitespace alignment too.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

efb79ea3

xfs: remove xfs_btree_cur.bc_blocklog · cc411740

由 Darrick J. Wong 提交于 9月 23, 2021

This field isn't used by anyone, so get rid of it.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

cc411740

xfs: fix incorrect decoding in xchk_btree_cur_fsbno · 94a14cfd

由 Darrick J. Wong 提交于 10月 13, 2021

During review of subsequent patches, Dave and I noticed that this
function doesn't work quite right -- accessing cur->bc_ino depends on
the ROOT_IN_INODE flag, not LONG_PTRS.  Fix that and the parentheses
isssue.  While we're at it, remove the piece that accesses cur->bc_ag,
because block 0 of an AG is never part of a btree.

Note: This changes the btree scrubber tracepoints behavior -- if the
cursor has no buffer for a certain level, it will always report
NULLFSBLOCK.  It is assumed that anyone tracing the online fsck code
will also be tracing xchk_start/xchk_done or otherwise be aware of what
exactly is being scrubbed.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

94a14cfd

xfs: fix perag reference leak on iteration race with growfs · 892a666f

由 Brian Foster 提交于 10月 14, 2021

The for_each_perag*() set of macros are hacky in that some (i.e.
those based on sb_agcount) rely on the assumption that perag
iteration terminates naturally with a NULL perag at the specified
end_agno. Others allow for the final AG to have a valid perag and
require the calling function to clean up any potential leftover
xfs_perag reference on termination of the loop.

Aside from providing a subtly inconsistent interface, the former
variant is racy with growfs because growfs can create discoverable
post-eofs perags before the final superblock update that completes
the grow operation and increases sb_agcount. This leads to the
following assert failure (reproduced by xfs/104) in the perag free
path during unmount:

 XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/libxfs/xfs_ag.c, line: 195

This occurs because one of the many for_each_perag() loops in the
code that is expected to terminate with a NULL pag (and thus has no
post-loop xfs_perag_put() check) raced with a growfs and found a
non-NULL post-EOFS perag, but terminated naturally based on the
end_agno check without releasing the post-EOFS perag.

Rework the iteration logic to lift the agno check from the main for
loop conditional to the iteration helper function. The for loop now
purely terminates on a NULL pag and xfs_perag_next() avoids taking a
reference to any perag beyond end_agno in the first place.

Fixes: f250eedc ("xfs: make for_each_perag... a first class citizen")
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>

892a666f

xfs: terminate perag iteration reliably on agcount · 8ed004eb

由 Brian Foster 提交于 10月 14, 2021

The for_each_perag_from() iteration macro relies on sb_agcount to
process every perag currently within EOFS from a given starting
point. It's perfectly valid to have perag structures beyond
sb_agcount, however, such as if a growfs is in progress. If a perag
loop happens to race with growfs in this manner, it will actually
attempt to process the post-EOFS perag where ->pag_agno ==
sb_agcount. This is reproduced by xfs/104 and manifests as the
following assert failure in superblock write verifier context:

 XFS: Assertion failed: agno < mp->m_sb.sb_agcount, file: fs/xfs/libxfs/xfs_types.c, line: 22

Update the corresponding macro to only process perags that are
within the current sb_agcount.

Fixes: 58d43a7e ("xfs: pass perags around in fsmap data dev functions")
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>

8ed004eb

xfs: rename the next_agno perag iteration variable · f1788b5e

由 Brian Foster 提交于 10月 14, 2021

Rename the next_agno variable to be consistent across the several
iteration macros and shorten line length.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>

f1788b5e

xfs: fold perag loop iteration logic into helper function · bf2307b1

由 Brian Foster 提交于 10月 14, 2021

Fold the loop iteration logic into a helper in preparation for
further fixups. No functional change in this patch.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>

bf2307b1

xfs: replace snprintf in show functions with sysfs_emit · 53eb47b4

由 Qing Wang 提交于 10月 14, 2021

coccicheck complains about the use of snprintf() in sysfs show functions.

Fix the coccicheck warning:
WARNING: use scnprintf or sprintf.

Use sysfs_emit instead of scnprintf or sprintf makes more sense.
Signed-off-by: NQing Wang <wangqing@vivo.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>

53eb47b4

15 10月, 2021 11 次提交

xfs: remove the xfs_dqblk_t typedef · 11a83f4c

由 Christoph Hellwig 提交于 10月 11, 2021

Remove the few leftover instances of the xfs_dinode_t typedef.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>

11a83f4c

xfs: remove the xfs_dsb_t typedef · ed67ebfd

由 Christoph Hellwig 提交于 10月 11, 2021

Remove the few leftover instances of the xfs_dinode_t typedef.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>

ed67ebfd

xfs: remove the xfs_dinode_t typedef · de38db72

由 Christoph Hellwig 提交于 10月 11, 2021

Remove the few leftover instances of the xfs_dinode_t typedef.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>

de38db72

xfs: check that bc_nlevels never overflows · 4c175af2

由 Darrick J. Wong 提交于 9月 16, 2021

Warn if we ever bump nlevels higher than the allowed maximum cursor
height.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

4c175af2

xfs: stricter btree height checking when scanning for btree roots · 1ba6fd34

由 Darrick J. Wong 提交于 9月 16, 2021

When we're scanning for btree roots to rebuild the AG headers, make sure
that the proposed tree does not exceed the maximum height for that btree
type (and not just XFS_BTREE_MAXLEVELS).
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>

1ba6fd34

xfs: stricter btree height checking when looking for errors · f4585e82

由 Darrick J. Wong 提交于 9月 16, 2021

Since each btree type has its own precomputed maxlevels variable now,
use them instead of the generic XFS_BTREE_MAXLEVELS to check the level
of each per-AG btree.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>

f4585e82

xfs: don't allocate scrub contexts on the stack · 510a28e1

由 Darrick J. Wong 提交于 9月 16, 2021

Convert the on-stack scrub context, btree scrub context, and da btree
scrub context into a heap allocation so that we reduce stack usage and
gain the ability to handle tall btrees without issue.

Specifically, this saves us ~208 bytes for the dabtree scrub, ~464 bytes
for the btree scrub, and ~200 bytes for the main scrub context.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

510a28e1

xfs: remove xfs_btree_cur_t typedef · ae127f08

由 Darrick J. Wong 提交于 9月 16, 2021

Get rid of this old typedef before we start changing other things.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

ae127f08

xfs: fix maxlevels comparisons in the btree staging code · 78e8ec83

由 Darrick J. Wong 提交于 9月 16, 2021

The btree geometry computation function has an off-by-one error in that
it does not allow maximally tall btrees (nlevels == XFS_BTREE_MAXLEVELS).
This can result in repairs failing unnecessarily on very fragmented
filesystems.  Subsequent patches to remove MAXLEVELS usage in favor of
the per-btree type computations will make this a much more likely
occurrence.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

78e8ec83

xfs: port the defer ops capture and continue to resource capture · 512edfac

由 Darrick J. Wong 提交于 9月 16, 2021

When log recovery tries to recover a transaction that had log intent
items attached to it, it has to save certain parts of the transaction
state (reservation, dfops chain, inodes with no automatic unlock) so
that it can finish single-stepping the recovered transactions before
finishing the chains.

This is done with the xfs_defer_ops_capture and xfs_defer_ops_continue
functions.  Right now they open-code this functionality, so let's port
this to the formalized resource capture structure that we introduced in
the previous patch.  This enables us to hold up to two inodes and two
buffers during log recovery, the same way we do for regular runtime.

With this patch applied, we'll be ready to support atomic extent swap
which holds two inodes; and logged xattrs which holds one inode and one
xattr leaf buffer.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>

512edfac

xfs: formalize the process of holding onto resources across a defer roll · c5db9f93

由 Darrick J. Wong 提交于 9月 16, 2021

Transaction users are allowed to flag up to two buffers and two inodes
for ownership preservation across a deferred transaction roll. Hoist
the variables and code responsible for this out of xfs_defer_trans_roll
so that we can use it for the defer capture mechanism.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>

c5db9f93

12 10月, 2021 2 次提交

xfs: use kmem_cache_free() for kmem_cache objects · c30a0cbd

由 Rustam Kovhaev 提交于 10月 11, 2021

For kmalloc() allocations SLOB prepends the blocks with a 4-byte header,
and it puts the size of the allocated blocks in that header.
Blocks allocated with kmem_cache_alloc() allocations do not have that
header.

SLOB explodes when you allocate memory with kmem_cache_alloc() and then
try to free it with kfree() instead of kmem_cache_free().
SLOB will assume that there is a header when there is none, read some
garbage to size variable and corrupt the adjacent objects, which
eventually leads to hang or panic.

Let's make XFS work with SLOB by using proper free function.

Fixes: 9749fee8 ("xfs: enable the xfs_defer mechanism to process extents to free")
Signed-off-by: NRustam Kovhaev <rkovhaev@gmail.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>

c30a0cbd

xfs: Use kvcalloc() instead of kvzalloc() · a785fba7

由 Gustavo A. R. Silva 提交于 10月 11, 2021

Use 2-factor argument multiplication form kvcalloc() instead of
kvzalloc().

Link: https://github.com/KSPP/linux/issues/162Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>

a785fba7

27 8月, 2021 2 次提交

dax: remove bdev_dax_supported · bdd3c50d

由 Christoph Hellwig 提交于 8月 26, 2021

All callers already have a dax_device obtained from fs_dax_get_by_bdev
at hand, so just pass that to dax_supported() insted of doing another
lookup.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/20210826135510.6293-10-hch@lst.deSigned-off-by: NDan Williams <dan.j.williams@intel.com>

bdd3c50d

xfs: factor out a xfs_buftarg_is_dax helper · a384f088

由 Christoph Hellwig 提交于 8月 26, 2021

Refactor the DAX setup code in preparation of removing
bdev_dax_supported.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210826135510.6293-9-hch@lst.deSigned-off-by: NDan Williams <dan.j.williams@intel.com>

a384f088

25 8月, 2021 1 次提交

xfs: fix I_DONTCACHE · f38a032b

由 Dave Chinner 提交于 8月 24, 2021

Yup, the VFS hoist broke it, and nobody noticed. Bulkstat workloads
make it clear that it doesn't work as it should.

Fixes: dae2f8ed ("fs: Lift XFS_IDONTCACHE to the VFS layer")
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>

f38a032b

24 8月, 2021 1 次提交

xfs: only set IOMAP_F_SHARED when providing a srcmap to a write · 72a048c1

由 Darrick J. Wong 提交于 8月 20, 2021

While prototyping a free space defragmentation tool, I observed an
unexpected IO error while running a sequence of commands that can be
recreated by the following sequence of commands:

# xfs_io -f -c "pwrite -S 0x58 -b 10m 0 10m" file1
# cp --reflink=always file1 file2
# punch-alternating -o 1 file2
# xfs_io -c "funshare 0 10m" file2
fallocate: Input/output error

I then scraped this (abbreviated) stack trace from dmesg:

WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450
CPU: 0 PID: 30788 Comm: xfs_io Not tainted 5.14.0-rc6-xfsx #rc6 5ef57b62a900814b3e4d885c755e9014541c8732
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:iomap_write_begin+0x376/0x450
RSP: 0018:ffffc90000c0fc20 EFLAGS: 00010297
RAX: 0000000000000001 RBX: ffffc90000c0fd10 RCX: 0000000000001000
RDX: ffffc90000c0fc54 RSI: 000000000000000c RDI: 000000000000000c
RBP: ffff888005d5dbd8 R08: 0000000000102000 R09: ffffc90000c0fc50
R10: 0000000000b00000 R11: 0000000000101000 R12: ffffea0000336c40
R13: 0000000000001000 R14: ffffc90000c0fd10 R15: 0000000000101000
FS:  00007f4b8f62fe40(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056361c554108 CR3: 000000000524e004 CR4: 00000000001706f0
Call Trace:
 iomap_unshare_actor+0x95/0x140
 iomap_apply+0xfa/0x300
 iomap_file_unshare+0x44/0x60
 xfs_reflink_unshare+0x50/0x140 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195]
 xfs_file_fallocate+0x27c/0x610 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195]
 vfs_fallocate+0x133/0x330
 __x64_sys_fallocate+0x3e/0x70
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f4b8f79140a

Looking at the iomap tracepoints, I saw this:

iomap_iter:           dev 8:64 ino 0x100 pos 0 length 0 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare
iomap_iter_dstmap:    dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 0 length 131072 type DELALLOC flags SHARED
iomap_iter_srcmap:    dev 8:64 ino 0x100 bdev 8:64 addr 147456 offset 0 length 4096 type MAPPED flags
iomap_iter:           dev 8:64 ino 0x100 pos 0 length 4096 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare
iomap_iter_dstmap:    dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 4096 length 4096 type DELALLOC flags SHARED
console:              WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450

The first time funshare calls ->iomap_begin, xfs sees that the first
block is shared and creates a 128k delalloc reservation in the COW fork.
The delalloc reservation is returned as dstmap, and the shared block is
returned as srcmap.  So far so good.

funshare calls ->iomap_begin to try the second block.  This time there's
no srcmap (punch-alternating punched it out!) but we still have the
delalloc reservation in the COW fork.  Therefore, we again return the
reservation as dstmap and the hole as srcmap.  iomap_unshare_iter
incorrectly tries to unshare the hole, which __iomap_write_begin rejects
because shared regions must be fully written and therefore cannot
require zeroing.

Therefore, change the buffered write iomap_begin function not to set
IOMAP_F_SHARED when there isn't a source mapping to read from for the
unsharing.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>

72a048c1

21 8月, 2021 1 次提交

xfs: fix perag structure refcounting error when scrub fails · 61e0d0cc

由 Darrick J. Wong 提交于 8月 19, 2021

The kernel test robot found the following bug when running xfs/355 to
scrub a bmap btree:

XFS: Assertion failed: !sa->pag, file: fs/xfs/scrub/common.c, line: 412
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:110!
invalid opcode: 0000 [#1] SMP PTI
CPU: 2 PID: 1415 Comm: xfs_scrub Not tainted 5.14.0-rc4-00021-g48c6615c #1
Hardware name: Hewlett-Packard p6-1451cx/2ADA, BIOS 8.15 02/05/2013
RIP: 0010:assfail+0x23/0x28 [xfs]
RSP: 0018:ffffc9000aacb890 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffffc9000aacbcc8 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffc09e7dcd
RBP: ffffc9000aacbc80 R08: ffff8881fdf17d50 R09: 0000000000000000
R10: 000000000000000a R11: f000000000000000 R12: 0000000000000000
R13: ffff88820c7ed000 R14: 0000000000000001 R15: ffffc9000aacb980
FS:  00007f185b955700(0000) GS:ffff8881fdf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7f6ef43000 CR3: 000000020de38002 CR4: 00000000001706e0
Call Trace:
 xchk_ag_read_headers+0xda/0x100 [xfs]
 xchk_ag_init+0x15/0x40 [xfs]
 xchk_btree_check_block_owner+0x76/0x180 [xfs]
 xchk_btree_get_block+0xd0/0x140 [xfs]
 xchk_btree+0x32e/0x440 [xfs]
 xchk_bmap_btree+0xd4/0x140 [xfs]
 xchk_bmap+0x1eb/0x3c0 [xfs]
 xfs_scrub_metadata+0x227/0x4c0 [xfs]
 xfs_ioc_scrub_metadata+0x50/0xc0 [xfs]
 xfs_file_ioctl+0x90c/0xc40 [xfs]
 __x64_sys_ioctl+0x83/0xc0
 do_syscall_64+0x3b/0xc0

The unusual handling of errors while initializing struct xchk_ag is the
root cause here.  Since the beginning of xfs_scrub, the goal of
xchk_ag_read_headers has been to read all three AG header buffers and
attach them both to the xchk_ag structure and the scrub transaction.
Corruption errors on any of the three headers doesn't necessarily
trigger an immediate return to userspace, because xfs_scrub can also
tell us to /fix/ the problem.

In other words, it's possible for the xchk_ag init functions to return
an error code and a partially filled out structure so that scrub can use
however much information it managed to pull.  Before 5.15, it was
sufficient to cancel (or commit) the scrub transaction on the way out of
the scrub code to release the buffers.

Ccommit 48c6615c added a reference to the perag structure to struct
xchk_ag.  Since perag structures are not attached to transactions like
buffers are, this adds the requirement that the perag ref be released
explicitly.  The scrub teardown function xchk_teardown was amended to do
this for the xchk_ag embedded in struct xfs_scrub.

Unfortunately, I forgot that certain parts of the scrub code probe
multiple AGs and therefore handle the initialization and cleanup on
their own.  Specifically, the bmbt scrubber will initialize it long
enough to cross-reference AG metadata for btree blocks and for the
extent mappings in the bmbt.

If one of the AG headers is corrupt, the init function returns with a
live perag structure reference and some of the AG header buffers.  If an
error occurs, the cross referencing will be noted as XCORRUPTion and
skipped, but the main scrub process will move on to the next record.
It is now necessary to release the perag reference before we try to
analyze something from a different AG, or else we'll trip over the
assertion noted above.

Fixes: 48c6615c ("xfs: grab active perag ref when reading AG headers")
Reported-by: Nkernel test robot <oliver.sang@intel.com>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>

61e0d0cc

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功