提交 · 75efa57d0bf5fcf650a183f0ce0dc011ba8c4bc8 · openeuler / Kernel

30 4月, 2019 3 次提交

xfs: add online scrub for superblock counters · 75efa57d

由 Darrick J. Wong 提交于 4月 25, 2019

Teach online scrub how to check the filesystem summary counters. We use
the incore delalloc block counter along with the incore AG headers to
compute expected values for fdblocks, icount, and ifree, and then check
that the percpu counter is within a certain threshold of the expected
value. This is done to avoid having to freeze or otherwise lock the
filesystem, which means that we're only checking that the counters are
fairly close, not that they're exactly correct.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

75efa57d

xfs: don't parse the mtpt mount option · 94079285

由 Christoph Hellwig 提交于 4月 28, 2019

The text isn't really any more useful than the default unknown option
handling.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

94079285

xfs: always rejoin held resources during defer roll · 710d707d

由 Darrick J. Wong 提交于 4月 24, 2019

During testing of xfs/141 on a V4 filesystem, I observed some
inconsistent behavior with regards to resources that are held (i.e.
remain locked) across a defer roll.  The transaction roll always gives
the defer roll function a new transaction, even if committing the old
transaction fails.  However, the defer roll function only rejoins the
held resources if the transaction commit succeedied.  This means that
callers of defer roll have to figure out whether the held resources are
attached to the transaction being passed back.

Worse yet, if the defer roll was part of a defer finish call, we have a
third possibility: the defer finish could pass back a dirty transaction
with dirty held resources and an error code.

The only sane way to handle all of these scenarios is to require that
the code that held the resource either cancel the transaction before
unlocking and releasing the resources, or use functions that detach
resources from a transaction properly (e.g.  xfs_trans_brelse) if they
need to drop the reference before committing or cancelling the
transaction.

In order to make this so, change the defer roll code to join held
resources to the new transaction unconditionally and fix all the bhold
callers to release the held buffers correctly.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

710d707d

27 4月, 2019 6 次提交

xfs: add missing error check in xfs_prepare_shift() · 1749d1ea

由 Brian Foster 提交于 4月 26, 2019

xfs_prepare_shift() fails to check the error return from
xfs_flush_unmap_range(). If the latter fails, that could lead to an
insert/collapse range operation over a delalloc range, which is not
supported.

Add an error check and return appropriately. This is reproduced
rarely by generic/475.

Fixes: 7f9f71be ("xfs: extent shifting doesn't fully invalidate page cache")
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NAllison Collins <allison.henderson@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

1749d1ea

xfs: scrub should check incore counters against ondisk headers · 47cd97b5

由 Darrick J. Wong 提交于 4月 25, 2019

In theory, the incore per-AG structure counters should match the ones on
disk, so check that.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

47cd97b5

xfs: allow scrubbers to pause background reclaim · 9a1f3049

由 Darrick J. Wong 提交于 4月 25, 2019

The forthcoming summary counter patch races with regular filesystem
activity to compute rough expected values for the counters. This design
was chosen to avoid having to freeze the entire filesystem to check the
counters, but while that's running we'd prefer to minimize background
reclamation activity to reduce the perturbations to the incore free
block count. Therefore, provide a way for scrubbers to disable
background posteof and cowblock reclamation.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

9a1f3049

xfs: rename the speculative block allocation reclaim toggle functions · ed30dcbd

由 Darrick J. Wong 提交于 4月 25, 2019

"reclaim" is used throughout the icache code to mean reclamation of
incore inode structures. It's also used for two helper functions that
toggle background deletion of speculative preallocations. Separate
the second of the two uses to make things less confusing.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

ed30dcbd

xfs: track delayed allocation reservations across the filesystem · 9fe82b8c

由 Darrick J. Wong 提交于 4月 25, 2019

Add a percpu counter to track the number of blocks directly reserved for
delayed allocations on the data device.  This counter (in contrast to
i_delayed_blks) does not track allocated CoW staging extents or anything
going on with the realtime device.  It will be used in the upcoming
summary counter scrub function to check the free block counts without
having to freeze the filesystem or walk all the inodes to find the
delayed allocations.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

9fe82b8c

xfs: fix broken bhold behavior in xrep_roll_ag_trans · f60be90f

由 Darrick J. Wong 提交于 4月 24, 2019

In xrep_roll_ag_trans, the transaction roll will always set sc->tp to
the new transaction, even if committing the old one fails. A bare
transaction roll leaves the buffer(s) locked but not joined to the new
transaction, so it's not necessary to release the hold if the roll
fails. Remove the incorrect xfs_trans_bhold_release calls.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

f60be90f

23 4月, 2019 7 次提交

xfs: unlock inode when xfs_ioctl_setattr_get_trans can't get transaction · 3de5eab3

由 Darrick J. Wong 提交于 4月 22, 2019

We passed an inode into xfs_ioctl_setattr_get_trans with join_flags
indicating which locks are held on that inode.  If we can't allocate a
transaction then we need to unlock the inode before we bail out, like
all the other error paths do.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

3de5eab3

xfs: kill the xfs_dqtrx_t typedef · 078f4a7d

由 Darrick J. Wong 提交于 4月 17, 2019

There's only a few uses left, so just kill the typedef while we're at
it.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

078f4a7d

xfs: widen inode delalloc block counter to 64-bits · 394aafdc

由 Darrick J. Wong 提交于 4月 17, 2019

Widen the incore inode's i_delayed_blks counter to be a 64-bit integer.
This is necessary to fix an integer overflow problem that can be
reproduced easily now that we use the counter to track blocks that are
assigned to the inode in memory but not on disk.  This includes actual
delalloc reservations as well as real extents in the COW fork that
are waiting to be remapped into the data fork.

These 'delayed mapping' blocks can easily exceed 2^32 blocks if one
creates a very large sparse file of size approximately 2^33 bytes with
one byte written every 2^23 bytes, sets a very large COW extent size
hint of 2^23 blocks, reflinks the first file into a second file, and
then writes a single byte every 2^23 blocks in the original file.

When this happens, we'll try to create approximately 1024 2^23 extent
reservations in the COW fork, which will overflow the counter and cause
problems.

Note that on x64 we end up filling a 4-byte gap in the structure so this
doesn't increase the incore size.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NAllison Collins <allison.henderson@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

394aafdc

xfs: widen quota block counters to 64-bit integers · 903b1fc2

由 Darrick J. Wong 提交于 4月 17, 2019

Widen the incore quota transaction delta structure to treat block
counters as 64-bit integers.  This is a necessary addition so that we
can widen the i_delayed_blks counter to be a 64-bit integer.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NAllison Collins <allison.henderson@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

903b1fc2

xfs: abort unaligned nowait directio early · 1fdeaea4

由 Darrick J. Wong 提交于 4月 17, 2019

Dave Chinner noticed that xfs_file_dio_aio_write returns EAGAIN without
dropping the IOLOCK when its deciding not to wait, which means that we
leak the IOLOCK there.  Since we now make unaligned directio always
wait, we have the opportunity to bail out before trying to take the
lock, which should reduce the overhead of this never-gonna-work case
considerably while also solving the dropped lock problem.
Reported-by: NDave Chinner <david@fromorbit.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

1fdeaea4

xfs: assert that we don't enter agfl freeing with a non-permanent transaction · 362f5e74

由 Brian Foster 提交于 4月 23, 2019

Block allocation requires a permanent transaction for deferred AGFL
frees.  Add an assert in the block allocation path to make explicit and
obvious to future callers the requirement of a transaction with a
permanent reservation.
Reported-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
[darrick: split this out from the previous patch per hch request]
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

362f5e74

xfs: make tr_growdata a permanent transaction · 945c941f

由 Brian Foster 提交于 4月 17, 2019

The growdata transaction is used by growfs operations to increase
the data size of the filesystem. Part of this sequence involves
extending the size of the last preexisting AG in the fs, if
necessary. This is implemented by freeing the newly available
physical range to the AG.

tr_growdata is not a permanent transaction, however, and block
allocation transactions must be permanent to handle deferred frees
of AGFL blocks. If the grow operation extends an existing AG that
requires AGFL fixing, assert failures occur due to a populated dfops
list on a non-permanent transaction and the AGFL free does not
occur. This is reproduced (rarely) by xfs/104.

Change tr_growdata to a permanent transaction with a default log
count. This increases initial transaction reservation size, but
growfs is an infrequent and non-performance critical operation and
so should have minimal impact.
Reported-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
[darrick: add a comment to the assert]
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

945c941f

17 4月, 2019 8 次提交

xfs: merge adjacent io completions of the same type · 3994fc48

由 Darrick J. Wong 提交于 4月 15, 2019

It's possible for pagecache writeback to split up a large amount of work
into smaller pieces for throttling purposes or to reduce the amount of
time a writeback operation is pending.  Whatever the reason, XFS can end
up with a bunch of IO completions that call for the same operation to be
performed on a contiguous extent mapping.  Since mappings are extent
based in XFS, we'd prefer to run fewer transactions when we can.

When we're processing an ioend on the list of io completions, check to
see if the next items on the list are both adjacent and of the same
type.  If so, we can merge the completions to reduce transaction
overhead.

On fast storage this doesn't seem to make much of a difference in
performance, though the number of transactions for an overnight xfstests
run seems to drop by ~5%.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

3994fc48

xfs: remove unused m_data_workqueue · 28408243

由 Darrick J. Wong 提交于 4月 15, 2019

Now that we're no longer using m_data_workqueue, remove it.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

28408243

xfs: implement per-inode writeback completion queues · cb357bf3

由 Darrick J. Wong 提交于 4月 15, 2019

When scheduling writeback of dirty file data in the page cache, XFS uses
IO completion workqueue items to ensure that filesystem metadata only
updates after the write completes successfully.  This is essential for
converting unwritten extents to real extents at the right time and
performing COW remappings.

Unfortunately, XFS queues each IO completion work item to an unbounded
workqueue, which means that the kernel can spawn dozens of threads to
try to handle the items quickly.  These threads need to take the ILOCK
to update file metadata, which results in heavy ILOCK contention if a
large number of the work items target a single file, which is
inefficient.

Worse yet, the writeback completion threads get stuck waiting for the
ILOCK while holding transaction reservations, which can use up all
available log reservation space.  When that happens, metadata updates to
other parts of the filesystem grind to a halt, even if the filesystem
could otherwise have handled it.

Even worse, if one of the things grinding to a halt happens to be a
thread in the middle of a defer-ops finish holding the same ILOCK and
trying to obtain more log reservation having exhausted the permanent
reservation, we now have an ABBA deadlock - writeback completion has a
transaction reserved and wants the ILOCK, and someone else has the ILOCK
and wants a transaction reservation.

Therefore, we create a per-inode writeback io completion queue + work
item.  When writeback finishes, it can add the ioend to the per-inode
queue and let the single worker item process that queue.  This
dramatically cuts down on the number of kworkers and ILOCK contention in
the system, and seems to have eliminated an occasional deadlock I was
seeing while running generic/476.

Testing with a program that simulates a heavy random-write workload to a
single file demonstrates that the number of kworkers drops from
approximately 120 threads per file to 1, without dramatically changing
write bandwidth or pagecache access latency.

Note that we leave the xfs-conv workqueue's max_active alone because we
still want to be able to run ioend processing for as many inodes as the
system can handle.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

cb357bf3

xfs: scrub should only cross-reference with healthy btrees · 4fb7951f

由 Darrick J. Wong 提交于 4月 16, 2019

Skip cross-referencing with a btree if the health report tells us that
it's known to be bad. This should reduce the dmesg spew considerably.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

4fb7951f

xfs: scrub/repair should update filesystem metadata health · 4860a05d

由 Darrick J. Wong 提交于 4月 16, 2019

Now that we have the ability to track sick metadata in-core, make scrub
and repair update those health assessments after doing work.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

4860a05d

xfs: hoist the already_fixed variable to the scrub context · 160b5a78

由 Darrick J. Wong 提交于 4月 16, 2019

Now that we no longer memset the scrub context, we can move the
already_fixed variable into the scrub context's state flags instead of
passing around pointers to separate stack variables.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

160b5a78

xfs: collapse scrub bool state flags into a single unsigned int · f8c2a225

由 Darrick J. Wong 提交于 4月 16, 2019

Combine all the boolean state flags in struct xfs_scrub into a single
unsigned int, because we're going to be adding more state flags soon.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

f8c2a225

xfs: refactor scrub context initialization · 9d71e155

由 Darrick J. Wong 提交于 4月 16, 2019

It's a little silly how the memset in scrub context initialization
forces us to declare stack variables to preserve context variables
across a retry.  Since the teardown functions already null out most of
the ephemeral state (buffer pointers, btree cursors, etc.), just skip
the memset and move the initialization as needed.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

9d71e155

15 4月, 2019 14 次提交

xfs: report inode health via bulkstat · 89d139d5

由 Darrick J. Wong 提交于 4月 12, 2019

Use space in the bulkstat ioctl structure to report any problems
observed with the inode.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

89d139d5

xfs: report AG health via AG geometry ioctl · 1302c6a2

由 Darrick J. Wong 提交于 4月 12, 2019

Use the AG geometry info ioctl to report health status too.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

1302c6a2

xfs: report fs and rt health via geometry structure · c23232d4

由 Darrick J. Wong 提交于 4月 12, 2019

Use our newly expanded geometry structure to report the overall fs and
realtime health status.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

c23232d4

xfs: add a new ioctl to describe allocation group geometry · 7cd5006b

由 Darrick J. Wong 提交于 4月 12, 2019

Add a new ioctl to describe an allocation group's geometry.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

7cd5006b

xfs: bump XFS_IOC_FSGEOMETRY to v5 structures · 1b6d968d

由 Dave Chinner 提交于 4月 12, 2019

Unfortunately, the V4 XFS_IOC_FSGEOMETRY structure is out of space so we
can't just add a new field to it. Hence we need to bump the definition
to V5 and and treat the V4 ioctl and structure similar to v1 to v3.

While doing this, clean up all the definitions associated with the
XFS_IOC_FSGEOMETRY ioctl.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
[darrick: forward port to 5.1, expand structure size to 256 bytes]
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

1b6d968d

xfs: clear BAD_SUMMARY if unmounting an unhealthy filesystem · 519841c2

由 Darrick J. Wong 提交于 4月 12, 2019

If we know the filesystem metadata isn't healthy during unmount, we want
to encourage the administrator to run xfs_repair right away. We can't
do this if BAD_SUMMARY will cause an unclean log unmount to force
summary recalculation, so turn it off if the fs is bad.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

519841c2

xfs: replace the BAD_SUMMARY mount flag with the equivalent health code · 39353ff6

由 Darrick J. Wong 提交于 4月 12, 2019

Replace the BAD_SUMMARY mount flag with calls to the equivalent health
tracking code.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

39353ff6

xfs: track metadata health status · 6772c1f1

由 Darrick J. Wong 提交于 4月 12, 2019

Add the necessary in-core metadata fields to keep track of which parts
of the filesystem have been observed and which parts were observed to be
unhealthy, and print a warning at unmount time if we have unfixed
problems.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

6772c1f1

xfs,fstrim: fix to return correct minlen · 2bf9d264

由 Wang Shilong 提交于 4月 12, 2019

This patch tries to address two problems:

1) return @minlen we used to trim to
user space.

2) return EINVAL if granularity is larger than
avg size, even most of cases, granularity is small(4K),
but if devices return a lager granularity for some reaons
(testing, bugs etc), fstrim should return failure directly.
Signed-off-by: NWang Shilong <wshilong@ddn.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

2bf9d264

xfs: don't account extra agfl blocks as available · 1ca89fbc

由 Brian Foster 提交于 4月 12, 2019

The block allocation AG selection code has parameters that allow a
caller to perform multiple allocations from a single AG and
transaction (under certain conditions). The parameters specify the
total block allocation count required by the transaction and the AG
selection code selects and locks an AG that will be able to satisfy
the overall requirement. If the available block accounting
calculation turns out to be inaccurate and a subsequent allocation
call fails with -ENOSPC, the resulting transaction cancel leads to
filesystem shutdown because the transaction is dirty.

This exact problem can be reproduced with a highly parallel space
consumer and fsstress workload running long enough to a large
filesystem against -ENOSPC conditions. A bmbt block allocation
request made for inode extent to bmap format conversion after an
extent allocation is expected to be satisfied by the same AG and the
same transaction as the extent allocation. The bmbt block allocation
fails, however, because the block availability of the AG has changed
since the AG was selected (outside of the blocks used for the extent
itself).

The inconsistent block availability calculation is caused by the
deferred block freeing behavior of the AGFL. This immediately
removes extra blocks from the AGFL to free up AGFL slots, but rather
than immediately freeing such blocks as was done in the past, the
block free is deferred such that said blocks are not available for
allocation until the current transaction commits. The AG selection
logic currently considers all AGFL blocks as available and executes
shortly before any extra AGFL blocks are freed. This means the block
availability of the current AG can change before the first
allocation even occurs, but in practice a failure is more likely to
manifest via a subsequent allocation because extent allocation
usually has a contiguity requirement larger than a single block that
can't be satisfied from the AGFL.

In general, XFS prefers operational robustness to absolute
allocation efficiency. In other words, we prefer to return -ENOSPC
slightly earlier at the expense of not being able to allocate every
last block in an AG to avoid this kind of problem. As such, update
the AG block availability calculation to consider extra AGFL blocks
as unavailable since they are immediately removed following the
calculation and will not become available until the current
transaction commits.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

1ca89fbc

xfs: shutdown after buf release in iflush cluster abort path · 22fedd80

由 Brian Foster 提交于 4月 12, 2019

If xfs_iflush_cluster() fails due to corruption, the error path
issues a shutdown and simulates an I/O completion to release the
buffer. This code has a couple small problems. First, the shutdown
sequence can issue a synchronous log force, which is unsafe to do
with buffer locks held. Second, the simulated I/O completion does not
guarantee the buffer is async and thus is unlocked and released.

For example, if the last operation on the buffer was a read off disk
prior to the corruption event, XBF_ASYNC is not set and the buffer
is left locked and held upon return. This results in a memory leak
as shown by the following message on module unload:

BUG xfs_buf (...): Objects remaining in xfs_buf on __kmem_cache_shutdown()

Fix both of these problems by setting XBF_ASYNC on the buffer prior
to the simulated I/O error and performing the shutdown immediately
after ioend processing when the buffer has been released.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

22fedd80

xfs: wake commit waiters on CIL abort before log item abort · 545aa41f

由 Brian Foster 提交于 4月 12, 2019

XFS shutdown deadlocks have been reproduced by fstest generic/475.
The deadlock signature involves log I/O completion running error
handling to abort logged items and waiting for an inode cluster
buffer lock in the buffer item unpin handler. The buffer lock is
held by xfsaild attempting to flush an inode. The buffer happens to
be pinned and so xfs_iflush() triggers an async log force to begin
work required to get it unpinned. The log force is blocked waiting
on the commit completion, which never occurs and thus leaves the
filesystem deadlocked.

The root problem is that aborted log I/O completion pots commit
completion behind callback completion, which is unexpected for async
log forces. Under normal running conditions, an async log force
returns to the caller once the CIL ctx has been formatted/submitted
and the commit completion event triggered at the tail end of
xlog_cil_push(). If the filesystem has shutdown, however, we rely on
xlog_cil_committed() to trigger the completion event and it happens
to do so after running log item unpin callbacks. This makes it
unsafe to invoke an async log force from contexts that hold locks
that might also be required in log completion processing.

To address this problem, wake commit completion waiters before
aborting log items in the log I/O completion handler. This ensures
that an async log force will not deadlock on held locks if the
filesystem happens to shutdown. Note that it is still unsafe to
issue a sync log force while holding such locks because a sync log
force explicitly waits on the force completion, which occurs after
log I/O completion processing.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

545aa41f

xfs: fix use after free in buf log item unlock assert · 4d09807f

由 Brian Foster 提交于 4月 12, 2019

The xfs_buf_log_item ->iop_unlock() callback asserts that the buffer
is unlocked when either non-stale or aborted. This assert occurs
after the bli refcount has been dropped and the log item potentially
freed. The aborted check is thus a potential use after free. This
problem has been reproduced with KASAN enabled via generic/475.

Fix up xfs_buf_item_unlock() to query aborted state before the bli
reference is dropped to prevent a potential use after free.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

4d09807f

fs: prevent page refcount overflow in pipe_buf_get · 15fab63e

由 Matthew Wilcox 提交于 4月 05, 2019

Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount.  All
callers converted to handle a failure.
Reported-by: NJann Horn <jannh@google.com>
Signed-off-by: NMatthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

15fab63e

12 4月, 2019 2 次提交

block: fix the return errno for direct IO · a89afe58

由 Jason Yan 提交于 4月 12, 2019

If the last bio returned is not dio->bio, the status of the bio will
not assigned to dio->bio if it is error. This will cause the whole IO
status wrong.

    ksoftirqd/21-117   [021] ..s.  4017.966090:   8,0    C   N 4883648 [0]
          <idle>-0     [018] ..s.  4017.970888:   8,0    C  WS 4924800 + 1024 [0]
          <idle>-0     [018] ..s.  4017.970909:   8,0    D  WS 4935424 + 1024 [<idle>]
          <idle>-0     [018] ..s.  4017.970924:   8,0    D  WS 4936448 + 321 [<idle>]
    ksoftirqd/21-117   [021] ..s.  4017.995033:   8,0    C   R 4883648 + 336 [65475]
    ksoftirqd/21-117   [021] d.s.  4018.001988: myprobe1: (blkdev_bio_end_io+0x0/0x168) bi_status=7
    ksoftirqd/21-117   [021] d.s.  4018.001992: myprobe: (aio_complete_rw+0x0/0x148) x0=0xffff802f2595ad80 res=0x12a000 res2=0x0

We always have to assign bio->bi_status to dio->bio.bi_status because we
will only check dio->bio.bi_status when we return the whole IO to
the upper layer.

Fixes: 542ff7bf ("block: new direct I/O implementation")
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a89afe58

NFSv4.1 fix incorrect return value in copy_file_range · 0769663b

由 Olga Kornievskaia 提交于 4月 11, 2019

According to the NFSv4.2 spec if the input and output file is the
same file, operation should fail with EINVAL. However, linux
copy_file_range() system call has no such restrictions. Therefore,
in such case let's return EOPNOTSUPP and allow VFS to fallback
to doing do_splice_direct(). Also when copy_file_range is called
on an NFSv4.0 or 4.1 mount (ie., a server that doesn't support
COPY functionality), we also need to return EOPNOTSUPP and
fallback to a regular copy.

Fixes xfstest generic/075, generic/091, generic/112, generic/263
for all NFSv4.x versions.
Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

0769663b

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功