提交 · b870553cdecb26d5291af09602352b763e323df2 · openeuler / raspberrypi-kernel

30 11月, 2012 3 次提交

xfs: fix stray dquot unlock when reclaiming dquots · b870553c

由 Dave Chinner 提交于 11月 28, 2012

When we fail to get a dquot lock during reclaim, we jump to an error
handler that unlocks the dquot. This is wrong as we didn't lock the
dquot, and unlocking it means who-ever is holding the lock has had
it silently taken away, and hence it results in a lock imbalance.

Found by inspection while modifying the code for the numa-lru
patchset. This fixes a random hang I've been seeing on xfstest 232
for the past several months.

cc: <stable@vger.kernel.org>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBen Myers <bpm@sgi.com>

b870553c

xfs: fix direct IO nested transaction deadlock. · 437a255a

由 Dave Chinner 提交于 11月 28, 2012

The direct IO path can do a nested transaction reservation when
writing past the EOF. The first transaction is the append
transaction for setting the filesize at IO completion, but we can
also need a transaction for allocation of blocks. If the log is low
on space due to reservations and small log, the append transaction
can be granted after wating for space as the only active transaction
in the system. This then attempts a reservation for an allocation,
which there isn't space in the log for, and the reservation sleeps.
The result is that there is nothing left in the system to wake up
all the processes waiting for log space to come free.

The stack trace that shows this deadlock is relatively innocuous:

 xlog_grant_head_wait
 xlog_grant_head_check
 xfs_log_reserve
 xfs_trans_reserve
 xfs_iomap_write_direct
 __xfs_get_blocks
 xfs_get_blocks_direct
 do_blockdev_direct_IO
 __blockdev_direct_IO
 xfs_vm_direct_IO
 generic_file_direct_write
 xfs_file_dio_aio_writ
 xfs_file_aio_write
 do_sync_write
 vfs_write

This was discovered on a filesystem with a log of only 10MB, and a
log stripe unit of 256k whih increased the base reservations by
512k. Hence a allocation transaction requires 1.2MB of log space to
be available instead of only 260k, and so greatly increased the
chance that there wouldn't be enough log space available for the
nested transaction to succeed. The key to reproducing it is this
mkfs command:

mkfs.xfs -f -d agcount=16,su=256k,sw=12 -l su=256k,size=2560b $SCRATCH_DEV

The test case was a 1000 fsstress processes running with random
freeze and unfreezes every few seconds. Thanks to Eryu Guan
(eguan@redhat.com) for writing the test that found this on a system
with a somewhat unique default configuration....

cc: <stable@vger.kernel.org>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NAndrew Dahl <adahl@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

437a255a

xfs: byte range granularity for XFS_IOC_ZERO_RANGE · ef9d8733

由 Dave Chinner 提交于 11月 29, 2012

XFS_IOC_ZERO_RANGE simply does not work properly for non page cache
aligned ranges. Neither test 242 or 290 exercise this correctly, so
the behaviour is completely busted even though the tests pass.

Fix it to support full byte range granularity as was originally
intended for this ioctl.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBen Myers <bpm@sgi.com>

ef9d8733

27 11月, 2012 1 次提交

xfs: inode allocation should use unmapped buffers. · 7c4cebe8

由 Dave Chinner 提交于 11月 23, 2012

Inode buffers do not need to be mapped as inodes are read or written
directly from/to the pages underlying the buffer. This fixes a
regression introduced by commit 611c9946 ("xfs: make XBF_MAPPED the
default behaviour").
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMark Tinguely <tinguely@sgi.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

7c4cebe8

20 11月, 2012 2 次提交

xfs: add CRC checks to the log · 0e446be4

由 Christoph Hellwig 提交于 11月 12, 2012

Implement CRCs for the log buffers.  We re-use a field in
struct xlog_rec_header that was used for a weak checksum of the
log buffer payload in debug builds before.

The new checksumming uses the crc32c checksum we will use elsewhere
in XFS, and also protects the record header and addition cycle data.

Due to this there are some interesting changes in xlog_sync, as we
need to do the cycle wrapping for the split buffer case much earlier,
as we would touch the buffer after generating the checksum otherwise.

The CRC calculation is always enabled, even for non-CRC filesystems,
as adding this CRC does not change the log format. On non-CRC
filesystems, only issue an alert if a CRC mismatch is found and
allow recovery to continue - this will act as an indicator that
log recovery problems are a result of log corruption. On CRC enabled
filesystems, however, log recovery will fail.

Note that existing debug kernels will write a simple checksum value
to the log, so the first time this is run on a filesystem taht was
last used on a debug kernel it will through CRC mismatch warning
errors. These can be ignored.

Initially based on a patch from Dave Chinner, then modified
significantly by Christoph Hellwig.  Modified again by Dave Chinner
to get to this version.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NMark Tinguely <tinguely@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

0e446be4

xfs: add CRC infrastructure · bc02e869

由 Christoph Hellwig 提交于 11月 16, 2012

 - add a mount feature bit for CRC enabled filesystems
 - add some helpers for generating and verifying the CRCs
 - add a copy_uuid helper

The checksumming helpers are loosely based on similar ones in sctp,
all other bits come from Dave Chinner.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NMark Tinguely <tinguely@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

bc02e869

16 11月, 2012 22 次提交

xfs: convert buffer verifiers to an ops structure. · 1813dd64

由 Dave Chinner 提交于 11月 14, 2012

To separate the verifiers from iodone functions and associate read
and write verifiers at the same time, introduce a buffer verifier
operations structure to the xfs_buf.

This avoids the need for assigning the write verifier, clearing the
iodone function and re-running ioend processing in the read
verifier, and gets rid of the nasty "b_pre_io" name for the write
verifier function pointer. If we ever need to, it will also be
easier to add further content specific callbacks to a buffer with an
ops structure in place.

We also avoid needing to export verifier functions, instead we
can simply export the ops structures for those that are needed
outside the function they are defined in.

This patch also fixes a directory block readahead verifier issue
it exposed.

This patch also adds ops callbacks to the inode/alloc btree blocks
initialised by growfs. These will need more work before they will
work with CRCs.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

1813dd64

xfs: connect up write verifiers to new buffers · b0f539de

由 Dave Chinner 提交于 11月 14, 2012

Metadata buffers that are read from disk have write verifiers
already attached to them, but newly allocated buffers do not. Add
appropriate write verifiers to all new metadata buffers.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBen Myers <bpm@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

b0f539de

xfs: add pre-write metadata buffer verifier callbacks · 612cfbfe

由 Dave Chinner 提交于 11月 14, 2012

These verifiers are essentially the same code as the read verifiers,
but do not require ioend processing. Hence factor the read verifier
functions and add a new write verifier wrapper that is used as the
callback.

This is done as one large patch for all verifiers rather than one
patch per verifier as the change is largely mechanical. This
includes hooking up the write verifier via the read verifier
function.

Hooking up the write verifier for buffers obtained via
xfs_trans_get_buf() will be done in a separate patch as that touches
code in many different places rather than just the verifier
functions.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NMark Tinguely <tinguely@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

612cfbfe

xfs: add buffer pre-write callback · cfb02852

由 Dave Chinner 提交于 11月 12, 2012

Add a callback to the buffer write path to enable verification of
the buffer and CRC calculation prior to issuing the write to the
underlying storage.

If the callback function detects some kind of failure or error
condition, it must mark the buffer with an error so that the caller
can take appropriate action. In the case of xfs_buf_ioapply(), a
corrupt metadta buffer willt rigger a shutdown of the filesystem,
because something is clearly wrong and we can't allow corrupt
metadata to be written to disk.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

cfb02852

xfs: Add verifiers to dir2 data readahead. · da6958c8

由 Dave Chinner 提交于 11月 12, 2012

Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

da6958c8

xfs: add xfs_da_node verification · d9392a4b

由 Dave Chinner 提交于 11月 12, 2012

Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

d9392a4b

xfs: factor and verify attr leaf reads · ad14c33a

由 Dave Chinner 提交于 11月 12, 2012

Some reads are not converted yet because it isn't obvious ahead of
time what the format of the block is going to be. Need to determine
how to tell if the first block in the tree is a node or leaf format
block. That will be done in later patches.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

ad14c33a

xfs: factor dir2 leaf read · e6f7667c

由 Dave Chinner 提交于 11月 12, 2012

Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

e6f7667c

xfs: factor out dir2 data block reading · e4813572

由 Dave Chinner 提交于 11月 12, 2012

And add a verifier callback function while there.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

e4813572

xfs: factor dir2 free block reading · 2025207c

由 Dave Chinner 提交于 11月 12, 2012

Also factor out the updating of the free block when removing entries
from leaf blocks, and add a verifier callback for reads.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

2025207c

xfs: verify dir2 block format buffers · 82025d7f

由 Dave Chinner 提交于 11月 12, 2012

Add a dir2 block format read verifier. To fully verify every block
when read, call xfs_dir2_data_check() on them. Change
xfs_dir2_data_check() to do runtime checking, convert ASSERT()
checks to XFS_WANT_CORRUPTED_RETURN(), which will trigger an ASSERT
failure on debug kernels, but on production kernels will dump an
error to dmesg and return EFSCORRUPTED to the caller.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

82025d7f

xfs: factor dir2 block read operations · 20f7e9f3

由 Dave Chinner 提交于 11月 12, 2012

In preparation for verifying dir2 block format buffers, factor
the read operations out of the block operations (lookup, addname,
getdents) and some of the additional logic to make it easier to
understand an dmodify the code.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBen Myers <bpm@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

20f7e9f3

xfs: add verifier callback to directory read code · 4bb20a83

由 Dave Chinner 提交于 11月 12, 2012

Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

4bb20a83

xfs: verify dquot blocks as they are read from disk · c6319198

由 Dave Chinner 提交于 11月 14, 2012

Add a dquot buffer verify callback function and pass it into the
buffer read functions. This checks all the dquots in a buffer, but
cannot completely verify the dquot ids are correct. Also, errors
cannot be repaired, so an additional function is added to repair bad
dquots in the buffer if such an error is detected in a context where
repair is allowed.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

c6319198

xfs: verify btree blocks as they are read from disk · 3d3e6f64

由 Dave Chinner 提交于 11月 12, 2012

Add an btree block verify callback function and pass it into the
buffer read functions. Because each different btree block type
requires different verification, add a function to the ops structure
that is called from the generic code.

Also, propagate the verification callback functions through the
readahead functions, and into the external bmap and bulkstat inode
readahead code that uses the generic btree buffer read functions.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

3d3e6f64

xfs: verify inode buffers as they are read from disk · af133e86

由 Dave Chinner 提交于 11月 12, 2012

Add an inode buffer verify callback function and pass it into the
buffer read functions. Inodes are special in that the verbose checks
will be done when reading the inode, but we still need to sanity
check the buffer when that is first read. Always verify the magic
numbers in all inodes in the buffer, rather than jus ton debug
kernels.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

af133e86

xfs: verify AGFL blocks as they are read from disk · bb80c6d7

由 Dave Chinner 提交于 11月 12, 2012

Add an AGFL block verify callback function and pass it into the
buffer read functions.

While this commit adds verification code to the AGFL, it cannot be
used reliably until the CRC format change comes along as mkfs does
not initialise the full AGFL. Hence it can be full of garbage at the
first mount and will fail verification right now. CRC enabled
filesystems won't have this problem, so leave the code that has
already been written ifdef'd out until the proper time.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

bb80c6d7

xfs: verify AGI blocks as they are read from disk · 3702ce6e

由 Dave Chinner 提交于 11月 12, 2012

Add an AGI block verify callback function and pass it into the
buffer read functions. Remove the now redundant verification code
that is currently in use.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBen Myers <bpm@sgi.com>

3702ce6e

xfs: verify AGF blocks as they are read from disk · 5d5f527d

由 Dave Chinner 提交于 11月 14, 2012

Add an AGF block verify callback function and pass it into the
buffer read functions. This replaces the existing verification that
is done after the read completes.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMark Tinguely <tinguely@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

5d5f527d

xfs: verify superblocks as they are read from disk · 98021821

由 Dave Chinner 提交于 11月 12, 2012

Add a superblock verify callback function and pass it into the
buffer read functions. Remove the now redundant verification code
that is currently in use.

Adding verification shows that secondary superblocks never have
their "sb_inprogress" flag cleared by mkfs.xfs, so when validating
the secondary superblocks during a grow operation we have to avoid
checking this field. Even if we fix mkfs, we will still have to
ignore this field for verification purposes unless a version of mkfs
that does not have this bug was used.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

98021821

xfs: uncached buffer reads need to return an error · eab4e633

由 Dave Chinner 提交于 11月 12, 2012

With verification being done as an IO completion callback, different
errors can be returned from a read. Uncached reads only return a
buffer or NULL on failure, which means the verification error cannot
be returned to the caller.

Split the error handling for these reads into two - a failure to get
a buffer will still return NULL, but a read error will return a
referenced buffer with b_error set rather than NULL. The caller is
responsible for checking the error state of the buffer returned.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

eab4e633

xfs: make buffer read verication an IO completion function · c3f8fc73

由 Dave Chinner 提交于 11月 12, 2012

Add a verifier function callback capability to the buffer read
interfaces.  This will be used by the callers to supply a function
that verifies the contents of the buffer when it is read from disk.
This patch does not provide callback functions, but simply modifies
the interfaces to allow them to be called.

The reason for adding this to the read interfaces is that it is very
difficult to tell fom the outside is a buffer was just read from
disk or whether we just pulled it out of cache. Supplying a callbck
allows the buffer cache to use it's internal knowledge of the buffer
to execute it only when the buffer is read from disk.

It is intended that the verifier functions will mark the buffer with
an EFSCORRUPTED error when verification fails. This allows the
reading context to distinguish a verification error from an IO
error, and potentially take further actions on the buffer (e.g.
attempt repair) based on the error reported.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NPhil White <pwhite@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

c3f8fc73

15 11月, 2012 5 次提交

xfs: remove xfs_flushinval_pages · fb595814

由 Dave Chinner 提交于 11月 12, 2012

It's just a simple wrapper around VFS functionality, and is actually
bugging in that it doesn't remove mappings before invalidating the
page cache. Remove it and replace it with the correct VFS
functionality.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NAndrew Dahl <adahl@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

fb595814

xfs: remove xfs_flush_pages · 4bc1ea6b

由 Dave Chinner 提交于 11月 12, 2012

It is a complex wrapper around VFS functions, but there are VFS
functions that provide exactly the same functionality. Call the VFS
functions directly and remove the unnecessary indirection and
complexity.

We don't need to care about clearing the XFS_ITRUNCATED flag, as
that is done during .writepages. Hence is cleared by the VFS
writeback path if there is anything to write back during the flush.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NAndrew Dahl <adahl@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

4bc1ea6b

xfs: remove xfs_wait_on_pages() · 95eacf0f

由 Dave Chinner 提交于 11月 12, 2012

It's just a simple wrapper around a VFS function that is only called
by another function in xfs_fs_subr.c. Remove it and call the VFS
function directly.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NAndrew Dahl <adahl@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

95eacf0f

xfs: reverse the check on XFS_IOC_ZERO_RANGE · d6638ae2

由 Andrew Dahl 提交于 11月 14, 2012

Reversing the check on XFS_IOC_ZERO_RANGE.

Range should be zeroed if the start is less than or equal to the end.
Signed-off-by: NAndrew Dahl <adahl@sgi.com>
Reviewed-by: NMark Tinguely <tinguely@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

d6638ae2

xfs: remove xfs_tosspages · f5b8911b

由 Dave Chinner 提交于 11月 14, 2012

It's a buggy, unnecessary wrapper that is duplicating
truncate_pagecache_range().

When replacing the call in xfs_change_file_space(), also ensure that
the length being allocated/freed is always positive before making
any changes. These checks are done in the lower extent manipulation
functions, too, but we need to do them before any page cache
operations.
Reported-by: NAndrew Dahl <adahl@sgi.com>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-By: NAndrew Dahl <adahl@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

f5b8911b

14 11月, 2012 7 次提交

xfs: make growfs initialise the AGFL header · de497688

由 Dave Chinner 提交于 11月 12, 2012

For verification purposes, AGFLs need to be initialised to a known
set of values. For upcoming CRC changes, they are also headers that
need to be initialised. Currently, growfs does neither for the AGFLs
- it ignores them completely. Add initialisation of the AGFL to be
full of invalid block numbers (NULLAGBLOCK) to put the
infrastructure in place needed for CRC support.

Includes a comment clarification from Jeff Liu.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by Rich Johnston <rjohnston@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

de497688

xfs: growfs: use uncached buffers for new headers · fd23683c

由 Dave Chinner 提交于 11月 12, 2012

When writing the new AG headers to disk, we can't attach write
verifiers because they have a dependency on the struct xfs-perag
being attached to the buffer to be fully initialised and growfs
can't fully initialise them until later in the process.

The simplest way to avoid this problem is to use uncached buffers
for writing the new headers. These buffers don't have the xfs-perag
attached to them, so it's simple to detect in the write verifier and
be able to skip the checks that need the xfs-perag.

This enables us to attach the appropriate buffer ops to the buffer
and hence calculate CRCs on the way to disk. IT also means that the
buffer is torn down immediately, and so the first access to the AG
headers will re-read the header from disk and perform full
verification of the buffer. This way we also can catch corruptions
due to problems that went undetected in growfs.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by Rich Johnston <rjohnston@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

fd23683c

xfs: use btree block initialisation functions in growfs · b64f3a39

由 Dave Chinner 提交于 11月 13, 2012

Factor xfs_btree_init_block() to be independent of the btree cursor,
and use the function to initialise btree blocks in the growfs code.
This makes adding support for different format btree blocks simple.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by Rich Johnston <rjohnston@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

b64f3a39

xfs: add more attribute tree trace points. · ee73259b

由 Dave Chinner 提交于 11月 12, 2012

Added when debugging recent attribute tree problems to more finely
trace code execution through the maze of twisty passages that makes
up the attr code.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NMark Tinguely <tinguely@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

ee73259b

xfs: drop buffer io reference when a bad bio is built · 37eb17e6

由 Dave Chinner 提交于 11月 12, 2012

Error handling in xfs_buf_ioapply_map() does not handle IO reference
counts correctly. We increment the b_io_remaining count before
building the bio, but then fail to decrement it in the failure case.
This leads to the buffer never running IO completion and releasing
the reference that the IO holds, so at unmount we can leak the
buffer. This leak is captured by this assert failure during unmount:

XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 273

This is not a new bug - the b_io_remaining accounting has had this
problem for a long, long time - it's just very hard to get a
zero length bio being built by this code...

Further, the buffer IO error can be overwritten on a multi-segment
buffer by subsequent bio completions for partial sections of the
buffer. Hence we should only set the buffer error status if the
buffer is not already carrying an error status. This ensures that a
partial IO error on a multi-segment buffer will not be lost. This
part of the problem is a regression, however.

cc: <stable@vger.kernel.org>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NMark Tinguely <tinguely@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

37eb17e6

xfs: fix broken error handling in xfs_vm_writepage · 7bf7f352

由 Dave Chinner 提交于 11月 12, 2012

When we shut down the filesystem, it might first be detected in
writeback when we are allocating a inode size transaction. This
happens after we have moved all the pages into the writeback state
and unlocked them. Unfortunately, if we fail to set up the
transaction we then abort writeback and try to invalidate the
current page. This then triggers are BUG() in block_invalidatepage()
because we are trying to invalidate an unlocked page.

Fixing this is a bit of a chicken and egg problem - we can't
allocate the transaction until we've clustered all the pages into
the IO and we know the size of it (i.e. whether the last block of
the IO is beyond the current EOF or not). However, we don't want to
hold pages locked for long periods of time, especially while we lock
other pages to cluster them into the write.

To fix this, we need to make a clear delineation in writeback where
errors can only be handled by IO completion processing. That is,
once we have marked a page for writeback and unlocked it, we have to
report errors via IO completion because we've already started the
IO. We may not have submitted any IO, but we've changed the page
state to indicate that it is under IO so we must now use the IO
completion path to report errors.

To do this, add an error field to xfs_submit_ioend() to pass it the
error that occurred during the building on the ioend chain. When
this is non-zero, mark each ioend with the error and call
xfs_finish_ioend() directly rather than building bios. This will
immediately push the ioends through completion processing with the
error that has occurred.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NMark Tinguely <tinguely@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

7bf7f352

xfs: fix attr tree double split corruption · 07428d7f

由 Dave Chinner 提交于 11月 12, 2012

In certain circumstances, a double split of an attribute tree is
needed to insert or replace an attribute. In rare situations, this
can go wrong, leaving the attribute tree corrupted. In this case,
the attr being replaced is the last attr in a leaf node, and the
replacement is larger so doesn't fit in the same leaf node.
When we have the initial condition of a node format attribute
btree with two leaves at index 1 and 2. Call them L1 and L2.  The
leaf L1 is completely full, there is not a single byte of free space
in it. L2 is mostly empty.  The attribute being replaced - call it X
- is the last attribute in L1.

The way an attribute replace is executed is that the replacement
attribute - call it Y - is first inserted into the tree, but has an
INCOMPLETE flag set on it so that list traversals ignore it. Once
this transaction is committed, a second transaction it run to
atomically mark Y as COMPLETE and X as INCOMPLETE, so that a
traversal will now find Y and skip X. Once that transaction is
committed, attribute X is then removed.

So, the initial condition is:

     +--------+     +--------+
     |   L1   |     |   L2   |
     | fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |
     | fsp: 0 |     | fsp: N |
     |--------|     |--------|
     | attr A |     | attr 1 |
     |--------|     |--------|
     | attr B |     | attr 2 |
     |--------|     |--------|
     ..........     ..........
     |--------|     |--------|
     | attr X |     | attr n |
     +--------+     +--------+


So now we go to replace X, and see that L1:fsp = 0 - it is full so
we can't insert Y in the same leaf. So we record the the location of
attribute X so we can track it for later use, then we split L1 into
L1 and L3 and reblance across the two leafs. We end with:


     +--------+     +--------+     +--------+
     |   L1   |     |   L3   |     |   L2   |
     | fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |<----| bwd: 3 |
     | fsp: M |     | fsp: J |     | fsp: N |
     |--------|     |--------|     |--------|
     | attr A |     | attr X |     | attr 1 |
     |--------|     +--------+     |--------|
     | attr B |                    | attr 2 |
     |--------|                    |--------|
     ..........                    ..........
     |--------|                    |--------|
     | attr W |                    | attr n |
     +--------+                    +--------+


And we track that the original attribute is now at L3:0.

We then try to insert Y into L1 again, and find that there isn't
enough room because the new attribute is larger than the old one.
Hence we have to split again to make room for Y. We end up with
this:


     +--------+     +--------+     +--------+     +--------+
     |   L1   |     |   L4   |     |   L3   |     |   L2   |
     | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
     | fsp: M |     | fsp: J |     | fsp: J |     | fsp: N |
     |--------|     |--------|     |--------|     |--------|
     | attr A |     | attr Y |     | attr X |     | attr 1 |
     |--------|     + INCOMP +     +--------+     |--------|
     | attr B |     +--------+                    | attr 2 |
     |--------|                                   |--------|
     ..........                                   ..........
     |--------|                                   |--------|
     | attr W |                                   | attr n |
     +--------+                                   +--------+

And now we have the new (incomplete) attribute @ L4:0, and the
original attribute at L3:0. At this point, the first transaction is
committed, and we move to the flipping of the flags.

This is where we are supposed to end up with this:

     +--------+     +--------+     +--------+     +--------+
     |   L1   |     |   L4   |     |   L3   |     |   L2   |
     | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
     | fsp: M |     | fsp: J |     | fsp: J |     | fsp: N |
     |--------|     |--------|     |--------|     |--------|
     | attr A |     | attr Y |     | attr X |     | attr 1 |
     |--------|     +--------+     + INCOMP +     |--------|
     | attr B |                    +--------+     | attr 2 |
     |--------|                                   |--------|
     ..........                                   ..........
     |--------|                                   |--------|
     | attr W |                                   | attr n |
     +--------+                                   +--------+

But that doesn't happen properly - the attribute tracking indexes
are not pointing to the right locations. What we end up with is both
the old attribute to be removed pointing at L4:0 and the new
attribute at L4:1.  On a debug kernel, this assert fails like so:

XFS: Assertion failed: args->index2 < be16_to_cpu(leaf2->hdr.count), file: fs/xfs/xfs_attr_leaf.c, line: 2725

because the new attribute location does not exist. On a production
kernel, this goes unnoticed and the code proceeds ahead merrily and
removes L4 because it thinks that is the block that is no longer
needed. This leaves the hash index node pointing to entries
L1, L4 and L2, but only blocks L1, L3 and L2 to exist. Further, the
leaf level sibling list is L1 <-> L4 <-> L2, but L4 is now free
space, and so everything is busted. This corruption is caused by the
removal of the old attribute triggering a join - it joins everything
correctly but then frees the wrong block.

xfs_repair will report something like:

bad sibling back pointer for block 4 in attribute fork for inode 131
problem with attribute contents in inode 131
would clear attr fork
bad nblocks 8 for inode 131, would reset to 3
bad anextents 4 for inode 131, would reset to 0

The problem lies in the assignment of the old/new blocks for
tracking purposes when the double leaf split occurs. The first split
tries to place the new attribute inside the current leaf (i.e.
"inleaf == true") and moves the old attribute (X) to the new block.
This sets up the old block/index to L1:X, and newly allocated
block to L3:0. It then moves attr X to the new block and tries to
insert attr Y at the old index. That fails, so it splits again.

With the second split, the rebalance ends up placing the new attr in
the second new block - L4:0 - and this is where the code goes wrong.
What is does is it sets both the new and old block index to the
second new block. Hence it inserts attr Y at the right place (L4:0)
but overwrites the current location of the attr to replace that is
held in the new block index (currently L3:0). It over writes it with
L4:1 - the index we later assert fail on.

Hopefully this table will show this in a foramt that is a bit easier
to understand:

Split		old attr index		new attr index
		vanilla	patched		vanilla	patched
before 1st	L1:26	L1:26		N/A	N/A
after 1st	L3:0	L3:0		L1:26	L1:26
after 2nd	L4:0	L3:0		L4:1	L4:0
                ^^^^			^^^^
		wrong			wrong

The fix is surprisingly simple, for all this analysis - just stop
the rebalance on the out-of leaf case from overwriting the new attr
index - it's already correct for the double split case.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NMark Tinguely <tinguely@sgi.com>
Signed-off-by: NBen Myers <bpm@sgi.com>

07428d7f