- 12 2月, 2019 36 次提交
-
-
由 Brian Foster 提交于
Most buffer verifiers have hardcoded magic value checks conditionalized on the version of the filesystem. The magic value field of the verifier structure facilitates abstraction of some of this code. Populate the ->magic field of various verifiers to take advantage of this abstraction. No functional changes. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Brian Foster 提交于
The dir2 leaf verifiers share the same underlying structure verification code, but implement six accessor functions to multiplex the code across the two verifiers. Further, the magic value isn't sufficiently abstracted such that the common helper has to manually fix up the magic from the caller on v5 filesystems. Use the magic field in the verifier structure to eliminate the duplicate code and clean this all up. No functional change. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Brian Foster 提交于
The allocation btree verifiers share code that is unable to detect cross-tree magic value corruptions such as a bnobt block with a cntbt magic value. Populate the b_ops->magic field of the associated verifier structures such that the structure verifier can check the magic value against the expected value based on tree type. The btree level check requires knowledge of the tree type to determine the appropriate maximum value. This was previously part of the hardcoded magic value checks. With that code removed, peek at the first magic value in the verifier to determine the expected tree type of the current block. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Brian Foster 提交于
Similar to the inode btree verifier, the same allocation btree verifier structure is shared between the by-bno (bnobt) and by-size (cntbt) btrees. This prevents the ability to distinguish magic values between them. Separate the verifier into two, one for each tree, and assign them appropriately. No functional changes. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Brian Foster 提交于
The inode btree verifier code is shared between the inode btree and free inode btree because the underlying metadata formats are essentially equivalent. A side effect of this is that the verifier cannot determine whether a particular btree block should have an inobt or finobt magic value. This logic allows an unfortunate xfs_repair bug to escape detection where certain level > 0 nodes of the finobt are stamped with inobt magic by xfs_repair finobt reconstruction. This is fortunately not a severe problem since the inode btree magic values do not contribute to any changes in kernel behavior, but we do need a means to detect and prevent this problem in the future. Add a field to xfs_buf_ops to store the v4 and v5 superblock magic values expected by a particular verifier. Add a helper to check an on-disk magic value against the value expected by the verifier. Call the helper from the shared [f]inobt verifier code for magic value verification. This ensures that the inode btree blocks each have the appropriate magic value based on specific tree type and superblock version. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Brian Foster 提交于
The inobt verifier is reused for the inobt and finobt, which prevents the ability to distinguish between magic values on a per-tree basis. Create a separate finobt structure in preparation for changes to enforce the appropriate magic value for the associated tree. This patch has no functional change. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Brian Foster 提交于
Most verifiers that check on-disk magic values convert the CPU endian magic value constant to disk endian to facilitate compile time optimization of the byte swap and reduce the need for runtime byte swaps in buffer verifiers. Several buffer verifiers do not follow this pattern. Update those verifiers for consistency. Also fix up a random typo in the inode readahead verifier name. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Brian Foster 提交于
Improve the documentation around xfs_buf_ensure_ops, which is the function that is responsible for cleaning up the b_ops state of buffers that go through xrep_findroot_block but don't match anything. Rename the function to xfs_buf_reverify. [darrick: this started off as bfoster mods of a previous patch of mine, but the renaming part is now this separate patch.] Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Use a rhashtable to cache the unlinked list incore. This should speed up unlinked processing considerably when there are a lot of inodes on the unlinked list because iunlink_remove no longer has to traverse an entire bucket list to find which inode points to the one being removed. The incore list structure records "X.next_unlinked = Y" relations, with the rhashtable using Y to index the records. This makes finding the inode X that points to a inode Y very quick. If our cache fails to find anything we can always fall back on the old method. FWIW this drastically reduces the amount of time it takes to remove inodes from the unlinked list. I wrote a program to open a lot of O_TMPFILE files and then close them in the same order, which takes a very long time if we have to traverse the unlinked lists. With the ptach, I see: + /d/t/tmpfile/tmpfile Opened 193531 files in 6.33s. Closed 193531 files in 5.86s real 0m12.192s user 0m0.064s sys 0m11.619s + cd / + umount /mnt real 0m0.050s user 0m0.004s sys 0m0.030s And without the patch: + /d/t/tmpfile/tmpfile Opened 193588 files in 6.35s. Closed 193588 files in 751.61s real 12m38.853s user 0m0.084s sys 12m34.470s + cd / + umount /mnt real 0m0.086s user 0m0.000s sys 0m0.060s Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Add tracepoints so we can associate high level operations with low level updates. No functional changes. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
In xfs_iunlink_remove we have two identical calls to xfs_iunlink_update_inode, so move it out of the if statement to simplify the code some more. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
There's a loop that searches an unlinked bucket list to find the inode that points to a given inode. Hoist this into a separate function; later we'll use our iunlink backref cache to bypass the slow list operation. No functional changes. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Hoist the functions that update an inode's unlinked pointer updates into a helper. No functional changes. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Strengthen our checking of the AGI unlinked pointers when we start to use them for updating the metadata. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Split the AGI unlinked bucket updates into a separate function. No functional changes. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
由 Darrick J. Wong 提交于
Add a new helper to check that a per-AG inode pointer is either null or points somewhere valid within that AG. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
由 Darrick J. Wong 提交于
Fix some indentation issues with the iunlink functions and reorganize the tops of the functions to be identical. No functional changes. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
由 Brian Foster 提交于
The writeback delalloc conversion code is racy with respect to changes in the currently cached file mapping outside of the current page. This is because the ilock is cycled between the time the caller originally looked up the mapping and across each real allocation of the provided file range. This code has collected various hacks over the years to help combat the symptoms of these races (i.e., truncate race detection, allocation into hole detection, etc.), but none address the fundamental problem that the imap may not be valid at allocation time. Rather than continue to use race detection hacks, update writeback delalloc conversion to a model that explicitly converts the delalloc extent backing the current file offset being processed. The current file offset is the only block we can trust to remain once the ilock is dropped because any operation that can remove the block (truncate, hole punch, etc.) must flush and discard pagecache pages first. Modify xfs_iomap_write_allocate() to use the xfs_bmapi_delalloc() mechanism to request allocation of the entire delalloc extent backing the current offset instead of assuming the extent passed by the caller is unchanged. Record the range specified by the caller and apply it to the resulting allocated extent so previous checks by the caller for COW fork overlap are not lost. Finally, overload the bmapi delalloc flag with the range reval flag behavior since this is the only use case for both. This ensures that writeback always picks up the correct and current extent associated with the page, regardless of races with other extent modifying operations. If operating on a data fork and the COW overlap state has changed since the ilock was cycled, the caller revalidates against the COW fork sequence number before using the imap for the next block. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Brian Foster 提交于
The writeback delalloc conversion code is racy with respect to changes in the currently cached file mapping. This stems from the fact that the bmapi allocation code requires a file range to allocate and the writeback conversion code assumes the range of the currently cached mapping is still valid with respect to the fork. It may not be valid, however, because the ilock is cycled (potentially multiple times) between the time the cached mapping was populated and the delalloc conversion occurs. To facilitate a solution to this problem, create a new xfs_bmapi_delalloc() wrapper to xfs_bmapi_write() that takes a file (FSB) offset and attempts to allocate whatever delalloc extent backs the offset. Use a new bmapi flag to cause xfs_bmapi_write() to set the range based on the extent backing the bno parameter unless bno lands in a hole. If bno does land in a hole, fall back to the current behavior (which may result in an error or quietly skipping holes in the specified range depending on other parameters). This patch does not change behavior. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Brian Foster 提交于
Now that the cached writeback mapping is explicitly invalidated on data fork changes, the EOF trimming band-aid is no longer necessary. Remove xfs_trim_extent_eof() as well since it has no other users. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Brian Foster 提交于
The writeback code caches the current extent mapping across multiple xfs_do_writepage() calls to avoid repeated lookups for sequential pages backed by the same extent. This is known to be slightly racy with extent fork changes in certain difficult to reproduce scenarios. The cached extent is trimmed to within EOF to help avoid the most common vector for this problem via speculative preallocation management, but this is a band-aid that does not address the fundamental problem. Now that we have an xfs_ifork sequence counter mechanism used to facilitate COW writeback, we can use the same mechanism to validate consistency between the data fork and cached writeback mappings. On its face, this is somewhat of a big hammer approach because any change to the data fork invalidates any mapping currently cached by a writeback in progress regardless of whether the data fork change overlaps with the range under writeback. In practice, however, the impact of this approach is minimal in most cases. First, data fork changes (delayed allocations) caused by sustained sequential buffered writes are amortized across speculative preallocations. This means that a cached mapping won't be invalidated by each buffered write of a common file copy workload, but rather only on less frequent allocation events. Second, the extent tree is always entirely in-core so an additional lookup of a usable extent mostly costs a shared ilock cycle and in-memory tree lookup. This means that a cached mapping reval is relatively cheap compared to the I/O itself. Third, spurious invalidations don't impact ioend construction. This means that even if the same extent is revalidated multiple times across multiple writepage instances, we still construct and submit the same size ioend (and bio) if the blocks are physically contiguous. Update struct xfs_writepage_ctx with a new field to hold the sequence number of the data fork associated with the currently cached mapping. Check the wpc seqno against the data fork when the mapping is validated and reestablish the mapping whenever the fork has changed since the mapping was cached. This ensures that writeback always uses a valid extent mapping and thus prevents lost writebacks and stale delalloc block problems. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NAllison Henderson <allison.henderson@oracle.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Brian Foster 提交于
The sequence counter in the xfs_ifork structure is only updated on COW forks. This is because the counter is currently only used to optimize out repetitive COW fork checks at writeback time. Tweak the extent code to update the seq counter regardless of the fork type in preparation for using this counter on data forks as well. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NAllison Henderson <allison.henderson@oracle.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Marco Benatto 提交于
Currently we have a few PTAGs in place allowing us to transform a filesystem error in a BUG() call. However, we don't have a panic tag for corrupt metadata, so introduce XFS_PTAG_VERIFIER_ERROR so that the administrator can use the fs.xfs.panic_mask sysctl knob to convert any error detected by buffer verifiers into a kernel panic. Signed-off-by: NMarco Benatto <mbenatto@redhat.com> Reviewed-by: NEric Sandeen <sandeen@redhat.com> [darrick: light editing of commit message] Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 YueHaibing 提交于
Remove duplicated include. Signed-off-by: NYueHaibing <yuehaibing@huawei.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Darrick J. Wong 提交于
Check extended attribute entry names for invalid characters. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Check directory entry names for invalid characters. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Fix an off-by-one error in the realtime bitmap "is used" cross-reference helper function if the realtime extent size is a single block. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Teach scrub to flag extent maps that exceed the range that can be mapped with a xfs_dablk_t. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
The extended attribute scrubber should abort the "read all attrs" loop if there's a fatal signal pending on the process. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Move all the confusing dinode mapping code that's split between xchk_iallocbt_check_cluster and xchk_iallocbt_check_cluster_ifree into the first function so that it's clearer how we find the dinode for a given inode. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Teach scrub how to handle the case that there are one or more inobt records covering a given inode cluster. This fixes the operation on big block filesystems (e.g. 64k blocks, 512 byte inodes). Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
The code to check inobt records against inode clusters is a mess of poorly named variables and unnecessary parameters. Clean the unnecessary inode number parameters out of _check_cluster_freemask in favor of computing them inside the function instead of making the caller do it. In xchk_iallocbt_check_cluster, rename the variables to make it more obvious just what chunk_ino and cluster_ino represent. Add a tracepoint to make it easier to track each inode cluster as we scrub it. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Hoist the inode cluster checks out of the inobt record check loop into a separate function in preparation for refactoring of that loop. No functional changes here; that's in the next patch. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
On a big block filesystem, there may be multiple inobt records covering a single inode cluster. These records obviously won't be aligned to cluster alignment rules, and they must cover the entire cluster. Teach scrub to check for these things. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
In xchk_iallocbt_rec, check the alignment of ir_startino by converting the inode cluster block alignment into units of inodes instead of the other way around (converting ir_startino to blocks). This prevents us from tripping over off-by-one errors in ir_startino which are obscured by the inode -> block conversion. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Make sure we never check more than XFS_INODES_PER_CHUNK inodes for any given inobt record since there can be more than one inobt record mapped to an inode cluster. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
- 04 2月, 2019 3 次提交
-
-
由 Darrick J. Wong 提交于
In xrep_findroot_block, we work out the btree type and correctness of a given block by calling different btree verifiers on root block candidates. However, we leave the NULL b_ops while ->verify_read validates the block, which means that if the verifier calls xfs_buf_verifier_error it'll crash on the null b_ops. Fix it to set b_ops before calling the verifier and unsetting it if the verifier fails. Furthermore, improve the documentation around xfs_buf_ensure_ops, which is the function that is responsible for cleaning up the b_ops state of buffers that go through xrep_findroot_block but don't match anything. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Brian Foster 提交于
As of commit e339dd8d ("xfs: use sync buffer I/O for sync delwri queue submission"), the delwri submission code uses sync buffer I/O for sync delwri I/O. Instead of waiting on async I/O to unlock the buffer, it uses the underlying sync I/O completion mechanism. If delwri buffer submission fails due to a shutdown scenario, an error is set on the buffer and buffer completion never occurs. This can cause xfs_buf_delwri_submit() to deadlock waiting on a completion event. We could check the error state before waiting on such buffers, but that doesn't serialize against the case of an error set via a racing I/O completion. Instead, invoke I/O completion in the shutdown case regardless of buffer I/O type. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Brian Foster 提交于
The cached writeback mapping is EOF trimmed to try and avoid races between post-eof block management and writeback that result in sending cached data to a stale location. The cached mapping is currently trimmed on the validation check, which leaves a race window between the time the mapping is cached and when it is trimmed against the current inode size. For example, if a new mapping is cached by delalloc conversion on a blocksize == page size fs, we could cycle various locks, perform memory allocations, etc. in the writeback codepath before the associated mapping is eventually trimmed to i_size. This leaves enough time for a post-eof truncate and file append before the cached mapping is trimmed. The former event essentially invalidates a range of the cached mapping and the latter bumps the inode size such the trim on the next writepage event won't trim all of the invalid blocks. fstest generic/464 reproduces this scenario occasionally and causes a lost writeback and stale delalloc blocks warning on inode inactivation. To work around this problem, trim the cached writeback mapping as soon as it is cached in addition to on subsequent validation checks. This is a minor tweak to tighten the race window as much as possible until a proper invalidation mechanism is available. Fixes: 40214d12 ("xfs: trim writepage mapping to within eof") Cc: <stable@vger.kernel.org> # v4.14+ Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NAllison Henderson <allison.henderson@oracle.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 30 12月, 2018 1 次提交
-
-
由 Julia Lawall 提交于
Drop LIST_HEAD where the variable it declares is never used. Commit 0410c3bb ("xfs: factor ag btree root block initialisation") stopped using buffer_list and started using a buffer list in an aghdr_init_data structure, but the declaration of buffer_list was not removed. The semantic patch that fixes this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ identifier x; @@ - LIST_HEAD(x); ... when != x // </smpl> Fixes: 0410c3bb ("xfs: factor ag btree root block initialisation") Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-