提交 · e28ac1fc31299d826b5dbf1d67e74720a95cda48 · openeuler / Kernel

12 1月, 2017 1 次提交

xfs: Timely free truncated dirty pages · 0a417b8d

由 Jan Kara 提交于 1月 11, 2017

Commit 99579cce "xfs: skip dirty pages in ->releasepage()" started
to skip dirty pages in xfs_vm_releasepage() which also has the effect
that if a dirty page is truncated, it does not get freed by
block_invalidatepage() and is lingering in LRU list waiting for reclaim.
So a simple loop like:

while true; do
	dd if=/dev/zero of=file bs=1M count=100
	rm file
done

will keep using more and more memory until we hit low watermarks and
start pagecache reclaim which will eventually reclaim also the truncate
pages. Keeping these truncated (and thus never usable) pages in memory
is just a waste of memory, is unnecessarily stressing page cache
reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM
prematurely.

So instead of just skipping dirty pages in xfs_vm_releasepage(), return
to old behavior of skipping them only if they have delalloc or unwritten
buffers and fix the spurious warnings by warning only if the page is
clean.

CC: stable@vger.kernel.org
CC: Brian Foster <bfoster@redhat.com>
CC: Vlastimil Babka <vbabka@suse.cz>
Reported-by: NPetr Tůma <petr.tuma@d3s.mff.cuni.cz>
Fixes: 99579cceSigned-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

0a417b8d

10 1月, 2017 5 次提交

xfs: don't print warnings when xfs_log_force fails · 84a4620c

由 Christoph Hellwig 提交于 1月 09, 2017

There are only two reasons for xfs_log_force / xfs_log_force_lsn to fail:
one is an I/O error, for which xlog_bdstrat already logs a warning, and
the second is an already shutdown log due to a previous I/O errors.  In
the latter case we'll already have a previous indication for the actual
error, but the large stream of misleading warnings from xfs_log_force
will probably scroll it out of the message buffer.

Simply removing the warnings thus makes the XFS log reporting significantly
better.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

84a4620c

xfs: don't rely on ->total in xfs_alloc_space_available · 12ef8301

由 Christoph Hellwig 提交于 1月 09, 2017

->total is a bit of an odd parameter passed down to the low-level
allocator all the way from the high-level callers.  It's supposed to
contain the maximum number of blocks to be allocated for the whole
transaction [1].

But in xfs_iomap_write_allocate we only convert existing delayed
allocations and thus only have a minimal block reservation for the
current transaction, so xfs_alloc_space_available can't use it for
the allocation decisions.  Use the maximum of args->total and the
calculated block requirement to make a decision.  We probably should
get rid of args->total eventually and instead apply ->minleft more
broadly, but that will require some extensive changes all over.

[1] which creates lots of confusion as most callers don't decrement it
once doing a first allocation.  But that's for a separate series.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

12ef8301

xfs: adjust allocation length in xfs_alloc_space_available · 54fee133

由 Christoph Hellwig 提交于 1月 09, 2017

We must decide in xfs_alloc_fix_freelist if we can perform an
allocation from a given AG is possible or not based on the available
space, and should not fail the allocation past that point on a
healthy file system.

But currently we have two additional places that second-guess
xfs_alloc_fix_freelist: xfs_alloc_ag_vextent tries to adjust the
maxlen parameter to remove the reservation before doing the
allocation (but ignores the various minium freespace requirements),
and xfs_alloc_fix_minleft tries to fix up the allocated length
after we've found an extent, but ignores the reservations and also
doesn't take the AGFL into account (and thus fails allocations
for not matching minlen in some cases).

Remove all these later fixups and just correct the maxlen argument
inside xfs_alloc_fix_freelist once we have the AGF buffer locked.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

54fee133

xfs: fix bogus minleft manipulations · 255c5162

由 Christoph Hellwig 提交于 1月 09, 2017

We can't just set minleft to 0 when we're low on space - that's exactly
what we need minleft for: to protect space in the AG for btree block
allocations when we are low on free space.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

255c5162

xfs: bump up reserved blocks in xfs_alloc_set_aside · 5149fd32

由 Christoph Hellwig 提交于 1月 09, 2017

Setting aside 4 blocks globally for bmbt splits isn't all that useful,
as different threads can allocate space in parallel.  Bump it to 4
blocks per AG to allow each thread that is currently doing an
allocation to dip into it separately.  Without that we may no have
enough reserved blocks if there are enough parallel transactions
in an almost out space file system that all run into bmap btree
splits.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

5149fd32

04 1月, 2017 4 次提交

xfs: fix max_retries _show and _store functions · ff97f239

由 Carlos Maiolino 提交于 1月 03, 2017

max_retries _show and _store functions should test against cfg->max_retries,
not cfg->retry_timeout
Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

ff97f239

xfs: fix crash and data corruption due to removal of busy COW extents · a1b7a4de

由 Christoph Hellwig 提交于 1月 03, 2017

There is a race window between write_cache_pages calling
clear_page_dirty_for_io and XFS calling set_page_writeback, in which
the mapping for an inode is tagged neither as dirty, nor as writeback.

If the COW shrinker hits in exactly that window we'll remove the delayed
COW extents and writepages trying to write it back, which in release
kernels will manifest as corruption of the bmap btree, and in debug
kernels will trip the ASSERT about now calling xfs_bmapi_write with the
COWFORK flag for holes. A complex customer load manages to hit this
window fairly reliably, probably by always having COW writeback in flight
while the cow shrinker runs.

This patch adds another check for having the I_DIRTY_PAGES flag set,
which is still set during this race window. While this fixes the problem
I'm still not overly happy about the way the COW shrinker works as it
still seems a bit fragile.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

a1b7a4de

xfs: use the actual AG length when reserving blocks · 20e73b00

由 Darrick J. Wong 提交于 1月 03, 2017

We need to use the actual AG length when making per-AG reservations,
since we could otherwise end up reserving more blocks out of the last
AG than there are actual blocks.
Complained-about-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

20e73b00

xfs: fix double-cleanup when CUI recovery fails · 7a21272b

由 Darrick J. Wong 提交于 1月 03, 2017

Dan Carpenter reported a double-free of rcur if _defer_finish fails
while we're recovering CUI items.  Fix the error recovery to prevent
this.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

7a21272b

25 12月, 2016 1 次提交

Replace <asm/uaccess.h> with <linux/uaccess.h> globally · 7c0f6ba6

由 Linus Torvalds 提交于 12月 24, 2016

This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.
Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7c0f6ba6

23 12月, 2016 1 次提交

vfs: fix isize/pos/len checks for reflink & dedupe · 22725ce4

由 Darrick J. Wong 提交于 12月 19, 2016

Strengthen the checking of pos/len vs. i_size, clarify the return values
for the clone prep function, and remove pointless code.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

22725ce4

10 12月, 2016 2 次提交

vfs: refactor clone/dedupe_file_range common functions · 876bec6f

由 Darrick J. Wong 提交于 12月 09, 2016

Hoist both the XFS reflink inode state and preparation code and the XFS
file blocks compare functions into the VFS so that ocfs2 can take
advantage of it for reflink and dedupe.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

876bec6f

fs: try to clone files first in vfs_copy_file_range · a76b5b04

由 Christoph Hellwig 提交于 12月 09, 2016

A clone is a perfectly fine implementation of a file copy, so most
file systems just implement the copy that way. Instead of duplicating
this logic move it to the VFS. Currently btrfs and XFS implement copies
the same way as clones and there is no behavior change for them, cifs
only implements clones and grow support for copy_file_range with this
patch. NFS implements both, so this will allow copy_file_range to work
on servers that only implement CLONE and be lot more efficient on servers
that implements CLONE and COPY.
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a76b5b04

09 12月, 2016 8 次提交

vfs: remove ".readlink = generic_readlink" assignments · dfeef688

由 Miklos Szeredi 提交于 12月 09, 2016

If .readlink == NULL implies generic_readlink().

Generated by:

to_del="\.readlink.*=.*generic_readlink"
for i in `git grep -l $to_del`; do sed -i "/$to_del"/d $i; done
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

dfeef688

vfs: replace calling i_op->readlink with vfs_readlink() · fd4a0edf

由 Miklos Szeredi 提交于 12月 09, 2016

Also check d_is_symlink() in callers instead of inode->i_op->readlink
because following patches will allow NULL ->readlink for symlinks.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

fd4a0edf

xfs: nuke unused tracepoint definitions · 9875258c

由 Eric Sandeen 提交于 12月 09, 2016

This is all unused code, so remove it.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

9875258c

xfs: use GPF_NOFS when allocating btree cursors · b24a978c

由 Darrick J. Wong 提交于 12月 09, 2016

Use NOFS for allocating btree cursors, since they can be called
under the ilock.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

b24a978c

xfs: use xfs_vn_setattr_size to check on new size · 0c187dc5

由 Eryu Guan 提交于 12月 09, 2016

Commit 65523218 ("xfs: remove i_iolock and use i_rwsem in the
VFS inode instead") introduced a regression that truncate(2) doesn't
check on new size, so it succeeds even if the new size exceeds the
current resource limit. Because xfs_setattr_size() was used instead
of xfs_vn_setattr_size(), and the latter calls xfs_vn_change_ok()
first to do sanity check on permission and new size.

This is found by truncate03 test from ltp, and the following is a
simplified reproducer:

  #!/bin/bash
  dev=/dev/sda5
  mnt=/mnt/xfs

  mkfs -t xfs -f $dev
  mount $dev $mnt

  # set max file size to 16k
  ulimit -f 16
  truncate -s $((16 * 1024 + 1)) /mnt/xfs/testfile
  [ $? -eq 0 ] && echo "FAIL: truncate exceeded max file size"
  ulimit -f unlimited
  umount $mnt
Signed-off-by: NEryu Guan <eguan@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

0c187dc5

xfs: deprecate barrier/nobarrier mount option · 4cf4573d

由 Dave Chinner 提交于 12月 09, 2016

We always perform integrity operations now, so these mount options
don't do anything. Deprecate them and mark them for removal in
in a year.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

4cf4573d

xfs: Always flush caches when integrity is required · 2291dab2

由 Dave Chinner 提交于 12月 09, 2016

There is no reason anymore for not issuing device integrity
operations when teh filesystem requires ordering or data integrity
guarantees. We should always issue cache flushes and FUA writes
where necessary and let the underlying storage optimise them as
necessary for correct integrity operation.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

2291dab2

xfs: ignore leaf attr ichdr.count in verifier during log replay · 2e1d2337

由 Eric Sandeen 提交于 12月 09, 2016

When we create a new attribute, we first create a shortform
attribute, and try to fit the new attribute into it.
If that fails, we copy the (empty) attribute into a leaf attribute,
and do the copy again.  Thus there can be a transient state where
we have an empty leaf attribute.

If we encounter this during log replay, the verifier will fail.
So add a test to ignore this part of the leaf attr verification
during log replay.

Thanks as usual to dchinner for spotting the problem.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

2e1d2337

07 12月, 2016 1 次提交

xfs: use rhashtable to track buffer cache · 6031e73a

由 Lucas Stach 提交于 12月 07, 2016

On filesystems with a lot of metadata and in metadata intensive workloads
xfs_buf_find() is showing up at the top of the CPU cycles trace. Most of
the CPU time is spent on CPU cache misses while traversing the rbtree.

As the buffer cache does not need any kind of ordering, but fast lookups
a hashtable is the natural data structure to use. The rhashtable
infrastructure provides a self-scaling hashtable implementation and
allows lookups to proceed while the table is going through a resize
operation.

This reduces the CPU-time spent for the lookups to 1/3 even for small
filesystems with a relatively small number of cached buffers, with
possibly much larger gains on higher loaded filesystems.

[dchinner: reduce minimum hash size to an acceptable size for large
	   filesystems with many AGs with no active use.]
[dchinner: remove stale rbtree asserts.]
[dchinner: use xfs_buf_map for compare function argument.]
[dchinner: make functions static.]
[dchinner: remove redundant comments.]
Signed-off-by: NLucas Stach <dev@lynxeye.de>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

6031e73a

05 12月, 2016 14 次提交

xfs: optimise CRC updates · cae028df

由 Dave Chinner 提交于 12月 05, 2016

Nick Piggin reported that the CRC overhead in an fsync heavy
workload was higher than expected on a Power8 machine. Part of this
was to do with the fact that the power8 CRC implementation is not
efficient for CRC lengths of less than 512 bytes, and so the way we
split the CRCs over the CRC field means a lot of the CRCs are
reduced to being less than than optimal size.

To optimise this, change the CRC update mechanism to zero the CRC
field first, and then compute the CRC in one pass over the buffer
and write the result back into the buffer. We can do this safely
because anything writing a CRC has exclusive access to the buffer
the CRC is being calculated over.

We leave the CRC verify code the same - it still splits the CRC
calculation - because we do not want read-only operations modifying
the underlying buffer. This is because read-only operations may not
have an exclusive access to the buffer guaranteed, and so temporary
modifications could leak out to to other processes accessing the
buffer concurrently.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

cae028df

xfs: make xfs btree stats less huge · 11ef38af

由 Dave Chinner 提交于 12月 05, 2016

Embedding a switch statement in every btree stats inc/add adds a lot
of code overhead to the core btree infrastructure paths. Stats are
supposed to be small and lightweight, but the btree stats have
become big and bloated as we've added more btrees. It needs fixing
because the reflink code will just add more overhead again.

Convert the v2 btree stats to arrays instead of independent
variables, and instead use the type to index the specific btree
array via an enum. This allows us to use array based indexing
to update the stats, rather than having to derefence variables
specific to the btree type.

If we then wrap the xfsstats structure in a union and place uint32_t
array beside it, and calculate the correct btree stats array base
array index when creating a btree cursor,  we can easily access
entries in the stats structure without having to switch names based
on the btree type.

We then replace with the switch statement with a simple set of stats
wrapper macros, resulting in a significant simplification of the
btree stats code, and:

   text	   data	    bss	    dec	    hex	filename
  48905	    144	      8	  49057	   bfa1	fs/xfs/libxfs/xfs_btree.o.old
  36793	    144	      8	  36945	   9051	fs/xfs/libxfs/xfs_btree.o

it reduces the core btree infrastructure code size by close to 25%!
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

11ef38af

xfs: don't cap maximum dedupe request length · 1bb33a98

由 Darrick J. Wong 提交于 12月 05, 2016

After various discussions on linux-fsdevel, it has been decided that it
is not necessary to cap the length of a dedupe request, and that
correctly-written userspace client programs will be able to absorb the
change.  Therefore, remove the length clamping behavior.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

1bb33a98

xfs: don't allow di_size with high bit set · ef388e20

由 Darrick J. Wong 提交于 12月 05, 2016

The on-disk field di_size is used to set i_size, which is a signed
integer of loff_t.  If the high bit of di_size is set, we'll end up with
a negative i_size, which will cause all sorts of problems.  Since the
VFS won't let us create a file with such length, we should catch them
here in the verifier too.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

ef388e20

xfs: error out if trying to add attrs and anextents > 0 · 0f352f8e

由 Darrick J. Wong 提交于 12月 05, 2016

We shouldn't assert if somehow we end up trying to add an attr fork to
an inode that apparently already has attr extents because this is an
indication of on-disk corruption.  Instead, return an error code to
userspace.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

0f352f8e

xfs: don't crash if reading a directory results in an unexpected hole · 96a3aefb

由 Darrick J. Wong 提交于 12月 05, 2016

In xfs_dir3_data_read, we can encounter the situation where err == 0 and
*bpp == NULL if the given bno offset happens to be a hole; this leads to
a crash if we try to set the buffer type after the _da_read_buf call.
Holes can happen due to corrupt or malicious entries in the bmbt data,
so be a little more careful when we're handling buffers.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

96a3aefb

xfs: complain if we don't get nextents bmap records · 356a3225

由 Darrick J. Wong 提交于 12月 05, 2016

When reading into memory all extents of a btree-format inode fork,
complain if the number of extents we find is not the same as the number
of extents reported in the inode core.  This is needed to stop an IO
action from accessing the garbage areas of the in-core fork.

[dchinner: removed redundant assert]
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

356a3225

xfs: check for bogus values in btree block headers · bb3be7e7

由 Darrick J. Wong 提交于 12月 05, 2016

When we're reading a btree block, make sure that what we retrieved
matches the owner and level; and has a plausible number of records.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

bb3be7e7

xfs: forbid AG btrees with level == 0 · d2a047f3

由 Darrick J. Wong 提交于 12月 05, 2016

There is no such thing as a zero-level AG btree since even a single-node
zero-records btree has one level.  Btree cursor constructors read
cur_nlevels straight from disk and then access things like
cur_bufs[cur_nlevels - 1] which is /really/ bad if cur_nlevels is zero!
Therefore, strengthen the verifiers to prevent this possibility.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

d2a047f3

xfs: several xattr functions can be void · f7a136ae