提交 · 5d829300bee000980a09ac2ccb761cb25867b67c · openeuler / Kernel

08 11月, 2016 2 次提交

xfs: provide helper for counting extents from if_bytes · 5d829300

由 Eric Sandeen 提交于 11月 08, 2016

The open-coded pattern:

ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)

is all over the xfs code; provide a new helper
xfs_iext_count(ifp) to count the number of inline extents
in an inode fork.

[dchinner: pick up several missed conversions]
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

5d829300

xfs: fix up xfs_swap_extent_forks inline extent handling · 4dfce57d

由 Eric Sandeen 提交于 11月 08, 2016

There have been several reports over the years of NULL pointer
dereferences in xfs_trans_log_inode during xfs_fsr processes,
when the process is doing an fput and tearing down extents
on the temporary inode, something like:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
PID: 29439  TASK: ffff880550584fa0  CPU: 6   COMMAND: "xfs_fsr"
    [exception RIP: xfs_trans_log_inode+0x10]
 #9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs]
#10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs]
#11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs]
#12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs]
#13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs]
#14 [ffff8800a57bbe00] evict at ffffffff811e1b67
#15 [ffff8800a57bbe28] iput at ffffffff811e23a5
#16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8
#17 [ffff8800a57bbe88] dput at ffffffff811dd06c
#18 [ffff8800a57bbea8] __fput at ffffffff811c823b
#19 [ffff8800a57bbef0] ____fput at ffffffff811c846e
#20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27
#21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c
#22 [ffff8800a57bbf50] int_signal at ffffffff8161405d

As it turns out, this is because the i_itemp pointer, along
with the d_ops pointer, has been overwritten with zeros
when we tear down the extents during truncate.  When the in-core
inode fork on the temporary inode used by xfs_fsr was originally
set up during the extent swap, we mistakenly looked at di_nextents
to determine whether all extents fit inline, but this misses extents
generated by speculative preallocation; we should be using if_bytes
instead.

This mistake corrupts the in-memory inode, and code in
xfs_iext_remove_inline eventually gets bad inputs, causing
it to memmove and memset incorrect ranges; this became apparent
because the two values in ifp->if_u2.if_inline_ext[1] contained
what should have been in d_ops and i_itemp; they were memmoved due
to incorrect array indexing and then the original locations
were zeroed with memset, again due to an array overrun.

Fix this by properly using i_df.if_bytes to determine the number
of extents, not di_nextents.

Thanks to dchinner for looking at this with me and spotting the
root cause.

Cc: stable@vger.kernel.org
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

4dfce57d

06 10月, 2016 7 次提交

xfs: implement swapext for rmap filesystems · 1f08af52

由 Darrick J. Wong 提交于 10月 03, 2016

Implement swapext for filesystems that have reverse mapping.  Back in
the reflink patches, we augmented the bmap code with a 'REMAP' flag
that updates only the bmbt and doesn't touch the allocator and
implemented log redo items for those two operations.  Now we can
rewrite extent swapping as a (looong) series of remap operations.

This is far less efficient than the fork swapping method implemented
in the past, so we only switch this on for rmap.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

1f08af52

xfs: refactor swapext code · 39aff5fd

由 Darrick J. Wong 提交于 10月 03, 2016

Refactor the swapext function to pull out the fork swapping piece
into a separate function.  In the next patch we'll add in the bit
we need to make it work with rmap filesystems.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

39aff5fd

xfs: various swapext cleanups · e06259aa

由 Darrick J. Wong 提交于 10月 03, 2016

Replace structure typedefs with struct expressions and fix some
whitespace issues that result.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

e06259aa

xfs: garbage collect old cowextsz reservations · 83104d44

由 Darrick J. Wong 提交于 10月 03, 2016

Trim CoW reservations made on behalf of a cowextsz hint if they get too
old or we run low on quota, so long as we don't have dirty data awaiting
writeback or directio operations in progress.

Garbage collection of the cowextsize extents are kept separate from
prealloc extent reaping because setting the CoW prealloc lifetime to a
(much) higher value than the regular prealloc extent lifetime has been
useful for combatting CoW fragmentation on VM hosts where the VMs
experience bursty write behaviors and we can keep the utilization ratios
low enough that we don't start to run out of space.  IOWs, it benefits
us to keep the CoW fork reservations around for as long as we can unless
we run out of blocks or hit inode reclaim.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

83104d44

xfs: create a separate cow extent size hint for the allocator · f7ca3522

由 Darrick J. Wong 提交于 10月 03, 2016

Create a per-inode extent size allocator hint for copy-on-write.  This
hint is separate from the existing extent size hint so that CoW can
take advantage of the fragmentation-reducing properties of extent size
hints without disabling delalloc for regular writes.

The extent size hint that's fed to the allocator during a copy on
write operation is the greater of the cowextsize and regular extsize
hint.

During reflink, if we're sharing the entire source file to the entire
destination file and the destination file doesn't already have a
cowextsize hint, propagate the source file's cowextsize hint to the
destination file.

Furthermore, zero the bulkstat buffer prior to setting the fields
so that we don't copy kernel memory contents into userspace.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

f7ca3522

xfs: swap inode reflink flags when swapping inode extents · f0bc4d13

由 Darrick J. Wong 提交于 10月 03, 2016

When we're swapping the extents of two inodes, be sure to swap the
reflink inode flag too.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

f0bc4d13

xfs: teach get_bmapx about shared extents and the CoW fork · f86f4037

由 Darrick J. Wong 提交于 10月 03, 2016

Teach xfs_getbmapx how to report shared extents and CoW fork contents
accurately in the bmap output by querying the refcount btree
appropriately.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

f86f4037

26 9月, 2016 1 次提交

xfs: remote attribute blocks aren't really userdata · 292378ed

由 Dave Chinner 提交于 9月 26, 2016

When adding a new remote attribute, we write the attribute to the
new extent before the allocation transaction is committed. This
means we cannot reuse busy extents as that violates crash
consistency semantics. Hence we currently treat remote attribute
extent allocation like userdata because it has the same overwrite
ordering constraints as userdata.

Unfortunately, this also allows the allocator to incorrectly apply
extent size hints to the remote attribute extent allocation. This
results in interesting failures, such as transaction block
reservation overruns and in-memory inode attribute fork corruption.

To fix this, we need to separate the busy extent reuse configuration
from the userdata configuration. This changes the definition of
XFS_BMAPI_METADATA slightly - it now means that allocation is
metadata and reuse of busy extents is acceptible due to the metadata
ordering semantics of the journal. If this flag is not set, it
means the allocation is that has unordered data writeback, and hence
busy extent reuse is not allowed. It no longer implies the
allocation is for user data, just that the data write will not be
strictly ordered. This matches the semantics for both user data
and remote attribute block allocation.

As such, This patch changes the "userdata" field to a "datatype"
field, and adds a "no busy reuse" flag to the field.
When we detect an unordered data extent allocation, we immediately set
the no reuse flag. We then set the "user data" flags based on the
inode fork we are allocating the extent to. Hence we only set
userdata flags on data fork allocations now and consider attribute
fork remote extents to be an unordered metadata extent.

The result is that remote attribute extents now have the expected
allocation semantics, and the data fork allocation behaviour is
completely unchanged.

It should be noted that there may be other ways to fix this (e.g.
use ordered metadata buffers for the remote attribute extent data
write) but they are more invasive and difficult to validate both
from a design and implementation POV. Hence this patch takes the
simple, obvious route to fixing the problem...
Reported-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

292378ed

03 8月, 2016 6 次提交

xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabled · 2b0eeb5e

由 Darrick J. Wong 提交于 8月 03, 2016

Swapping extents between two inodes requires the owner to be updated
in the rmap tree for all the extents that are swapped. This code
does not yet exist, so switch off the XFS_IOC_SWAPEXT ioctl until
support has been implemented. This will need to be done before the
rmap btree code can have the experimental tag removed.

This functionality will be provided in a (much) later patch, using
some of the reflink deferred block remapping functionality to
accomlish extent swapping with rmap updates.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

2b0eeb5e

xfs: propagate bmap updates to rmapbt · 9c194644

由 Darrick J. Wong 提交于 8月 03, 2016

When we map, unmap, or convert an extent in a file's data or attr
fork, schedule a respective update in the rmapbt.  Previous versions
of this patch required a 1:1 correspondence between bmap and rmap,
but this is no longer true as we now have ability to make interval
queries against the rmapbt.

We use the deferred operations code to handle redo operations
atomically and deadlock free.  This plumbs in all five rmap actions
(map, unmap, convert extent, alloc, free); we'll use the first three
now for file data, and reflink will want the last two.  We also add
an error injection site to test log recovery.

Finally, we need to fix the bmap shift extent code to adjust the
rmaps correctly.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

9c194644

xfs: rename flist/free_list to dfops · 2c3234d1

由 Darrick J. Wong 提交于 8月 03, 2016

Mechanical change of flist/free_list to dfops, since they're now
deferred ops, not just a freeing list.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

2c3234d1

xfs: change xfs_bmap_{finish,cancel,init,free} -> xfs_defer_* · 310a75a3

由 Darrick J. Wong 提交于 8月 03, 2016

Drop the compatibility shims that we were using to integrate the new
deferred operation mechanism into the existing code.  No new code.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

310a75a3

xfs: rework xfs_bmap_free callers to use xfs_defer_ops · 3ab78df2

由 Darrick J. Wong 提交于 8月 03, 2016

Restructure everything that used xfs_bmap_free to use xfs_defer_ops
instead.  For now we'll just remove the old symbols and play some
cpp magic to make it work; in the next patch we'll actually rename
everything.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

3ab78df2

xfs: fix locking of the rt bitmap/summary inodes · f4a0660d

由 Darrick J. Wong 提交于 8月 03, 2016

When we're deleting realtime extents, we need to lock the summary
inode in case we need to update the summary info to prevent an assert
on the rsumip inode lock on a debug kernel.  While we're at it, fix
the locking annotations so that we avoid triggering lockdep warnings.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

f4a0660d

21 6月, 2016 3 次提交

xfs: convert list of extents to free into a regular list · e66a4c67

由 Darrick J. Wong 提交于 6月 21, 2016

In struct xfs_bmap_free, convert the open-coded free extent list to
a regular list, then use list_sort to sort it prior to processing.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

e66a4c67

xfs: kill xfs_zero_remaining_bytes · 3c2bdc91

由 Christoph Hellwig 提交于 6月 21, 2016

Instead punch the whole first, and the use the our zeroing helper
to punch out the edge blocks.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

3c2bdc91

xfs: split xfs_free_file_space in manageable pieces · bdb0d04f

由 Christoph Hellwig 提交于 6月 21, 2016

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

bdb0d04f

01 6月, 2016 2 次提交

xfs: make several functions static · 0d5a75e9

由 Eric Sandeen 提交于 6月 01, 2016

Al Viro noticed that xfs_lock_inodes should be static, and
that led to ... a few more.

These are just the easy ones, others require moving functions
higher in source files, so that's not done here to keep
this review simple.
Signed-off-by: NEric Sandeen <sandeen@sandeen.net>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

0d5a75e9

xfs: remove spurious shutdown type check from xfs_bmap_finish() · 0c871f9a

由 Brian Foster 提交于 6月 01, 2016

The static checker reports that after commit 8d99fe92 ("xfs: fix
efi/efd error handling to avoid fs shutdown hangs"), the code has been
reworked such that error == -EFSCORRUPTED is not possible in this
codepath.

Remove the spurious error check and just use SHUTDOWN_META_IO_ERROR
unconditionally.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

0c871f9a

19 5月, 2016 1 次提交

dax: use sb_issue_zerout instead of calling dax_clear_sectors · 3dc29161

由 Matthew Wilcox 提交于 3月 15, 2016

dax_clear_sectors() cannot handle poisoned blocks.  These must be
zeroed using the BIO interface instead.  Convert ext2 and XFS to use
only sb_issue_zerout().
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
[vishal: Also remove the dax_clear_sectors function entirely]
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>

3dc29161

06 4月, 2016 1 次提交

xfs: better xfs_trans_alloc interface · 253f4911

由 Christoph Hellwig 提交于 4月 06, 2016

Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
that returns a transaction with all the required log and block reservations,
and which allows passing transaction flags directly to avoid the cumbersome
_xfs_trans_alloc interface.

While we're at it we also get rid of the transaction type argument that has
been superflous since we stopped supporting the non-CIL logging mode.  The
guts of it will be removed in another patch.

[dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

253f4911

05 4月, 2016 1 次提交

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf

由 Kirill A. Shutemov 提交于 4月 01, 2016

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

09cbfeaf

28 2月, 2016 1 次提交

dax: give DAX clearing code correct bdev · 20a90f58

由 Ross Zwisler 提交于 2月 26, 2016

dax_clear_blocks() needs a valid struct block_device and previously it
was using inode->i_sb->s_bdev in all cases.  This is correct for normal
inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
DAX raw block devices and for XFS real-time devices.

Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
its arguments to take a bdev and a sector instead of an inode and a
block.  This better reflects what the function does, and it allows the
filesystem and raw block device code to pass in an appropriate struct
block_device.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

20a90f58

09 2月, 2016 1 次提交

xfs: mode di_mode to vfs inode · c19b3b05

由 Dave Chinner 提交于 2月 09, 2016

Move the di_mode value from the xfs_icdinode to the VFS inode, reducing
the xfs_icdinode byte another 2 bytes and collapsing another 2 byte hole
in the structure.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

c19b3b05

08 2月, 2016 1 次提交

xfs: lock rt summary inode on allocation · 4b680afb

由 Dave Chinner 提交于 2月 08, 2016

RT allocation can fail on a debug kernel with:

XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_SHARED|XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 4039

When modifying the summary inode during allocation. This occurs
because the summary inode is never locked, and xfs_bmapi_*
operations expect it to be locked. The summary inode is effectively
protected byt he lock on the bitmap inode, so this really is only a
debug kernel issue.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

4b680afb

11 1月, 2016 1 次提交

xfs: eliminate committed arg from xfs_bmap_finish · f6106efa

由 Eric Sandeen 提交于 1月 11, 2016

Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
associated comments were replicated several times across
the attribute code, all dealing with what to do if the
transaction was or wasn't committed.

And in that replicated code, an ASSERT() test of an
uninitialized variable occurs in several locations:

	error = xfs_attr_thing(&args);
	if (!error) {
		error = xfs_bmap_finish(&args.trans, args.flist,
					&committed);
	}
	if (error) {
		ASSERT(committed);

If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
never set "committed", and then test it in the ASSERT.

Fix this up by moving the committed state internal to xfs_bmap_finish,
and add a new inode argument.  If an inode is passed in, it is passed
through to __xfs_trans_roll() and joined to the transaction there if
the transaction was committed.

xfs_qm_dqalloc() was a little unique in that it called bjoin rather
than ijoin, but as Dave points out we can detect the committed state
but checking whether (*tpp != tp).

Addresses-Coverity-Id: 102360
Addresses-Coverity-Id: 102361
Addresses-Coverity-Id: 102363
Addresses-Coverity-Id: 102364
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

f6106efa

03 11月, 2015 1 次提交

xfs: introduce BMAPI_ZERO for allocating zeroed extents · 3fbbbea3

由 Dave Chinner 提交于 11月 03, 2015

To enable DAX to do atomic allocation of zeroed extents, we need to
drive the block zeroing deep into the allocator. Because
xfs_bmapi_write() can return merged extents on allocation that were
only partially allocated (i.e. requested range spans allocated and
hole regions, allocation into the hole was contiguous), we cannot
zero the extent returned from xfs_bmapi_write() as that can
overwrite existing data with zeros.

Hence we have to drive the extent zeroing into the allocation code,
prior to where we merge the extents into the BMBT and return the
resultant map. This means we need to propagate this need down to
the xfs_alloc_vextent() and issue the block zeroing at this point.

While this functionality is being introduced for DAX, there is no
reason why it is specific to DAX - we can per-zero blocks during the
allocation transaction on any type of device. It's just slow (and
usually slower than unwritten allocation and conversion) on
traditional block devices so doesn't tend to get used. We can,
however, hook hardware zeroing optimisations via sb_issue_zeroout()
to this operation, so it may be useful in future and hence the
"allocate zeroed blocks" API needs to be implementation neutral.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

3fbbbea3

12 10月, 2015 1 次提交

xfs: pass total block res. as total xfs_bmapi_write() parameter · dbd5c8c9

由 Brian Foster 提交于 10月 12, 2015

The total field from struct xfs_alloc_arg is a bit of an unknown
commodity. It is documented as the total block requirement for the
transaction and is used in this manner from most call sites by virtue of
passing the total block reservation of the transaction associated with
an allocation. Several xfs_bmapi_write() callers pass hardcoded values
of 0 or 1 for the total block requirement, which is a historical oddity
without any clear reasoning.

The xfs_iomap_write_direct() caller, for example, passes 0 for the total
block requirement. This has been determined to cause problems in the
form of ABBA deadlocks of AGF buffers due to incorrect AG selection in
the block allocator. Specifically, the xfs_alloc_space_available()
function incorrectly selects an AG that doesn't actually have sufficient
space for the allocation. This occurs because the args.total field is 0
and thus the remaining free space check on the AG doesn't actually
consider the size of the allocation request. This locks the AGF buffer,
the allocation attempt proceeds and ultimately fails (in
xfs_alloc_fix_minleft()), and xfs_alloc_vexent() moves on to the next
AG. In turn, this can lead to incorrect AG locking order (if the
allocator wraps around, attempting to lock AG 0 after acquiring AG N)
and thus deadlock if racing with another operation. This problem has
been reproduced via generic/299 on smallish (1GB) ramdisk test devices.

To avoid this problem, replace the undocumented hardcoded total
parameters from the iomap and utility callers to pass the block
reservation used for the associated transaction. This is consistent with
other xfs_bmapi_write() callers throughout XFS. The assumption is that
the total field allows the selection of an AG that can handle the entire
operation rather than simply the allocation/range being requested (e.g.,
resulting btree splits, etc.). This addresses the aforementioned
generic/299 hang by ensuring AG selection only occurs when the
allocation can be satisfied by the AG.
Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

dbd5c8c9

19 8月, 2015 3 次提交

xfs: add missing bmap cancel calls in error paths · d4a97a04

由 Brian Foster 提交于 8月 19, 2015

If a failure occurs after the bmap free list is populated and before
xfs_bmap_finish() completes successfully (which returns a partial
list on failure), the bmap free list must be cancelled. Otherwise,
the extent items on the list are never freed and a memory leak
occurs.

Several random error paths throughout the code suffer this problem.
Fix these up such that xfs_bmap_cancel() is always called on error.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

d4a97a04

xfs: ensure EFD trans aborts on log recovery extent free failure · 6bc43af3

由 Brian Foster 提交于 8月 19, 2015

Log recovery attempts to free extents with leftover EFIs in the AIL
after initial processing. If the extent free fails (e.g., due to
unrelated fs corruption), the transaction is cancelled, though it
might not be dirtied at the time. If this is the case, the EFD does
not abort and thus does not release the EFI. This can lead to hangs
as the EFI pins the AIL.

Update xlog_recover_process_efi() to log the EFD in the transaction
before xfs_free_extent() errors are handled to ensure the
transaction is dirty, aborts the EFD and releases the EFI on error.
Since this is a requirement for EFD processing (and consistent with
xfs_bmap_finish()), update the EFD logging helper to do the extent
free and unconditionally log the EFD. This encodes the required EFD
logging behavior into the helper and reduces the likelihood of
errors down the road.

[dchinner: re-add xfs_alloc.h to xfs_log_recover.c to fix build
 failure.]
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

6bc43af3

xfs: fix efi/efd error handling to avoid fs shutdown hangs · 8d99fe92

由 Brian Foster 提交于 8月 19, 2015

Freeing an extent in XFS involves logging an EFI (extent free
intention), freeing the actual extent, and logging an EFD (extent
free done). The EFI object is created with a reference count of 2:
one for the current transaction and one for the subsequently created
EFD. Under normal circumstances, the first reference is dropped when
the EFI is unpinned and the second reference is dropped when the EFD
is committed to the on-disk log.

In event of errors or filesystem shutdown, there are various
potential cleanup scenarios depending on the state of the EFI/EFD.
The cleanup scenarios are confusing and racy, as demonstrated by the
following test sequence:

	# mount $dev $mnt
	# fsstress -d $mnt -n 99999 -p 16 -z -f fallocate=1 \
		-f punch=1 -f creat=1 -f unlink=1 &
	# sleep 5
	# killall -9 fsstress; wait
	# godown -f $mnt
	# umount

... in which the final umount can hang due to the AIL being pinned
indefinitely by one or more EFI items. This can occur due to several
conditions. For example, if the shutdown occurs after the EFI is
committed to the on-disk log and the EFD committed to the CIL, but
before the EFD committed to the log, the EFD iop_committed() abort
handler does not drop its reference to the EFI. Alternatively,
manual error injection in the xfs_bmap_finish() codepath shows that
if an error occurs after the EFI transaction is committed but before
the EFD is constructed and logged, the EFI is never released from
the AIL.

Update the EFI/EFD item handling code to use a more straightforward
and reliable approach to error handling. If an error occurs after
the EFI transaction is committed and before the EFD is constructed,
release the EFI explicitly from xfs_bmap_finish(). If the EFI
transaction is cancelled, release the EFI in the unlock handler.

Once the EFD is constructed, it is responsible for releasing the EFI
under any circumstances (including whether the EFI item aborts due
to log I/O error). Update the EFD item handlers to release the EFI
if the transaction is cancelled or aborts due to log I/O error.
Finally, update xfs_bmap_finish() to log at least one EFD extent to
the transaction before xfs_free_extent() errors are handled to
ensure the transaction is dirty and EFD item error handling is
triggered.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

8d99fe92

04 6月, 2015 4 次提交

xfs: saner xfs_trans_commit interface · 70393313

由 Christoph Hellwig 提交于 6月 04, 2015

The flags argument to xfs_trans_commit is not useful for most callers, as
a commit of a transaction without a permanent log reservation must pass
0 here, and all callers for a transaction with a permanent log reservation
except for xfs_trans_roll must pass XFS_TRANS_RELEASE_LOG_RES. So remove
the flags argument from the public xfs_trans_commit interfaces, and
introduce low-level __xfs_trans_commit variant just for xfs_trans_roll
that regrants a log reservation instead of releasing it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

70393313

xfs: remove the flags argument to xfs_trans_cancel · 4906e215

由 Christoph Hellwig 提交于 6月 04, 2015

xfs_trans_cancel takes two flags arguments: XFS_TRANS_RELEASE_LOG_RES and
XFS_TRANS_ABORT.  Both of them are a direct product of the transaction
state, and can be deducted:

 - any dirty transaction needs XFS_TRANS_ABORT to be properly canceled,
   and XFS_TRANS_ABORT is a noop for a transaction that is not dirty.
 - any transaction with a permanent log reservation needs
   XFS_TRANS_RELEASE_LOG_RES to be properly canceled, and passing
   XFS_TRANS_RELEASE_LOG_RES for a transaction without a permanent
   log reservation is invalid.

So just remove the flags argument and do the right thing.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

4906e215

xfs: switch remaining xfs_trans_dup users to xfs_trans_roll · 2e6db6c4

由 Christoph Hellwig 提交于 6月 04, 2015

We have three remaining callers of xfs_trans_dup:

 - xfs_itruncate_extents which open codes xfs_trans_roll
 - xfs_bmap_finish doesn't have an xfs_inode argument and thus leaves
   attaching them to it's callers, but otherwise is identical to
   xfs_trans_roll
 - xfs_dir_ialloc looks at the log reservations in the old xfs_trans
   structure instead of the log reservation parameters, but otherwise
   is identical to xfs_trans_roll.

By allowing a NULL xfs_inode argument to xfs_trans_roll we can switch
these three remaining users over to xfs_trans_roll and mark xfs_trans_dup
static.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

2e6db6c4

xfs: add DAX block zeroing support · 4f69f578

由 Dave Chinner 提交于 6月 04, 2015

Add initial support for DAX block zeroing operations to XFS. DAX
cannot use buffered IO through the page cache for zeroing, nor do we
need to issue IO for uncached block zeroing. In both cases, we can
simply call out to the dax block zeroing function.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

4f69f578

13 4月, 2015 1 次提交

xfs: xfs_shift_file_space can be static · 72c1a739

由 kbuild test robot 提交于 4月 13, 2015

Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

72c1a739

25 3月, 2015 1 次提交

xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate · a904b1ca

由 Namjae Jeon 提交于 3月 25, 2015

This patch implements fallocate's FALLOC_FL_INSERT_RANGE for XFS.

1) Make sure that both offset and len are block size aligned.
2) Update the i_size of inode by len bytes.
3) Compute the file's logical block number against offset. If the computed
   block number is not the starting block of the extent, split the extent
   such that the block number is the starting block of the extent.
4) Shift all the extents which are lying bewteen [offset, last allocated extent]
   towards right by len bytes. This step will make a hole of len bytes
   at offset.
Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

a904b1ca

23 2月, 2015 1 次提交

xfs: lock out page faults from extent swap operations · 723cac48

由 Dave Chinner 提交于 2月 23, 2015

Extent swap operations are another extent manipulation operation
that we need to ensure does not race against mmap page faults. The
current code returns if the file is mapped prior to the swap being
done, but it could potentially race against new page faults while
the swap is in progress. Hence we should use the XFS_MMAPLOCK_EXCL
for this operation, too.

While there, fix the error path handling that can result in double
unlocks of the inodes when cancelling the swapext transaction.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

723cac48

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功