提交 · baf4bcacb715cebd412b2f4bb69989ef24496523 · openeuler / Kernel

04 10月, 2016 9 次提交

xfs: create refcount update intent log items · baf4bcac

由 Darrick J. Wong 提交于 10月 03, 2016

Create refcount update intent/done log items to record redo
information in the log.  Because we need to roll transactions between
updating the bmbt mapping and updating the reverse mapping, we also
have to track the status of the metadata updates that will be recorded
in the post-roll transactions, just in case we crash before committing
the final transaction.  This mechanism enables log recovery to finish
what was already started.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

baf4bcac

xfs: add refcount btree operations · bdf28630

由 Darrick J. Wong 提交于 10月 03, 2016

Implement the generic btree operations required to manipulate refcount
btree blocks.  The implementation is similar to the bmapbt, though it
will only allocate and free blocks from the AG.

Since the refcount root and level fields are separate from the
existing roots and levels array, they need a separate logging flag.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
[hch: fix logging of AGF refcount btree fields]
Signed-off-by: NChristoph Hellwig <hch@lst.de>

bdf28630

xfs: account for the refcount btree in the alloc/free log reservation · f310bd2e

由 Darrick J. Wong 提交于 10月 03, 2016

Every time we allocate or free a data extent, we might need to split
the refcount btree.  Reserve some blocks in the transaction to handle
this possibility.  Even though the deferred refcount code can roll a
transaction to avoid overloading the transaction, we can still exceed
the reservation.

Certain pathological workloads (1k blocks, no cowextsize hint, random
directio writes), cause a perfect storm wherein a refcount adjustment
of a large range of blocks causes full tree splits in two separate
extents in two separate refcount tree blocks; allocating new refcount
tree blocks causes rmap btree splits; and all the allocation activity
causes the freespace btrees to split, blowing the reservation.

(Reproduced by generic/167 over NFS atop XFS)
Signed-off-by: NChristoph Hellwig <hch@lst.de>
[darrick.wong@oracle.com: add commit message]
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

f310bd2e

xfs: add refcount btree support to growfs · ac4fef69

由 Darrick J. Wong 提交于 10月 03, 2016

Modify the growfs code to initialize new refcount btree blocks.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

ac4fef69

xfs: define the on-disk refcount btree format · 1946b91c

由 Darrick J. Wong 提交于 10月 03, 2016

Start constructing the refcount btree implementation by establishing
the on-disk format and everything needed to read, write, and
manipulate the refcount btree blocks.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

1946b91c

xfs: refcount btree add more reserved blocks · af30dfa1

由 Darrick J. Wong 提交于 10月 03, 2016

Since XFS reserves a small amount of space in each AG as the minimum
free space needed for an operation, save some more space in case we
touch the refcount btree.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

af30dfa1

xfs: introduce refcount btree definitions · 46eeb521

由 Darrick J. Wong 提交于 10月 03, 2016

Add new per-AG refcount btree definitions to the per-AG structures.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

46eeb521

xfs: define tracepoints for refcount btree activities · c75c752d

由 Darrick J. Wong 提交于 10月 03, 2016

Define all the tracepoints we need to inspect the refcount btree
runtime operation.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

c75c752d

xfs: return an error when an inline directory is too small · 9cdafd8a

由 Darrick J. Wong 提交于 10月 03, 2016

If the size of an inline directory is so small that it doesn't
even cover the required header size, return an error to userspace
instead of ASSERTing and returning 0 like everything's ok.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reported-by: NJan Kara <jack@suse.cz>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

9cdafd8a

03 10月, 2016 1 次提交

xfs: update atime before I/O in xfs_file_dio_aio_read · a447d7cd

由 Christoph Hellwig 提交于 10月 03, 2016

After the call to __blkdev_direct_IO the final reference to the file
might have been dropped by aio_complete already, and the call to
file_accessed might cause a use after free.

Instead update the access time before the I/O, similar to how we
update the time stamps before writes.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reported-and-tested-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

a447d7cd

26 9月, 2016 7 次提交

xfs: log recovery tracepoints to track current lsn and buffer submission · 5cd9cee9

由 Brian Foster 提交于 9月 26, 2016

Log recovery has particular rules around buffer submission along with
tricky corner cases where independent transactions can share an LSN. As
such, it can be difficult to follow when/why buffers are submitted
during recovery.

Add a couple tracepoints to post the current LSN of a record when a new
record is being processed and when a buffer is being skipped due to LSN
ordering. Also, update the recover item class to include the LSN of the
current transaction for the item being processed.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

5cd9cee9

xfs: update metadata LSN in buffers during log recovery · 60a4a222

由 Brian Foster 提交于 9月 26, 2016

Log recovery is currently broken for v5 superblocks in that it never
updates the metadata LSN of buffers written out during recovery. The
metadata LSN is recorded in various bits of metadata to provide recovery
ordering criteria that prevents transient corruption states reported by
buffer write verifiers. Without such ordering logic, buffer updates can
be replayed out of order and lead to false positive transient corruption
states. This is generally not a corruption vector on its own, but
corruption detection shuts down the filesystem and ultimately prevents a
mount if it occurs during log recovery. This requires an xfs_repair run
that clears the log and potentially loses filesystem updates.

This problem is avoided in most cases as metadata writes during normal
filesystem operation update the metadata LSN appropriately. The problem
with log recovery not updating metadata LSNs manifests if the system
happens to crash shortly after log recovery itself. In this scenario, it
is possible for log recovery to complete all metadata I/O such that the
filesystem is consistent. If a crash occurs after that point but before
the log tail is pushed forward by subsequent operations, however, the
next mount performs the same log recovery over again. If a buffer is
updated multiple times in the dirty range of the log, an earlier update
in the log might not be valid based on the current state of the
associated buffer after all of the updates in the log had been replayed
(before the previous crash). If a verifier happens to detect such a
problem, the filesystem claims corruption and immediately shuts down.

This commonly manifests in practice as directory block verifier failures
such as the following, likely due to directory verifiers being
particularly detailed in their checks as compared to most others:

  ...
  Mounting V5 Filesystem
  XFS (dm-0): Starting recovery (logdev: internal)
  XFS (dm-0): Internal error XFS_WANT_CORRUPTED_RETURN at line ... of \
    file fs/xfs/libxfs/xfs_dir2_data.c.  Caller xfs_dir3_data_verify ...
  ...

Update log recovery to update the metadata LSN of recovered buffers.
Since metadata LSNs are already updated by write verifer functions via
attached log items, attach a dummy log item to the buffer during
validation and explicitly set the LSN of the current transaction. This
ensures that the metadata LSN of a buffer is updated based on whether
the recovery I/O actually completes, and if so, that subsequent recovery
attempts identify that the buffer is already up to date with respect to
the current transaction.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

60a4a222

xfs: don't warn on buffers not being recovered due to LSN · 040c52c0

由 Brian Foster 提交于 9月 26, 2016

The log recovery buffer validation function is invoked in cases where a
buffer update may be skipped due to LSN ordering. If the validation
function happens to come across directory conversion situations (e.g., a
dir3 block to data conversion), it may warn about seeing a buffer log
format of one type and a buffer with a magic number of another.

This warning is not valid as the buffer update is ultimately skipped.
This is indicated by a current_lsn of NULLCOMMITLSN provided by the
caller. As such, update xlog_recover_validate_buf_type() to only warn in
such cases when a buffer update is expected.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

040c52c0

xfs: pass current lsn to log recovery buffer validation · 22db9af2

由 Brian Foster 提交于 9月 26, 2016

The current LSN must be available to the buffer validation function to
provide the ability to update the metadata LSN of the buffer. Pass the
current_lsn value down to xlog_recover_validate_buf_type() in
preparation.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

22db9af2

xfs: rework log recovery to submit buffers on LSN boundaries · 12818d24

由 Brian Foster 提交于 9月 26, 2016

The fix to log recovery to update the metadata LSN in recovered buffers
introduces the requirement that a buffer is submitted only once per
current LSN. Log recovery currently submits buffers on transaction
boundaries. This is not sufficient as the abstraction between log
records and transactions allows for various scenarios where multiple
transactions can share the same current LSN. If independent transactions
share an LSN and both modify the same buffer, log recovery can
incorrectly skip updates and leave the filesystem in an inconsisent
state.

In preparation for proper metadata LSN updates during log recovery,
update log recovery to submit buffers for write on LSN change boundaries
rather than transaction boundaries. Explicitly track the current LSN in
a new struct xlog field to handle the various corner cases of when the
current LSN may or may not change.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

12818d24

xfs: quiesce the filesystem after recovery on readonly mount · ddeb14f4

由 Dave Chinner 提交于 9月 26, 2016

Recently we've had a number of reports where log recovery on a v5
filesystem has reported corruptions that looked to be caused by
recovery being re-run over the top of an already-recovered
metadata. This has uncovered a bug in recovery (fixed elsewhere)
but the vector that caused this was largely unknown.

A kdump test started tripping over this problem - the system
would be crashed, the kdump kernel and environment would boot and
dump the kernel core image, and then the system would reboot. After
reboot, the root filesystem was triggering log recovery and
corruptions were being detected. The metadumps indicated the above
log recovery issue.

What is happening is that the kdump kernel and environment is
mounting the root device read-only to find the binaries needed to do
it's work. The result of this is that it is running log recovery.
However, because there were unlinked files and EFIs to be processed
by recovery, the completion of phase 1 of log recovery could not
mark the log clean. And because it's a read-only mount, the unmount
process does not write records to the log to mark it clean, either.
Hence on the next mount of the filesystem, log recovery was run
again across all the metadata that had already been recovered and
this is what triggered corruption warnings.

To avoid this problem, we need to ensure that a read-only mount
always updates the log when it completes the second phase of
recovery. We already handle this sort of issue with rw->ro remount
transitions, so the solution is as simple as quiescing the
filesystem at the appropriate time during the mount process. This
results in the log being marked clean so the mount behaviour
recorded in the logs on repeated RO mounts will change (i.e. log
recovery will no longer be run on every mount until a RW mount is
done). This is a user visible change in behaviour, but it is
harmless.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

ddeb14f4

xfs: remote attribute blocks aren't really userdata · 292378ed

由 Dave Chinner 提交于 9月 26, 2016

When adding a new remote attribute, we write the attribute to the
new extent before the allocation transaction is committed. This
means we cannot reuse busy extents as that violates crash
consistency semantics. Hence we currently treat remote attribute
extent allocation like userdata because it has the same overwrite
ordering constraints as userdata.

Unfortunately, this also allows the allocator to incorrectly apply
extent size hints to the remote attribute extent allocation. This
results in interesting failures, such as transaction block
reservation overruns and in-memory inode attribute fork corruption.

To fix this, we need to separate the busy extent reuse configuration
from the userdata configuration. This changes the definition of
XFS_BMAPI_METADATA slightly - it now means that allocation is
metadata and reuse of busy extents is acceptible due to the metadata
ordering semantics of the journal. If this flag is not set, it
means the allocation is that has unordered data writeback, and hence
busy extent reuse is not allowed. It no longer implies the
allocation is for user data, just that the data write will not be
strictly ordered. This matches the semantics for both user data
and remote attribute block allocation.

As such, This patch changes the "userdata" field to a "datatype"
field, and adds a "no busy reuse" flag to the field.
When we detect an unordered data extent allocation, we immediately set
the no reuse flag. We then set the "user data" flags based on the
inode fork we are allocating the extent to. Hence we only set
userdata flags on data fork allocations now and consider attribute
fork remote extents to be an unordered metadata extent.

The result is that remote attribute extents now have the expected
allocation semantics, and the data fork allocation behaviour is
completely unchanged.

It should be noted that there may be other ways to fix this (e.g.
use ordered metadata buffers for the remote attribute extent data
write) but they are more invasive and difficult to validate both
from a design and implementation POV. Hence this patch takes the
simple, obvious route to fixing the problem...
Reported-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

292378ed

19 9月, 2016 15 次提交

xfs: use iomap to implement DAX · 6c31f495

由 Christoph Hellwig 提交于 9月 19, 2016

Another users of buffer_heads bytes the dust.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

6c31f495

xfs: refactor xfs_setfilesize · e372843a

由 Christoph Hellwig 提交于 9月 19, 2016

Rename the current function to __xfs_setfilesize and add a non-static
wrapper that also takes care of creating the transaction.  This new
helper will be used by the new iomap-based DAX path.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

e372843a

xfs: take the ilock shared if possible in xfs_file_iomap_begin · 66642c5c

由 Christoph Hellwig 提交于 9月 19, 2016

We always just read the extent first, and will later lock exlusively
after first dropping the lock in case we actually allocate blocks.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

66642c5c

xfs: fix locking for DAX writes · 17879e8f

由 Christoph Hellwig 提交于 9月 19, 2016

So far DAX writes inherited the locking from direct I/O writes, but
the direct I/O model of using shared locks for writes is actually
wrong for DAX.  For direct I/O we're out of any standards and don't
have to provide the Posix required exclusion between writers, but
for DAX which gets transparently enable on applications without any
knowledge of it we can't simply drop the requirement.  Even worse
this only happens for aligned writes and thus doesn't show up for
many typical use cases.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

17879e8f

iomap: add IOMAP_F_NEW flag · ecd50729

由 Christoph Hellwig 提交于 9月 19, 2016

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

ecd50729

xfs: rewrite and optimize the delalloc write path · 51446f5b

由 Christoph Hellwig 提交于 9月 19, 2016

Currently xfs_iomap_write_delay does up to lookups in the inode
extent tree, which is rather costly especially with the new iomap
based write path and small write sizes.

But it turns out that the low-level xfs_bmap_search_extents gives us
all the information we need in the regular delalloc buffered write
path:

 - it will return us an extent covering the block we are looking up
   if it exists.  In that case we can simply return that extent to
   the caller and are done
 - it will tell us if we are beyoned the last current allocated
   block with an eof return parameter.  In that case we can create a
   delalloc reservation and use the also returned information about
   the last extent in the file as the hint to size our delalloc
   reservation.
 - it can tell us that we are writing into a hole, but that there is
   an extent beyoned this hole.  In this case we can create a
   delalloc reservation that covers the requested size (possible
   capped to the next existing allocation).

All that can be done in one single routine instead of bouncing up
and down a few layers.  This reduced the CPU overhead of the block
mapping routines and also simplified the code a lot.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

51446f5b

xfs: make xfs_inode_set_eofblocks_tag cheaper for the common case · 85a6e764

由 Christoph Hellwig 提交于 9月 19, 2016

For long growing file writes we will usually already have the
eofblocks tag set when adding more speculative preallocations.  Add
a flag in the inode to allow us to skip the the fairly expensive
AG-wide spinlocks and multiple radix tree operations in that case.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

85a6e764

xfs: factor our a helper to calculate the EOF alignment · f8e3a825

由 Christoph Hellwig 提交于 9月 19, 2016

And drop the pointless mp argument to xfs_iomap_eof_align_last_fsb,
while we're at it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

f8e3a825

xfs: move xfs_bmbt_to_iomap up · e9c49736

由 Christoph Hellwig 提交于 9月 19, 2016

We'll need it earlier in the file soon, so the unchanged function to
the top of xfs_iomap.c
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

e9c49736

xfs: set up per-AG free space reservations · 3fd129b6

由 Darrick J. Wong 提交于 9月 19, 2016

One unfortunate quirk of the reference count and reverse mapping
btrees -- they can expand in size when blocks are written to *other*
allocation groups if, say, one large extent becomes a lot of tiny
extents. Since we don't want to start throwing errors in the middle
of CoWing, we need to reserve some blocks to handle future expansion.
The transaction block reservation counters aren't sufficient here
because we have to have a reserve of blocks in every AG, not just
somewhere in the filesystem.

Therefore, create two per-AG block reservation pools. One feeds the
AGFL so that rmapbt expansion always succeeds, and the other feeds all
other metadata so that refcountbt expansion never fails.

Use the count of how many reserved blocks we need to have on hand to
create a virtual reservation in the AG. Through selective clamping of
the maximum length of allocation requests and of the length of the
longest free extent, we can make it look like there's less free space
in the AG unless the reservation owner is asking for blocks.

In other words, play some accounting tricks in-core to make sure that
we always have blocks available. On the plus side, there's nothing to
clean up if we crash, which is contrast to the strategy that the rough
draft used (actually removing extents from the freespace btrees).
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

3fd129b6

xfs: defer should allow ->finish_item to request a new transaction · 385d6558

由 Darrick J. Wong 提交于 9月 19, 2016

When xfs_defer_finish calls ->finish_item, it's possible that
(refcount) won't be able to finish all the work in a single
transaction.  When this happens, the ->finish_item handler should
shorten the log done item's list count, update the work item to
reflect where work should continue, and return -EAGAIN so that
defer_finish knows to retain the pending item on the pending list,
roll the transaction, and restart processing where we left off.

Plumb in the code and document how this mechanism is supposed to work.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

385d6558

xfs: count the blocks in a btree · c611cc03

由 Darrick J. Wong 提交于 9月 19, 2016

Provide a helper method to count the number of blocks in a short form
btree.  The refcount and rmap btrees need to know the number of blocks
already in use to set up their per-AG block reservations during mount.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

c611cc03

xfs: create a standard btree size calculator code · 4ed3f687

由 Darrick J. Wong 提交于 9月 19, 2016

Create a helper to generate AG btree height calculator functions.
This will be used (much) later when we get to the refcount btree.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

4ed3f687

xfs: remove xfs_btree_bigkey · a1d46cff

由 Darrick J. Wong 提交于 9月 19, 2016

Remove the xfs_btree_bigkey mess and simply make xfs_btree_key big enough
to hold both keys in-core.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

a1d46cff

xfs: convert RUI log formats to use variable length arrays · cd00158c

由 Darrick J. Wong 提交于 9月 19, 2016

Use variable length array declarations for RUI log items,
and replace the open coded sizeof formulae with a single function.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

cd00158c

14 9月, 2016 4 次提交

xfs: normalize "infinite" retries in error configs · 77169812

由 Eric Sandeen 提交于 9月 14, 2016

As it stands today, the "fail immediately" vs. "retry forever"
values for max_retries and retry_timeout_seconds in the xfs metadata
error configurations are not consistent.

A retry_timeout_seconds of 0 means "retry forever," but a
max_retries of 0 means "fail immediately."

retry_timeout_seconds < 0 is disallowed, while max_retries == -1
means "retry forever."

Make this consistent across the error configs, such that a value of
0 means "fail immediately" (i.e. wait 0 seconds, or retry 0 times),
and a value of -1 always means "retry forever."

This makes retry_timeout a signed long to accommodate the -1, even
though it stores jiffies.  Given our limit of a 1 day maximum
timeout, this should be sufficient even at much higher HZ values
than we have available today.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

77169812

xfs: fix signed integer overflow · 79c350e4

由 Xie XiuQi 提交于 9月 14, 2016

Use 1U for unsigned int to avoid a overflow warning from UBSAN.

[   31.910858] UBSAN: Undefined behaviour in fs/xfs/xfs_buf_item.c:889:25
[   31.911252] signed integer overflow:
[   31.911478] -2147483648 - 1 cannot be represented in type 'int'
[   31.911846] CPU: 1 PID: 1011 Comm: tuned Tainted: G    B          ---- -------   3.10.0-327.28.3.el7.x86_64 #1
[   31.911857] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 01/07/2011
[   31.911866]  1ffff1004069cd3b 0000000076bec3fd ffff8802034e69a0 ffffffff81ee3140
[   31.911883]  ffff8802034e69b8 ffffffff81ee31fd ffffffffa0ad79e0 ffff8802034e6b20
[   31.911898]  ffffffff81ee46e2 0000002d515470c0 0000000000000001 0000000041b58ab3
[   31.911913] Call Trace:
[   31.911932]  [<ffffffff81ee3140>] dump_stack+0x1e/0x20
[   31.911947]  [<ffffffff81ee31fd>] ubsan_epilogue+0x12/0x55
[   31.911964]  [<ffffffff81ee46e2>] handle_overflow+0x1ba/0x215
[   31.912083]  [<ffffffff81ee4798>] __ubsan_handle_sub_overflow+0x2a/0x31
[   31.912204]  [<ffffffffa08676fb>] xfs_buf_item_log+0x34b/0x3f0 [xfs]
[   31.912314]  [<ffffffffa0880490>] xfs_trans_log_buf+0x120/0x260 [xfs]
[   31.912402]  [<ffffffffa079a890>] xfs_btree_log_recs+0x80/0xc0 [xfs]
[   31.912490]  [<ffffffffa07a29f8>] xfs_btree_delrec+0x11a8/0x2d50 [xfs]
[   31.913589]  [<ffffffffa07a86f9>] xfs_btree_delete+0xc9/0x260 [xfs]
[   31.913762]  [<ffffffffa075b5cf>] xfs_free_ag_extent+0x63f/0xe20 [xfs]
[   31.914339]  [<ffffffffa075ec0f>] xfs_free_extent+0x2af/0x3e0 [xfs]
[   31.914641]  [<ffffffffa0801b2b>] xfs_bmap_finish+0x32b/0x4b0 [xfs]
[   31.914841]  [<ffffffffa083c2e7>] xfs_itruncate_extents+0x3b7/0x740 [xfs]
[   31.915216]  [<ffffffffa08342fa>] xfs_setattr_size+0x60a/0x860 [xfs]
[   31.915471]  [<ffffffffa08345ea>] xfs_vn_setattr+0x9a/0xe0 [xfs]
[   31.915590]  [<ffffffff8149ad38>] notify_change+0x5c8/0x8a0
[   31.915607]  [<ffffffff81450f22>] do_truncate+0x122/0x1d0
[   31.915640]  [<ffffffff8147beee>] do_last+0x15de/0x2c80
[   31.915707]  [<ffffffff8147d777>] path_openat+0x1e7/0xcc0
[   31.915802]  [<ffffffff81480824>] do_filp_open+0xa4/0x160
[   31.915848]  [<ffffffff81453127>] do_sys_open+0x1b7/0x3f0
[   31.915879]  [<ffffffff81453392>] SyS_open+0x32/0x40
[   31.915897]  [<ffffffff81f08989>] system_call_fastpath+0x16/0x1b

[  240.086809] UBSAN: Undefined behaviour in fs/xfs/xfs_buf_item.c:866:34
[  240.086820] signed integer overflow:
[  240.086830] -2147483648 - 1 cannot be represented in type 'int'
[  240.086846] CPU: 1 PID: 12969 Comm: rm Tainted: G    B          ---- -------   3.10.0-327.28.3.el7.x86_64 #1
[  240.086857] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 01/07/2011
[  240.086868]  1ffff10040491def 00000000e2ea59c1 ffff88020248ef40 ffffffff81ee3140
[  240.086885]  ffff88020248ef58 ffffffff81ee31fd ffffffffa0ad79e0 ffff88020248f0c0
[  240.086901]  ffffffff81ee46e2 0000002d02488000 0000000000000001 0000000041b58ab3
[  240.086915] Call Trace:
[  240.086938]  [<ffffffff81ee3140>] dump_stack+0x1e/0x20
[  240.086953]  [<ffffffff81ee31fd>] ubsan_epilogue+0x12/0x55
[  240.086971]  [<ffffffff81ee46e2>] handle_overflow+0x1ba/0x215
...
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

79c350e4

Make __xfs_xattr_put_listen preperly report errors. · 791cc43b

由 Artem Savkov 提交于 9月 14, 2016

Commit 2a6fba6d "xfs: only return -errno or success from attr ->put_listent"
changes the returnvalue of __xfs_xattr_put_listen to 0 in case when there is
insufficient space in the buffer assuming that setting context->count to -1
would be enough, but all of the ->put_listent callers only check seen_enough.
This results in a failed assertion:
XFS: Assertion failed: context->count >= 0, file: fs/xfs/xfs_xattr.c, line: 175
in insufficient buffer size case.

This is only reproducible with at least 2 xattrs and only when the buffer
gets depleted before the last one.

Furthermore if buffersize is such that it is enough to hold the last xattr's
name, but not enough to hold the sum of preceeding xattr names listxattr won't
fail with ERANGE, but will suceed returning last xattr's name without the
first character. The first character end's up overwriting data stored at
(context->alist - 1).
Signed-off-by: NArtem Savkov <asavkov@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

791cc43b

xfs: undo block reservation correctly in xfs_trans_reserve() · a27f6ef4

由 Eryu Guan 提交于 9月 14, 2016

"blocks" should be added back to fdblocks at undo time, not taken
away, i.e. the minus sign should not be used.

This is a regression introduced by commit 0d485ada ("xfs: use
generic percpu counters for free block counter"). And it's found by
code inspection, I didn't it in real world, so there's no
reproducer.
Signed-off-by: NEryu Guan <eguan@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

a27f6ef4

30 8月, 2016 1 次提交

xfs: track log done items directly in the deferred pending work item · ea78d808

由 Darrick J. Wong 提交于 8月 30, 2016

Christoph reports slab corruption when a deferred refcount update
aborts during _defer_finish().  The cause of this was broken log item
state tracking in xfs_defer_pending -- upon an abort,
_defer_trans_abort() will call abort_intent on all intent items,
including the ones that have already had a done item attached.

This is incorrect because each intent item has 2 refcount: the first
is released when the intent item is committed to the log; and the
second is released when the _done_ item is committed to the log, or
by the intent creator if there is no done item.  In other words, once
we log the done item, responsibility for releasing the intent item's
second refcount is transferred to the done item and /must not/ be
performed by anything else.

The dfp_committed flag should have been tracking whether or not we had
a done item so that _defer_trans_abort could decide if it needs to
abort the intent item, but due to a thinko this was not the case.  Rip
it out and track the done item directly so that we do the right thing
w.r.t. intent item freeing.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reported-by: NChristoph Hellwig <hch@infradead.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

ea78d808

26 8月, 2016 3 次提交

xfs: prevent dropping ioend completions during buftarg wait · 800b2694

由 Brian Foster 提交于 8月 26, 2016

xfs_wait_buftarg() waits for all pending I/O, drains the ioend
completion workqueue and walks the LRU until all buffers in the cache
have been released. This is traditionally an unmount operation` but the
mechanism is also reused during filesystem freeze.

xfs_wait_buftarg() invokes drain_workqueue() as part of the quiesce,
which is intended more for a shutdown sequence in that it indicates to
the queue that new operations are not expected once the drain has begun.
New work jobs after this point result in a WARN_ON_ONCE() and are
otherwise dropped.

With filesystem freeze, however, read operations are allowed and can
proceed during or after the workqueue drain. If such a read occurs
during the drain sequence, the workqueue infrastructure complains about
the queued ioend completion work item and drops it on the floor. As a
result, the buffer remains on the LRU and the freeze never completes.

Despite the fact that the overall buffer cache cleanup is not necessary
during freeze, fix up this operation such that it is safe to invoke
during non-unmount quiesce operations. Replace the drain_workqueue()
call with flush_workqueue(), which runs a similar serialization on
pending workqueue jobs without causing new jobs to be dropped. This is
safe for unmount as unmount independently locks out new operations by
the time xfs_wait_buftarg() is invoked.

cc: <stable@vger.kernel.org>
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

800b2694

xfs: fix superblock inprogress check · f3d7ebde

由 Dave Chinner 提交于 8月 26, 2016

From inspection, the superblock sb_inprogress check is done in the
verifier and triggered only for the primary superblock via a
"bp->b_bn == XFS_SB_DADDR" check.

Unfortunately, the primary superblock is an uncached buffer, and
hence it is configured by xfs_buf_read_uncached() with:

	bp->b_bn = XFS_BUF_DADDR_NULL;  /* always null for uncached buffers */

And so this check never triggers. Fix it.

cc: <stable@vger.kernel.org>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

f3d7ebde

xfs: simple btree query range should look right if LE lookup fails · 5b5c2dbd

由 Darrick J. Wong 提交于 8月 26, 2016

If the initial LOOKUP_LE in the simple query range fails to find
anything, we should attempt to increment the btree cursor to see
if there actually /are/ records for what we're trying to find.
Without this patch, a bnobt range query of (0, $agsize) returns
no results because the leftmost record never has a startblock
of zero.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

5b5c2dbd

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功