提交 · ed5c3e66a32883e2b3d119d358d23fd5990dc9c2 · openeuler / Kernel

10 5月, 2018 6 次提交

xfs: move generic_write_sync calls inwards · ed5c3e66

由 Dave Chinner 提交于 5月 02, 2018

To prepare for iomap iinfrastructure based DSYNC optimisations.

While moving the code araound, move the XFS write bytes metric
update for direct IO into xfs_dio_write_end_io callback so that we
always capture the amount of data written via AIO+DIO. This fixes
the problem where queued AIO+DIO writes are not accounted to this
metric.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

ed5c3e66

xfs: don't retry xfs_buf_find on XBF_TRYLOCK failure · b027d4c9

由 Dave Chinner 提交于 4月 18, 2018

When looking at an event trace recently, I noticed that non-blocking
buffer lookup attempts would fail on cached locked buffers and then
run the slow cache-miss path. This means we are doing an xfs_buf
allocation, lookup and free unnecessarily every time we avoid
blocking on a locked buffer.

Fix this by changing _xfs_buf_find() to return an error status to
the caller to indicate that we failed the lock attempt rather than
just returning a NULL. This allows the higher level code to
discriminate between a cache miss and an cache hit that we failed to
lock.

This also allows us to return a -EFSCORRUPTED state if we are asked
to look up a block number outside the range of the filesystem in
_xfs_buf_find(), which moves us one step closer to being able to
handle such errors in a more graceful manner at the higher levels.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

b027d4c9

xfs: make xfs_buf_incore out of line · 8925a3dc

由 Dave Chinner 提交于 4月 18, 2018

Move xfs_buf_incore out of line and make it the only way to look up
a buffer in the buffer cache from outside the buffer cache. Convert
the external users of _xfs_buf_find() to xfs_buf_incore() and make
_xfs_buf_find() static.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
[darrick: actually rename xfs_incore -> xfs_buf_incore]
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

8925a3dc

xfs: trace ATTR flags in xattr tracepoints · e443523d

由 Eric Sandeen 提交于 4月 17, 2018

This will trace i.e. the ATTR_SECURE/ATTR_CREATE/ATTR_REPLACE
flags as well as the OP_FLAGS.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

e443523d

xfs: validate allocated inode number · 8b26984d

由 Dave Chinner 提交于 4月 17, 2018

When we have corrupted free inode btrees, we can attempt to
allocate inodes that we know are already allocated. Catch allocation
of these inodes and report corruption as early as possible to
prevent corruption propagation or deadlocks.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

8b26984d

xfs: validate cached inodes are free when allocated · afca6c5b

由 Dave Chinner 提交于 4月 17, 2018

A recent fuzzed filesystem image cached random dcache corruption
when the reproducer was run. This often showed up as panics in
lookup_slow() on a null inode->i_ops pointer when doing pathwalks.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
....
Call Trace:
 lookup_slow+0x44/0x60
 walk_component+0x3dd/0x9f0
 link_path_walk+0x4a7/0x830
 path_lookupat+0xc1/0x470
 filename_lookup+0x129/0x270
 user_path_at_empty+0x36/0x40
 path_listxattr+0x98/0x110
 SyS_listxattr+0x13/0x20
 do_syscall_64+0xf5/0x280
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

but had many different failure modes including deadlocks trying to
lock the inode that was just allocated or KASAN reports of
use-after-free violations.

The cause of the problem was a corrupt INOBT on a v4 fs where the
root inode was marked as free in the inobt record. Hence when we
allocated an inode, it chose the root inode to allocate, found it in
the cache and re-initialised it.

We recently fixed a similar inode allocation issue caused by inobt
record corruption problem in xfs_iget_cache_miss() in commit
ee457001 ("xfs: catch inode allocation state mismatch
corruption"). This change adds similar checks to the cache-hit path
to catch it, and turns the reproducer into a corruption shutdown
situation.
Reported-by: NWen Xu <wen.xu@gatech.edu>
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
[darrick: fix typos in comment]
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

afca6c5b

03 5月, 2018 1 次提交

xfs: cap the length of deduplication requests · 021ba8e9

由 Darrick J. Wong 提交于 4月 16, 2018

Since deduplication potentially has to read in all the pages in both
files in order to compare the contents, cap the deduplication request
length at MAX_RW_COUNT/2 (roughly 1GB) so that we have /some/ upper bound
on the request length and can't just lock up the kernel forever.  Found
by running generic/304 after commit 1ddae54555b62 ("common/rc: add
missing 'local' keywords").

Reported-by: matorola@gmail.com
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>

021ba8e9

18 4月, 2018 4 次提交

xfs: don't fail when converting shortform attr to long form during ATTR_REPLACE · 7b38460d

由 Darrick J. Wong 提交于 4月 17, 2018

Kanda Motohiro reported that expanding a tiny xattr into a large xattr
fails on XFS because we remove the tiny xattr from a shortform fork and
then try to re-add it after converting the fork to extents format having
not removed the ATTR_REPLACE flag.  This fails because the attr is no
longer present, causing a fs shutdown.

This is derived from the patch in his bug report, but we really
shouldn't ignore a nonzero retval from the remove call.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199119
Reported-by: kanda.motohiro@gmail.com
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

7b38460d

xfs: prevent creating negative-sized file via INSERT_RANGE · 7d83fb14

由 Darrick J. Wong 提交于 4月 16, 2018


During the "insert range" fallocate operation, i_size grows by the
specified 'len' bytes.  XFS verifies that i_size + len < s_maxbytes, as
it should.  But this comparison is done using the signed 'loff_t', and
'i_size + len' can wrap around to a negative value, causing the check to
incorrectly pass, resulting in an inode with "negative" i_size.  This is
possible on 64-bit platforms, where XFS sets s_maxbytes = LLONG_MAX.
ext4 and f2fs don't run into this because they set a smaller s_maxbytes.

Fix it by using subtraction instead.

Reproducer:
    xfs_io -f file -c "truncate $(((1<<63)-1))" -c "finsert 0 4096"

Fixes: a904b1ca ("xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate")
Cc: <stable@vger.kernel.org> # v4.1+
Originally-From: Eric Biggers <ebiggers@google.com>
Signed-off-by: NEric Biggers <ebiggers@google.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
[darrick: fix signed integer addition overflow too]
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

7d83fb14

xfs: set format back to extents if xfs_bmap_extents_to_btree · 2c4306f7

由 Eric Sandeen 提交于 4月 16, 2018

If xfs_bmap_extents_to_btree fails in a mode where we call
xfs_iroot_realloc(-1) to de-allocate the root, set the
format back to extents.

Otherwise we can assume we can dereference ifp->if_broot
based on the XFS_DINODE_FMT_BTREE format, and crash.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199423Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

2c4306f7

xfs: enhance dinode verifier · b42db086

由 Eric Sandeen 提交于 4月 16, 2018

Add several more validations to xfs_dinode_verify:

- For LOCAL data fork formats, di_nextents must be 0.
- For LOCAL attr fork formats, di_anextents must be 0.
- For inodes with no attr fork offset,
  - format must be XFS_DINODE_FMT_EXTENTS if set at all
  - di_anextents must be 0.

Thanks to dchinner for pointing out a couple related checks I had
forgotten to add.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199377Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

b42db086

12 4月, 2018 1 次提交

export __set_page_dirty · f82b3764

由 Matthew Wilcox 提交于 4月 10, 2018

XFS currently contains a copy-and-paste of __set_page_dirty().  Export
it from buffer.c instead.

Link: http://lkml.kernel.org/r/20180313132639.17387-6-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Acked-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f82b3764

11 4月, 2018 2 次提交

Force log to disk before reading the AGF during a fstrim · 8c81dd46

由 Carlos Maiolino 提交于 4月 10, 2018

Forcing the log to disk after reading the agf is wrong, we might be
calling xfs_log_force with XFS_LOG_SYNC with a metadata lock held.

This can cause a deadlock when racing a fstrim with a filesystem
shutdown.

The deadlock has been identified due a miscalculation bug in device-mapper
dm-thin, which returns lack of space to its users earlier than the device itself
really runs out of space, changing the device-mapper volume into an error state.

The problem happened while filling the filesystem with a single file,
triggering the bug in device-mapper, consequently causing an IO error
and shutting down the filesystem.

If such file is removed, and fstrim executed before the XFS finishes the
shut down process, the fstrim process will end up holding the buffer
lock, and going to sleep on the cil wait queue.

At this point, the shut down process will try to wake up all the threads
waiting on the cil wait queue, but for this, it will try to hold the
same buffer log already held my the fstrim, locking up the filesystem.
Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

8c81dd46

Export __set_page_dirty · fbbb4509

由 Matthew Wilcox 提交于 4月 10, 2018

XFS currently contains a copy-and-paste of __set_page_dirty().  Export
it from buffer.c instead.
Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Acked-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

fbbb4509

10 4月, 2018 3 次提交

xfs: only cancel cow blocks when truncating the data fork · 4919d42a

由 Darrick J. Wong 提交于 4月 10, 2018

In xfs_itruncate_extents, only cancel cow blocks and clear the reflink
flag if we were asked to truncate the data fork.  Attr fork blocks
cannot be shared, so this makes no sense.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

4919d42a

xfs: non-scrub - remove unused function parameters · a1f69417

由 Eric Sandeen 提交于 4月 06, 2018

Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

a1f69417

xfs: remove filestream item xfs_inode reference · 7fcd3efa

由 Christoph Hellwig 提交于 4月 09, 2018

The filestreams allocator stores an xfs_fstrm_item structure in the MRU to
cache inode number to agno mappings for a particular length of time.  Each
xfs_fstrm_item contains the internal MRU structure, an inode pointer and
agno value.

The inode pointer stored in the xfs_fstrm_item is not referenced, however,
which means the inode itself can be removed and reclaimed before the MRU
item is freed. If this occurs, xfs_fstrm_free_func() can access freed or
unrelated memory through xfs_fstrm_item->ip and crash.

The obvious solution is to grab an inode reference for xfs_fstrm_item.
The filestream mechanism only actually uses the inode pointer as a means
to access the xfs_mount, however.  Rather than add unnecessary
complexity, simplify the implementation to store an xfs_mount pointer in
struct xfs_mru_cache, and pass it to the free callback.  This also
requires updates to the tracepoint class to provide the associated data
via parameters rather than the inode and a minor hack to peek at the MRU
key to establish the inode number at free time.

Based on debugging work and an earlier patch from Brian Foster, who
also wrote most of this changelog.
Reported-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

7fcd3efa

03 4月, 2018 2 次提交

xfs: fix intent use-after-free on abort · 0612d116

由 Dave Chinner 提交于 4月 02, 2018

When an intent is aborted during it's initial commit through
xfs_defer_trans_abort(), there is a use after free. The current
report is for a RUI  through this path in generic/388:

 Freed by task 6274:
  __kasan_slab_free+0x136/0x180
  kmem_cache_free+0xe7/0x4b0
  xfs_trans_free_items+0x198/0x2e0
  __xfs_trans_commit+0x27f/0xcc0
  xfs_trans_roll+0x17b/0x2a0
  xfs_defer_trans_roll+0x6ad/0xe60
  xfs_defer_finish+0x2a6/0x2140
  xfs_alloc_file_space+0x53a/0xf90
  xfs_file_fallocate+0x5c6/0xac0
  vfs_fallocate+0x2f5/0x930
  ioctl_preallocate+0x1dc/0x320
  do_vfs_ioctl+0xfe4/0x1690

The problem is that the RUI has two active references - one in the
current transaction, and another held by the defer_ops structure
that is passed to the RUD (intent done) so that both the intent and
the intent done structures are freed on commit of the intent done.

Hence during abort, we need to release the intent item, because the
defer_ops reference is released separately via ->abort_intent
callback. Fix all the intent code to do this correctly.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

0612d116

xfs: Remove "committed" argument of xfs_dir_ialloc · c959025e

由 Chandan Rajendra 提交于 4月 02, 2018

xfs_dir_ialloc() rolls the current transaction when allocation of a new
inode required the space manager to perform an allocation and replinish
the Inode btree.

None of the callers of xfs_dir_ialloc() need to know if the
transaction was committed. Hence this commit removes the "committed"
argument of xfs_dir_ialloc.
Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

c959025e

31 3月, 2018 1 次提交

xfs, dax: introduce xfs_dax_aops · 6e2608df

由 Dan Williams 提交于 3月 07, 2018

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings like the
following:

 WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
 xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
 [..]
 CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O 4.15.0-rc2+ #984
 [..]
 Call Trace:
  set_page_dirty_lock+0x40/0x60
  bio_set_pages_dirty+0x37/0x50
  iomap_dio_actor+0x2b7/0x3b0
  ? iomap_dio_zero+0x110/0x110
  iomap_apply+0xa4/0x110
  iomap_dio_rw+0x29e/0x3b0
  ? iomap_dio_zero+0x110/0x110
  ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_read_iter+0xa0/0xc0 [xfs]
  __vfs_read+0xf9/0x170
  vfs_read+0xa6/0x150
  SyS_pread64+0x93/0xb0
  entry_SYSCALL_64_fastpath+0x1f/0x96

...where the default set_page_dirty() handler assumes that dirty state
is being tracked in 'struct page' flags.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: NJan Kara <jack@suse.cz>
Suggested-by: NDave Chinner <david@fromorbit.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

6e2608df

30 3月, 2018 1 次提交

xfs: do not log/recover swapext extent owner changes for deleted inodes · dc1baa71

由 Eric Sandeen 提交于 3月 28, 2018

Today if we run xfs_fsr and crash[1], log replay can fail because
the recovery code tries to instantiate the donor inode from
disk to replay the swapext, but it's been deleted and we get
verifier failures when we try to read the inode off disk with
i_mode == 0.

This fixes both sides: We don't log the swapext change if the
inode has been deleted, and we don't try to recover it either.

[1] or if systemd doesn't cleanly unmount root, as it is wont
    to do ...
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

dc1baa71

26 3月, 2018 2 次提交

xfs: clean up xfs_mount allocation and dynamic initializers · 72c44e35

由 Brian Foster 提交于 3月 23, 2018

Most of the generic data structures embedded in xfs_mount are
dynamically initialized immediately after mp is allocated. A few
fields are left out and initialized during the xfs_mountfs()
sequence, after mp has been attached to the superblock.

To clean this up and help prevent premature access of associated
fields, refactor xfs_mount allocation and all dependent init calls
into a new helper. This self-documents that all low level data
structures (i.e., locks, trees, etc.) should be initialized before
xfs_mount is attached to the superblock.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

72c44e35

treewide: Align function definition open/close braces · 447a5647

由 Joe Perches 提交于 3月 21, 2018

Some functions definitions have either the initial open brace and/or
the closing brace outside of column 1.

Move those braces to column 1.

This allows various function analyzers like gnu complexity to work
properly for these modified functions.
Signed-off-by: NJoe Perches <joe@perches.com>
Acked-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: NPaul Moore <paul@paul-moore.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Acked-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Acked-by: NAlexandre Belloni <alexandre.belloni@free-electrons.com>
Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
Acked-by: NTakashi Iwai <tiwai@suse.de>
Acked-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NNicolin Chen <nicoleotsuka@gmail.com>
Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>

447a5647

24 3月, 2018 17 次提交

xfs: remove dead inode version setting code · fa4493f0

由 Dave Chinner 提交于 3月 23, 2018

We can only get into the branch if CRCs are enabled, so there's no
need to check inside the branch for CRCs being enabled....
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

fa4493f0

xfs: catch inode allocation state mismatch corruption · ee457001

由 Dave Chinner 提交于 3月 23, 2018

We recently came across a V4 filesystem causing memory corruption
due to a newly allocated inode being setup twice and being added to
the superblock inode list twice. From code inspection, the only way
this could happen is if a newly allocated inode was not marked as
free on disk (i.e. di_mode wasn't zero).

Running the metadump on an upstream debug kernel fails during inode
allocation like so:

XFS: Assertion failed: ip->i_d.di_nblocks == 0, file: fs/xfs/xfs_inod=
e.c, line: 838
 ------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:114!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 11 PID: 3496 Comm: mkdir Not tainted 4.16.0-rc5-dgc #442
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/0=
1/2014
RIP: 0010:assfail+0x28/0x30
RSP: 0018:ffffc9000236fc80 EFLAGS: 00010202
RAX: 00000000ffffffea RBX: 0000000000004000 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff8227211b
RBP: ffffc9000236fce8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000bec R11: f000000000000000 R12: ffffc9000236fd30
R13: ffff8805c76bab80 R14: ffff8805c77ac800 R15: ffff88083fb12e10
FS:  00007fac8cbff040(0000) GS:ffff88083fd00000(0000) knlGS:0000000000000=
000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffa6783ff8 CR3: 00000005c6e2b003 CR4: 00000000000606e0
Call Trace:
 xfs_ialloc+0x383/0x570
 xfs_dir_ialloc+0x6a/0x2a0
 xfs_create+0x412/0x670
 xfs_generic_create+0x1f7/0x2c0
 ? capable_wrt_inode_uidgid+0x3f/0x50
 vfs_mkdir+0xfb/0x1b0
 SyS_mkdir+0xcf/0xf0
 do_syscall_64+0x73/0x1a0
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

Extracting the inode number we crashed on from an event trace and
looking at it with xfs_db:

xfs_db> inode 184452204
xfs_db> p
core.magic = 0x494e
core.mode = 0100644
core.version = 2
core.format = 2 (extents)
core.nlinkv2 = 1
core.onlink = 0
.....

Confirms that it is not a free inode on disk. xfs_repair
also trips over this inode:

.....
zero length extent (off = 0, fsbno = 0) in ino 184452204
correcting nextents for inode 184452204
bad attribute fork in inode 184452204, would clear attr fork
bad nblocks 1 for inode 184452204, would reset to 0
bad anextents 1 for inode 184452204, would reset to 0
imap claims in-use inode 184452204 is free, would correct imap
would have cleared inode 184452204
.....
disconnected inode 184452204, would move to lost+found

And so we have a situation where the directory structure and the
inobt thinks the inode is free, but the inode on disk thinks it is
still in use. Where this corruption came from is not possible to
diagnose, but we can detect it and prevent the kernel from oopsing
on lookup. The reproducer now results in:

$ sudo mkdir /mnt/scratch/{0,1,2,3,4,5}{0,1,2,3,4,5}
mkdir: cannot create directory =E2=80=98/mnt/scratch/00=E2=80=99: File ex=
ists
mkdir: cannot create directory =E2=80=98/mnt/scratch/01=E2=80=99: File ex=
ists
mkdir: cannot create directory =E2=80=98/mnt/scratch/03=E2=80=99: Structu=
re needs cleaning
mkdir: cannot create directory =E2=80=98/mnt/scratch/04=E2=80=99: Input/o=
utput error
mkdir: cannot create directory =E2=80=98/mnt/scratch/05=E2=80=99: Input/o=
utput error
....

And this corruption shutdown:

[   54.843517] XFS (loop0): Corruption detected! Free inode 0xafe846c not=
 marked free on disk
[   54.845885] XFS (loop0): Internal error xfs_trans_cancel at line 1023 =
of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x425/0x670
[   54.848994] CPU: 10 PID: 3541 Comm: mkdir Not tainted 4.16.0-rc5-dgc #=
443
[   54.850753] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO=
S 1.10.2-1 04/01/2014
[   54.852859] Call Trace:
[   54.853531]  dump_stack+0x85/0xc5
[   54.854385]  xfs_trans_cancel+0x197/0x1c0
[   54.855421]  xfs_create+0x425/0x670
[   54.856314]  xfs_generic_create+0x1f7/0x2c0
[   54.857390]  ? capable_wrt_inode_uidgid+0x3f/0x50
[   54.858586]  vfs_mkdir+0xfb/0x1b0
[   54.859458]  SyS_mkdir+0xcf/0xf0
[   54.860254]  do_syscall_64+0x73/0x1a0
[   54.861193]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
[   54.862492] RIP: 0033:0x7fb73bddf547
[   54.863358] RSP: 002b:00007ffdaa553338 EFLAGS: 00000246 ORIG_RAX: 0000=
000000000053
[   54.865133] RAX: ffffffffffffffda RBX: 00007ffdaa55449a RCX: 00007fb73=
bddf547
[   54.866766] RDX: 0000000000000001 RSI: 00000000000001ff RDI: 00007ffda=
a55449a
[   54.868432] RBP: 00007ffdaa55449a R08: 00000000000001ff R09: 00005623a=
8670dd0
[   54.870110] R10: 00007fb73be72d5b R11: 0000000000000246 R12: 000000000=
00001ff
[   54.871752] R13: 00007ffdaa5534b0 R14: 0000000000000000 R15: 00007ffda=
a553500
[   54.873429] XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1=
024 of file fs/xfs/xfs_trans.c.  Return address = ffffffff814cd050
[   54.882790] XFS (loop0): Corruption of in-memory data detected.  Shutt=
ing down filesystem
[   54.884597] XFS (loop0): Please umount the filesystem and rectify the =
problem(s)

Note that this crash is only possible on v4 filesystemsi or v5
filesystems mounted with the ikeep mount option. For all other V5
filesystems, this problem cannot occur because we don't read inodes
we are allocating from disk - we simply overwrite them with the new
inode information.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
Tested-by: NCarlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

ee457001

xfs: xfs_scrub_iallocbt_xref_rmap_inodes should use xref_set_corrupt · b83e4c3c

由 Darrick J. Wong 提交于 3月 23, 2018

In xfs_scrub_iallocbt_xref_rmap_inodes we're checking inodes against
rmap records, so we should use xfs_scrub_btree_xref_set_corrupt if we
encounter discrepancies here so that we know that it's a cross
referencing error, not necessarily a corruption in the inobt itself.

The userspace xfs_scrub program will try to repair outright corruptions
in the agi/inobt prior to phase 3 so that the inode scan will proceed.
If only a cross-referencing error is noted, the repair program defers
the repair attempt until it can check the other space metadata at least
once.

It is therefore essential that the inobt scrubber can correctly
distinguish between corruptions and "unable to cross-reference something
else with this inobt".  The same reasoning applies to "xfs: record inode
buf errors as a xref error in inobt scrubber".
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

b83e4c3c

xfs: flag inode corruption if parent ptr doesn't get us a real inode · 5927268f

由 Darrick J. Wong 提交于 3月 23, 2018

If a directory's parent inode pointer doesn't point to an inode, the
directory should be flagged as corrupt. Enable IGET_UNTRUSTED here so
that _iget will return -EINVAL if the inobt does not confirm that the
inode is present and allocated and we can flag the directory corruption.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

5927268f

xfs: don't accept inode buffers with suspicious unlinked chains · 6a96c565

由 Darrick J. Wong 提交于 3月 23, 2018

When we're verifying inode buffers, sanity-check the unlinked pointer.
We don't want to run the risk of trying to purge something that's
obviously broken.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

6a96c565

xfs: move inode extent size hint validation to libxfs · 8bb82bc1

由 Darrick J. Wong 提交于 3月 23, 2018

Extent size hint validation is used by scrub to decide if there's an
error, and it will be used by repair to decide to remove the hint.
Since these use the same validation functions, move them to libxfs.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

8bb82bc1

xfs: record inode buf errors as a xref error in inobt scrubber · 1b44a6ae

由 Darrick J. Wong 提交于 3月 23, 2018

During the inode btree scrubs we try to confirm the freemask bits
against the inode records.  If the inode buffer read fails, this is a
cross-referencing error, not a corruption of the inode btree itself.
Use the xref_process_error call here.  Found via core.version middlebit
fuzz in xfs/415.

The userspace xfs_scrub program will try to repair outright corruptions
in the agi/inobt prior to phase 3 so that the inode scan will proceed.
If only a cross-referencing error is noted, the repair program defers
the repair attempt until it can check the other space metadata at least
once.

It is therefore essential that the inobt scrubber can correctly
distinguish between corruptions and "unable to cross-reference something
else with this inobt".
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

1b44a6ae

xfs: remove xfs_buf parameter from inode scrub methods · 7e56d9ea

由 Darrick J. Wong 提交于 3月 23, 2018

Now that we no longer do raw inode buffer scrubbing, the bp parameter is
no longer used anywhere we're dealing with an inode, so remove it and
all the useless NULL parameters that go with it.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

7e56d9ea

xfs: inode scrubber shouldn't bother with raw checks · d0018ad8

由 Darrick J. Wong 提交于 3月 23, 2018

The inode scrubber tries to _iget the inode prior to running checks.
If that _iget call fails with corruption errors that's an automatic
fail, regardless of whether it was the inode buffer read verifier,
the ifork verifier, or the ifork formatter that errored out.

Therefore, get rid of the raw mode scrub code because it's not needed.
Found by trying to fix some test failures in xfs/379 and xfs/415.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

d0018ad8

xfs: bmap scrubber should do rmap xref with bmap for sparse files · 5e777b62

由 Darrick J. Wong 提交于 3月 23, 2018

When we're scanning an extent mapping inode fork, ensure that every rmap
record for this ifork has a corresponding bmbt record too.  This
(mostly) provides the ability to cross-reference rmap records with bmap
data.  The rmap scrubber cannot do the xref on its own because that
requires taking an ilock with the agf lock held, which violates our
locking order rules (inode, then agf).

Note that we only do this for forks that are in btree format due to the
increased complexity; or forks that should have data but suspiciously
have zero extents because the inode could have just had its iforks
zapped by the inode repair code and now we need to reclaim the old
extents.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

5e777b62

xfs: refactor inode buffer verifier error logging · 6edb1810

由 Darrick J. Wong 提交于 3月 23, 2018

When the inode buffer verifier encounters an error, it's much more
helpful to print a buffer from the offending inode instead of just the
start of the inode chunk buffer.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

6edb1810

xfs: refactor inode verifier error logging · 90a58f95

由 Darrick J. Wong 提交于 3月 23, 2018

Refactor some of the inode verifier failure logging call sites to use
the new xfs_inode_verifier_error method which dumps the offending buffer
as well as the code location of the failed check.  This trims the
output, makes it clearer to the admin that repair must be run, and gives
the developers more details to work from.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

90a58f95

xfs: refactor bmap record validation · 30b0984d

由 Darrick J. Wong 提交于 3月 23, 2018

Refactor the bmap validator into a more complete helper that looks for
extents that run off the end of the device, overflow into the next AG,
or have invalid flag states.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

30b0984d

xfs: sanity-check the unused space before trying to use it · 6915ef35

由 Darrick J. Wong 提交于 3月 23, 2018

In xfs_dir2_data_use_free, we examine on-disk metadata and ASSERT if
it doesn't make sense.  Since a carefully crafted fuzzed image can cause
the kernel to crash after blowing a bunch of assertions, let's move
those checks into a validator function and rig everything up to return
EFSCORRUPTED to userspace.  Found by lastbit fuzzing ltail.bestcount via
xfs/391.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

6915ef35

xfs: detect agfl count corruption and reset agfl · a27ba260

由 Brian Foster 提交于 3月 15, 2018

The struct xfs_agfl v5 header was originally introduced with
unexpected padding that caused the AGFL to operate with one less
slot than intended. The header has since been packed, but the fix
left an incompatibility for users who upgrade from an old kernel
with the unpacked header to a newer kernel with the packed header
while the AGFL happens to wrap around the end. The newer kernel
recognizes one extra slot at the physical end of the AGFL that the
previous kernel did not. The new kernel will eventually attempt to
allocate a block from that slot, which contains invalid data, and
cause a crash.

This condition can be detected by comparing the active range of the
AGFL to the count. While this detects a padding mismatch, it can
also trigger false positives for unrelated flcount corruption. Since
we cannot distinguish a size mismatch due to padding from unrelated
corruption, we can't trust the AGFL enough to simply repopulate the
empty slot.

Instead, avoid unnecessarily complex detection logic and and use a
solution that can handle any form of flcount corruption that slips
through read verifiers: distrust the entire AGFL and reset it to an
empty state. Any valid blocks within the AGFL are intentionally
leaked. This requires xfs_repair to rectify (which was already
necessary based on the state the AGFL was found in). The reset
mitigates the side effect of the padding mismatch problem from a
filesystem crash to a free space accounting inconsistency. The
generic approach also means that this patch can be safely backported
to kernels with or without a packed struct xfs_agfl.

Check the AGF for an invalid freelist count on initial read from
disk. If detected, set a flag on the xfs_perag to indicate that a
reset is required before the AGFL can be used. In the first
transaction that attempts to use a flagged AGFL, reset it to empty,
warn the user about the inconsistency and allow the freelist fixup
code to repopulate the AGFL with new blocks. The xfs_perag flag is
cleared to eliminate the need for repeated checks on each block
allocation operation.

This allows kernels that include the packing fix commit 96f859d5
("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
to handle older unpacked AGFL formats without a filesystem crash.
Suggested-by: NDave Chinner <david@fromorbit.com>
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

a27ba260

xfs: unwind the try_again loop in xfs_log_force · 3e4da466

由 Christoph Hellwig 提交于 3月 13, 2018

Instead split out a __xfs_log_fore_lsn helper that gets called again
with the already_slept flag set to true in case we had to sleep.

This prepares for aio_fsync support.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

3e4da466

xfs: refactor xfs_log_force_lsn · 93806299

由 Christoph Hellwig 提交于 3月 13, 2018

Use the the smallest possible loop as preable to find the correct iclog
buffer, and then use gotos for unwinding to straighten the code.

Also fix the top of function comment while we're at it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

93806299

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功