提交 · ce89755cdfeaa0073341f8b5d07caff4fa9fc316 · openeuler / Kernel

29 6月, 2019 4 次提交

Assining a numerical value that is not close to the flags
defined near by is just asking for conflicts later on.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

ce89755c

xfs: remove the never used _XBF_COMPOUND flag · 153fd7b5

由 Christoph Hellwig 提交于 5年前

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

153fd7b5

xfs: remove the no-op spinlock_destroy stub · 1e85a367

由 Christoph Hellwig 提交于 5年前

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

1e85a367

xfs: move xfs_ino_geometry to xfs_shared.h · 5467b34b

由 Darrick J. Wong 提交于 5年前

The inode geometry structure isn't related to ondisk format; it's
support for the mount structure.  Move it to xfs_shared.h.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

5467b34b

13 6月, 2019 1 次提交

xfs: remove unused flag arguments · f5b999c0

由 Eric Sandeen 提交于 5年前

There are several functions which take a flag argument that is
only ever passed as "0," so remove these arguments.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NBill O'Donnell <billodo@redhat.com>
Reviewed-by: NAllison Collins <allison.henderson@oracle.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

f5b999c0

12 6月, 2019 8 次提交

xfs: remove the debug-only q_transp field from struct xfs_dquot · 76dee769

由 Christoph Hellwig 提交于 5年前

The field is only used for a few assertations.  Shrink the dqout
structure instead, similarly to what commit f3ca8738
("xfs: remove i_transp") did for the xfs_inode.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

76dee769

xfs: merge xfs_buf_zero and xfs_buf_iomove · f9a196ee

由 Christoph Hellwig 提交于 5年前

xfs_buf_zero is the only caller of xfs_buf_iomove.  Remove support
for copying from or to the buffer in xfs_buf_iomove and merge the
two functions.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

f9a196ee

xfs: remove unused flags arg from getsb interfaces · 8c9ce2f7

由 Eric Sandeen 提交于 5年前

The flags value is always passed as 0 so remove the argument.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

8c9ce2f7

xfs: include WARN, REPAIR build options in XFS_BUILD_OPTIONS · d03a2f1b

由 Eric Sandeen 提交于 5年前

The XFS_BUILD_OPTIONS string, shown at module init time and
in modinfo output, does not currently include all available
build options.  So, add in CONFIG_XFS_WARN and CONFIG_XFS_REPAIR.

It has been suggested in some quarters
That this is not enough.
Well ...

Anybody who would like to see this in a sysfs file can send
a patch.  :)
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NBill O'Donnell <billodo@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

d03a2f1b

xfs: finish converting to inodes_per_cluster · 4b4d98cc

由 Darrick J. Wong 提交于 5年前

Finish converting all the old inode_cluster_size >> inopblog users to
inodes_per_cluster.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

4b4d98cc

xfs: fix inode_cluster_size rounding mayhem · 490d451f

由 Darrick J. Wong 提交于 5年前

inode_cluster_size is supposed to represent the size (in bytes) of an
inode cluster buffer. We avoid having to handle multiple clusters per
filesystem block on filesystems with large blocks by openly rounding
this value up to 1 FSB when necessary. However, we never reset
inode_cluster_size to reflect this new rounded value, which adds to the
potential for mistakes in calculating geometries.

Fix this by setting inode_cluster_size to reflect the rounded-up size if
needed, and special-case the few places in the sparse inodes code where
we actually need the smaller value to validate on-disk metadata.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

490d451f

xfs: refactor inode geometry setup routines · 494dba7b

由 Darrick J. Wong 提交于 5年前

Migrate all of the inode geometry setup code from xfs_mount.c into a
single libxfs function that we can share with xfsprogs.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

494dba7b

xfs: separate inode geometry · ef325959

由 Darrick J. Wong 提交于 5年前

Separate the inode geometry information into a distinct structure.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

ef325959

10 6月, 2019 1 次提交

xfs: use file_modified() helper · 8c3f406c

由 Amir Goldstein 提交于 5年前

Note that by using the helper, the order of calling file_remove_privs()
after file_update_mtime() in xfs_file_aio_write_checks() has changed.
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

8c3f406c

04 6月, 2019 1 次提交

xfs: inode btree scrubber should calculate im_boffset correctly · 025197eb

由 Darrick J. Wong 提交于 5年前

The im_boffset field is in units of bytes, whereas XFS_INO_OFFSET
returns a value in units of inodes.  Convert the units so that scrub on
a 64k-block filesystem works correctly.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

025197eb

24 5月, 2019 1 次提交

xfs: fix broken log reservation debugging · d31d7185

由 Darrick J. Wong 提交于 5年前

xlog_print_tic_res() is supposed to print a human readable string for
each element of the log ticket reservation array. Unfortunately, I
forgot to update the string array when we added rmap & reflink support,
so the debug message prints "region[3]: (null) - 352 bytes" which isn't
useful at all. Add the missing elements and add a build check so that
we don't forget again to add a string when adding a new XLOG_REG_TYPE.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

d31d7185

21 5月, 2019 2 次提交

treewide: Add SPDX license identifier - Makefile/Kconfig · ec8f24b7

由 Thomas Gleixner 提交于 5年前

Add SPDX license identifiers to all Make/Kconfig files which:

 - Have no license information of any form

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

ec8f24b7

xfs: don't reserve per-AG space for an internal log · 5cd213b0

由 Darrick J. Wong 提交于 5年前

It turns out that the log can consume nearly all the space in an AG, and
when this happens this it's possible that there will be less free space
in the AG than the reservation would try to hide. On a debug kernel
this can trigger an ASSERT in xfs/250:

XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319

The log is permanently allocated, so we know we're never going to have
to expand the btrees to hold any records associated with the log space.
We therefore can treat the space as if it doesn't exist.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NEric Sandeen <sandeen@redhat.com>

5cd213b0

18 5月, 2019 1 次提交

treewide: prefix header search paths with $(srctree)/ · 9cc342f6

由 Masahiro Yamada 提交于 5年前

Currently, the Kbuild core manipulates header search paths in a crazy
way [1].

To fix this mess, I want all Makefiles to add explicit $(srctree)/ to
the search paths in the srctree. Some Makefiles are already written in
that way, but not all. The goal of this work is to make the notation
consistent, and finally get rid of the gross hacks.

Having whitespaces after -I does not matter since commit 48f6e3cf
("kbuild: do not drop -I without parameter").

[1]: https://patchwork.kernel.org/patch/9632347/Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>

9cc342f6

02 5月, 2019 1 次提交

xfs: change some error-less functions to void types · 91083269

由 Eric Sandeen 提交于 5年前

There are several functions which have no opportunity to return
an error, and don't contain any ASSERTs which could be argued
to be better constructed as error cases.  So, make them voids
to simplify the callers.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

91083269

30 4月, 2019 4 次提交

block: remove the i argument to bio_for_each_segment_all · 2b070cfe

由 Christoph Hellwig 提交于 5年前

We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.
Suggested-by: NMatthew Wilcox <willy@infradead.org>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Acked-by: NDavid Sterba <dsterba@suse.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Acked-by: NColy Li <colyli@suse.de>
Reviewed-by: NMatthew Wilcox <willy@infradead.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2b070cfe

xfs: add online scrub for superblock counters · 75efa57d

由 Darrick J. Wong 提交于 5年前

Teach online scrub how to check the filesystem summary counters. We use
the incore delalloc block counter along with the incore AG headers to
compute expected values for fdblocks, icount, and ifree, and then check
that the percpu counter is within a certain threshold of the expected
value. This is done to avoid having to freeze or otherwise lock the
filesystem, which means that we're only checking that the counters are
fairly close, not that they're exactly correct.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

75efa57d

xfs: don't parse the mtpt mount option · 94079285

由 Christoph Hellwig 提交于 5年前

The text isn't really any more useful than the default unknown option
handling.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

94079285

xfs: always rejoin held resources during defer roll · 710d707d

由 Darrick J. Wong 提交于 5年前

During testing of xfs/141 on a V4 filesystem, I observed some
inconsistent behavior with regards to resources that are held (i.e.
remain locked) across a defer roll.  The transaction roll always gives
the defer roll function a new transaction, even if committing the old
transaction fails.  However, the defer roll function only rejoins the
held resources if the transaction commit succeedied.  This means that
callers of defer roll have to figure out whether the held resources are
attached to the transaction being passed back.

Worse yet, if the defer roll was part of a defer finish call, we have a
third possibility: the defer finish could pass back a dirty transaction
with dirty held resources and an error code.

The only sane way to handle all of these scenarios is to require that
the code that held the resource either cancel the transaction before
unlocking and releasing the resources, or use functions that detach
resources from a transaction properly (e.g.  xfs_trans_brelse) if they
need to drop the reference before committing or cancelling the
transaction.

In order to make this so, change the defer roll code to join held
resources to the new transaction unconditionally and fix all the bhold
callers to release the held buffers correctly.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

710d707d

27 4月, 2019 6 次提交

xfs: add missing error check in xfs_prepare_shift() · 1749d1ea

由 Brian Foster 提交于 5年前

xfs_prepare_shift() fails to check the error return from
xfs_flush_unmap_range(). If the latter fails, that could lead to an
insert/collapse range operation over a delalloc range, which is not
supported.

Add an error check and return appropriately. This is reproduced
rarely by generic/475.

Fixes: 7f9f71be ("xfs: extent shifting doesn't fully invalidate page cache")
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NAllison Collins <allison.henderson@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

1749d1ea

xfs: scrub should check incore counters against ondisk headers · 47cd97b5

由 Darrick J. Wong 提交于 5年前

In theory, the incore per-AG structure counters should match the ones on
disk, so check that.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

47cd97b5

xfs: allow scrubbers to pause background reclaim · 9a1f3049

由 Darrick J. Wong 提交于 5年前

The forthcoming summary counter patch races with regular filesystem
activity to compute rough expected values for the counters. This design
was chosen to avoid having to freeze the entire filesystem to check the
counters, but while that's running we'd prefer to minimize background
reclamation activity to reduce the perturbations to the incore free
block count. Therefore, provide a way for scrubbers to disable
background posteof and cowblock reclamation.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

9a1f3049

xfs: rename the speculative block allocation reclaim toggle functions · ed30dcbd

由 Darrick J. Wong 提交于 5年前

"reclaim" is used throughout the icache code to mean reclamation of
incore inode structures. It's also used for two helper functions that
toggle background deletion of speculative preallocations. Separate
the second of the two uses to make things less confusing.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

ed30dcbd

xfs: track delayed allocation reservations across the filesystem · 9fe82b8c

由 Darrick J. Wong 提交于 5年前

Add a percpu counter to track the number of blocks directly reserved for
delayed allocations on the data device.  This counter (in contrast to
i_delayed_blks) does not track allocated CoW staging extents or anything
going on with the realtime device.  It will be used in the upcoming
summary counter scrub function to check the free block counts without
having to freeze the filesystem or walk all the inodes to find the
delayed allocations.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

9fe82b8c

xfs: fix broken bhold behavior in xrep_roll_ag_trans · f60be90f

由 Darrick J. Wong 提交于 5年前

In xrep_roll_ag_trans, the transaction roll will always set sc->tp to
the new transaction, even if committing the old one fails. A bare
transaction roll leaves the buffer(s) locked but not joined to the new
transaction, so it's not necessary to release the hold if the roll
fails. Remove the incorrect xfs_trans_bhold_release calls.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

f60be90f

23 4月, 2019 7 次提交

xfs: unlock inode when xfs_ioctl_setattr_get_trans can't get transaction · 3de5eab3

由 Darrick J. Wong 提交于 5年前

We passed an inode into xfs_ioctl_setattr_get_trans with join_flags
indicating which locks are held on that inode.  If we can't allocate a
transaction then we need to unlock the inode before we bail out, like
all the other error paths do.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

3de5eab3

xfs: kill the xfs_dqtrx_t typedef · 078f4a7d

由 Darrick J. Wong 提交于 5年前

There's only a few uses left, so just kill the typedef while we're at
it.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

078f4a7d

xfs: widen inode delalloc block counter to 64-bits · 394aafdc

由 Darrick J. Wong 提交于 5年前

Widen the incore inode's i_delayed_blks counter to be a 64-bit integer.
This is necessary to fix an integer overflow problem that can be
reproduced easily now that we use the counter to track blocks that are
assigned to the inode in memory but not on disk.  This includes actual
delalloc reservations as well as real extents in the COW fork that
are waiting to be remapped into the data fork.

These 'delayed mapping' blocks can easily exceed 2^32 blocks if one
creates a very large sparse file of size approximately 2^33 bytes with
one byte written every 2^23 bytes, sets a very large COW extent size
hint of 2^23 blocks, reflinks the first file into a second file, and
then writes a single byte every 2^23 blocks in the original file.

When this happens, we'll try to create approximately 1024 2^23 extent
reservations in the COW fork, which will overflow the counter and cause
problems.

Note that on x64 we end up filling a 4-byte gap in the structure so this
doesn't increase the incore size.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NAllison Collins <allison.henderson@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

394aafdc

xfs: widen quota block counters to 64-bit integers · 903b1fc2

由 Darrick J. Wong 提交于 5年前

Widen the incore quota transaction delta structure to treat block
counters as 64-bit integers.  This is a necessary addition so that we
can widen the i_delayed_blks counter to be a 64-bit integer.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NAllison Collins <allison.henderson@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

903b1fc2

xfs: abort unaligned nowait directio early · 1fdeaea4

由 Darrick J. Wong 提交于 5年前

Dave Chinner noticed that xfs_file_dio_aio_write returns EAGAIN without
dropping the IOLOCK when its deciding not to wait, which means that we
leak the IOLOCK there.  Since we now make unaligned directio always
wait, we have the opportunity to bail out before trying to take the
lock, which should reduce the overhead of this never-gonna-work case
considerably while also solving the dropped lock problem.
Reported-by: NDave Chinner <david@fromorbit.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

1fdeaea4

xfs: assert that we don't enter agfl freeing with a non-permanent transaction · 362f5e74

由 Brian Foster 提交于 5年前

Block allocation requires a permanent transaction for deferred AGFL
frees.  Add an assert in the block allocation path to make explicit and
obvious to future callers the requirement of a transaction with a
permanent reservation.
Reported-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
[darrick: split this out from the previous patch per hch request]
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

362f5e74

xfs: make tr_growdata a permanent transaction · 945c941f

由 Brian Foster 提交于 5年前

The growdata transaction is used by growfs operations to increase
the data size of the filesystem. Part of this sequence involves
extending the size of the last preexisting AG in the fs, if
necessary. This is implemented by freeing the newly available
physical range to the AG.

tr_growdata is not a permanent transaction, however, and block
allocation transactions must be permanent to handle deferred frees
of AGFL blocks. If the grow operation extends an existing AG that
requires AGFL fixing, assert failures occur due to a populated dfops
list on a non-permanent transaction and the AGFL free does not
occur. This is reproduced (rarely) by xfs/104.

Change tr_growdata to a permanent transaction with a default log
count. This increases initial transaction reservation size, but
growfs is an infrequent and non-performance critical operation and
so should have minimal impact.
Reported-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
[darrick: add a comment to the assert]
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

945c941f

17 4月, 2019 3 次提交

xfs: merge adjacent io completions of the same type · 3994fc48

由 Darrick J. Wong 提交于 5年前

It's possible for pagecache writeback to split up a large amount of work
into smaller pieces for throttling purposes or to reduce the amount of
time a writeback operation is pending.  Whatever the reason, XFS can end
up with a bunch of IO completions that call for the same operation to be
performed on a contiguous extent mapping.  Since mappings are extent
based in XFS, we'd prefer to run fewer transactions when we can.

When we're processing an ioend on the list of io completions, check to
see if the next items on the list are both adjacent and of the same
type.  If so, we can merge the completions to reduce transaction
overhead.

On fast storage this doesn't seem to make much of a difference in
performance, though the number of transactions for an overnight xfstests
run seems to drop by ~5%.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

3994fc48

xfs: remove unused m_data_workqueue · 28408243

由 Darrick J. Wong 提交于 5年前

Now that we're no longer using m_data_workqueue, remove it.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

28408243

xfs: implement per-inode writeback completion queues · cb357bf3

由 Darrick J. Wong 提交于 5年前

When scheduling writeback of dirty file data in the page cache, XFS uses
IO completion workqueue items to ensure that filesystem metadata only
updates after the write completes successfully.  This is essential for
converting unwritten extents to real extents at the right time and
performing COW remappings.

Unfortunately, XFS queues each IO completion work item to an unbounded
workqueue, which means that the kernel can spawn dozens of threads to
try to handle the items quickly.  These threads need to take the ILOCK
to update file metadata, which results in heavy ILOCK contention if a
large number of the work items target a single file, which is
inefficient.

Worse yet, the writeback completion threads get stuck waiting for the
ILOCK while holding transaction reservations, which can use up all
available log reservation space.  When that happens, metadata updates to
other parts of the filesystem grind to a halt, even if the filesystem
could otherwise have handled it.

Even worse, if one of the things grinding to a halt happens to be a
thread in the middle of a defer-ops finish holding the same ILOCK and
trying to obtain more log reservation having exhausted the permanent
reservation, we now have an ABBA deadlock - writeback completion has a
transaction reserved and wants the ILOCK, and someone else has the ILOCK
and wants a transaction reservation.

Therefore, we create a per-inode writeback io completion queue + work
item.  When writeback finishes, it can add the ioend to the per-inode
queue and let the single worker item process that queue.  This
dramatically cuts down on the number of kworkers and ILOCK contention in
the system, and seems to have eliminated an occasional deadlock I was
seeing while running generic/476.

Testing with a program that simulates a heavy random-write workload to a
single file demonstrates that the number of kworkers drops from
approximately 120 threads per file to 1, without dramatically changing
write bandwidth or pagecache access latency.

Note that we leave the xfs-conv workqueue's max_active alone because we
still want to be able to run ioend processing for as many inodes as the
system can handle.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

cb357bf3

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功