- 08 4月, 2021 4 次提交
-
-
由 Christoph Hellwig 提交于
In preparation of removing the historic icinode struct, move the flags2 field into the containing xfs_inode structure. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Christoph Hellwig 提交于
In preparation of removing the historic icinode struct, move the flags field into the containing xfs_inode structure. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Christoph Hellwig 提交于
In preparation of removing the historic icinode struct, move the cowextsize field into the containing xfs_inode structure. Also switch to use the xfs_extlen_t instead of a uint32_t. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Christoph Hellwig 提交于
In preparation of removing the historic icinode struct, move the on-disk size field into the containing xfs_inode structure. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
- 04 2月, 2021 4 次提交
-
-
由 Darrick J. Wong 提交于
In anticipation of more restructuring of the eof/cowblocks gc code, refactor calling of those two functions into a single internal helper function, then present a new standard interface to purge speculative block preallocations and start shifting higher level code to use that. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Change the signature of xfs_blockgc_free_quota in preparation for the next few patches. Callers can now pass EOF_FLAGS into the function to control scan parameters; and the function will now pass back any corruption errors seen while scanning, though for our retry loops we'll just try again unconditionally. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Move this function further down in the file so that later cleanups won't have to declare static functions. Change the name because we're about to rework all the code that performs garbage collection of speculatively allocated file blocks. No functional changes. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
The functions to run an eof/cowblocks scan to try to reduce quota usage are kind of a mess -- the logic repeatedly initializes an eofb structure and there are logic bugs in the code that result in the cowblocks scan never actually happening. Replace all three functions with a single function that fills out an eofb and runs both eof and cowblocks scans. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
- 02 2月, 2021 8 次提交
-
-
由 Dave Chinner 提交于
Attempt shared locking for unaligned DIO, but only if the the underlying extent is already allocated and in written state. On failure, retry with the existing exclusive locking. Test case is fio randrw of 512 byte IOs using AIO and an iodepth of 32 IOs. Vanilla: READ: bw=4560KiB/s (4670kB/s), 4560KiB/s-4560KiB/s (4670kB/s-4670kB/s), io=134MiB (140MB), run=30001-30001msec WRITE: bw=4567KiB/s (4676kB/s), 4567KiB/s-4567KiB/s (4676kB/s-4676kB/s), io=134MiB (140MB), run=30001-30001msec Patched: READ: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1127MiB (1182MB), run=30002-30002msec WRITE: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1128MiB (1183MB), run=30002-30002msec That's an improvement from ~18k IOPS to a ~150k IOPS, which is about the IOPS limit of the VM block device setup I'm testing on. 4kB block IO comparison: READ: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8868MiB (9299MB), run=30002-30002msec WRITE: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8878MiB (9309MB), run=30002-30002msec Which is ~150k IOPS, same as what the test gets for sub-block AIO+DIO writes with this patch. Signed-off-by: NDave Chinner <dchinner@redhat.com> [hch: rebased, split unaligned from nowait] Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
The unaligned DIO write path is more convolted than the normal path, and we are about to make it more complex. Keep the block aligned fast path dio write code trim and simple by splitting out the unaligned DIO code from it. Signed-off-by: NDave Chinner <dchinner@redhat.com> [hch: rebased, fixed a few minor nits] Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Christoph Hellwig 提交于
Use a more suitable event class. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Christoph Hellwig 提交于
Pass the iocb and iov_iter to the tracepoints and leave decoding of actual arguments to the code only run when tracing is enabled. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Christoph Hellwig 提交于
The iomap code has been designed from the start not to do magic fallback, so remove the assert in preparation for further code cleanups. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Christoph Hellwig 提交于
Drop a few pointless aio_ prefixes. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Christoph Hellwig 提交于
Ensure we don't block on the iolock, or waiting for I/O in xfs_file_aio_write_checks if the caller asked to avoid that. Fixes: 29a5d29e ("xfs: nowait aio support") Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Christoph Hellwig 提交于
Add a helper to factor out the nowait locking logical for the read/write helpers. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
- 24 1月, 2021 2 次提交
-
-
由 Christoph Hellwig 提交于
Enable idmapped mounts for xfs. This basically just means passing down the user_namespace argument from the VFS methods down to where it is passed to the relevant helpers. Note that full-filesystem bulkstat is not supported from inside idmapped mounts as it is an administrative operation that acts on the whole file system. The limitation is not applied to the bulkstat single operation that just operates on a single inode. Link: https://lore.kernel.org/r/20210121131959.646623-40-christian.brauner@ubuntu.comSigned-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
由 Christoph Hellwig 提交于
Pass a set of flags to iomap_dio_rw instead of the boolean wait_for_completion argument. The IOMAP_DIO_FORCE_WAIT flag replaces the wait_for_completion, but only needs to be passed when the iocb isn't synchronous to start with to simplify the callers. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> [djwong: rework xfs_file.c so that we can push iomap changes separately] Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
- 23 1月, 2021 3 次提交
-
-
由 Christoph Hellwig 提交于
If the inode is not pinned by the time fsync is called we don't need the ilock to protect against concurrent clearing of ili_fsync_fields as the inode won't need a log flush or clearing of these fields. Not taking the iolock allows for full concurrency of fsync and thus O_DSYNC completions with io_uring/aio write submissions. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com>
-
由 Christoph Hellwig 提交于
Factor out the log syncing logic into two helpers to make the code easier to read and more maintainable. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com>
-
由 Eric Biggers 提交于
The comment in xfs_file_aio_write_checks() about calling file_modified() after dropping the ilock doesn't make sense, because the code that unconditionally acquires and drops the ilock was removed by commit 467f7899 ("xfs: reduce ilock hold times in xfs_file_aio_write_checks"). Remove this outdated comment. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NEric Biggers <ebiggers@google.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 20 1月, 2021 1 次提交
-
-
由 Kirill A. Shutemov 提交于
alloc_set_pte() has two users with different requirements: in the faultaround code, it called from an atomic context and PTE page table has to be preallocated. finish_fault() can sleep and allocate page table as needed. PTL locking rules are also strange, hard to follow and overkill for finish_fault(). Let's untangle the mess. alloc_set_pte() has gone now. All locking is explicit. The price is some code duplication to handle huge pages in faultaround path, but it should be fine, having overall improvement in readability. Link: https://lore.kernel.org/r/20201229132819.najtavneutnf7ajp@boxSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> [will: s/from from/from/ in comment; spotted by willy] Signed-off-by: NWill Deacon <will@kernel.org>
-
- 22 10月, 2020 1 次提交
-
-
由 Darrick J. Wong 提交于
In commit fe341eb1, I forgot that xfs_free_file_space isn't strictly a "remove mapped blocks" function. It is actually a function to zero file space by punching out the middle and writing zeroes to the unaligned ends of the specified range. Therefore, putting a rtextsize alignment check in that function is wrong because that breaks unaligned ZERO_RANGE on the realtime volume. Furthermore, xfs_file_fallocate already has alignment checks for the functions require the file range to be aligned to the size of a fundamental allocation unit (which is 1 FSB on the data volume and 1 rt extent on the realtime volume). Create a new helper to check fallocate arguments against the realtiem allocation unit size, fix the fallocate frontend to use it, fix free_file_space to delete the correct range, and remove a now redundant check from insert_file_space. NOTE: The realtime extent size is not required to be a power of two! Fixes: fe341eb1 ("xfs: ensure that fpunch, fcollapse, and finsert operations are aligned to rt extent size") Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
-
- 16 9月, 2020 1 次提交
-
-
由 Darrick J. Wong 提交于
Commit 5833112d tried to make it so that a remap operation would force the log out to disk if the filesystem is mounted with mandatory synchronous writes. Unfortunately, that commit failed to handle the case where the inode or the file descriptor require mandatory synchronous writes. Refactor the check into into a helper that will look for all three conditions, and now we can treat reflink just like any other synchronous write. Fixes: 5833112d ("xfs: reflink should force the log out if mounted with wsync") Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
- 06 9月, 2020 1 次提交
-
-
由 Mikulas Patocka 提交于
When running in a dax mode, if the user maps a page with MAP_PRIVATE and PROT_WRITE, the xfs filesystem would incorrectly update ctime and mtime when the user hits a COW fault. This breaks building of the Linux kernel. How to reproduce: 1. extract the Linux kernel tree on dax-mounted xfs filesystem 2. run make clean 3. run make -j12 4. run make -j12 at step 4, make would incorrectly rebuild the whole kernel (although it was already built in step 3). The reason for the breakage is that almost all object files depend on objtool. When we run objtool, it takes COW page fault on its .data section, and these faults will incorrectly update the timestamp of the objtool binary. The updated timestamp causes make to rebuild the whole tree. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 8月, 2020 2 次提交
-
-
由 Christoph Hellwig 提交于
Failing to invalid the page cache means data in incoherent, which is a very bad state for the system. Always fall back to buffered I/O through the page cache if we can't invalidate mappings. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Acked-by: NBob Peterson <rpeterso@redhat.com> Acked-by: NDamien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> # for ext4 Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> # for gfs2 Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
-
由 Christoph Hellwig 提交于
This is what the classic fs/direct-io.c implementation and thuse other file systems use. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 07 7月, 2020 5 次提交
-
-
由 Dave Chinner 提交于
The inode log item is kind of special in that it can be aggregating new changes in memory at the same time time existing changes are being written back to disk. This means there are fields in the log item that are accessed concurrently from contexts that don't share any locking at all. e.g. updating ili_last_fields occurs at flush time under the ILOCK_EXCL and flush lock at flush time, under the flush lock at IO completion time, and is read under the ILOCK_EXCL when the inode is logged. Hence there is no actual serialisation between reading the field during logging of the inode in transactions vs clearing the field in IO completion. We currently get away with this by the fact that we are only clearing fields in IO completion, and nothing bad happens if we accidentally log more of the inode than we actually modify. Worst case is we consume a tiny bit more memory and log bandwidth. However, if we want to do more complex state manipulations on the log item that requires updates at all three of these potential locations, we need to have some mechanism of serialising those operations. To do this, introduce a spinlock into the log item to serialise internal state. This could be done via the xfs_inode i_flags_lock, but this then leads to potential lock inversion issues where inode flag updates need to occur inside locks that best nest inside the inode log item locks (e.g. marking inodes stale during inode cluster freeing). Using a separate spinlock avoids these sorts of problems and simplifies future code. This does not touch the use of ili_fields in the item formatting code - that is entirely protected by the ILOCK_EXCL at this point in time, so it remains untouched. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Dave Chinner 提交于
The page faultround path ->map_pages is implemented in XFS via filemap_map_pages(). This function checks that pages found in page cache lookups have not raced with truncate based invalidation by checking page->mapping is correct and page->index is within EOF. However, we've known for a long time that this is not sufficient to protect against races with invalidations done by operations that do not change EOF. e.g. hole punching and other fallocate() based direct extent manipulations. The way we protect against these races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED lock so they serialise against fallocate and truncate before calling into the filemap function that processes the fault. Do the same for XFS's ->map_pages implementation to close this potential data corruption issue. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Darrick J. Wong 提交于
Move the double-inode locking helpers to xfs_inode.c since they're not specific to reflink. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Refactor the two functions that we use to lock and unlock two inodes to block userspace from initiating IO against a file, whether via system calls or mmap activity. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Fix the return value of xfs_reflink_remap_prep so that its return value conventions match the rest of xfs. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
- 22 6月, 2020 1 次提交
-
-
由 Jens Axboe 提交于
XFS uses generic_file_read_iter(), which already supports this. Acked-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 10 6月, 2020 1 次提交
-
-
由 Michel Lespinasse 提交于
Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: NMichel Lespinasse <walken@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 5月, 2020 1 次提交
-
-
由 Christoph Hellwig 提交于
There are there are three extents counters per inode, one for each of the forks. Two are in the legacy icdinode and one is directly in struct xfs_inode. Switch to a single counter in the xfs_ifork structure where it uses up padding at the end of the structure. This simplifies various bits of code that just wants the number of extents counter and can now directly dereference it. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 06 4月, 2020 2 次提交
-
-
由 Christoph Hellwig 提交于
Reflink should force the log out to disk if the filesystem was mounted with wsync, the same as most other operations in xfs. [Note: XFS_MOUNT_WSYNC is set when the admin mounts the filesystem with either the 'wsync' or 'sync' mount options, which effectively means that we're classifying reflink/dedupe as IO operations and making them synchronous when required.] Fixes: 3fc9f5e4 ("xfs: remove xfs_reflink_remap_range") Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com> [darrick: add more to the changelog] Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
Create a new helper to force the log up to the last LSN touching an inode. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 16 1月, 2020 1 次提交
-
-
由 Christoph Hellwig 提交于
Direct I/O reads can also be used with RWF_NOWAIT & co. Fix the inode locking in xfs_file_dio_aio_read to take IOCB_NOWAIT into account. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 23 11月, 2019 1 次提交
-
-
由 Christoph Hellwig 提交于
Replace the mappedbno argument with the simple flags for xfs_da_reada_buf and xfs_dir3_data_readahead. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 01 11月, 2019 1 次提交
-
-
由 Dave Chinner 提交于
AIO+DIO can extend the file size on IO completion, and it holds no inode locks while the IO is in flight. Therefore, a race condition exists in file size updates if we do something like this: aio-thread fallocate-thread lock inode submit IO beyond inode->i_size unlock inode ..... lock inode break layouts if (off + len > inode->i_size) new_size = off + len ..... inode_dio_wait() <blocks> ..... completes inode->i_size updated inode_dio_done() .... <wakes> <does stuff no long beyond EOF> if (new_size) xfs_vn_setattr(inode, new_size) Yup, that attempt to extend the file size in the fallocate code turns into a truncate - it removes the whatever the aio write allocated and put to disk, and reduced the inode size back down to where the fallocate operation ends. Fundamentally, xfs_file_fallocate() not compatible with racing AIO+DIO completions, so we need to move the inode_dio_wait() call up to where the lock the inode and break the layouts. Secondly, storing the inode size and then using it unchecked without holding the ILOCK is not safe; we can only do such a thing if we've locked out and drained all IO and other modification operations, which we don't do initially in xfs_file_fallocate. It should be noted that some of the fallocate operations are compound operations - they are made up of multiple manipulations that may zero data, and so we may need to flush and invalidate the file multiple times during an operation. However, we only need to lock out IO and other space manipulation operations once, as that lockout is maintained until the entire fallocate operation has been completed. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-