- 19 8月, 2015 12 次提交
-
-
由 Brian Foster 提交于
Several areas of code duplicate a pattern where we take the AIL lock, check whether an item is in the AIL and remove it if so. Create a new helper for this pattern and use it where appropriate. Signed-off-by: NBrian Foster <bfoster@redhat.com>
-
由 Brian Foster 提交于
The btree cursor cleanup function takes an error parameter that affects how buffers are released from the cursor. All buffers are released in the event of error. Several callers do not specify the XFS_BTREE_ERROR flag in the event of error, however. This can cause buffers to hang around locked or with an elevated hold count and thus lead to umount hangs in the event of errors. Fix up the xfs_btree_del_cursor() callers to pass XFS_BTREE_ERROR if the cursor is being torn down due to error. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Brian Foster 提交于
The root inode is read as part of the xfs_mountfs() sequence and the reference is dropped in the event of failure after we grab the inode. The reference drop doesn't necessarily free the inode, however. It marks it for reclaim and potentially kicks off the reclaim workqueue. The workqueue is destroyed further up the error path, which means we are subject to crash if the workqueue job runs after this point or a memory leak which is identified if the xfs_inode_zone is destroyed (e.g., on module removal). Both of these outcomes are reproducible via manual instrumentation of a mount error after the root inode xfs_iget() call in xfs_mountfs(). Update the xfs_mountfs() error path to cancel any potential reclaim work items and to run a synchronous inode reclaim if the root inode is marked for reclaim. This ensures that no jobs remain on the queue before it is destroyed and that the root inode is freed before the reclaim mechanism is torn down. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Brian Foster 提交于
The first 4 bytes of every basic block in the physical log is stamped with the current lsn. To support this mechanism, the log record header (first block of each new log record) contains space for the original first byte of each log record block before it is replaced with the lsn. The log record header has space for 32k worth of blocks. The version 2 log adds new extended record headers for each additional 32k worth of blocks beyond what is supported by the record header. The log record checksum incorporates the log record header, the extended headers and the record payload. xlog_cksum() checksums the extended headers based on log->l_iclog_heads, which specifies the number of extended headers in a log record based on the log buffer size mount option. The log buffer size is variable, however, and thus means the checksum can be calculated differently based on how a filesystem is mounted. This is problematic if a filesystem crashes and recovery occurs on a subsequent mount using a different log buffer size. For example, crash an active filesystem that is mounted with the default (32k) logbsize, attempt remount/recovery using '-o logbsize=64k' and the mount fails on or warns about log checksum failures. To avoid this problem, update xlog_cksum() to calculate the checksum based on the size of the log buffer according to the log record. The size is already included in the h_size field of the log record header and thus is available at log recovery time. Extended log record headers are also only written when the log record is large enough to require them. This makes checksum calculation of log records consistent with the extended record header mechanism as well as how on-disk records are checksummed with various log buffer size mount options. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Brian Foster 提交于
Inode cluster buffers are invalidated and cancelled when inode chunks are freed to notify log recovery that previous logged updates to the metadata buffer should be skipped. This ensures that log recovery does not overwrite buffers that might have already been reused. On v4 filesystems, inode chunk allocation and inode updates are logged via the cluster buffers and thus cancellation is easily detected via buffer cancellation items. v5 filesystems use the new icreate transaction, which uses logical logging and ordered buffers to log a full inode chunk allocation at once. The resulting icreate item often spans multiple inode cluster buffers. Log recovery checks for cancelled buffers when processing icreate log items, but it has a couple problems. First, it uses the full length of the inode chunk rather than the cluster size. Second, it uses the length in FSB units rather than BB units. Either of these problems prevent icreate recovery from identifying cancelled buffers and thus inode initialization proceeds unconditionally. Update xlog_recover_do_icreate_pass2() to iterate the icreate range in cluster sized increments and check each increment for cancellation. Since icreate is currently only used for the minimum atomic inode chunk allocation, we expect that either all or none of the buffers will be cancelled. Cancel the icreate if at least one buffer is cancelled to avoid making a bad situation worse by initializing a partial inode chunk, but detect such anomalies and warn the user. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Brian Foster 提交于
Various log items have recovery tracepoints to identify whether a particular log item is recovered or cancelled. Add the equivalent tracepoints for the icreate transaction. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Brian Foster 提交于
Log recovery occurs in two phases at mount time. In the first phase, EFIs and EFDs are processed and potentially cancelled out. EFIs without EFD objects are inserted into the AIL for processing and recovery in the second phase. xfs_mountfs() runs various other operations between the phases and is thus subject to failure. If failure occurs after the first phase but before the second, pending EFIs sit on the AIL, pin it and cause the mount to hang. Update the mount sequence to ensure that pending EFIs are cancelled in the event of failure. Add a recovery cancellation mechanism to iterate the AIL and cancel all EFI items when requested. Plumb cancellation support through the log mount finish helper and update xfs_mountfs() to invoke cancellation in the event of failure after recovery has started. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Brian Foster 提交于
The EFI is initialized with a reference count of 2. One for the EFI to ensure the item makes it to the AIL and one for the subsequently created EFD to release the EFI once the EFD is committed. Log recovery uses the EFI in a similar manner, but implements a hack to remove both references in one call once the EFD is handled. Update log recovery to use EFI reference counting in a manner consistent with the log. When an EFI is encountered during recovery, an EFI item is allocated and inserted to the AIL directly. Since the EFI reference is typically dropped when the EFI is unpinned and this is analogous with AIL insertion, drop the EFI reference at this point. When a corresponding EFD is encountered in the log, this indicates that the extents were freed, no processing is required and the EFI can be dropped. Update xlog_recover_efd_pass2() to simply drop the EFD reference at this point rather than open code the AIL removal and EFI free. Remaining EFIs (i.e., with no corresponding EFD) are processed in xlog_recover_finish(). An EFD transaction is allocated and the extents are freed, which transfers ownership of the EFI reference to the EFD item in the log. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Brian Foster 提交于
Log recovery attempts to free extents with leftover EFIs in the AIL after initial processing. If the extent free fails (e.g., due to unrelated fs corruption), the transaction is cancelled, though it might not be dirtied at the time. If this is the case, the EFD does not abort and thus does not release the EFI. This can lead to hangs as the EFI pins the AIL. Update xlog_recover_process_efi() to log the EFD in the transaction before xfs_free_extent() errors are handled to ensure the transaction is dirty, aborts the EFD and releases the EFI on error. Since this is a requirement for EFD processing (and consistent with xfs_bmap_finish()), update the EFD logging helper to do the extent free and unconditionally log the EFD. This encodes the required EFD logging behavior into the helper and reduces the likelihood of errors down the road. [dchinner: re-add xfs_alloc.h to xfs_log_recover.c to fix build failure.] Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Brian Foster 提交于
Freeing an extent in XFS involves logging an EFI (extent free intention), freeing the actual extent, and logging an EFD (extent free done). The EFI object is created with a reference count of 2: one for the current transaction and one for the subsequently created EFD. Under normal circumstances, the first reference is dropped when the EFI is unpinned and the second reference is dropped when the EFD is committed to the on-disk log. In event of errors or filesystem shutdown, there are various potential cleanup scenarios depending on the state of the EFI/EFD. The cleanup scenarios are confusing and racy, as demonstrated by the following test sequence: # mount $dev $mnt # fsstress -d $mnt -n 99999 -p 16 -z -f fallocate=1 \ -f punch=1 -f creat=1 -f unlink=1 & # sleep 5 # killall -9 fsstress; wait # godown -f $mnt # umount ... in which the final umount can hang due to the AIL being pinned indefinitely by one or more EFI items. This can occur due to several conditions. For example, if the shutdown occurs after the EFI is committed to the on-disk log and the EFD committed to the CIL, but before the EFD committed to the log, the EFD iop_committed() abort handler does not drop its reference to the EFI. Alternatively, manual error injection in the xfs_bmap_finish() codepath shows that if an error occurs after the EFI transaction is committed but before the EFD is constructed and logged, the EFI is never released from the AIL. Update the EFI/EFD item handling code to use a more straightforward and reliable approach to error handling. If an error occurs after the EFI transaction is committed and before the EFD is constructed, release the EFI explicitly from xfs_bmap_finish(). If the EFI transaction is cancelled, release the EFI in the unlock handler. Once the EFD is constructed, it is responsible for releasing the EFI under any circumstances (including whether the EFI item aborts due to log I/O error). Update the EFD item handlers to release the EFI if the transaction is cancelled or aborts due to log I/O error. Finally, update xfs_bmap_finish() to log at least one EFD extent to the transaction before xfs_free_extent() errors are handled to ensure the transaction is dirty and EFD item error handling is triggered. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Brian Foster 提交于
Some callers need to make error handling decisions based on whether the current transaction successfully committed or not. Rename xfs_trans_roll(), add a new parameter and provide a wrapper to preserve existing callers. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Brian Foster 提交于
Release of the EFI either occurs based on the reference count or the extent count. The extent count used is either the count tracked in the EFI or EFD, depending on the particular situation. In either case, the count is initialized to the final value and thus always matches the current efi_next_extent value once the EFI is completely constructed. For example, the EFI extent count is increased as the extents are logged in xfs_bmap_finish() and the full free list is always completely processed. Therefore, the count is guaranteed to be complete once the EFI transaction is committed. The EFD uses the efd_nextents counter to release the EFI. This counter is initialized to the count of the EFI when the EFD is created. Thus the EFD, as currently used, has no concept of partial EFI release based on extent count. Given that the EFI extent count is always released in whole, use of the extent count for reference counting is unnecessary. Remove this level of the API and release the EFI based on the core reference count. The efi_next_extent counter remains because it is still used to track the slot to log the next extent to free. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
- 24 6月, 2015 2 次提交
-
-
由 Jan Kara 提交于
Currently XFS calls file_remove_privs() without holding i_mutex. This is wrong because that function can end up messing with file permissions and file capabilities stored in xattrs for which we need i_mutex held. Fix the problem by grabbing iolock exclusively when we will need to change anything in permissions / xattrs. Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Jan Kara 提交于
file_remove_suid() is a misnomer since it removes also file capabilities stored in xattrs and sets S_NOSEC flag. Also should_remove_suid() tells something else than whether file_remove_suid() call is necessary which leads to bugs. Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 23 6月, 2015 1 次提交
-
-
由 Brian Foster 提交于
The xfs_attr3_root_inactive() call from xfs_attr_inactive() assumes that attribute blocks exist to invalidate. It is possible to have an attribute fork without extents, however. Consider the case where the attribute fork is created towards the beginning of xfs_attr_set() but some part of the subsequent attribute set fails. If an inode in such a state hits xfs_attr_inactive(), it eventually calls xfs_dabuf_map() and possibly xfs_bmapi_read(). The former emits a filesystem corruption warning, returns an error that bubbles back up to xfs_attr_inactive(), and leads to destruction of the in-core attribute fork without an on-disk reset. If the inode happens to make it back through xfs_inactive() in this state (e.g., via a concurrent bulkstat that cycles the inode from the reclaim state and releases it), i_afp might not exist when xfs_bmapi_read() is called and causes a NULL dereference panic. A '-p 2' fsstress run to ENOSPC on a relatively small fs (1GB) reproduces these problems. The behavior is a regression caused by: 6dfe5a04 xfs: xfs_attr_inactive leaves inconsistent attr fork state behind ... which removed logic that avoided the attribute extent truncate when no extents exist. Restore this logic to ensure the attribute fork is destroyed and reset correctly if it exists without any allocated extents. cc: stable@vger.kernel.org # 3.12 to 4.0.x Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
- 22 6月, 2015 10 次提交
-
-
由 Dave Chinner 提交于
We no longer calculate the minimum freelist size from the on-disk AGF, so we don't need the macros used for this. That means the nested macros can be cleaned up, and turn this into an actual function so the logic is clear and concise. This will make it much easier to add support for the rmap btree when the time comes. This also gets rid of the XFS_AG_MAXLEVELS macro used by these freelist macros as it is simply a wrapper around a single variable. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Dave Chinner 提交于
The error handling is currently an inconsistent mess as every error condition handles return values and releasing buffers individually. Clean this up by using gotos and a sane error label stack. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Dave Chinner 提交于
The longest extent length checks in xfs_alloc_fix_freelist() are now essentially identical. Factor them out into a helper function, so we know they are checking exactly the same thing before and after we lock the AGF. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Dave Chinner 提交于
At the moment, xfs_alloc_fix_freelist() uses a mix of per-ag based access and agf buffer based access to freelist and space usage information. However, once the AGF buffer is locked inside this function, it is guaranteed that both the in-memory and on-disk values are identical. xfs_alloc_fix_freelist() doesn't modify the values in the structures directly, so it is a read-only user of the infomration, and hence can use the per-ag structure exclusively for determining what it should do. This opens up an avenue for cleaning up a lot of duplicated logic whose only difference is the structure it gets the data from, and in doing so removes a lot of needless byte swapping overhead when fixing up the free list. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Christoph Hellwig 提交于
Just use char pointers directly instead of the confusing typedef to a pointer type. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Christoph Hellwig 提交于
Compared to char pointers this saves us a lot of casting effort. Also add another local variable to make the code easier to read. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Christoph Hellwig 提交于
This avoids all kinds of unessecary casts in an envrionment like Linux where we can assume that pointer arithmetics are support on void pointers. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Christoph Hellwig 提交于
We can simply use a void pointer to pass a long return addresses in the debugging helpers. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Christoph Hellwig 提交于
Replace uses of __psint_t with the proper uintptr_t and ptrdiff_t types, and remove the defintions of __psint_t and __psunsigned_t. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Eric Sandeen 提交于
If we create a CRC filesystem, mount it, and create a symlink with a path long enough that it can't live in the inode, we get a very strange result upon remount: # ls -l mnt total 4 lrwxrwxrwx. 1 root root 929 Jun 15 16:58 link -> XSLM XSLM is the V5 symlink block header magic (which happens to be followed by a NUL, so the string looks terminated). xfs_readlink_bmap() advanced cur_chunk by the size of the header for CRC filesystems, but never actually used that pointer; it kept reading from bp->b_addr, which is the start of the block, rather than the start of the symlink data after the header. Looks like this problem goes back to v3.10. Fixing this gets us reading the proper link target, again. Cc: stable@vger.kernel.org Signed-off-by: NEric Sandeen <sandeen@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
- 04 6月, 2015 13 次提交
-
-
由 Christoph Hellwig 提交于
Instead of the confusing flags argument pass a boolean flag to indicate if we want to release or regrant a log reservation. Also ensure that xfs_log_done always drop the reference on the log ticket, to both simplify the code and make the logic in xfs_trans_roll easier to understand. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Christoph Hellwig 提交于
The flags argument to xfs_trans_commit is not useful for most callers, as a commit of a transaction without a permanent log reservation must pass 0 here, and all callers for a transaction with a permanent log reservation except for xfs_trans_roll must pass XFS_TRANS_RELEASE_LOG_RES. So remove the flags argument from the public xfs_trans_commit interfaces, and introduce low-level __xfs_trans_commit variant just for xfs_trans_roll that regrants a log reservation instead of releasing it. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Christoph Hellwig 提交于
xfs_trans_cancel takes two flags arguments: XFS_TRANS_RELEASE_LOG_RES and XFS_TRANS_ABORT. Both of them are a direct product of the transaction state, and can be deducted: - any dirty transaction needs XFS_TRANS_ABORT to be properly canceled, and XFS_TRANS_ABORT is a noop for a transaction that is not dirty. - any transaction with a permanent log reservation needs XFS_TRANS_RELEASE_LOG_RES to be properly canceled, and passing XFS_TRANS_RELEASE_LOG_RES for a transaction without a permanent log reservation is invalid. So just remove the flags argument and do the right thing. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Christoph Hellwig 提交于
The flags value always was 0 or XFS_TRANS_ABORT. Switch to a bool parameter to allow further cleanups. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Christoph Hellwig 提交于
We have three remaining callers of xfs_trans_dup: - xfs_itruncate_extents which open codes xfs_trans_roll - xfs_bmap_finish doesn't have an xfs_inode argument and thus leaves attaching them to it's callers, but otherwise is identical to xfs_trans_roll - xfs_dir_ialloc looks at the log reservations in the old xfs_trans structure instead of the log reservation parameters, but otherwise is identical to xfs_trans_roll. By allowing a NULL xfs_inode argument to xfs_trans_roll we can switch these three remaining users over to xfs_trans_roll and mark xfs_trans_dup static. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Brian Foster 提交于
The inode allocator enables random sparse inode chunk allocations in DEBUG mode to facilitate testing. Sparse inode allocations are not always possible, however, depending on the fs geometry. For example, there is no possibility for a sparse inode allocation on filesystems where the block size is large enough to fit one or more inode chunks within a single block. Fix up the DEBUG mode sparse inode allocation logic to trigger random sparse allocations only when the geometry of the fs allows it. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Brian Foster 提交于
The kbuild test robot reports the following compilation failure with a 32-bit kernel configuration: fs/built-in.o: In function `xfs_ifree_cluster': >> xfs_inode.c:(.text+0x17ac84): undefined reference to `__umoddi3' This is due to the use of the modulus operator on a 64-bit variable in the ASSERT() added as part of the following commit: xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() This ASSERT() simply checks that the offset of the inode in a sparse cluster is appropriately aligned. Since the maximum inode record offset is 63 (for a 64 inode record) and the calculated offset here should be something less than that, just use a 32-bit variable to store the offset and call the do_mod() helper. Reported-by: Nkbuild test robot <fengguang.wu@intel.com> Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Dave Chinner 提交于
Add initial DAX support to XFS. To do this we need a new mount option to turn DAX on filesystem, and we need to propagate this into the inode flags whenever an inode is instantiated so that the per-inode checks throughout the code Do The Right Thing. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Dave Chinner 提交于
DAX does not do buffered IO (can't buffer direct access!) and hence all read/write IO is vectored through the direct IO path. Hence we need to add the DAX IO path callouts to the direct IO infrastructure. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Dave Chinner 提交于
When we truncate a DAX file, we need to call through the DAX page truncation path rather than through block_truncate_page() so that mappings and block zeroing are all handled correctly. Otherwise, truncate does not need to change. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Dave Chinner 提交于
Add initial support for DAX block zeroing operations to XFS. DAX cannot use buffered IO through the page cache for zeroing, nor do we need to issue IO for uncached block zeroing. In both cases, we can simply call out to the dax block zeroing function. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Dave Chinner 提交于
Add the initial support for DAX file operations to XFS. This includes the necessary block allocation and mmap page fault hooks for DAX to function. Note that there are changes to the splice interfaces to ensure that for DAX splice avoids direct page cache manipulations and instead takes the DAX IO paths for read/write operations. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Dave Chinner 提交于
Lock ordering for the new mmap lock needs to be: mmap_sem sb_start_pagefault i_mmap_lock page lock <fault processsing> Right now xfs_vm_page_mkwrite gets this the wrong way around, While technically it cannot deadlock due to the current freeze ordering, it's still a landmine that might explode if we change anything in future. Hence we need to nest the locks correctly. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
- 02 6月, 2015 2 次提交
-
-
由 Tejun Heo 提交于
With the planned cgroup writeback support, backing-dev related declarations will be more widely used across block and cgroup; unfortunately, including backing-dev.h from include/linux/blkdev.h makes cyclic include dependency quite likely. This patch separates out backing-dev-defs.h which only has the essential definitions and updates blkdev.h to include it. c files which need access to more backing-dev details now include backing-dev.h directly. This takes backing-dev.h off the common include dependency chain making it a lot easier to use it across block and cgroup. v2: fs/fat build failure fixed. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Greg Thelen 提交于
When modifying PG_Dirty on cached file pages, update the new MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where global NR_FILE_DIRTY is managed. The new memcg stat is visible in the per memcg memory.stat cgroupfs file. The most recent past attempt at this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632 The new accounting supports future efforts to add per cgroup dirty page throttling and writeback. It also helps an administrator break down a container's memory usage and provides evidence to understand memcg oom kills (the new dirty count is included in memcg oom kill messages). The ability to move page accounting between memcg (memory.move_charge_at_immigrate) makes this accounting more complicated than the global counter. The existing mem_cgroup_{begin,end}_page_stat() lock is used to serialize move accounting with stat updates. Typical update operation: memcg = mem_cgroup_begin_page_stat(page) if (TestSetPageDirty()) { [...] mem_cgroup_update_page_stat(memcg) } mem_cgroup_end_page_stat(memcg) Summary of mem_cgroup_end_page_stat() overhead: - Without CONFIG_MEMCG it's a no-op - With CONFIG_MEMCG and no inter memcg task movement, it's just rcu_read_lock() - With CONFIG_MEMCG and inter memcg task movement, it's rcu_read_lock() + spin_lock_irqsave() A memcg parameter is added to several routines because their callers now grab mem_cgroup_begin_page_stat() which returns the memcg later needed by for mem_cgroup_update_page_stat(). Because mem_cgroup_begin_page_stat() may disable interrupts, some adjustments are needed: - move __mark_inode_dirty() from __set_page_dirty() to its caller. __mark_inode_dirty() locking does not want interrupts disabled. - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in __delete_from_page_cache(), replace_page_cache_page(), invalidate_complete_page2(), and __remove_mapping(). text data bss dec hex filename 8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before 8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after +192 text bytes 8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before 8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after +773 text bytes Performance tests run on v4.0-rc1-36-g4f671fe2. Lower is better for all metrics, they're all wall clock or cycle counts. The read and write fault benchmarks just measure fault time, they do not include I/O time. * CONFIG_MEMCG not set: baseline patched kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples) dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03% dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99% dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77% read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples) write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples) * CONFIG_MEMCG=y root_memcg: baseline patched kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples) dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90% dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33% dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00% read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples) write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples) * CONFIG_MEMCG=y non-root_memcg: baseline patched kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples) dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82% dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27% dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52% read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples) write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples) As expected anon page faults are not affected by this patch. tj: Updated to apply on top of the recent cancel_dirty_page() changes. Signed-off-by: NSha Zhengju <handai.szj@gmail.com> Signed-off-by: NGreg Thelen <gthelen@google.com> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-