- 12 5月, 2022 6 次提交
-
-
由 Dave Chinner 提交于
Clean up the final leaf/node states in xfs_attr_set_iter() to further simplify the high level state machine and to set the completion state correctly. As we are adding a separate state for node format removal, we need to ensure that node formats are collapsed back to shortform or empty correctly. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson<allison.henderson@oracle.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Dave Chinner 提交于
We may not have a remote value for the old xattr we have to remove, so skip over the remote value removal states and go straight to the xattr name removal in the leaf/node block. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson<allison.henderson@oracle.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Dave Chinner 提交于
We can skip the REPLACE state when LARP is enabled, but that means the XFS_DAS_FLIP_LFLAG state is now poorly named - it indicates something that has been done rather than what the state is going to do. Rename it to "REMOVE_OLD" to indicate that we are now going to perform removal of the old attr. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson<allison.henderson@oracle.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Dave Chinner 提交于
When we set a new xattr, we have three exit paths: 1. nothing else to do 2. allocate and set the remote xattr value 3. perform the rest of a replace operation Currently we push both 2 and 3 into the same state, regardless of whether we just set a remote attribute or not. Once we've set the remote xattr, we have two exit states: 1. nothing else to do 2. perform the rest of a replace operation Hence we can split the remote xattr allocation and setting into their own states and factor it out of xfs_attr_set_iter() to further clean up the state machine and the implementation of the state machine. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson<allison.henderson@oracle.com> Reviewed-by: NDave Chinner <david@fromorbit.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Dave Chinner 提交于
We re-enter the XFS_DAS_FOUND_LBLK state when we have to allocate multiple extents for a remote xattr. We currently have a flag called XFS_DAC_LEAF_ADDNAME_INIT to avoid running the remote attr hole finding code more than once. However, for the node format tree, we have a separate state for this so we never reenter the state machine at XFS_DAS_FOUND_NBLK and so it does not need a special flag to skip over the remote attr hold finding code. Convert the leaf block code to use the same state machine as the node blocks and kill the XFS_DAC_LEAF_ADDNAME_INIT flag. This further points out that this "ALLOC" state is only traversed if we have remote xattrs or we are doing a rename operation. Rename both the leaf and node alloc states to _ALLOC_RMT to indicate they are iterating to do allocation of remote xattr blocks. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson<allison.henderson@oracle.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Dave Chinner 提交于
We current use XFS_DAS_UNINIT for several steps in the attr_set state machine. We use it for setting shortform xattrs, converting from shortform to leaf, leaf add, leaf-to-node and leaf add. All of these things are essentially known before we start the state machine iterating, so we really should separate them out: XFS_DAS_SF_ADD: - tries to do a shortform add - on success -> done - on ENOSPC converts to leaf, -> XFS_DAS_LEAF_ADD - on error, dies. XFS_DAS_LEAF_ADD: - tries to do leaf add - on success: - inline attr -> done - remote xattr || REPLACE -> XFS_DAS_FOUND_LBLK - on ENOSPC converts to node, -> XFS_DAS_NODE_ADD - on error, dies XFS_DAS_NODE_ADD: - tries to do node add - on success: - inline attr -> done - remote xattr || REPLACE -> XFS_DAS_FOUND_NBLK - on error, dies This makes it easier to understand how the state machine starts up and sets us up on the path to further state machine simplifications. This also converts the DAS state tracepoints to use strings rather than numbers, as converting between enums and numbers requires manual counting rather than just reading the name. This also introduces a XFS_DAS_DONE state so that we can trace successful operation completions easily. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson<allison.henderson@oracle.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
- 11 5月, 2022 2 次提交
-
-
由 Dave Chinner 提交于
Logged attribute intents only have set and remove types - there is no separate intent type for a replace operation. We should have a separate type for a replace operation, as it needs to perform operations that neither SET or REMOVE can perform. Add this type to the intent items and rearrange the deferred operation setup to reflect the different operations we are performing. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson<allison.henderson@oracle.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Allison Henderson 提交于
This patch adds a helper function xfs_attr_leaf_addname. While this does help to break down xfs_attr_set_iter, it does also hoist out some of the state management. This patch has been moved to the end of the clean up series for further discussion. Suggested-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NAllison Henderson <allison.henderson@oracle.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
- 04 5月, 2022 1 次提交
-
-
由 Dave Chinner 提交于
When we log modifications based on intents, we add both intent and intent done items to the modification being made. These get written to the log to ensure that the operation is re-run if the intent done is not found in the log. However, for operations that complete wholly within a single checkpoint, the change in the checkpoint is atomic and will never need replay. In this case, we don't need to actually write the intent and intent done items to the journal because log recovery will never need to manually restart this modification. Log recovery currently handles intent/intent done matching by inserting the intent into the AIL, then removing it when a matching intent done item is found. Hence for all the intent-based operations that complete within a checkpoint, we spend all that time parsing the intent/intent done items just to cancel them and do nothing with them. Hence it follows that the only time we actually need intents in the log is when the modification crosses checkpoint boundaries in the log and so may only be partially complete in the journal. Hence if we commit and intent done item to the CIL and the intent item is in the same checkpoint, we don't actually have to write them to the journal because log recovery will always cancel the intents. We've never really worried about the overhead of logging intents unnecessarily like this because the intents we log are generally very much smaller than the change being made. e.g. freeing an extent involves modifying at lease two freespace btree blocks and the AGF, so the EFI/EFD overhead is only a small increase in space and processing time compared to the overall cost of freeing an extent. However, delayed attributes change this cost equation dramatically, especially for inline attributes. In the case of adding an inline attribute, we only log the inode core and attribute fork at present. With delayed attributes, we now log the attr intent which includes the name and value, the inode core adn attr fork, and finally the attr intent done item. We increase the number of items we log from 1 to 3, and the number of log vectors (regions) goes up from 3 to 7. Hence we tripple the number of objects that the CIL has to process, and more than double the number of log vectors that need to be written to the journal. At scale, this means delayed attributes cause a non-pipelined CIL to become CPU bound processing all the extra items, resulting in a > 40% performance degradation on 16-way file+xattr create worklaods. Pipelining the CIL (as per 5.15) reduces the performance degradation to 20%, but now the limitation is the rate at which the log items can be written to the iclogs and iclogs be dispatched for IO and completed. Even log IO completion is slowed down by these intents, because it now has to process 3x the number of items in the checkpoint. Processing completed intents is especially inefficient here, because we first insert the intent into the AIL, then remove it from the AIL when the intent done is processed. IOWs, we are also doing expensive operations in log IO completion we could completely avoid if we didn't log completed intent/intent done pairs. Enter log item whiteouts. When an intent done is committed, we can check to see if the associated intent is in the same checkpoint as we are currently committing the intent done to. If so, we can mark the intent log item with a whiteout and immediately free the intent done item rather than committing it to the CIL. We can basically skip the entire formatting and CIL insertion steps for the intent done item. However, we cannot remove the intent item from the CIL at this point because the unlocked per-cpu CIL item lists do not permit removal without holding the CIL context lock exclusively. Transaction commit only holds the context lock shared, hence the best we can do is mark the intent item with a whiteout so that the CIL push can release it rather than writing it to the log. This means we never write the intent to the log if the intent done has also been committed to the same checkpoint, but we'll always write the intent if the intent done has not been committed or has been committed to a different checkpoint. This will result in correct log recovery behaviour in all cases, without the overhead of logging unnecessary intents. This intent whiteout concept is generic - we can apply it to all intent/intent done pairs that have a direct 1:1 relationship. The way deferred ops iterate and relog intents mean that all intents currently have a 1:1 relationship with their done intent, and hence we can apply this cancellation to all existing intent/intent done implementations. For delayed attributes with a 16-way 64kB xattr create workload, whiteouts reduce the amount of journalled metadata from ~2.5GB/s down to ~600MB/s and improve the creation rate from 9000/s to 14000/s. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NAllison Henderson <allison.henderson@oracle.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
- 29 4月, 2022 3 次提交
-
-
由 Darrick J. Wong 提交于
Currently, the code that performs CoW remapping after a write has this odd behavior where it walks /backwards/ through the data fork to remap extents in reverse order. Earlier, we rewrote the reflink remap function to use deferred bmap log items instead of trying to cram as much into the first transaction that we could. Now do the same for the CoW remap code. There doesn't seem to be any performance impact; we're just making better use of code that we added for the benefit of reflink. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
由 Darrick J. Wong 提交于
Move the tracepoint that computes the size of the transaction used to compute the minimum log size into xfs_log_get_max_trans_res so that we only have to compute this stuff once. Leave xfs_log_get_max_trans_res as a non-static function so that xfs_db can call it to report the results of the userspace computation of the same value to diagnose mkfs/kernel misinteractions. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
由 Darrick J. Wong 提交于
Every time someone changes the transaction reservation sizes, they introduce potential compatibility problems if the changes affect the minimum log size that we validate at mount time. If the minimum log size gets larger (which should be avoided because doing so presents a serious risk of log livelock), filesystems created with old mkfs will not mount on a newer kernel; if the minimum size shrinks, filesystems created with newer mkfs will not mount on older kernels. Therefore, enable the creation of a shadow log reservation structure where we can "undo" the effects of tweaks when computing minimum log sizes. These shadow reservations should never be used in practice, but they insulate us from perturbations in minimum log size. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
- 28 4月, 2022 1 次提交
-
-
由 Darrick J. Wong 提交于
Record the buffer ops in the xfs_buf tracepoints so that we can monitor the alleged type of the buffer. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com>
-
- 21 4月, 2022 2 次提交
-
-
由 Dave Chinner 提交于
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned fields to be unsigned. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChandan Babu R <chandan.babu@oracle.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
由 Dave Chinner 提交于
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned fields to be unsigned. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChandan Babu R <chandan.babu@oracle.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
- 11 4月, 2022 2 次提交
-
-
由 Chandan Babu R 提交于
A future commit will introduce a 64-bit on-disk data extent counter and a 32-bit on-disk attr extent counter. This commit promotes xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits in order to correctly handle in-core versions of these quantities. Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NChandan Babu R <chandan.babu@oracle.com>
-
由 Chandan Babu R 提交于
xfs_extnum_t is the type to use to declare variables which have values obtained from xfs_dinode->di_[a]nextents. This commit replaces basic types (e.g. uint32_t) with xfs_extnum_t for such variables. Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NChandan Babu R <chandan.babu@oracle.com>
-
- 20 3月, 2022 1 次提交
-
-
由 Dave Chinner 提交于
Log items belong to the log, not the xfs_mount. Convert the mount pointer in the log item to a xlog pointer in preparation for upcoming log centric changes to the log items. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChandan Babu R <chandan.babu@oracle.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
- 15 3月, 2022 1 次提交
-
-
由 Darrick J. Wong 提交于
Various directory functions do not modify their @name parameter, so mark it const to make that clear. This will enable us to mark the global xfs_name_dotdot variable as const to prevent mischief. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com>
-
- 20 10月, 2021 1 次提交
-
-
由 Darrick J. Wong 提交于
Split out the btree level information into a separate struct and put it at the end of the cursor structure as a VLA. Files with huge data forks (and in the future, the realtime rmap btree) will require the ability to support many more levels than a per-AG btree cursor, which means that we're going to create per-btree type cursor caches to conserve memory for the more common case. Note that a subsequent patch actually introduces dynamic cursor heights. This one merely rearranges the structure to prepare for that. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChandan Babu R <chandan.babu@oracle.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com>
-
- 20 8月, 2021 16 次提交
-
-
由 Dave Chinner 提交于
Stop directly referencing b_bn in code outside the buffer cache, as b_bn is supposed to be used only as an internal cache index. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
The remaining mount flags kept in m_flags are actually runtime state flags. These change dynamically, so they really should be updated atomically so we don't potentially lose an update due to racing modifications. Convert these remaining flags to be stored in m_opstate and use atomic bitops to set and clear the flags. This also adds a couple of simple wrappers for common state checks - read only and shutdown. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Darrick J. Wong 提交于
Because there are a lot of tracepoints that express numeric data with an associated unit and tag, document what they are to help everyone else keep these thigns straight. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
-
由 Darrick J. Wong 提交于
Always print inode generation in hexadecimal and preceded with the unit "gen". Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
-
由 Darrick J. Wong 提交于
For the remaining xfs_buf tracepoints, convert all the tags to xfs_daddr_t units and retag them 'daddrcount' to match everything else. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
-
由 Darrick J. Wong 提交于
Emit whichfork values as text strings in the ftrace output. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
-
由 Darrick J. Wong 提交于
Whenever we record i_disk_size (i.e. the ondisk file size), use the "disize" tag and hexadecimal format consistently. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
-
由 Darrick J. Wong 提交于
Some of our tracepoints have a field known as "count". That name doesn't describe any units, which makes the fields not very useful. Rename the fields to capture units and ensure the format is hexadecimal when we're referring to blocks, extents, or IO operations. "fsbcount" are in units of fs blocks "bytecount" are in units of bytes Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
-
由 Darrick J. Wong 提交于
Some of our tracepoints have a field known as "len". That name doesn't describe any units, which makes the fields not very useful. Rename the fields to capture units and ensure the format is hexadecimal. "fsbcount" are in units of fs blocks "bbcount" are in units of 512b blocks "ireccount" are in units of inodes Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
-
由 Darrick J. Wong 提交于
Some of our tracepoints describe fields as "offset". That name doesn't describe any units, which makes the fields not very useful. Rename the fields to capture units and ensure the format is hexadecimal. "fileoff" means file offset, in units of fs blocks "pos" means file offset, in bytes "forkoff" means inode fork offset, in bytes The one remaining "offset" value is for iclogs, since that's the byte offset of the end of where we've written into the current iclog. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
-
由 Darrick J. Wong 提交于
Some of our tracepoints describe fields as "blkno", "block", or "bno". That name doesn't describe any units, which makes the fields not very useful. Rename the fields to capture units and ensure the format is hexadecimal. "startblock" is the startblock field from the bmap structure, which is a segmented fsblock on the data device, or an rfsblock on the realtime device. "fileoff" is a file offset, in units of filesystem blocks "daddr" is a raw device offset, in 512b blocks Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
-
由 Darrick J. Wong 提交于
Always print disk addr (i.e. 512 byte block) numbers in hexadecimal and preceded with the unit "daddr". Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
-
由 Darrick J. Wong 提交于
Always print rmap owner number in hexadecimal and preceded with the unit "owner". Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
-
由 Darrick J. Wong 提交于
Always print allocation group block numbers in hexadecimal and preceded with the unit "agbno". Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
-
由 Darrick J. Wong 提交于
Always print allocation group numbers in hexadecimal and preceded with the unit "agno". Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
-
由 Darrick J. Wong 提交于
Always print inode numbers in hexadecimal and preceded with the unit "ino" or "agino", as apropriate. Fix one tracepoint that used "ino %u" for an inode btree block count to reduce confusion. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
-
- 19 8月, 2021 2 次提交
-
-
由 Darrick J. Wong 提交于
The query_range functions are supposed to call a caller-supplied function on each record found in the dataset. These functions don't own the memory storing the record, so don't let them change the record. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
由 Darrick J. Wong 提交于
Add a tracepoint for fs shutdowns so we can capture that in ftrace output. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
- 17 8月, 2021 1 次提交
-
-
由 Dave Chinner 提交于
We don't need an iclog state field to tell us the log has been shut down. We can just check the xlog_is_shutdown() instead. The avoids the need to have shutdown overwrite the current iclog state while being active used by the log code and so having to ensure that every iclog state check handles XLOG_STATE_IOERROR appropriately. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
- 10 8月, 2021 1 次提交
-
-
由 Allison Henderson 提交于
This is a quick patch to add a new xfs_attr_*_return tracepoints. We use these to track when ever a new state is set or -EAGAIN is returned Signed-off-by: NAllison Henderson <allison.henderson@oracle.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-