- 07 7月, 2020 16 次提交
-
-
由 Dave Chinner 提交于
This was used to track if the item had logged fields being flushed to disk. We log everything in the inode these days, so this logic is no longer needed. Remove it. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Dave Chinner 提交于
In tracking down a problem in this patchset, I discovered we are reclaiming dirty stale inodes. This wasn't discovered until inodes were always attached to the cluster buffer and then the rcu callback that freed inodes was assert failing because the inode still had an active pointer to the cluster buffer after it had been reclaimed. Debugging the issue indicated that this was a pre-existing issue resulting from the way the inodes are handled in xfs_inactive_ifree. When we free a cluster buffer from xfs_ifree_cluster, all the inodes in cache are marked XFS_ISTALE. Those that are clean have nothing else done to them and so eventually get cleaned up by background reclaim. i.e. it is assumed we'll never dirty/relog an inode marked XFS_ISTALE. On journal commit dirty stale inodes as are handled by both buffer and inode log items to run though xfs_istale_done() and removed from the AIL (buffer log item commit) or the log item will simply unpin it because the buffer log item will clean it. What happens to any specific inode is entirely dependent on which log item wins the commit race, but the result is the same - stale inodes are clean, not attached to the cluster buffer, and not in the AIL. Hence inode reclaim can just free these inodes without further care. However, if the stale inode is relogged, it gets dirtied again and relogged into the CIL. Most of the time this isn't an issue, because relogging simply changes the inode's location in the current checkpoint. Problems arise, however, when the CIL checkpoints between two transactions in the xfs_inactive_ifree() deferops processing. This results in the XFS_ISTALE inode being redirtied and inserted into the CIL without any of the other stale cluster buffer infrastructure being in place. Hence on journal commit, it simply gets unpinned, so it remains dirty in memory. Everything in inode writeback avoids XFS_ISTALE inodes so it can't be written back, and it is not tracked in the AIL so there's not even a trigger to attempt to clean the inode. Hence the inode just sits dirty in memory until inode reclaim comes along, sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming of a dirty inode caused use after free, list corruptions and other nasty issues later in this patchset. Hence this patch addresses a violation of the "never log XFS_ISTALE inodes" caused by the deferops processing rolling a transaction and relogging a stale inode in xfs_inactive_free. It also adds a bunch of asserts to catch this problem in debug kernels so that we don't reintroduce this problem in future. Reproducer for this issue was generic/558 on a v4 filesystem. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Yafang Shao 提交于
Remove current_pid(), current_test_flags() and current_clear_flags_nested(), because they are useless. Signed-off-by: NYafang Shao <laoar.shao@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Dave Chinner 提交于
The page faultround path ->map_pages is implemented in XFS via filemap_map_pages(). This function checks that pages found in page cache lookups have not raced with truncate based invalidation by checking page->mapping is correct and page->index is within EOF. However, we've known for a long time that this is not sufficient to protect against races with invalidations done by operations that do not change EOF. e.g. hole punching and other fallocate() based direct extent manipulations. The way we protect against these races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED lock so they serialise against fallocate and truncate before calling into the filemap function that processes the fault. Do the same for XFS's ->map_pages implementation to close this potential data corruption issue. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Darrick J. Wong 提交于
Move the double-inode locking helpers to xfs_inode.c since they're not specific to reflink. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Refactor the two functions that we use to lock and unlock two inodes to block userspace from initiating IO against a file, whether via system calls or mmap activity. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Fix the return value of xfs_reflink_remap_prep so that its return value conventions match the rest of xfs. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
If the source and destination map are identical, we can skip the remap step to save some time. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
When logging quota block count updates during a reflink operation, we only log the /delta/ of the block count changes to the dquot. Since we now know ahead of time the extent type of both dmap and smap (and that they have the same length), we know that we only need to reserve quota blocks for dmap's blockcount if we're mapping it into a hole. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Now that we've reworked xfs_reflink_remap_extent to remap only one extent per transaction, we actually know if the extent being removed is an allocated mapping. This means that we now know ahead of time if we're going to be touching the data fork. Since we only need blocks for a bmbt split if we're going to update the data fork, we only need to get quota reservation if we know we're going to touch the data fork. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
The existing reflink remapping loop has some structural problems that need addressing: The biggest problem is that we create one transaction for each extent in the source file without accounting for the number of mappings there are for the same range in the destination file. In other words, we don't know the number of remap operations that will be necessary and we therefore cannot guess the block reservation required. On highly fragmented filesystems (e.g. ones with active dedupe) we guess wrong, run out of block reservation, and fail. The second problem is that we don't actually use the bmap intents to their full potential -- instead of calling bunmapi directly and having to deal with its backwards operation, we could call the deferred ops xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead. This makes the frontend loop much simpler. Solve all of these problems by refactoring the remapping loops so that we only perform one remapping operation per transaction, and each operation only tries to remap a single extent from source to dest. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reported-by: NEdwin Török <edwin@etorok.net> Tested-by: NEdwin Török <edwin@etorok.net>
-
由 Darrick J. Wong 提交于
The name of this predicate is a little misleading -- it decides if the extent mapping is allocated and written. Change the name to be more direct, as we're going to add a new predicate in the next patch. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
Quota reservations are supposed to account for the blocks that might be allocated due to a bmap btree split. Reflink doesn't do this, so fix this to make the quota accounting more accurate before we start rearranging things. Fixes: 862bb360 ("xfs: reflink extents from one file to another") Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com>
-
由 Darrick J. Wong 提交于
The data fork scrubber calls filemap_write_and_wait to flush dirty pages and delalloc reservations out to disk prior to checking the data fork's extent mappings. Unfortunately, this means that scrub can consume the EIO/ENOSPC errors that would otherwise have stayed around in the address space until (we hope) the writer application calls fsync to persist data and collect errors. The end result is that programs that wrote to a file might never see the error code and proceed as if nothing were wrong. xfs_scrub is not in a position to notify file writers about the writeback failure, and it's only here to check metadata, not file contents. Therefore, if writeback fails, we should stuff the error code back into the address space so that an fsync by the writer application can pick that up. Fixes: 99d9d8d0 ("xfs: scrub inode block mappings") Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com>
-
由 Brian Foster 提交于
The rmapbt extent swap algorithm remaps individual extents between the source inode and the target to trigger reverse mapping metadata updates. If either inode straddles a format or other bmap allocation boundary, the individual unmap and map cycles can trigger repeated bmap block allocations and frees as the extent count bounces back and forth across the boundary. While net block usage is bound across the swap operation, this behavior can prematurely exhaust the transaction block reservation because it continuously drains as the transaction rolls. Each allocation accounts against the reservation and each free returns to global free space on transaction roll. The previous workaround to this problem attempted to detect this boundary condition and provide surplus block reservation to acommodate it. This is insufficient because more remaps can occur than implied by the extent counts; if start offset boundaries are not aligned between the two inodes, for example. To address this problem more generically and dynamically, add a transaction accounting mode that returns freed blocks to the transaction reservation instead of the superblock counters on transaction roll and use it when the rmapbt based algorithm is active. This allows the chain of remap transactions to preserve the block reservation based own its own frees and prevent premature exhaustion regardless of the remap pattern. Note that this is only safe for superblocks with lazy sb accounting, but the latter is required for v5 supers and the rmap feature depends on v5. Fixes: b3fed434 ("xfs: account format bouncing into rmapbt swapext tx reservation") Root-caused-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Keyur Patel 提交于
./xfs/libxfs/xfs_inode_buf.c:56: unnecssary ==> unnecessary ./xfs/libxfs/xfs_inode_buf.c:59: behavour ==> behaviour ./xfs/libxfs/xfs_inode_buf.c:206: unitialized ==> uninitialized Signed-off-by: NKeyur Patel <iamkeyur96@gmail.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 05 7月, 2020 1 次提交
-
-
由 Jens Axboe 提交于
When switching to TWA_SIGNAL for task_work notifications, we also made any signal based condition in io_cqring_wait() return -ERESTARTSYS. This breaks applications that rely on using signals to abort someone waiting for events. Check if we have a signal pending because of queued task_work, and repeat the signal check once we've run the task_work. This provides a reliable way of telling the two apart. Additionally, only use TWA_SIGNAL if we are using an eventfd. If not, we don't have the dependency situation described in the original commit, and we can get by with just using TWA_RESUME like we previously did. Fixes: ce593a6c ("io_uring: use signal based task_work running") Cc: stable@vger.kernel.org # v5.7 Reported-by: NAndres Freund <andres@anarazel.de> Tested-by: NAndres Freund <andres@anarazel.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 04 7月, 2020 1 次提交
-
-
由 Matthew Wilcox (Oracle) 提交于
This error path returned directly instead of calling sysctl_head_finish(). Fixes: ef9d965b ("sysctl: reject gigantic reads/write to sysctl files") Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 03 7月, 2020 5 次提交
-
-
由 Bob Peterson 提交于
Before this patch, some gfs2 code locked the freeze glock with LM_FLAG_NOEXP (Do not freeze) flag, and some did not. We never want to freeze the freeze glock, so this patch makes it consistently use LM_FLAG_NOEXP always. Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
由 Bob Peterson 提交于
Before this patch, the freeze code in gfs2 specified GL_NOCACHE in several places. That's wrong because we always want to know the state of whether the file system is frozen. There was also a problem with freeze/thaw transitioning the glock from frozen (EX) to thawed (SH) because gfs2 will normally grant glocks in EX to processes that request it in SH mode, unless GL_EXACT is specified. Therefore, the freeze/thaw code, which tried to reacquire the glock in SH mode would get the glock in EX mode, and miss the transition from EX to SH. That made it think the thaw had completed normally, but since the glock was still cached in EX, other nodes could not freeze again. This patch removes the GL_NOCACHE flag to allow the freeze glock to be cached. It also adds the GL_EXACT flag so the glock is fully transitioned from EX to SH, thereby allowing future freeze operations. Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
由 Bob Peterson 提交于
Before this patch, only read-write mounts would grab the freeze glock in read-only mode, as part of gfs2_make_fs_rw. So the freeze glock was never initialized. That meant requests to freeze, which request the glock in EX, were granted without any state transition. That meant you could mount a gfs2 file system, which is currently frozen on a different cluster node, in read-only mode. This patch makes read-only mounts lock the freeze glock in SH mode, which will block for file systems that are frozen on another node. Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
由 Bob Peterson 提交于
Before this patch, function freeze_go_sync, called when promoting the freeze glock, was testing for the SDF_JOURNAL_LIVE superblock flag. That's only set for read-write mounts. Read-only mounts don't use a journal, so the bit is never set, so the freeze never happened. This patch removes the check for SDF_JOURNAL_LIVE for freeze requests but still checks it when deciding whether to flush a journal. Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
由 Bob Peterson 提交于
In several places, we used the GIF_ORDERED inode flag to determine if an inode was on the ordered writes list. However, since we always held the sd_ordered_lock spin_lock during the manipulation, we can just as easily check list_empty(&ip->i_ordered) instead. This allows us to keep more than one ordered writes list to make journal writing improvements. This patch eliminates GIF_ORDERED in favor of checking list_empty. Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
- 02 7月, 2020 8 次提交
-
-
由 Ronnie Sahlberg 提交于
The wait_event_... defines evaluate to long so we should not assign it an int as this may truncate the value. Reported-by: NMarshall Midden <marshallmidden@gmail.com> Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com> Signed-off-by: NSteve French <stfrench@microsoft.com>
-
由 Zhang Xiaoxu 提交于
When xfstest generic/035, we found the target file was deleted if the rename return -EACESS. In cifs_rename2, we unlink the positive target dentry if rename failed with EACESS or EEXIST, even if the target dentry is positived before rename. Then the existing file was deleted. We should just delete the target file which created during the rename. Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: NSteve French <stfrench@microsoft.com> Reviewed-by: NAurelien Aptel <aaptel@suse.com>
-
由 Paul Aurich 提交于
The flag from the primary tcon needs to be copied into the volume info so that cifs_get_tcon will try to enable extensions on the per-user tcon. At that point, since posix extensions must have already been enabled on the superblock, don't try to needlessly adjust the mount flags. Fixes: ce558b0e ("smb3: Add posix create context for smb3.11 posix mounts") Fixes: b326614e ("smb3: allow "posix" mount option to enable new SMB311 protocol extensions") Signed-off-by: NPaul Aurich <paul@darkrain42.org> Signed-off-by: NSteve French <stfrench@microsoft.com> Reviewed-by: NAurelien Aptel <aaptel@suse.com>
-
由 Paul Aurich 提交于
Fixes: ca567eb2 ("SMB3: Allow persistent handle timeout to be configurable on mount") Signed-off-by: NPaul Aurich <paul@darkrain42.org> CC: Stable <stable@vger.kernel.org> Signed-off-by: NSteve French <stfrench@microsoft.com> Reviewed-by: NAurelien Aptel <aaptel@suse.com>
-
由 Paul Aurich 提交于
Fixes: 3e7a02d4 ("smb3: allow disabling requesting leases") Signed-off-by: NPaul Aurich <paul@darkrain42.org> CC: Stable <stable@vger.kernel.org> Signed-off-by: NSteve French <stfrench@microsoft.com> Reviewed-by: NAurelien Aptel <aaptel@suse.com>
-
由 Paul Aurich 提交于
Without this: - persistent handles will only be enabled for per-user tcons if the server advertises the 'Continuous Availabity' capability - resilient handles would never be enabled for per-user tcons Signed-off-by: NPaul Aurich <paul@darkrain42.org> CC: Stable <stable@vger.kernel.org> Signed-off-by: NSteve French <stfrench@microsoft.com> Reviewed-by: NAurelien Aptel <aaptel@suse.com>
-
由 Paul Aurich 提交于
Ensure multiuser SMB3 mounts use encryption for all users' tcons if the mount options are configured to require encryption. Without this, only the primary tcon and IPC tcons are guaranteed to be encrypted. Per-user tcons would only be encrypted if the server was configured to require encryption. Signed-off-by: NPaul Aurich <paul@darkrain42.org> CC: Stable <stable@vger.kernel.org> Signed-off-by: NSteve French <stfrench@microsoft.com> Reviewed-by: NAurelien Aptel <aaptel@suse.com>
-
由 Paul Aurich 提交于
This is useful for distinguishing SMB sessions on a multiuser mount. Signed-off-by: NPaul Aurich <paul@darkrain42.org> Signed-off-by: NSteve French <stfrench@microsoft.com> Reviewed-by: NAurelien Aptel <aaptel@suse.com>
-
- 01 7月, 2020 1 次提交
-
-
由 Jens Axboe 提交于
Since 5.7, we've been using task_work to trigger async running of requests in the context of the original task. This generally works great, but there's a case where if the task is currently blocked in the kernel waiting on a condition to become true, it won't process task_work. Even though the task is woken, it just checks whatever condition it's waiting on, and goes back to sleep if it's still false. This is a problem if that very condition only becomes true when that task_work is run. An example of that is the task registering an eventfd with io_uring, and it's now blocked waiting on an eventfd read. That read could depend on a completion event, and that completion event won't get trigged until task_work has been run. Use the TWA_SIGNAL notification for task_work, so that we ensure that the task always runs the work when queued. Cc: stable@vger.kernel.org # v5.7 Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 30 6月, 2020 6 次提交
-
-
由 Andreas Gruenbacher 提交于
In flush_delete_work, instead of flushing each individual pending delayed work item, cancel and re-queue them for immediate execution. The waiting isn't needed here because we're already waiting for all queued work items to complete in gfs2_flush_delete_work. This makes the code more efficient, but more importantly, it avoids sleeping during a rhashtable walk, inside rcu_read_lock(). Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Bob Peterson 提交于
Log flush operations (gfs2_log_flush()) can target a specific transaction. But if the function encounters errors (e.g. io errors) and withdraws, the transaction was only freed it if was queued to one of the ail lists. If the withdraw occurred before the transaction was queued to the ail1 list, function ail_drain never freed it. The result was: BUG gfs2_trans: Objects remaining in gfs2_trans on __kmem_cache_shutdown() This patch makes log_flush() add the targeted transaction to the ail1 list so that function ail_drain() will find and free it properly. Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: NBob Peterson <rpeterso@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Andreas Gruenbacher 提交于
Callers expect gfs2_inode_lookup to return an inode pointer or ERR_PTR(error). Commit b66648ad caused it to return NULL instead of ERR_PTR(-ESTALE) in some cases. Fix that. Reported-by: NDan Carpenter <dan.carpenter@oracle.com> Fixes: b66648ad ("gfs2: Move inode generation number check into gfs2_inode_lookup") Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 J. Bruce Fields 提交于
I don't understand this code well, but I'm seeing a warning about a still-referenced inode on unmount, and every other similar filesystem does a dput() here. Fixes: e8a79fb1 ("nfsd: add nfsd/clients directory") Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 J. Bruce Fields 提交于
We don't drop the reference on the nfsdfs filesystem with mntput(nn->nfsd_mnt) until nfsd_exit_net(), but that won't be called until the nfsd module's unloaded, and we can't unload the module as long as there's a reference on nfsdfs. So this prevents module unloading. Fixes: 2c830dd7 ("nfsd: persist nfsd filesystem across mounts") Reported-and-Tested-by: Luo Xiaogang <lxgrxd@163.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Mel Gorman 提交于
This reverts commit e9c15bad ("fs: Do not check if there is a fsnotify watcher on pseudo inodes"). The commit intended to eliminate fsnotify-related overhead for pseudo inodes but it is broken in concept. inotify can receive events of pipe files under /proc/X/fd and chromium relies on close and open events for sandboxing. Maxim Levitsky reported the following Chromium starts as a white rectangle, shows few white rectangles that resemble its notifications and then crashes. The stdout output from chromium: [mlevitsk@starship ~]$chromium-freeworld mesa: for the --simplifycfg-sink-common option: may only occur zero or one times! mesa: for the --global-isel-abort option: may only occur zero or one times! [3379:3379:0628/135151.440930:ERROR:browser_switcher_service.cc(238)] XXX Init() ../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0072 Received signal 11 SEGV_MAPERR 0000004a9048 Crashes are not universal but even if chromium does not crash, it certainly does not work properly. While filtering just modify and access might be safe, the benefit is not worth the risk hence the revert. Reported-by: NMaxim Levitsky <mlevitsk@redhat.com> Fixes: e9c15bad ("fs: Do not check if there is a fsnotify watcher on pseudo inodes") Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 6月, 2020 2 次提交
-
-
由 Sungjong Seo 提交于
generic_file_fsync() exfat used could not guarantee the consistency of a file because it has flushed not dirty metadata but only dirty data pages for a file. Instead of that, use exfat_file_fsync() for files and directories so that it guarantees to commit both the metadata and data pages for a file. Signed-off-by: NSungjong Seo <sj1557.seo@samsung.com> Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
-
由 Namjae Jeon 提交于
Move setting VOL_DIRTY over exfat_remove_entries() to avoid unneeded leaving VOL_DIRTY on -ENOTEMPTY. Fixes: 5f2aa075 ("exfat: add inode operations") Cc: stable@vger.kernel.org # v5.7 Reported-by: NTetsuhiro Kohada <kohada.t2@gmail.com> Reviewed-by: NSungjong Seo <sj1557.seo@samsung.com> Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
-