- 23 2月, 2021 1 次提交
-
-
由 Andreas Gruenbacher 提交于
In the log, revokes are stored as a revoke descriptor (struct gfs2_log_descriptor), followed by zero or more additional revoke blocks (struct gfs2_meta_header). On filesystems with a blocksize of 4k, the revoke descriptor contains up to 503 revokes, and the metadata blocks contain up to 509 revokes each. We've so far been reserving space for revokes in transactions in block granularity, so a lot more space than necessary was being allocated and then released again. This patch switches to assigning revokes to transactions individually instead. Initially, space for the revoke descriptor is reserved and handed out to transactions. When more revokes than that are reserved, additional revoke blocks are added. When the log is flushed, the space for the additional revoke blocks is released, but we keep the space for the revoke descriptor block allocated. Transactions may still reserve more revokes than they will actually need in the end, but now we won't overshoot the target as much, and by only returning the space for excess revokes at log flush time, we further reduce the amount of contention between processes. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 04 2月, 2021 3 次提交
-
-
由 Andreas Gruenbacher 提交于
Keep the current value of the updated log tail in the super block as sb_log_flush_tail instead of computing it on the fly. This avoids unnecessary sd_ail_lock taking and cleans up the code. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Andreas Gruenbacher 提交于
This counter and the associated wait queue are only used so that gfs2_make_fs_ro can efficiently wait for all pending log space allocations to fail after setting the filesystem to read-only. This comes at the cost of waking up that wait queue very frequently. Instead, when gfs2_log_reserve fails because the filesystem has become read-only, Wake up sd_log_waitq. In gfs2_make_fs_ro, set the file system read-only and then wait until all the log space has been released. Give up and report the problem after a while. With that, sd_reserving_log and sd_reserving_log_wait can be removed. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Andreas Gruenbacher 提交于
Replace the TR_ALLOCED flag by its inverse, TR_ONSTACK: that way, the flag only needs to be set in the exceptional case of on-stack transactions. Split off __gfs2_trans_begin from gfs2_trans_begin and use it to replace the open-coded version in gfs2_ail_empty_gl. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 01 12月, 2020 1 次提交
-
-
由 Andreas Gruenbacher 提交于
Since commit a0e3cc65 ("gfs2: Turn gl_delete into a delayed work"), we're cancelling any pending delete work of an iopen glock before attaching a new inode to that glock in gfs2_create_inode. This means that delete_work_func can no longer be queued or running when attaching the iopen glock to the new inode, and we can revert commit a4923865 ("GFS2: Prevent delete work from occurring on glocks used for create"), which tried to achieve the same but in a racy way. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 25 11月, 2020 1 次提交
-
-
由 Alexander Aring 提交于
This patch introduce a new globs attribute to define the subclass of the glock lockref spinlock. This avoid the following lockdep warning, which occurs when we lock an inode lock while an iopen lock is held: ============================================ WARNING: possible recursive locking detected 5.10.0-rc3+ #4990 Not tainted -------------------------------------------- kworker/0:1/12 is trying to acquire lock: ffff9067d45672d8 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: lockref_get+0x9/0x20 but task is already holding lock: ffff9067da308588 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: delete_work_func+0x164/0x260 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&gl->gl_lockref.lock); lock(&gl->gl_lockref.lock); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by kworker/0:1/12: #0: ffff9067c1bfdd38 ((wq_completion)delete_workqueue){+.+.}-{0:0}, at: process_one_work+0x1b7/0x540 #1: ffffac594006be70 ((work_completion)(&(&gl->gl_delete)->work)){+.+.}-{0:0}, at: process_one_work+0x1b7/0x540 #2: ffff9067da308588 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: delete_work_func+0x164/0x260 stack backtrace: CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.10.0-rc3+ #4990 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Workqueue: delete_workqueue delete_work_func Call Trace: dump_stack+0x8b/0xb0 __lock_acquire.cold+0x19e/0x2e3 lock_acquire+0x150/0x410 ? lockref_get+0x9/0x20 _raw_spin_lock+0x27/0x40 ? lockref_get+0x9/0x20 lockref_get+0x9/0x20 delete_work_func+0x188/0x260 process_one_work+0x237/0x540 worker_thread+0x4d/0x3b0 ? process_one_work+0x540/0x540 kthread+0x127/0x140 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30 Suggested-by: NAndreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NAlexander Aring <aahringo@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 23 10月, 2020 1 次提交
-
-
由 Abhi Das 提交于
We need to lookup the master statfs inode and the local statfs inodes earlier in the mount process (in init_journal) so journal recovery can use them when it attempts to recover the statfs info. We lookup all the local statfs inodes and store them in a linked list to allow a node to recover statfs info for other nodes in the cluster. Signed-off-by: NAbhi Das <adas@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 21 10月, 2020 2 次提交
-
-
由 Abhi Das 提交于
And read these in __get_log_header() from the log header. Also make gfs2_statfs_change_out() non-static so it can be used outside of super.c Signed-off-by: NAbhi Das <adas@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Bob Peterson 提交于
The gfs2_glock structure has a gl_vm member, introduced in commit 7005c3e4 ("GFS2: Use range based functions for rgrp sync/invalidation"), which stores the location of resource groups within their address space. This structure is in a union with iopen glock specific fields. It was introduced because at unmount time, the resource group objects were destroyed before flushing out any pending resource group glock work, and flushing out such work could require flushing / truncating the address space. Since commit b3422cac ("gfs2: Rework how rgrp buffer_heads are managed"), any pending resource group glock work is flushed out before destroying the resource group objects. So the resource group objects will now always exist in rgrp_go_sync and rgrp_go_inval, and we now simply compute the gl_vm values where needed instead of caching them. This also eliminates the union. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 15 10月, 2020 2 次提交
-
-
由 Bob Peterson 提交于
Before this patch, glock.c maintained a flag, GLF_QUEUED, which indicated when a glock had a holder queued. It was only checked for inode glocks, although set and cleared by all glocks, and it was only used to determine whether the glock should be held for the minimum hold time before releasing. The problem is that the flag is not accurate at all. If a process holds the glock, the flag is set. When they dequeue the glock, it only cleared the flag in cases when the state actually changed. So if the state doesn't change, the flag may still be set, even when nothing is queued. This happens to iopen glocks often: the get held in SH, then the file is closed, but the glock remains in SH mode. We don't need a special flag to indicate this: we can simply tell whether the glock has any items queued to the holders queue. It's a waste of cpu time to maintain it. This patch eliminates the flag in favor of simply checking list_empty on the glock holders. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Jamie Iles 提交于
syzkaller found the following splat with CONFIG_DEBUG_KOBJECT_RELEASE=y: Read of size 1 at addr ffff000028e896b8 by task kworker/1:2/228 CPU: 1 PID: 228 Comm: kworker/1:2 Tainted: G S 5.9.0-rc8+ #101 Hardware name: linux,dummy-virt (DT) Workqueue: events kobject_delayed_cleanup Call trace: dump_backtrace+0x0/0x4d8 show_stack+0x34/0x48 dump_stack+0x174/0x1f8 print_address_description.constprop.0+0x5c/0x550 kasan_report+0x13c/0x1c0 __asan_report_load1_noabort+0x34/0x60 memcmp+0xd0/0xd8 gfs2_uevent+0xc4/0x188 kobject_uevent_env+0x54c/0x1240 kobject_uevent+0x2c/0x40 __kobject_del+0x190/0x1d8 kobject_delayed_cleanup+0x2bc/0x3b8 process_one_work+0x96c/0x18c0 worker_thread+0x3f0/0xc30 kthread+0x390/0x498 ret_from_fork+0x10/0x18 Allocated by task 1110: kasan_save_stack+0x28/0x58 __kasan_kmalloc.isra.0+0xc8/0xe8 kasan_kmalloc+0x10/0x20 kmem_cache_alloc_trace+0x1d8/0x2f0 alloc_super+0x64/0x8c0 sget_fc+0x110/0x620 get_tree_bdev+0x190/0x648 gfs2_get_tree+0x50/0x228 vfs_get_tree+0x84/0x2e8 path_mount+0x1134/0x1da8 do_mount+0x124/0x138 __arm64_sys_mount+0x164/0x238 el0_svc_common.constprop.0+0x15c/0x598 do_el0_svc+0x60/0x150 el0_svc+0x34/0xb0 el0_sync_handler+0xc8/0x5b4 el0_sync+0x15c/0x180 Freed by task 228: kasan_save_stack+0x28/0x58 kasan_set_track+0x28/0x40 kasan_set_free_info+0x24/0x48 __kasan_slab_free+0x118/0x190 kasan_slab_free+0x14/0x20 slab_free_freelist_hook+0x6c/0x210 kfree+0x13c/0x460 Use the same pattern as f2fs + ext4 where the kobject destruction must complete before allowing the FS itself to be freed. This means that we need an explicit free_sbd in the callers. Cc: Bob Peterson <rpeterso@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NJamie Iles <jamie@nuviainc.com> [Also go to fail_free when init_names fails.] Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 03 7月, 2020 1 次提交
-
-
由 Bob Peterson 提交于
In several places, we used the GIF_ORDERED inode flag to determine if an inode was on the ordered writes list. However, since we always held the sd_ordered_lock spin_lock during the manipulation, we can just as easily check list_empty(&ip->i_ordered) instead. This allows us to keep more than one ordered writes list to make journal writing improvements. This patch eliminates GIF_ORDERED in favor of checking list_empty. Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
- 06 6月, 2020 3 次提交
-
-
由 Andreas Gruenbacher 提交于
In delete_work_func, if the iopen glock still has an inode attached, limit the inode lookup to that specific generation number: in the likely case that the inode was deleted on the node on which the inode's link count dropped to zero, we can skip verifying the on-disk block type and reading in the inode. The same applies if another node that had the inode open managed to delete the inode before us. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Andreas Gruenbacher 提交于
When there's contention on the iopen glock, it means that the link count of the corresponding inode has dropped to zero on a remote node which is now trying to delete the inode. In that case, try to evict the inode so that the iopen glock will be released, which will allow the remote node to do its job. When the inode is still open locally, the inode's reference count won't drop to zero and so we'll keep holding the inode and its iopen glock. The remote node will time out its request to grab the iopen glock, and when the inode is finally closed locally, we'll try to delete it ourself. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Andreas Gruenbacher 提交于
This requires flushing delayed work items in gfs2_make_fs_ro (which is called before unmounting a filesystem). When inodes are deleted and then recreated, pending gl_delete work items would have no effect because the inode generations will have changed, so we can cancel any pending gl_delete works before reusing iopen glocks. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 28 3月, 2020 1 次提交
-
-
由 Bob Peterson 提交于
Before this patch, multiple users called gfs2_qa_alloc which allocated a qadata structure to the inode, if quotas are turned on. Later, in file close or evict, the structure was deleted with gfs2_qa_delete. But there can be several competing processes who need access to the structure. There were races between file close (release) and the others. Thus, a release could delete the structure out from under a process that relied upon its existence. For example, chown. This patch changes the management of the qadata structures to be a get/put scheme. Function gfs2_qa_alloc has been changed to gfs2_qa_get and if the structure is allocated, the count essentially starts out at 1. Function gfs2_qa_delete has been renamed to gfs2_qa_put, and the last guy to decrement the count to 0 frees the memory. Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
- 27 2月, 2020 2 次提交
-
-
由 Bob Peterson 提交于
Before this patch, function do_xmote would try to sync out the glock dirty data by calling the appropriate glops function XXX_go_sync() but it did not check for a good return code. If the sync was not possible due to an io error or whatever, do_xmote would continue on and call go_inval and release the glock to other cluster nodes. When those nodes go to replay the journal, they may already be holding glocks for the journal records that should have been synced, but were not due to the ignored error. This patch introduces proper error code checking to the go_sync family of glops functions. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Bob Peterson 提交于
When a node withdraws from a file system, it often leaves its journal in an incomplete state. This is especially true when the withdraw is caused by io errors writing to the journal. Before this patch, a withdraw would try to write a "shutdown" record to the journal, tell dlm it's done with the file system, and none of the other nodes know about the problem. Later, when the problem is fixed and the withdrawn node is rebooted, it would then discover that its own journal was incomplete, and replay it. However, replaying it at this point is almost guaranteed to introduce corruption because the other nodes are likely to have used affected resource groups that appeared in the journal since the time of the withdraw. Replaying the journal later will overwrite any changes made, and not through any fault of dlm, which was instructed during the withdraw to release those resources. This patch makes file system withdraws seen by the entire cluster. Withdrawing nodes dequeue their journal glock to allow recovery. The remaining nodes check all the journals to see if they are clean or in need of replay. They try to replay dirty journals, but only the journals of withdrawn nodes will be "not busy" and therefore available for replay. Until the journal replay is complete, no i/o related glocks may be given out, to ensure that the replay does not cause the aforementioned corruption: We cannot allow any journal replay to overwrite blocks associated with a glock once it is held. The "live" glock which is now used to signal when a withdraw occurs. When a withdraw occurs, the node signals its withdraw by dequeueing the "live" glock and trying to enqueue it in EX mode, thus forcing the other nodes to all see a demote request, by way of a "1CB" (one callback) try lock. The "live" glock is not granted in EX; the callback is only just used to indicate a withdraw has occurred. Note that all nodes in the cluster must wait for the recovering node to finish replaying the withdrawing node's journal before continuing. To this end, it checks that the journals are clean multiple times in a retry loop. Also note that the withdraw function may be called from a wide variety of situations, and therefore, we need to take extra precautions to make sure pointers are valid before using them in many circumstances. We also need to take care when glocks decide to withdraw, since the withdraw code now uses glocks. Also, before this patch, if a process encountered an error and decided to withdraw, if another process was already withdrawing, the second withdraw would be silently ignored, which set it free to unlock its glocks. That's correct behavior if the original withdrawer encounters further errors down the road. But if secondary waiters don't wait for the journal replay, unlocking glocks will allow other nodes to use them, despite the fact that the journal containing those blocks is being replayed. The replay needs to finish before our glocks are released to other nodes. IOW, secondary withdraws need to wait for the first withdraw to finish. For example, if an rgrp glock is unlocked by a process that didn't wait for the first withdraw, a journal replay could introduce file system corruption by replaying a rgrp block that has already been granted to a different cluster node. Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
- 21 2月, 2020 1 次提交
-
-
由 Bob Peterson 提交于
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted when we're withdrawn. For example, to maintain metadata integrity, we should disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like iopen or the transaction glocks may be safely used because none of their metadata goes through the journal. So in general, we should disallow all glocks with an address space, and allow all the others. One exception is: we need to allow our active journal to be demoted so others may recover it. Allowing glocks after withdraw gives us the ability to take appropriate action (in a following patch) to have our journal properly replayed by another node rather than just abandoning the current transactions and pretending nothing bad happened, leaving the other nodes free to modify the blocks we had in our journal, which may result in file system corruption. Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
- 10 2月, 2020 3 次提交
-
-
由 Bob Peterson 提交于
Before this patch, gfs2 kept track of journal io errors in two places sd_log_error and the SDF_AIL1_IO_ERROR flag in sd_flags. This patch consolidates the two into sd_log_error so that it reflects the first error encountered writing to the journal. In future patches, we will take advantage of this by checking this value rather than having to check both when reacting to io errors. In addition, this fixes a tight loop in unmount: If buffers get on the ail1 list and an io error occurs elsewhere, the ail1 list would never be cleared because they were always busy. So unmount would hang, waiting for the ail1 list to empty. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Bob Peterson 提交于
Before this patch, the rgrp code had a serious problem related to how it managed buffer_heads for resource groups. The problem caused file system corruption, especially in cases of journal replay. When an rgrp glock was demoted to transfer ownership to a different cluster node, do_xmote() first calls rgrp_go_sync and then rgrp_go_inval, as expected. When it calls rgrp_go_sync, that called gfs2_rgrp_brelse() that dropped the buffer_head reference count. In most cases, the reference count went to zero, which is right. However, there were other places where the buffers are handled differently. After rgrp_go_sync, do_xmote called rgrp_go_inval which called gfs2_rgrp_brelse a second time, then rgrp_go_inval's call to truncate_inode_pages_range would get rid of the pages in memory, but only if the reference count drops to 0. Unfortunately, gfs2_rgrp_brelse was setting bi->bi_bh = NULL. So when rgrp_go_sync called gfs2_rgrp_brelse, it lost the pointer to the buffer_heads in cases where the reference count was still 1. Therefore, when rgrp_go_inval called gfs2_rgrp_brelse a second time, it failed the check for "if (bi->bi_bh)" and thus failed to call brelse a second time. Because of that, the reference count on those buffers sometimes failed to drop from 1 to 0. And that caused function truncate_inode_pages_range to keep the pages in page cache rather than freeing them. The next time the rgrp glock was acquired, the metadata read of the rgrp buffers re-used the pages in memory, which were now wrong because they were likely modified by the other node who acquired the glock in EX (which is why we demoted the glock). This re-use of the page cache caused corruption because changes made by the other nodes were never seen, so the bitmaps were inaccurate. For some reason, the problem became most apparent when journal replay forced the replay of rgrps in memory, which caused newer rgrp data to be overwritten by the older in-core pages. A big part of the problem was that the rgrp buffer were released in multiple places: The go_unlock function would release them when the glock was released rather than when the glock is demoted, which is clearly wrong because our intent was to cache them until the glock is demoted from SH or EX. This patch attempts to clean up the mess and make one consistent and centralized mechanism for managing the rgrp buffer_heads by implementing several changes: 1. It eliminates the call to gfs2_rgrp_brelse() from rgrp_go_sync. We don't want to release the buffers or zero the pointers when syncing for the reasons stated above. It only makes sense to release them when the glock is actually invalidated (go_inval). And when we do, then we set the bh pointers to NULL. 2. The go_unlock function (which was only used for rgrps) is eliminated, as we've talked about doing many times before. The go_unlock function was called too early in the glock dq process, and should not happen until the glock is invalidated. 3. It also eliminates the call to rgrp_brelse in gfs2_clear_rgrpd. That will now happen automatically when the rgrp glocks are demoted, and shouldn't happen any sooner or later than that. Instead, function gfs2_clear_rgrpd has been modified to demote the rgrp glocks, and therefore, free those pages, before the remaining glocks are culled by gfs2_gl_hash_clear. This prevents the gl_object from hanging around when the glocks are culled. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Bob Peterson 提交于
File system withdraws can be delayed when inconsistencies are discovered when we cannot withdraw immediately, for example, when critical spin_locks are held. But delaying the withdraw can cause gfs2 to ignore the error and keep running for a short period of time. For example, an rgrp glock may be dequeued and demoted while there are still buffers that haven't been properly revoked, due to io errors writing to the journal. This patch introduces a new concept of a pending withdraw, which means an inconsistency has been discovered and we need to withdraw at the earliest possible opportunity. In these cases, we aren't quite withdrawn yet, but we still need to not dequeue glocks and other critical things. If we dequeue the glocks and the withdraw results in our journal being replayed, the replay could overwrite data that's been modified by a different node that acquired the glock in the meantime. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 28 1月, 2020 1 次提交
-
-
由 Bob Peterson 提交于
This reverts commit e955537e. Before patch e955537e, tr_num_revoke tracked the number of revokes added to the transaction, and tr_num_revoke_rm tracked how many revokes were removed. But since revokes are queued off the sdp (superblock) pointer, some transactions could remove more revokes than they added. (e.g. revokes added by a different process). Commit e955537e eliminated transaction variable tr_num_revoke_rm, but in order to do so, it changed the accounting to always use tr_num_revoke for its math. Since you can remove more revokes than you add, tr_num_revoke could now become a negative value. This negative value broke the assert in function gfs2_trans_end: if (gfs2_assert_withdraw(sdp, (nbuf <=3D tr->tr_blocks) && (tr->tr_num_revoke <=3D tr->tr_revokes))) One way to fix this is to simply remove the tr_num_revoke clause from the assert and allow the value to become negative. Andreas didn't like that idea, so instead, we decided to revert e955537e. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 20 1月, 2020 2 次提交
-
-
由 Andreas Gruenbacher 提交于
The dlm lockspace is set up to have lock value blocks of GDLM_LVB_SIZE bytes, and dlm is the only lock manager we support, so there is no point in claiming that the lock value block could have any other size. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Andreas Gruenbacher 提交于
Rename sd_log_commited_revoke to sd_log_committed_revoke. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 08 1月, 2020 1 次提交
-
-
由 Bob Peterson 提交于
Every caller of function gfs2_struct2blk specified sizeof(u64). This patch eliminates the unnecessary parameter and replaces the size calculation with a new superblock variable that is computed to be the maximum number of block pointers we can fit inside a log descriptor, as is done for pointers per dinode and indirect block. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Reviewed-by: NAndrew Price <anprice@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 19 9月, 2019 1 次提交
-
-
由 Andrew Price 提交于
Convert gfs2 and gfs2meta to fs_context. Removes the duplicated vfs code from gfs2_mount and instead uses the new vfs_get_block_super() before switching the ->root to the appropriate dentry. The mount option parsing has been converted to the new API and error reporting for invalid options has been made more precise at the same time. All of the mount/remount code has been moved into ops_fstype.c Signed-off-by: NAndrew Price <anprice@redhat.com> Signed-off-by: NDavid Howells <dhowells@redhat.com> cc: cluster-devel@redhat.com Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 05 9月, 2019 1 次提交
-
-
由 Bob Peterson 提交于
Because s_vfs_rename_mutex is not cluster-wide, multiple nodes can reverse the roles of which directories are "old" and which are "new" for the purposes of rename. This can cause deadlocks where two nodes end up waiting for each other. There can be several layers of directory dependencies across many nodes. This patch fixes the problem by acquiring all gfs2_rename's inode glocks asychronously and waiting for all glocks to be acquired. That way all inodes are locked regardless of the order. The timeout value for multiple asynchronous glocks is calculated to be the total of the individual wait times for each glock times two. Since gfs2_exchange is very similar to gfs2_rename, both functions are patched in the same way. A new async glock wait queue, sd_async_glock_wait, keeps a list of waiters for these events. If gfs2's holder_wake function detects an async holder, it wakes up any waiters for the event. The waiter only tests whether any of its requests are still pending. Since the glocks are sent to dlm asychronously, the wait function needs to check to see which glocks, if any, were granted. If a glock is granted by dlm (and therefore held), its minimum hold time is checked and adjusted as necessary, as other glock grants do. If the event times out, all glocks held thus far must be dequeued to resolve any existing deadlocks. Then, if there are any outstanding locking requests, we need to loop around and wait for dlm to respond to those requests too. After we release all requests, we return -ESTALE to the caller (vfs rename) which loops around and retries the request. Node1 Node2 --------- --------- 1. Enqueue A Enqueue B 2. Enqueue B Enqueue A 3. A granted 6. B granted 7. Wait for B 8. Wait for A 9. A times out (since Node 1 holds A) 10. Dequeue B (since it was granted) 11. Wait for all requests from DLM 12. B Granted (since Node2 released it in step 10) 13. Rename 14. Dequeue A 15. DLM Grants A 16. Dequeue A (due to the timeout and since we no longer have B held for our task). 17. Dequeue B 18. Return -ESTALE to vfs 19. VFS retries the operation, goto step 1. This release-all-locks / acquire-all-locks may slow rename / exchange down as both nodes struggle in the same way and do the same thing. However, this will only happen when there is contention for the same inodes, which ought to be rare. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 28 6月, 2019 3 次提交
-
-
由 Bob Peterson 提交于
Before this patch, if a glock error was encountered, the glock with the problem was dumped. But sometimes you may have lots of file systems mounted, and that doesn't tell you which file system it was for. This patch adds a new boolean parameter fsid to the dump_glock family of functions. For non-error cases, such as dumping the glocks debugfs file, the fsid is not dumped in order to keep lock dumps and glocktop as clean as possible. For all error cases, such as GLOCK_BUG_ON, the file system id is now printed. This will make it easier to debug. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Bob Peterson 提交于
Before this patch, the superblock flag indicating when a file system is withdrawn was called SDF_SHUTDOWN. This patch simply renames it to the more obvious SDF_WITHDRAWN. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Bob Peterson 提交于
For its journal processing, gfs2 kept track of the number of buffers added and removed on a per-transaction basis. These values are used to calculate space needed in the journal. But while these calculations make sense for the number of buffers, they make no sense for revokes. Revokes are managed in their own list, linked from the superblock. So it's entirely unnecessary to keep separate per-transaction counts for revokes added and removed. A single count will do the same job. Therefore, this patch combines the transaction revokes into a single count. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 06 6月, 2019 1 次提交
-
-
由 Bob Peterson 提交于
Commit 73118ca8 introduced a glock reference counting bug in gfs2_trans_remove_revoke. Given that, replacing gl_revokes with a GLF flag is no longer useful, so revert that commit. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 05 6月, 2019 1 次提交
-
-
由 Thomas Gleixner 提交于
Based on 1 normalized pattern(s): this copyrighted material is made available to anyone wishing to use modify copy or redistribute it subject to the terms and conditions of the gnu general public license version 2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 44 file(s). Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NAllison Randal <allison@lohutok.net> Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190531081038.653000175@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 08 5月, 2019 4 次提交
-
-
由 Abhi Das 提交于
As part of the freeze operation, gfs2_freeze_func() is left blocking on a request to hold the sd_freeze_gl in SH. This glock is held in EX by the gfs2_freeze() code. A subsequent call to gfs2_unfreeze() releases the EXclusively held sd_freeze_gl, which allows gfs2_freeze_func() to acquire it in SH and resume its operation. gfs2_unfreeze(), however, doesn't wait for gfs2_freeze_func() to complete. If a umount is issued right after unfreeze, it could result in an inconsistent filesystem because some journal data (statfs update) isn't written out. Refer to commit 24972557 for a more detailed explanation of how freeze/unfreeze work. This patch causes gfs2_unfreeze() to wait for gfs2_freeze_func() to complete before returning to the user. Signed-off-by: NAbhi Das <adas@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Andreas Gruenbacher 提交于
Rename sd_log_le_revoke to sd_log_revokes and sd_log_le_ordered to sd_log_ordered: not sure what le stands for here, but it doesn't add clarity, and if it stands for list entry, it's actually confusing as those are both list heads but not list entries. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Bob Peterson 提交于
The gl_revokes value determines how many outstanding revokes a glock has on the superblock revokes list; this is used to avoid unnecessary log flushes. However, gl_revokes is only ever tested for being zero, and it's only decremented in revoke_lo_after_commit, which removes all revokes from the list, so we know that the gl_revoke values of all the glocks on the list will reach zero. Therefore, we can replace gl_revokes with a bit flag. This saves an atomic counter in struct gfs2_glock. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Bob Peterson 提交于
This patch fixes regressions in 588bff95. Due to that patch, function clean_journal was setting the value of sd_log_flush_head, but that's only valid if it is replaying the node's own journal. If it's replaying another node's journal, that's completely wrong and will lead to multiple problems. This patch tries to clean up the mess by passing the value of the logical journal block number into gfs2_write_log_header so the function can treat non-owned journals generically. For the local journal, the journal extent map is used for best performance. For other nodes from other journals, new function gfs2_lblk_to_dblk is called to figure it out using gfs2_iomap_get. This patch also tries to establish more consistency when passing journal block parameters by changing several unsigned int types to a consistent u32. Fixes: 588bff95 ("GFS2: Reduce code redundancy writing log headers") Signed-off-by: NBob Peterson <rpeterso@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 23 1月, 2019 1 次提交
-
-
由 Greg Kroah-Hartman 提交于
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. There is no need to save the dentries for the debugfs files, so drop those variables to save a bit of space and make the code simpler. Cc: Bob Peterson <rpeterso@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: cluster-devel@redhat.com Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
- 12 12月, 2018 2 次提交
-
-
由 Bob Peterson 提交于
This patch is based on an idea from Steve Whitehouse. The idea is to dump the number of pages for inodes in the glock dumps. The additional locking required me to drop const from quite a few places. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-
由 Bob Peterson 提交于
Field bd_ops was set but never used, so I removed it, and all code supporting it. Signed-off-by: NBob Peterson <rpeterso@redhat.com> Acked-by: NSteven Whitehouse <swhiteho@redhat.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
-