- 26 4月, 2023 9 次提交
-
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.18-rc2 commit dc04db2a category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=dc04db2aa7c9307e740d6d0e173085301c173b1a -------------------------------- To catch the obvious graph cycle problem and hence potential endless looping. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDave Chinner <david@fromorbit.com> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.14-rc4 commit 04fcad80 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=04fcad80cd068731a779fb442f78234732683755 -------------------------------- Introduce a helper function xfs_buf_daddr() to extract the disk address of the buffer from the struct xfs_buf. This will replace direct accesses to bp->b_bn and bp->b_maps[0].bm_bn, as well as the XFS_BUF_ADDR() macro. This patch introduces the helper function and replaces all uses of XFS_BUF_ADDR() as this is just a simple sed replacement. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v5.10-rc5 commit 3945ae03 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3945ae03d822aa47584dd502ac024ae1e1eb9e2d -------------------------------- A couple of the superblock validation checks apply only to the kernel, so move them to xfs_fc_fill_super before we add the needsrepair "feature", which will prevent the kernel (but not xfsprogs) from mounting the filesystem. This also reduces the diff between kernel and userspace libxfs. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.19-rc2 commit 7cf2b0f9 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7cf2b0f9611b9971d663e1fc3206eeda3b902922 -------------------------------- Currently inodegc work can sit queued on the per-cpu queue until the workqueue is either flushed of the queue reaches a depth that triggers work queuing (and later throttling). This means that we could queue work that waits for a long time for some other event to trigger flushing. Hence instead of just queueing work at a specific depth, use a delayed work that queues the work at a bound time. We can still schedule the work immediately at a given depth, but we no long need to worry about leaving a number of items on the list that won't get processed until external events prevail. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.19-rc2 commit 5e672cd6 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5e672cd69f0a534a445df4372141fd0d1d00901d -------------------------------- The current blocking mechanism for pushing the inodegc queue out to disk can result in systems becoming unusable when there is a long running inodegc operation. This is because the statfs() implementation currently issues a blocking flush of the inodegc queue and a significant number of common system utilities will call statfs() to discover something about the underlying filesystem. This can result in userspace operations getting stuck on inodegc progress, and when trying to remove a heavily reflinked file on slow storage with a full journal, this can result in delays measuring in hours. Avoid this problem by adding "push" function that expedites the flushing of the inodegc queue, but doesn't wait for it to complete. Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this mechanism so they don't block but still ensure that queued operations are expedited. Fixes: ab23a776 ("xfs: per-cpu deferred inode inactivation queues") Reported-by: NChris Dunlop <chris@onthe.net.au> Signed-off-by: NDave Chinner <dchinner@redhat.com> [djwong: fix _getquota_next to use _inodegc_push too] Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Brian Foster 提交于
mainline inclusion from mainline-v5.16-rc5 commit 6191cf3a category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6191cf3ad59fda5901160633fef8e41b064a5246 -------------------------------- The xfs_inodegc_stop() helper performs a high level flush of pending work on the percpu queues and then runs a cancel_work_sync() on each of the percpu work tasks to ensure all work has completed before returning. While cancel_work_sync() waits for wq tasks to complete, it does not guarantee work tasks have started. This means that the _stop() helper can queue and instantly cancel a wq task without having completed the associated work. This can be observed by tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>" test: xfs_destroy_inode: ... ino 0x83 ... xfs_inode_set_need_inactive: ... ino 0x83 ... xfs_inodegc_stop: ... ... xfs_inodegc_start: ... xfs_inodegc_worker: ... xfs_inode_inactivating: ... ino 0x83 ... The first few lines show that the inode is removed and need inactive state set, but the inactivation work has not completed before the inodegc mechanism stops. The inactivation doesn't actually occur until the fs is unfrozen and the gc mechanism starts back up. Note that this test requires fsfreeze to reproduce because xfs_freeze indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush(). When this occurs, the workqueue try_to_grab_pending() logic first tries to steal the pending bit, which does not succeed because the bit has been set by queue_work_on(). Subsequently, it checks for association of a pool workqueue from the work item under the pool lock. This association is set at the point a work item is queued and cleared when dequeued for processing. If the association exists, the work item is removed from the queue and cancel_work_sync() returns true. If the pwq association is cleared, the remove attempt assumes the task is busy and retries (eventually returning false to the caller after waiting for the work task to complete). To avoid this race, we can flush each work item explicitly before cancel. However, since the _queue_all() already schedules each underlying work item, the workqueue level helpers are sufficient to achieve the same ordering effect. E.g., the inodegc enabled flag prevents scheduling any further work in the _stop() case. Use the drain_workqueue() helper in this particular case to make the intent a bit more self explanatory. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.17-rc6 commit 919edbad category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=919edbadebe17a67193533f531c2920c03e40fa4 -------------------------------- Jan Kara reported a performance regression in dbench that he bisected down to commit bad77c37 ("xfs: CIL checkpoint flushes caches unconditionally"). Whilst developing the journal flush/fua optimisations this cache was part of, it appeared to made a significant difference to performance. However, now that this patchset has settled and all the correctness issues fixed, there does not appear to be any significant performance benefit to asynchronous cache flushes. In fact, the opposite is true on some storage types and workloads, where additional cache flushes that can occur from fsync heavy workloads have measurable and significant impact on overall throughput. Local dbench testing shows little difference on dbench runs with sync vs async cache flushes on either fast or slow SSD storage, and no difference in streaming concurrent async transaction workloads like fs-mark. Fast NVME storage. From `dbench -t 30`, CIL scale: clients async sync BW Latency BW Latency 1 935.18 0.855 915.64 0.903 8 2404.51 6.873 2341.77 6.511 16 3003.42 6.460 2931.57 6.529 32 3697.23 7.939 3596.28 7.894 128 7237.43 15.495 7217.74 11.588 512 5079.24 90.587 5167.08 95.822 fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize create chown unlink async 1m41s 1m16s 2m03s sync 1m40s 1m19s 1m54s Slower SATA SSD storage: From `dbench -t 30`, CIL scale: clients async sync BW Latency BW Latency 1 78.59 15.792 83.78 10.729 8 367.88 92.067 404.63 59.943 16 564.51 72.524 602.71 76.089 32 831.66 105.984 870.26 110.482 128 1659.76 102.969 1624.73 91.356 512 2135.91 223.054 2603.07 161.160 fsmark, 16 threads, create w/32k logbsize create unlink async 5m06s 4m15s sync 5m00s 4m22s And on Jan's test machine: 5.18-rc8-vanilla 5.18-rc8-patched Amean 1 71.22 ( 0.00%) 64.94 * 8.81%* Amean 2 93.03 ( 0.00%) 84.80 * 8.85%* Amean 4 150.54 ( 0.00%) 137.51 * 8.66%* Amean 8 252.53 ( 0.00%) 242.24 * 4.08%* Amean 16 454.13 ( 0.00%) 439.08 * 3.31%* Amean 32 835.24 ( 0.00%) 829.74 * 0.66%* Amean 64 1740.59 ( 0.00%) 1686.73 * 3.09%* Performance and cache flush behaviour is restored to pre-regression levels. As such, we can now consider the async cache flush mechanism an unnecessary exercise in premature optimisation and hence we can now remove it and the infrastructure it requires completely. Fixes: bad77c37 ("xfs: CIL checkpoint flushes caches unconditionally") Reported-and-tested-by: NJan Kara <jack@suse.cz> Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.14-rc1 commit 9d110014 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9d110014205cb1129fa570d8de83d486fa199354 -------------------------------- From the department of "generic/482 keeps on giving", we bring you another tail update race condition: iclog: S1 C1 +-----------------------+-----------------------+ S2 EOIC Two checkpoints in a single iclog. One is complete, the other just contains the start record and overruns into a new iclog. Timeline: Before S1: Cache flush, log tail = X At S1: Metadata stable, write start record and checkpoint At C1: Write commit record, set NEED_FUA Single iclog checkpoint, so no need for NEED_FLUSH Log tail still = X, so no need for NEED_FLUSH After C1, Before S2: Cache flush, log tail = X At S2: Metadata stable, write start record and checkpoint After S2: Log tail moves to X+1 At EOIC: End of iclog, more journal data to write Releases iclog Not a commit iclog, so no need for NEED_FLUSH Writes log tail X+1 into iclog. At this point, the iclog has tail X+1 and NEED_FUA set. There has been no cache flush for the metadata between X and X+1, and the iclog writes the new tail permanently to the log. THis is sufficient to violate on disk metadata/journal ordering. We have two options here. The first is to detect this case in some manner and ensure that the partial checkpoint write sets NEED_FLUSH when the iclog is already marked NEED_FUA and the log tail changes. This seems somewhat fragile and quite complex to get right, and it doesn't actually make it obvious what underlying problem it is actually addressing from reading the code. The second option seems much cleaner to me, because it is derived directly from the requirements of the C1 commit record in the iclog. That is, when we write this commit record to the iclog, we've guaranteed that the metadata/data ordering is correct for tail update purposes. Hence if we only write the log tail into the iclog for the *first* commit record rather than the log tail at the last release, we guarantee that the log tail does not move past where the the first commit record in the log expects it to be. IOWs, taking the first option means that replay of C1 becomes dependent on future operations doing the right thing, not just the C1 checkpoint itself doing the right thing. This makes log recovery almost impossible to reason about because now we have to take into account what might or might not have happened in the future when looking at checkpoints in the log rather than just having to reconstruct the past... Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.14-rc1 commit b2ae3a9e category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b2ae3a9ef91152931b99620c431cf3805daa1429 -------------------------------- Because I cannot tell if the NEED_FLUSH flag is being set correctly by the log force and CIL push machinery without it. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
- 19 4月, 2023 19 次提交
-
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.12-rc4 commit 5825bea0 category: bugfix bugzilla: 188483, https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5825bea05265d2938c4c20a1c0f8b7d7ab59523d -------------------------------- - 21.92% __xfs_trans_commit - 21.62% xfs_log_commit_cil - 11.69% xfs_trans_unreserve_and_mod_sb - 11.58% __percpu_counter_compare - 11.45% __percpu_counter_sum - 10.29% _raw_spin_lock_irqsave - 10.28% do_raw_spin_lock __pv_queued_spin_lock_slowpath We debated just getting rid of it last time this came up and there was no real objection to removing it. Now it's the biggest scalability limitation for debug kernels even on smallish machines, so let's just get rid of it. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v6.1-rc1 commit 03a7485c category: bugfix bugzilla: 188220, https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=03a7485cd701e1c08baadcf39d9592d83715e224 -------------------------------- Starting in 6.1, CONFIG_FORTIFY_SOURCE checks the length parameter of memcpy. Since we're already fixing problems with BUI item copying, we should fix it everything else. An extra difficulty here is that the ef[id]_extents arrays are declared as single-element arrays. This is not the convention for flex arrays in the modern kernel, and it causes all manner of problems with static checking tools, since they often cannot tell the difference between a single element array and a flex array. So for starters, change those array[1] declarations to array[] declarations to signal that they are proper flex arrays and adjust all the "size-1" expressions to fit the new declaration style. Next, refactor the xfs_efi_copy_format function to handle the copying of the head and the flex array members separately. While we're at it, fix a minor validation deficiency in the recovery function. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NKees Cook <keescook@chromium.org> Reviewed-by: NAllison Henderson <allison.henderson@oracle.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> conflicts: fs/xfs/xfs_extfree_item.c fs/xfs/xfs_super.c Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v6.1-rc1 commit b45ca961 category: bugfix bugzilla: 188220, https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b45ca961e94673df83ab1900802afe82776966e6 -------------------------------- Starting in 6.1, CONFIG_FORTIFY_SOURCE checks the length parameter of memcpy. Since we're already fixing problems with BUI item copying, we should fix it everything else. Refactor the xfs_rui_copy_format function to handle the copying of the head and the flex array members separately. While we're at it, fix a minor validation deficiency in the recovery function. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NAllison Henderson <allison.henderson@oracle.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v6.1-rc1 commit a38935c0 category: bugfix bugzilla: 188220, https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a38935c03c7914a6ab22eefb750b259868ed5a4b -------------------------------- Starting in 6.1, CONFIG_FORTIFY_SOURCE checks the length parameter of memcpy. Since we're already fixing problems with BUI item copying, we should fix it everything else. Refactor the xfs_cui_copy_format function to handle the copying of the head and the flex array members separately. While we're at it, fix a minor validation deficiency in the recovery function. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NAllison Henderson <allison.henderson@oracle.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> conflicts: fs/xfs/xfs_ondisk.h Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v6.1-rc1 commit a38ebce1 category: bugfix bugzilla: 188220, https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a38ebce1da271f480e47c3def4f810c6106b74a1 -------------------------------- Starting in 6.1, CONFIG_FORTIFY_SOURCE checks the length parameter of memcpy. Unfortunately, it doesn't handle flex arrays correctly: ------------[ cut here ]------------ memcpy: detected field-spanning write (size 48) of single field "dst_bui_fmt" at fs/xfs/xfs_bmap_item.c:628 (size 16) Fix this by refactoring the xfs_bui_copy_format function to handle the copying of the head and the flex array members separately. While we're at it, fix a minor validation deficiency in the recovery function. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NAllison Henderson <allison.henderson@oracle.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> conflicts: fs/xfs/xfs_ondisk.h Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Chandan Babu R 提交于
mainline inclusion from mainline-v5.12-rc4 commit b6785e27 category: bugfix bugzilla: 188220, https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b6785e279d53ca5c4fa6be1146e85000870d73ef -------------------------------- The incore data fork of an inode stores the bmap btree root node as 'struct xfs_btree_block'. However, the ondisk version of the inode stores the bmap btree root node as a 'struct xfs_bmdr_block'. xfs_bmap_add_attrfork_btree() checks if the btree root node fits inside the data fork of the inode. However, it incorrectly uses 'struct xfs_btree_block' to compute the size of the bmap btree root node. Since size of 'struct xfs_btree_block' is larger than that of 'struct xfs_bmdr_block', xfs_bmap_add_attrfork_btree() could end up unnecessarily demoting the current root node as the child of newly allocated root node. This commit optimizes space usage by modifying xfs_bmap_add_attrfork_btree() to use 'struct xfs_bmdr_block' to check if the bmap btree root node fits inside the data fork of the inode. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChandan Babu R <chandanrlinux@gmail.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Conflicts: fs/xfs/libxfs/xfs_bmap.c Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Donald Douwsma 提交于
mainline inclusion from mainline-v6.3-rc1 commit 167ce4cb category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=167ce4cbfa370114fee61ad5b58e401d95e2d5cd -------------------------------- xfs will not allow combining other panic masks with XFS_PTAG_VERIFIER_ERROR. # sysctl fs.xfs.panic_mask=511 sysctl: setting key "fs.xfs.panic_mask": Invalid argument fs.xfs.panic_mask = 511 Update to the maximum value that can be set to allow the full range of masks. Do this using a mask of possible values to prevent this happening again as suggested by Darrick. Fixes: d519da41 ("xfs: Introduce XFS_PTAG_VERIFIER_ERROR panic mask") Signed-off-by: NDonald Douwsma <ddouwsma@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Conflicts: fs/xfs/xfs_error.h Signed-off-by: Nyangerkun <yangerkun@huaweicloud.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v6.3-rc1 commit c85007e2 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c85007e2e3942da1f9361e4b5a9388ea3a8dcc5b -------------------------------- When we split a BMBT due to record insertion, we offload it to a worker thread because we can be deep in the stack when we try to allocate a new block for the BMBT. Allocation can use several kilobytes of stack (full memory reclaim, swap and/or IO path can end up on the stack during allocation) and we can already be several kilobytes deep in the stack when we need to split the BMBT. A recent workload demonstrated a deadlock in this BMBT split offload. It requires several things to happen at once: 1. two inodes need a BMBT split at the same time, one must be unwritten extent conversion from IO completion, the other must be from extent allocation. 2. there must be a no available xfs_alloc_wq worker threads available in the worker pool. 3. There must be sustained severe memory shortages such that new kworker threads cannot be allocated to the xfs_alloc_wq pool for both threads that need split work to be run 4. The split work from the unwritten extent conversion must run first. 5. when the BMBT block allocation runs from the split work, it must loop over all AGs and not be able to either trylock an AGF successfully, or each AGF is is able to lock has no space available for a single block allocation. 6. The BMBT allocation must then attempt to lock the AGF that the second task queued to the rescuer thread already has locked before it finds an AGF it can allocate from. At this point, we have an ABBA deadlock between tasks queued on the xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task holding the AGF lock can't be run by the rescuer thread until the task the rescuer thread is runing gets the AGF lock.... This is a highly improbably series of events, but there it is. There's a couple of ways to fix this, but the easiest way to ensure that we only punt tasks with a locked AGF that holds enough space for the BMBT block allocations to the worker thread. This works for unwritten extent conversion in IO completion (which doesn't have a locked AGF and space reservations) because we have tight control over the IO completion stack. It is typically only 6 functions deep when xfs_btree_split() is called because we've already offloaded the IO completion work to a worker thread and hence we don't need to worry about stack overruns here. The other place we can be called for a BMBT split without a preceeding allocation is __xfs_bunmapi() when punching out the center of an existing extent. We don't remove extents in the IO path, so these operations don't tend to be called with a lot of stack consumed. Hence we don't really need to ship the split off to a worker thread in these cases, either. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: Nyangerkun <yangerkun@huaweicloud.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Wengang Wang 提交于
mainline inclusion from mainline-v6.2-rc4 commit 601a27ea category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=601a27ea09a317d0fe2895df7d875381fb393041 -------------------------------- In xfs_extent_busy_update_extent() case 6 and 7, whenever bno is modified on extent busy, the relavent length has to be modified accordingly. Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: Nyangerkun <yangerkun@huaweicloud.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Hironori Shiina 提交于
mainline inclusion from mainline-v6.2-rc4 commit 817644fa category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=817644fa4525258992f17fecf4f1d6cdd2e1b731 -------------------------------- The root inode number should be set to `breq->startino` for getting stat information of the root when XFS_BULK_IREQ_SPECIAL_ROOT is used. Otherwise, the inode search is started from 1 (XFS_BULK_IREQ_SPECIAL_ROOT) and the inode with the lowest number in a filesystem is returned. Fixes: bf3cb394 ("xfs: allow single bulkstat of special inodes") Signed-off-by: NHironori Shiina <shiina.hironori@fujitsu.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: Nyangerkun <yangerkun@huaweicloud.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v6.2-rc1 commit ddfdd530 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ddfdd530e43fcb3f7a0a69966e5f6c33497b4ae3 -------------------------------- While investigating test failures in xfs/17[1-3] in alwayscow mode, I noticed through code inspection that xfs_bmap_alloc_userdata isn't setting XFS_ALLOC_USERDATA when allocating extents for a file's CoW fork. COW staging extents should be flagged as USERDATA, since user data are persisted to these blocks before being remapped into a file. This mis-classification has a few impacts on the behavior of the system. First, the filestreams allocator is supposed to keep allocating from a chosen AG until it runs out of space in that AG. However, it only does that for USERDATA allocations, which means that COW allocations aren't tied to the filestreams AG. Fortunately, few people use filestreams, so nobody's noticed. A more serious problem is that xfs_alloc_ag_vextent_small looks for a buffer to invalidate *if* the USERDATA flag is set and the AG is so full that the allocation had to come from the AGFL because the cntbt is empty. The consequences of not invalidating the buffer are severe -- if the AIL incorrectly checkpoints a buffer that is now being used to store user data, that action will clobber the user's written data. Fix filestreams and yet another data corruption vector by flagging COW allocations as USERDATA. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: Nyangerkun <yangerkun@huaweicloud.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v6.2-rc1 commit 4c6dbfd2 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4c6dbfd2756bd83a0085ed804e2bb7be9cc16bc5 -------------------------------- I've been running near-continuous integration testing of online fsck, and I've noticed that once a day, one of the ARM VMs will fail the test with out of order records in the data fork. xfs/804 races fsstress with online scrub (aka scan but do not change anything), so I think this might be a bug in the core xfs code. This also only seems to trigger if one runs the test for more than ~6 minutes via TIME_FACTOR=13 or something. https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/tree/tests/xfs/804?h=djwong-wtf I added a debugging patch to the kernel to check the data fork extents after taking the ILOCK, before dropping ILOCK, and before and after each bmapping operation. So far I've narrowed it down to the delalloc code inserting a record in the wrong place in the iext tree: xfs_bmap_add_extent_hole_delay, near line 2691: case 0: /* * New allocation is not contiguous with another * delayed allocation. * Insert a new entry. */ oldlen = newlen = 0; xfs_iunlock_check_datafork(ip); <-- ok here xfs_iext_insert(ip, icur, new, state); xfs_iunlock_check_datafork(ip); <-- bad here break; } I recorded the state of the data fork mappings and iext cursor state when a corrupt data fork is detected immediately after the xfs_bmap_add_extent_hole_delay call in xfs_bmapi_reserve_delalloc: ino 0x140bb3 func xfs_bmapi_reserve_delalloc line 4164 data fork: ino 0x140bb3 nr 0x0 nr_real 0x0 offset 0xb9 blockcount 0x1f startblock 0x935de2 state 1 ino 0x140bb3 nr 0x1 nr_real 0x1 offset 0xe6 blockcount 0xa startblock 0xffffffffe0007 state 0 ino 0x140bb3 nr 0x2 nr_real 0x1 offset 0xd8 blockcount 0xe startblock 0x935e01 state 0 Here we see that a delalloc extent was inserted into the wrong position in the iext leaf, same as all the other times. The extra trace data I collected are as follows: ino 0x140bb3 fork 0 oldoff 0xe6 oldlen 0x4 oldprealloc 0x6 isize 0xe6000 ino 0x140bb3 oldgotoff 0xea oldgotstart 0xfffffffffffffffe oldgotcount 0x0 oldgotstate 0 ino 0x140bb3 crapgotoff 0x0 crapgotstart 0x0 crapgotcount 0x0 crapgotstate 0 ino 0x140bb3 freshgotoff 0xd8 freshgotstart 0x935e01 freshgotcount 0xe freshgotstate 0 ino 0x140bb3 nowgotoff 0xe6 nowgotstart 0xffffffffe0007 nowgotcount 0xa nowgotstate 0 ino 0x140bb3 oldicurpos 1 oldleafnr 2 oldleaf 0xfffffc00f0609a00 ino 0x140bb3 crapicurpos 2 crapleafnr 2 crapleaf 0xfffffc00f0609a00 ino 0x140bb3 freshicurpos 1 freshleafnr 2 freshleaf 0xfffffc00f0609a00 ino 0x140bb3 newicurpos 1 newleafnr 3 newleaf 0xfffffc00f0609a00 The first line shows that xfs_bmapi_reserve_delalloc was called with whichfork=XFS_DATA_FORK, off=0xe6, len=0x4, prealloc=6. The second line ("oldgot") shows the contents of @got at the beginning of the call, which are the results of the first iext lookup in xfs_buffered_write_iomap_begin. Line 3 ("crapgot") is the result of duplicating the cursor at the start of the body of xfs_bmapi_reserve_delalloc and performing a fresh lookup at @off. Line 4 ("freshgot") is the result of a new xfs_iext_get_extent right before the call to xfs_bmap_add_extent_hole_delay. Totally garbage. Line 5 ("nowgot") is contents of @got after the xfs_bmap_add_extent_hole_delay call. Line 6 is the contents of @icur at the beginning fo the call. Lines 7-9 are the contents of the iext cursors at the point where the block mappings were sampled. I think @oldgot is a HOLESTARTBLOCK extent because the first lookup didn't find anything, so we filled in imap with "fake hole until the end". At the time of the first lookup, I suspect that there's only one 32-block unwritten extent in the mapping (hence oldicurpos==1) but by the time we get to recording crapgot, crapicurpos==2. Dave then added: Ok, that's much simpler to reason about, and implies the smoke is coming from xfs_buffered_write_iomap_begin() or xfs_bmapi_reserve_delalloc(). I suspect the former - it does a lot of stuff with the ILOCK_EXCL held..... .... including calling xfs_qm_dqattach_locked(). xfs_buffered_write_iomap_begin ILOCK_EXCL look up icur xfs_qm_dqattach_locked xfs_qm_dqattach_one xfs_qm_dqget_inode dquot cache miss xfs_iunlock(ip, XFS_ILOCK_EXCL); error = xfs_qm_dqread(mp, id, type, can_alloc, &dqp); xfs_ilock(ip, XFS_ILOCK_EXCL); .... xfs_bmapi_reserve_delalloc(icur) Yup, that's what is letting the magic smoke out - xfs_qm_dqattach_locked() can cycle the ILOCK. If that happens, we can pass a stale icur to xfs_bmapi_reserve_delalloc() and it all goes downhill from there. Back to Darrick now: So. Fix this by moving the dqattach_locked call up before we take the ILOCK, like all the other callers in that file. Fixes: a526c85c ("xfs: move xfs_file_iomap_begin_delay around") # goes further back than this Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Conflicts: fs/xfs/xfs_iomap.c Signed-off-by: Nyangerkun <yangerkun@huaweicloud.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v6.2-rc1 commit 032e1603 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=032e160305f6872e590c77f11896fb28365c6d6c -------------------------------- Every now and then I see fstests failures on aarch64 (64k pages) that trigger on the following sequence: mkfs.xfs $dev mount $dev $mnt touch $mnt/a umount $mnt xfs_db -c 'path /a' -c 'print' $dev 99% of the time this succeeds, but every now and then xfs_db cannot find /a and fails. This turns out to be a race involving udev/blkid, the page cache for the block device, and the xfs_db process. udev is triggered whenever anyone closes a block device or unmounts it. The default udev rules invoke blkid to read the fs super and create symlinks to the bdev under /dev/disk. For this, it uses buffered reads through the page cache. xfs_db also uses buffered reads to examine metadata. There is no coordination between xfs_db and udev, which means that they can run concurrently. Note there is no coordination between the kernel and blkid either. On a system with 64k pages, the page cache can cache the superblock and the root inode (and hence the root dir) with the same 64k page. If udev spawns blkid after the mkfs and the system is busy enough that it is still running when xfs_db starts up, they'll both read from the same page in the pagecache. The unmount writes updated inode metadata to disk directly. The XFS buffer cache does not use the bdev pagecache, nor does it invalidate the pagecache on umount. If the above scenario occurs, the pagecache no longer reflects what's on disk, xfs_db reads the stale metadata, and fails to find /a. Most of the time this succeeds because closing a bdev invalidates the page cache, but when processes race, everyone loses. Fix the problem by invalidating the bdev pagecache after flushing the bdev, so that xfs_db will see up to date metadata. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NGao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Conflicts: fs/xfs/xfs_buf.c Signed-off-by: Nyangerkun <yangerkun@huaweicloud.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Eric Sandeen 提交于
mainline inclusion from mainline-v6.0-rc1 commit 70b589a3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=70b589a37e1aba892c1e5d41957b0042f9eb031b -------------------------------- We got a report that "renameat2() with flags=RENAME_WHITEOUT doesn't apply an SELinux label on xfs" as it does on other filesystems (for example, ext4 and tmpfs.) While I'm not quite sure how labels may interact w/ whiteout files, leaving them as unlabeled seems inconsistent at best. Now that xfs_init_security is not static, rename it to xfs_inode_init_security per dchinner's suggestion. Signed-off-by: NEric Sandeen <sandeen@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Conflicts: fs/xfs/xfs_inode.c fs/xfs/xfs_iops.h Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v5.19-rc1 commit 910bbdf2 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=910bbdf2f4d7df46781bc9b723048f5ebed3d0d7 -------------------------------- While we're messing around with how recovery allocates and frees the buffer cancellation table, convert the allocation to use kmalloc_array instead of the old kmem_alloc APIs, and make it handle a null return, even though that's not likely. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com> Conflicts: fs/xfs/libxfs/xfs_log_recover.h Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v5.19-rc1 commit 8db074bd category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8db074bd84df5ccc88bff3f8f900f66f4b8349fa -------------------------------- If log recovery fails, we free the memory used by the buffer cancellation buckets, but we don't actually traverse each bucket list to free the individual xfs_buf_cancel objects. This leads to a memory leak, as reported by kmemleak in xfs/051: unreferenced object 0xffff888103629560 (size 32): comm "mount", pid 687045, jiffies 4296935916 (age 10.752s) hex dump (first 32 bytes): 08 d3 0a 01 00 00 00 00 08 00 00 00 01 00 00 00 ................ d0 f5 0b 92 81 88 ff ff 80 64 64 25 81 88 ff ff .........dd%.... backtrace: [<ffffffffa0317c83>] kmem_alloc+0x73/0x140 [xfs] [<ffffffffa03234a9>] xlog_recover_buf_commit_pass1+0x139/0x200 [xfs] [<ffffffffa032dc27>] xlog_recover_commit_trans+0x307/0x350 [xfs] [<ffffffffa032df15>] xlog_recovery_process_trans+0xa5/0xe0 [xfs] [<ffffffffa032e12d>] xlog_recover_process_data+0x8d/0x140 [xfs] [<ffffffffa032e49d>] xlog_do_recovery_pass+0x19d/0x740 [xfs] [<ffffffffa032f22d>] xlog_do_log_recovery+0x6d/0x150 [xfs] [<ffffffffa032f343>] xlog_do_recover+0x33/0x1d0 [xfs] [<ffffffffa032faba>] xlog_recover+0xda/0x190 [xfs] [<ffffffffa03194bc>] xfs_log_mount+0x14c/0x360 [xfs] [<ffffffffa030bfed>] xfs_mountfs+0x50d/0xa60 [xfs] [<ffffffffa03124b5>] xfs_fs_fill_super+0x6a5/0x950 [xfs] [<ffffffff812b92a5>] get_tree_bdev+0x175/0x280 [<ffffffff812b7c3a>] vfs_get_tree+0x1a/0x80 [<ffffffff812e366f>] path_mount+0x6ff/0xaa0 [<ffffffff812e3b13>] __x64_sys_mount+0x103/0x140 Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v5.19-rc1 commit 27232349 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2723234923b3294dbcf6019c288c87465e927ed4 -------------------------------- Move the code that allocates and frees the buffer cancellation tables used by log recovery into the file that actually uses the tables. This is a precursor to some cleanups and a memory leak fix. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com> Conflicts: fs/xfs/libxfs/xfs_log_recover.h fs/xfs/xfs_log_priv.h Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v5.14-rc1 commit 10be350b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10be350b8c6c426b82d4df937f25b37eabdc3d67 -------------------------------- It's currently unlikely that we will ever end up with more than 4 billion inodes waiting for reclamation, but the fs object code uses long int for object counts and we're certainly capable of generating that many. Instead of truncating the internal counters, widen them and report the object counts correctly. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: Nyangerkun <yangerkun@huaweicloud.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v5.13-rc1 commit 8e9800f9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id8e9800f9f2b89e1efe2a5993361fae4d618a6c26 -------------------------------- While running generic/050 with an external log, I observed this warning in dmesg: Trying to write to read-only block-device sda4 (partno 4) WARNING: CPU: 2 PID: 215677 at block/blk-core.c:704 submit_bio_checks+0x256/0x510 Call Trace: submit_bio_noacct+0x2c/0x430 _xfs_buf_ioapply+0x283/0x3c0 [xfs] __xfs_buf_submit+0x6a/0x210 [xfs] xfs_buf_delwri_submit_buffers+0xf8/0x270 [xfs] xfsaild+0x2db/0xc50 [xfs] kthread+0x14b/0x170 I think this happened because we tried to cover the log after a readonly mount, and the AIL tried to write the primary superblock to the data device. The test marks the data device readonly, but it doesn't do the same to the external log device. Therefore, XFS thinks that the log is writable, even though AIL writes whine to dmesg because the data device is read only. Fix this by amending xfs_log_writable to prevent writes when the AIL can't possible write anything into the filesystem. Note: As for the external log or the rt devices being readonly-- xfs_blkdev_get will complain about that if we aren't doing a norecovery mount. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: Nyangerkun <yangerkun@huaweicloud.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
- 12 4月, 2023 12 次提交
-
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v5.19-rc5 commit c78c2d09 category: bugfix bugzilla: 187164, https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c78c2d0903183a41beb90c56a923e30f90fa91b9 -------------------------------- I observed the following evidence of a memory leak while running xfs/399 from the xfs fsck test suite (edited for brevity): XFS (sde): Metadata corruption detected at xfs_attr_shortform_verify_struct.part.0+0x7b/0xb0 [xfs], inode 0x1172 attr fork XFS: Assertion failed: ip->i_af.if_u1.if_data == NULL, file: fs/xfs/libxfs/xfs_inode_fork.c, line: 315 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 91635 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs] CPU: 2 PID: 91635 Comm: xfs_scrub Tainted: G W 5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 RIP: 0010:assfail+0x46/0x4a [xfs] Call Trace: <TASK> xfs_ifork_zap_attr+0x7c/0xb0 xfs_iformat_attr_fork+0x86/0x110 xfs_inode_from_disk+0x41d/0x480 xfs_iget+0x389/0xd70 xfs_bulkstat_one_int+0x5b/0x540 xfs_bulkstat_iwalk+0x1e/0x30 xfs_iwalk_ag_recs+0xd1/0x160 xfs_iwalk_run_callbacks+0xb9/0x180 xfs_iwalk_ag+0x1d8/0x2e0 xfs_iwalk+0x141/0x220 xfs_bulkstat+0x105/0x180 xfs_ioc_bulkstat.constprop.0.isra.0+0xc5/0x130 xfs_file_ioctl+0xa5f/0xef0 __x64_sys_ioctl+0x82/0xa0 do_syscall_64+0x2b/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 This newly-added assertion checks that there aren't any incore data structures hanging off the incore fork when we're trying to reset its contents. From the call trace, it is evident that iget was trying to construct an incore inode from the ondisk inode, but the attr fork verifier failed and we were trying to undo all the memory allocations that we had done earlier. The three assertions in xfs_ifork_zap_attr check that the caller has already called xfs_idestroy_fork, which clearly has not been done here. As the zap function then zeroes the pointers, we've effectively leaked the memory. The shortest change would have been to insert an extra call to xfs_idestroy_fork, but it makes more sense to bundle the _idestroy_fork call into _zap_attr, since all other callsites call _idestroy_fork immediately prior to calling _zap_attr. IOWs, it eliminates one way to fail. Note: This change only applies cleanly to 2ed5b09b, since we just reworked the attr fork lifetime. However, I think this memory leak has existed since 0f45a1b2, since the chain xfs_iformat_attr_fork -> xfs_iformat_local -> xfs_init_local_fork will allocate ifp->if_u1.if_data, but if xfs_ifork_verify_local_attr fails, xfs_iformat_attr_fork will free i_afp without freeing any of the stuff hanging off i_afp. The solution for older kernels I think is to add the missing call to xfs_idestroy_fork just prior to calling kmem_cache_free. Found by fuzzing a.sfattr.hdr.totsize = lastbit in xfs/399. Fixes: 2ed5b09b ("xfs: make inode attribute forks a permanent part of struct xfs_inode") Probably-Fixes: 0f45a1b2 ("xfs: improve local fork verification") Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> conflicts: fs/xfs/libxfs/xfs_attr_leaf.c Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dan Carpenter 提交于
mainline inclusion from mainline-v5.19-rc5 commit 3f52e016 category: bugfix bugzilla: 187164, https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3f52e016af600982989b5dee958d313c52483c92 -------------------------------- These NULL check are no long needed after commit 2ed5b09b ("xfs: make inode attribute forks a permanent part of struct xfs_inode"). Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v5.19-rc5 commit c01147d9 category: bugfix bugzilla: 187164, https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c01147d929899f02a0a8b15e406d12784768ca72 -------------------------------- Replace the shouty macros here with typechecked helper functions. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> conflicts: fs/xfs/libxfs/xfs_attr_leaf.c fs/xfs/libxfs/xfs_bmap.c fs/xfs/libxfs/xfs_bmap_btree.c fs/xfs/libxfs/xfs_dir2.c fs/xfs/libxfs/xfs_inode_fork.h fs/xfs/scrub/symlink.c fs/xfs/xfs_itable.c fs/xfs/xfs_symlink.c fs/xfs/xfs_trace.h Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v5.19-rc5 commit 932b42c6 category: bugfix bugzilla: 187164, https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=932b42c66cb5d0ca9800b128415b4ad6b1952b3e -------------------------------- Replace this shouty macro with a real C function that has a more descriptive name. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> conflicts: fs/xfs/libxfs/xfs_attr.h fs/xfs/libxfs/xfs_inode_fork.c fs/xfs/scrub/btree.c fs/xfs/xfs_inode.c Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v5.19-rc5 commit e45d7cb2 category: bugfix bugzilla: 187164, https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e45d7cb2356e6b59fe64da28324025cc6fcd3fbd -------------------------------- Modify xfs_ifork_ptr to return a NULL pointer if the caller asks for the attribute fork but i_forkoff is zero. This eliminates the ambiguity between i_forkoff and i_af.if_present, which should make it easier to understand the lifetime of attr forks. While we're at it, remove the if_present checks around calls to xfs_idestroy_fork and xfs_ifork_zap_attr since they can both handle attr forks that have already been torn down. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> conflicts: fs/xfs/libxfs/xfs_attr.h fs/xfs/libxfs/xfs_inode_fork.c fs/xfs/libxfs/xfs_inode_fork.h fs/xfs/xfs_icache.c fs/xfs/xfs_inode.c Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v5.19-rc5 commit 2ed5b09b category: bugfix bugzilla: 187164, https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2ed5b09b3e8fc274ae8fecd6ab7c5106a364bed1 -------------------------------- Syzkaller reported a UAF bug a while back: ================================================================== BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127 Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958 CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted 5.15.0-0.30.3-20220406_1406 #3 Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106 print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256 __kasan_report mm/kasan/report.c:442 [inline] kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459 xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127 xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159 xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36 __vfs_getxattr+0xdf/0x13d fs/xattr.c:399 cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300 security_inode_need_killpriv+0x4c/0x97 security/security.c:1408 dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912 dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908 do_truncate+0xc3/0x1e0 fs/open.c:56 handle_truncate fs/namei.c:3084 [inline] do_open fs/namei.c:3432 [inline] path_openat+0x30ab/0x396d fs/namei.c:3561 do_filp_open+0x1c4/0x290 fs/namei.c:3588 do_sys_openat2+0x60d/0x98c fs/open.c:1212 do_sys_open+0xcf/0x13c fs/open.c:1228 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0x0 RIP: 0033:0x7f7ef4bb753d Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0 RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0 </TASK> Allocated by task 2953: kasan_save_stack+0x19/0x38 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:434 [inline] __kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467 kasan_slab_alloc include/linux/kasan.h:254 [inline] slab_post_alloc_hook mm/slab.h:519 [inline] slab_alloc_node mm/slub.c:3213 [inline] slab_alloc mm/slub.c:3221 [inline] kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226 kmem_cache_zalloc include/linux/slab.h:711 [inline] xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287 xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098 xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746 xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59 __vfs_setxattr+0x11b/0x177 fs/xattr.c:180 __vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214 __vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275 vfs_setxattr+0x154/0x33d fs/xattr.c:301 setxattr+0x216/0x29f fs/xattr.c:575 __do_sys_fsetxattr fs/xattr.c:632 [inline] __se_sys_fsetxattr fs/xattr.c:621 [inline] __x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0x0 Freed by task 2949: kasan_save_stack+0x19/0x38 mm/kasan/common.c:38 kasan_set_track+0x1c/0x21 mm/kasan/common.c:46 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360 ____kasan_slab_free mm/kasan/common.c:366 [inline] ____kasan_slab_free mm/kasan/common.c:328 [inline] __kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374 kasan_slab_free include/linux/kasan.h:230 [inline] slab_free_hook mm/slub.c:1700 [inline] slab_free_freelist_hook mm/slub.c:1726 [inline] slab_free mm/slub.c:3492 [inline] kmem_cache_free+0xdc/0x3ce mm/slub.c:3508 xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773 xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822 xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413 xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684 xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802 xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59 __vfs_removexattr+0x106/0x16a fs/xattr.c:468 cap_inode_killpriv+0x24/0x47 security/commoncap.c:324 security_inode_killpriv+0x54/0xa1 security/security.c:1414 setattr_prepare+0x1a6/0x897 fs/attr.c:146 xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682 xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065 xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093 notify_change+0xae5/0x10a1 fs/attr.c:410 do_truncate+0x134/0x1e0 fs/open.c:64 handle_truncate fs/namei.c:3084 [inline] do_open fs/namei.c:3432 [inline] path_openat+0x30ab/0x396d fs/namei.c:3561 do_filp_open+0x1c4/0x290 fs/namei.c:3588 do_sys_openat2+0x60d/0x98c fs/open.c:1212 do_sys_open+0xcf/0x13c fs/open.c:1228 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0x0 The buggy address belongs to the object at ffff88802cec9188 which belongs to the cache xfs_ifork of size 40 The buggy address is located 20 bytes inside of 40-byte region [ffff88802cec9188, ffff88802cec91b0) The buggy address belongs to the page: page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2cec9 flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff) raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80 raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc >ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb ^ ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb ================================================================== The root cause of this bug is the unlocked access to xfs_inode.i_afp from the getxattr code paths while trying to determine which ILOCK mode to use to stabilize the xattr data. Unfortunately, the VFS does not acquire i_rwsem when vfs_getxattr (or listxattr) call into the filesystem, which means that getxattr can race with a removexattr that's tearing down the attr fork and crash: xfs_attr_set: xfs_attr_get: xfs_attr_fork_remove: xfs_ilock_attr_map_shared: xfs_idestroy_fork(ip->i_afp); kmem_cache_free(xfs_ifork_cache, ip->i_afp); if (ip->i_afp && ip->i_afp = NULL; xfs_need_iread_extents(ip->i_afp)) <KABOOM> ip->i_forkoff = 0; Regrettably, the VFS is much more lax about i_rwsem and getxattr than is immediately obvious -- not only does it not guarantee that we hold i_rwsem, it actually doesn't guarantee that we *don't* hold it either. The getxattr system call won't acquire the lock before calling XFS, but the file capabilities code calls getxattr with and without i_rwsem held to determine if the "security.capabilities" xattr is set on the file. Fixing the VFS locking requires a treewide investigation into every code path that could touch an xattr and what i_rwsem state it expects or sets up. That could take years or even prove impossible; fortunately, we can fix this UAF problem inside XFS. An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to ensure that i_forkoff is always zeroed before i_afp is set to null and changed the read paths to use smp_rmb before accessing i_forkoff and i_afp, which avoided these UAF problems. However, the patch author was too busy dealing with other problems in the meantime, and by the time he came back to this issue, the situation had changed a bit. On a modern system with selinux, each inode will always have at least one xattr for the selinux label, so it doesn't make much sense to keep incurring the extra pointer dereference. Furthermore, Allison's upcoming parent pointer patchset will also cause nearly every inode in the filesystem to have extended attributes. Therefore, make the inode attribute fork structure part of struct xfs_inode, at a cost of 40 more bytes. This patch adds a clunky if_present field where necessary to maintain the existing logic of xattr fork null pointer testing in the existing codebase. The next patch switches the logic over to XFS_IFORK_Q and it all goes away. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> conflicts: fs/xfs/libxfs/xfs_attr.c fs/xfs/libxfs/xfs_attr.h fs/xfs/libxfs/xfs_attr_leaf.c fs/xfs/libxfs/xfs_bmap.c fs/xfs/libxfs/xfs_inode_buf.c fs/xfs/libxfs/xfs_inode_fork.c fs/xfs/libxfs/xfs_inode_fork.h fs/xfs/xfs_attr_inactive.c fs/xfs/xfs_attr_list.c fs/xfs/xfs_icache.c fs/xfs/xfs_inode.c fs/xfs/xfs_inode.h fs/xfs/xfs_inode_item.c fs/xfs/xfs_itable.c Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v5.19-rc5 commit 732436ef category: bugfix bugzilla: 187164, https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=732436ef916b4f338d672ea56accfdb11e8d0732 -------------------------------- We're about to make this logic do a bit more, so convert the macro to a static inline function for better typechecking and fewer shouty macros. No functional changes here. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> conflicts: fs/xfs/libxfs/xfs_bmap.c fs/xfs/libxfs/xfs_bmap_btree.c fs/xfs/libxfs/xfs_inode_fork.c fs/xfs/libxfs/xfs_inode_fork.h fs/xfs/scrub/bmap.c fs/xfs/scrub/symlink.c fs/xfs/xfs_inode.c fs/xfs/xfs_ioctl.c fs/xfs/xfs_qm.c fs/xfs/xfs_reflink.c Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Brian Foster 提交于
mainline inclusion from mainline-v5.11-rc4 commit 06058bc4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=06058bc40534530e617e5623775c53bb24f032cb -------------------------------- Freed extents are marked busy from the point the freeing transaction commits until the associated CIL context is checkpointed to the log. This prevents reuse and overwrite of recently freed blocks before the changes are committed to disk, which can lead to corruption after a crash. The exception to this rule is that metadata allocation is allowed to reuse busy extents because metadata changes are also logged. As of commit 97d3ac75 ("xfs: exact busy extent tracking"), XFS has allowed modification or complete invalidation of outstanding busy extents for metadata allocations. This implementation assumes that use of the associated extent is imminent, which is not always the case. For example, the trimmed extent might not satisfy the minimum length of the allocation request, or the allocation algorithm might be involved in a search for the optimal result based on locality. generic/019 reproduces a corruption caused by this scenario. First, a metadata block (usually a bmbt or symlink block) is freed from an inode. A subsequent bmbt split on an unrelated inode attempts a near mode allocation request that invalidates the busy block during the search, but does not ultimately allocate it. Due to the busy state invalidation, the block is no longer considered busy to subsequent allocation. A direct I/O write request immediately allocates the block and writes to it. Finally, the filesystem crashes while in a state where the initial metadata block free had not committed to the on-disk log. After recovery, the original metadata block is in its original location as expected, but has been corrupted by the aforementioned dio. This demonstrates that it is fundamentally unsafe to modify busy extent state for extents that are not guaranteed to be allocated. This applies to pretty much all of the code paths that currently trim busy extents for one reason or another. Therefore to address this problem, drop the reuse mechanism from the busy extent trim path. This code already knows how to return partial non-busy ranges of the targeted free extent and higher level code tracks the busy state of the allocation attempt. If a block allocation fails where one or more candidate extents is busy, we force the log and retry the allocation. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Zheng Yongjun 提交于
mainline inclusion from mainline-v5.10-rc5 commit 1189686e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1189686e5440041057f8cc21a7c1d13bb6642cb9 -------------------------------- Replace a comma between expression statements by a semicolon. Signed-off-by: NZheng Yongjun <zhengyongjun3@huawei.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NEric Sandeen <sandeen@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.17-rc6 commit 941fbdfd category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=941fbdfd6dd0f1d7961c28123b5460912f678cb5 -------------------------------- xfs_ail_push_all_sync() has a loop like this: while max_ail_lsn { prepare_to_wait(ail_empty) target = max_ail_lsn wake_up(ail_task); schedule() } Which is designed to sleep until the AIL is emptied. When xfs_ail_update_finish() moves the tail of the log, it does: if (list_empty(&ailp->ail_head)) wake_up_all(&ailp->ail_empty); So it will only wake up the sync push waiter when the AIL goes empty. If, by the time the push waiter has woken, the AIL has more in it, it will reset the target, wake the push task and go back to sleep. The problem here is that if the AIL is having items added to it when xfs_ail_push_all_sync() is called, then they may get inserted into the AIL at a LSN higher than the target LSN. At this point, xfsaild_push() will see that the target is X, the item LSNs are (X+N) and skip over them, hence never pushing the out. The result of this the AIL will not get emptied by the AIL push thread, hence xfs_ail_finish_update() will never see the AIL being empty even if it moves the tail. Hence xfs_ail_push_all_sync() never gets woken and hence cannot update the push target to capture the items beyond the current target on the LSN. This is a TOCTOU type of issue so the way to avoid it is to not use the push target at all for sync pushes. We know that a sync push is being requested by the fact the ail_empty wait queue is active, hence the xfsaild can just set the target to max_ail_lsn on every push that we see the wait queue active. Hence we no longer will leave items on the AIL that are beyond the LSN sampled at the start of a sync push. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChandan Babu R <chandan.babu@oracle.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.17-rc6 commit dbd0f529 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=dbd0f5299302f8506637592e2373891a748c6990 -------------------------------- AIL flushing can get stuck here: [316649.005769] INFO: task xfsaild/pmem1:324525 blocked for more than 123 seconds. [316649.007807] Not tainted 5.17.0-rc6-dgc+ #975 [316649.009186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [316649.011720] task:xfsaild/pmem1 state:D stack:14544 pid:324525 ppid: 2 flags:0x00004000 [316649.014112] Call Trace: [316649.014841] <TASK> [316649.015492] __schedule+0x30d/0x9e0 [316649.017745] schedule+0x55/0xd0 [316649.018681] io_schedule+0x4b/0x80 [316649.019683] xfs_buf_wait_unpin+0x9e/0xf0 [316649.021850] __xfs_buf_submit+0x14a/0x230 [316649.023033] xfs_buf_delwri_submit_buffers+0x107/0x280 [316649.024511] xfs_buf_delwri_submit_nowait+0x10/0x20 [316649.025931] xfsaild+0x27e/0x9d0 [316649.028283] kthread+0xf6/0x120 [316649.030602] ret_from_fork+0x1f/0x30 in the situation where flushing gets preempted between the unpin check and the buffer trylock under nowait conditions: blk_start_plug(&plug); list_for_each_entry_safe(bp, n, buffer_list, b_list) { if (!wait_list) { if (xfs_buf_ispinned(bp)) { pinned++; continue; } Here >>>>>> if (!xfs_buf_trylock(bp)) continue; This means submission is stuck until something else triggers a log force to unpin the buffer. To get onto the delwri list to begin with, the buffer pin state has already been checked, and hence it's relatively rare we get a race between flushing and encountering a pinned buffer in delwri submission to begin with. Further, to increase the pin count the buffer has to be locked, so the only way we can hit this race without failing the trylock is to be preempted between the pincount check seeing zero and the trylock being run. Hence to avoid this problem, just invert the order of trylock vs pin check. We shouldn't hit that many pinned buffers here, so optimising away the trylock for pinned buffers should not matter for performance at all. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChandan Babu R <chandan.babu@oracle.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.17-rc6 commit a9a4bc8c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a9a4bc8c76d747aa40b30e2dfc176c781f353a08 -------------------------------- After 963 iterations of generic/530, it deadlocked during recovery on a pinned inode cluster buffer like so: XFS (pmem1): Starting recovery (logdev: internal) INFO: task kworker/8:0:306037 blocked for more than 122 seconds. Not tainted 5.17.0-rc6-dgc+ #975 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/8:0 state:D stack:13024 pid:306037 ppid: 2 flags:0x00004000 Workqueue: xfs-inodegc/pmem1 xfs_inodegc_worker Call Trace: <TASK> __schedule+0x30d/0x9e0 schedule+0x55/0xd0 schedule_timeout+0x114/0x160 __down+0x99/0xf0 down+0x5e/0x70 xfs_buf_lock+0x36/0xf0 xfs_buf_find+0x418/0x850 xfs_buf_get_map+0x47/0x380 xfs_buf_read_map+0x54/0x240 xfs_trans_read_buf_map+0x1bd/0x490 xfs_imap_to_bp+0x4f/0x70 xfs_iunlink_map_ino+0x66/0xd0 xfs_iunlink_map_prev.constprop.0+0x148/0x2f0 xfs_iunlink_remove_inode+0xf2/0x1d0 xfs_inactive_ifree+0x1a3/0x900 xfs_inode_unlink+0xcc/0x210 xfs_inodegc_worker+0x1ac/0x2f0 process_one_work+0x1ac/0x390 worker_thread+0x56/0x3c0 kthread+0xf6/0x120 ret_from_fork+0x1f/0x30 </TASK> task:mount state:D stack:13248 pid:324509 ppid:324233 flags:0x00004000 Call Trace: <TASK> __schedule+0x30d/0x9e0 schedule+0x55/0xd0 schedule_timeout+0x114/0x160 __down+0x99/0xf0 down+0x5e/0x70 xfs_buf_lock+0x36/0xf0 xfs_buf_find+0x418/0x850 xfs_buf_get_map+0x47/0x380 xfs_buf_read_map+0x54/0x240 xfs_trans_read_buf_map+0x1bd/0x490 xfs_imap_to_bp+0x4f/0x70 xfs_iget+0x300/0xb40 xlog_recover_process_one_iunlink+0x4c/0x170 xlog_recover_process_iunlinks.isra.0+0xee/0x130 xlog_recover_finish+0x57/0x110 xfs_log_mount_finish+0xfc/0x1e0 xfs_mountfs+0x540/0x910 xfs_fs_fill_super+0x495/0x850 get_tree_bdev+0x171/0x270 xfs_fs_get_tree+0x15/0x20 vfs_get_tree+0x24/0xc0 path_mount+0x304/0xba0 __x64_sys_mount+0x108/0x140 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> task:xfsaild/pmem1 state:D stack:14544 pid:324525 ppid: 2 flags:0x00004000 Call Trace: <TASK> __schedule+0x30d/0x9e0 schedule+0x55/0xd0 io_schedule+0x4b/0x80 xfs_buf_wait_unpin+0x9e/0xf0 __xfs_buf_submit+0x14a/0x230 xfs_buf_delwri_submit_buffers+0x107/0x280 xfs_buf_delwri_submit_nowait+0x10/0x20 xfsaild+0x27e/0x9d0 kthread+0xf6/0x120 ret_from_fork+0x1f/0x30 We have the mount process waiting on an inode cluster buffer read, inodegc doing unlink waiting on the same inode cluster buffer, and the AIL push thread blocked in writeback waiting for the inode cluster buffer to become unpinned. What has happened here is that the AIL push thread has raced with the inodegc process modifying, committing and pinning the inode cluster buffer here in xfs_buf_delwri_submit_buffers() here: blk_start_plug(&plug); list_for_each_entry_safe(bp, n, buffer_list, b_list) { if (!wait_list) { if (xfs_buf_ispinned(bp)) { pinned++; continue; } Here >>>>>> if (!xfs_buf_trylock(bp)) continue; Basically, the AIL has found the buffer wasn't pinned and got the lock without blocking, but then the buffer was pinned. This implies the processing here was pre-empted between the pin check and the lock, because the pin count can only be increased while holding the buffer locked. Hence when it has gone to submit the IO, it has blocked waiting for the buffer to be unpinned. With all executing threads now waiting on the buffer to be unpinned, we normally get out of situations like this via the background log worker issuing a log force which will unpinned stuck buffers like this. But at this point in recovery, we haven't started the log worker. In fact, the first thing we do after processing intents and unlinked inodes is *start the log worker*. IOWs, we start it too late to have it break deadlocks like this. Avoid this and any other similar deadlock vectors in intent and unlinked inode recovery by starting the log worker before we recover intents and unlinked inodes. This part of recovery runs as though the filesystem is fully active, so we really should have the same infrastructure running as we normally do at runtime. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChandan Babu R <chandan.babu@oracle.com> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NLong Li <leo.lilong@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-