- 13 7月, 2022 2 次提交
-
-
由 Darrick J. Wong 提交于
Replace the shouty macros here with typechecked helper functions. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com>
-
由 Darrick J. Wong 提交于
Replace this shouty macro with a real C function that has a more descriptive name. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com>
-
- 10 7月, 2022 5 次提交
-
-
由 Darrick J. Wong 提交于
Modify xfs_ifork_ptr to return a NULL pointer if the caller asks for the attribute fork but i_forkoff is zero. This eliminates the ambiguity between i_forkoff and i_af.if_present, which should make it easier to understand the lifetime of attr forks. While we're at it, remove the if_present checks around calls to xfs_idestroy_fork and xfs_ifork_zap_attr since they can both handle attr forks that have already been torn down. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com>
-
由 Darrick J. Wong 提交于
Syzkaller reported a UAF bug a while back: ================================================================== BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127 Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958 CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted 5.15.0-0.30.3-20220406_1406 #3 Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106 print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256 __kasan_report mm/kasan/report.c:442 [inline] kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459 xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127 xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159 xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36 __vfs_getxattr+0xdf/0x13d fs/xattr.c:399 cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300 security_inode_need_killpriv+0x4c/0x97 security/security.c:1408 dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912 dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908 do_truncate+0xc3/0x1e0 fs/open.c:56 handle_truncate fs/namei.c:3084 [inline] do_open fs/namei.c:3432 [inline] path_openat+0x30ab/0x396d fs/namei.c:3561 do_filp_open+0x1c4/0x290 fs/namei.c:3588 do_sys_openat2+0x60d/0x98c fs/open.c:1212 do_sys_open+0xcf/0x13c fs/open.c:1228 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0x0 RIP: 0033:0x7f7ef4bb753d Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0 RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0 </TASK> Allocated by task 2953: kasan_save_stack+0x19/0x38 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:434 [inline] __kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467 kasan_slab_alloc include/linux/kasan.h:254 [inline] slab_post_alloc_hook mm/slab.h:519 [inline] slab_alloc_node mm/slub.c:3213 [inline] slab_alloc mm/slub.c:3221 [inline] kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226 kmem_cache_zalloc include/linux/slab.h:711 [inline] xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287 xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098 xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746 xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59 __vfs_setxattr+0x11b/0x177 fs/xattr.c:180 __vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214 __vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275 vfs_setxattr+0x154/0x33d fs/xattr.c:301 setxattr+0x216/0x29f fs/xattr.c:575 __do_sys_fsetxattr fs/xattr.c:632 [inline] __se_sys_fsetxattr fs/xattr.c:621 [inline] __x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0x0 Freed by task 2949: kasan_save_stack+0x19/0x38 mm/kasan/common.c:38 kasan_set_track+0x1c/0x21 mm/kasan/common.c:46 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360 ____kasan_slab_free mm/kasan/common.c:366 [inline] ____kasan_slab_free mm/kasan/common.c:328 [inline] __kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374 kasan_slab_free include/linux/kasan.h:230 [inline] slab_free_hook mm/slub.c:1700 [inline] slab_free_freelist_hook mm/slub.c:1726 [inline] slab_free mm/slub.c:3492 [inline] kmem_cache_free+0xdc/0x3ce mm/slub.c:3508 xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773 xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822 xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413 xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684 xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802 xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59 __vfs_removexattr+0x106/0x16a fs/xattr.c:468 cap_inode_killpriv+0x24/0x47 security/commoncap.c:324 security_inode_killpriv+0x54/0xa1 security/security.c:1414 setattr_prepare+0x1a6/0x897 fs/attr.c:146 xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682 xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065 xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093 notify_change+0xae5/0x10a1 fs/attr.c:410 do_truncate+0x134/0x1e0 fs/open.c:64 handle_truncate fs/namei.c:3084 [inline] do_open fs/namei.c:3432 [inline] path_openat+0x30ab/0x396d fs/namei.c:3561 do_filp_open+0x1c4/0x290 fs/namei.c:3588 do_sys_openat2+0x60d/0x98c fs/open.c:1212 do_sys_open+0xcf/0x13c fs/open.c:1228 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0x0 The buggy address belongs to the object at ffff88802cec9188 which belongs to the cache xfs_ifork of size 40 The buggy address is located 20 bytes inside of 40-byte region [ffff88802cec9188, ffff88802cec91b0) The buggy address belongs to the page: page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2cec9 flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff) raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80 raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc >ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb ^ ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb ================================================================== The root cause of this bug is the unlocked access to xfs_inode.i_afp from the getxattr code paths while trying to determine which ILOCK mode to use to stabilize the xattr data. Unfortunately, the VFS does not acquire i_rwsem when vfs_getxattr (or listxattr) call into the filesystem, which means that getxattr can race with a removexattr that's tearing down the attr fork and crash: xfs_attr_set: xfs_attr_get: xfs_attr_fork_remove: xfs_ilock_attr_map_shared: xfs_idestroy_fork(ip->i_afp); kmem_cache_free(xfs_ifork_cache, ip->i_afp); if (ip->i_afp && ip->i_afp = NULL; xfs_need_iread_extents(ip->i_afp)) <KABOOM> ip->i_forkoff = 0; Regrettably, the VFS is much more lax about i_rwsem and getxattr than is immediately obvious -- not only does it not guarantee that we hold i_rwsem, it actually doesn't guarantee that we *don't* hold it either. The getxattr system call won't acquire the lock before calling XFS, but the file capabilities code calls getxattr with and without i_rwsem held to determine if the "security.capabilities" xattr is set on the file. Fixing the VFS locking requires a treewide investigation into every code path that could touch an xattr and what i_rwsem state it expects or sets up. That could take years or even prove impossible; fortunately, we can fix this UAF problem inside XFS. An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to ensure that i_forkoff is always zeroed before i_afp is set to null and changed the read paths to use smp_rmb before accessing i_forkoff and i_afp, which avoided these UAF problems. However, the patch author was too busy dealing with other problems in the meantime, and by the time he came back to this issue, the situation had changed a bit. On a modern system with selinux, each inode will always have at least one xattr for the selinux label, so it doesn't make much sense to keep incurring the extra pointer dereference. Furthermore, Allison's upcoming parent pointer patchset will also cause nearly every inode in the filesystem to have extended attributes. Therefore, make the inode attribute fork structure part of struct xfs_inode, at a cost of 40 more bytes. This patch adds a clunky if_present field where necessary to maintain the existing logic of xattr fork null pointer testing in the existing codebase. The next patch switches the logic over to XFS_IFORK_Q and it all goes away. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com>
-
由 Darrick J. Wong 提交于
We're about to make this logic do a bit more, so convert the macro to a static inline function for better typechecking and fewer shouty macros. No functional changes here. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com>
-
由 Andrey Strachuk 提交于
At line 1561, variable "state" is being compared with NULL every loop iteration. ------------------------------------------------------------------- 1561 for (i = 0; state != NULL && i < state->path.active; i++) { 1562 xfs_trans_brelse(args->trans, state->path.blk[i].bp); 1563 state->path.blk[i].bp = NULL; 1564 } ------------------------------------------------------------------- However, it cannot be NULL. ---------------------------------------- 1546 state = xfs_da_state_alloc(args); ---------------------------------------- xfs_da_state_alloc calls kmem_cache_zalloc. kmem_cache_zalloc is called with __GFP_NOFAIL flag and, therefore, it cannot return NULL. -------------------------------------------------------------------------- struct xfs_da_state * xfs_da_state_alloc( struct xfs_da_args *args) { struct xfs_da_state *state; state = kmem_cache_zalloc(xfs_da_state_cache, GFP_NOFS | __GFP_NOFAIL); state->args = args; state->mp = args->dp->i_mount; return state; } -------------------------------------------------------------------------- Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: NAndrey Strachuk <strochuk@ispras.ru> Fixes: 4d0cdd2b ("xfs: clean up xfs_attr_node_hasname") Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Eric Sandeen 提交于
We got a report that "renameat2() with flags=RENAME_WHITEOUT doesn't apply an SELinux label on xfs" as it does on other filesystems (for example, ext4 and tmpfs.) While I'm not quite sure how labels may interact w/ whiteout files, leaving them as unlabeled seems inconsistent at best. Now that xfs_init_security is not static, rename it to xfs_inode_init_security per dchinner's suggestion. Signed-off-by: NEric Sandeen <sandeen@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
- 07 7月, 2022 23 次提交
-
-
由 Dave Chinner 提交于
We check if an ag contains the log in many places, so make this a first class XFS helper by lifting it to fs/xfs/libxfs/xfs_ag.h and renaming it xfs_ag_contains_log(). The convert all the places that check if the AG contains the log to use this helper. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
Many of the places that call xfs_ag_block_count() have a perag available. These places can just read pag->block_count directly instead of calculating the AG block count from first principles. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
There is a lot of overhead in functions like xfs_verify_agino() that repeatedly calculate the geometry limits of an AG. These can be pre-calculated as they are static and the verification context has a per-ag context it can quickly reference. In the case of xfs_verify_agino(), we now always have a perag context handy, so we can store the minimum and maximum agino values in the AG in the perag. This means we don't have to calculate it on every call and it can be inlined in callers if we move it to xfs_ag.h. xfs_verify_agino_or_null() gets the same perag treatment. xfs_agino_range() is moved to xfs_ag.c as it's not really a type function, and it's use is largely restricted as the first and last aginos can be grabbed straight from the perag in most cases. Note that we leave the original xfs_verify_agino in place in xfs_types.c as a static function as other callers in that file do not have per-ag contexts so still need to go the long way. It's been renamed to xfs_verify_agno_agino() to indicate it takes both an agno and an agino to differentiate it from new function. $ size --totals fs/xfs/built-in.a text data bss dec hex filename before 1482185 329588 572 1812345 1ba779 (TOTALS) after 1481937 329588 572 1812097 1ba681 (TOTALS) Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
There is a lot of overhead in functions like xfs_verify_agbno() that repeatedly calculate the geometry limits of an AG. These can be pre-calculated as they are static and the verification context has a per-ag context it can quickly reference. In the case of xfs_verify_agbno(), we now always have a perag context handy, so we can store the AG length and the minimum valid block in the AG in the perag. This means we don't have to calculate it on every call and it can be inlined in callers if we move it to xfs_ag.h. Move xfs_ag_block_count() to xfs_ag.c because it's really a per-ag function and not an XFS type function. We need a little bit of rework that is specific to xfs_initialise_perag() to allow growfs to calculate the new perag sizes before we've updated the primary superblock during the grow (chicken/egg situation). Note that we leave the original xfs_verify_agbno in place in xfs_types.c as a static function as other callers in that file do not have per-ag contexts so still need to go the long way. It's been renamed to xfs_verify_agno_agbno() to indicate it takes both an agno and an agbno to differentiate it from new function. Future commits will make similar changes for other per-ag geometry validation functions. Further: $ size --totals fs/xfs/built-in.a text data bss dec hex filename before 1483006 329588 572 1813166 1baaae (TOTALS) after 1482185 329588 572 1812345 1ba779 (TOTALS) This rework reduces the binary size by ~820 bytes, indicating that much less work is being done to bounds check the agbno values against on per-ag geometry information. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
We have the perag in most places we call xfs_alloc_read_agfl, so pass the perag instead of a mount/agno pair. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
It's available in all callers, so pass it in so that the perag can be passed further down the stack. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
It's available in all callers, so pass it in so that the perag can be passed further down the stack. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
We have the perag in most places we call xfs_read_agf, so pass the perag instead of a mount/agno pair. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
We have the perag in most palces we call xfs_read_agi, so pass the perag instead of a mount/agno pair. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
xfs_alloc_read_agf() initialises the perag if it hasn't been done yet, so it makes sense to pass it the perag rather than pull a reference from the buffer. This allows callers to be per-ag centric rather than passing mount/agno pairs everywhere. Whilst modifying the xfs_reflink_find_shared() function definition, declare it static and remove the extern declaration as it is an internal function only these days. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
Trivial wrapper around xfs_alloc_read_agf(), can be easily replaced by passing a NULL agfbp to xfs_alloc_read_agf(). Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
xfs_ialloc_read_agi() initialises the perag if it hasn't been done yet, so it makes sense to pass it the perag rather than pull a reference from the buffer. This allows callers to be per-ag centric rather than passing mount/agno pairs everywhere. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
This is just a basic wrapper around xfs_ialloc_read_agi(), which can be entirely handled by xfs_ialloc_read_agi() by passing a NULL agibpp.... Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
Because the perag must exist for these operations, look it up as part of the common shrink operations and pass it instead of the mount/agno pair. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
When xlog_sync() rounds off the tail the iclog that is being flushed, it manually subtracts that space from the grant heads. This space is actually reserved by the transaction ticket that covers the xlog_sync() call from xlog_write(), but we don't plumb the ticket down far enough for it to account for the space consumed in the current log ticket. The grant heads are hot, so we really should be accounting this to the ticket is we can, rather than adding thousands of extra grant head updates every CIL commit. Interestingly, this actually indicates a potential log space overrun can occur when we force the log. By the time that xfs_log_force() pushes out an active iclog and consumes the roundoff space, the reservation for that roundoff space has been returned to the grant heads and is no longer covered by a reservation. In theory the roundoff added to log force on an already full log could push the write head past the tail. In practice, the CIL commit that writes to the log and needs the iclog pushed will have reserved space for roundoff, so when it releases the ticket there will still be physical space for the roundoff to be committed to the log, even though it is no longer reserved. This roundoff won't be enough space to allow a transaction to be woken if the log is full, so overruns should not actually occur in practice. That said, it indicates that we should not release the CIL context log ticket until after we've released the commit iclog. It also means that xlog_sync() still needs the direct grant head manipulation if we don't provide it with a ticket. Log forces are rare when we are in fast paths running 1.5 million transactions/s that make the grant heads hot, so let's optimise the hot case and pass CIL log tickets down to the xlog_sync() code. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
Because now it hurts when the CIL fills up. - 37.20% __xfs_trans_commit - 35.84% xfs_log_commit_cil - 19.34% _raw_spin_lock - do_raw_spin_lock 19.01% __pv_queued_spin_lock_slowpath - 4.20% xfs_log_ticket_ungrant 0.90% xfs_log_space_wake Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
Adding a list_sort() call to the CIL push work while the xc_ctx_lock is held exclusively has resulted in fairly long lock hold times and that stops all front end transaction commits from making progress. We can move the sorting out of the xc_ctx_lock if we can transfer the ordering information to the log vectors as they are detached from the log items and then we can sort the log vectors. With these changes, we can move the list_sort() call to just before we call xlog_write() when we aren't holding any locks at all. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
Because the next change is going to require sorting log vectors, and that requires arbitrary rearrangement of the list which cannot be done easily with a single linked list. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
So that we can remove the cil_lock which is a global serialisation point. We've already got ordering sorted, so all we need to do is treat the CIL list like the busy extent list and reconstruct it before the push starts. This is what we're trying to avoid: - 75.35% 1.83% [kernel] [k] xfs_log_commit_cil - 46.35% xfs_log_commit_cil - 41.54% _raw_spin_lock - 67.30% do_raw_spin_lock 66.96% __pv_queued_spin_lock_slowpath Which happens on a 32p system when running a 32-way 'rm -rf' workload. After this patch: - 20.90% 3.23% [kernel] [k] xfs_log_commit_cil - 17.67% xfs_log_commit_cil - 6.51% xfs_log_ticket_ungrant 1.40% xfs_log_space_wake 2.32% memcpy_erms - 2.18% xfs_buf_item_committing - 2.12% xfs_buf_item_release - 1.03% xfs_buf_unlock 0.96% up 0.72% xfs_buf_rele 1.33% xfs_inode_item_format 1.19% down_read 0.91% up_read 0.76% xfs_buf_item_format - 0.68% kmem_alloc_large - 0.67% kmem_alloc 0.64% __kmalloc 0.50% xfs_buf_item_size It kinda looks like the workload is running out of log space all the time. But all the spinlock contention is gone and the transaction commit rate has gone from 800k/s to 1.3M/s so the amount of real work being done has gone up a *lot*. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
Before we split the ordered CIL up into per cpu lists, we need a mechanism to track the order of the items in the CIL. We need to do this because there are rules around the order in which related items must physically appear in the log even inside a single checkpoint transaction. An example of this is intents - an intent must appear in the log before it's intent done record so that log recovery can cancel the intent correctly. If we have these two records misordered in the CIL, then they will not be recovered correctly by journal replay. We also will not be able to move items to the tail of the CIL list when they are relogged, hence the log items will need some mechanism to allow the correct log item order to be recreated before we write log items to the hournal. Hence we need to have a mechanism for recording global order of transactions in the log items so that we can recover that order from un-ordered per-cpu lists. Do this with a simple monotonic increasing commit counter in the CIL context. Each log item in the transaction gets stamped with the current commit order ID before it is added to the CIL. If the item is already in the CIL, leave it where it is instead of moving it to the tail of the list and instead sort the list before we start the push work. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
To get them out from under the CIL lock. This is an unordered list, so we can simply punt it to per-cpu lists during transaction commits and reaggregate it back into a single list during the CIL push work. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
To get it out from under the cil spinlock. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
Now that we have the CIL percpu structures in place, implement the space used counter as a per-cpu counter. We have to be really careful now about ensuring that the checks and updates run without arbitrary delays, which means they need to run with pre-emption disabled. We do this by careful placement of the get_cpu_ptr/put_cpu_ptr calls to access the per-cpu structures for that CPU. We need to be able to reliably detect that the CIL has reached the hard limit threshold so we can take extra reservations for the iclog headers when the space used overruns the original reservation. hence we factor out xlog_cil_over_hard_limit() from xlog_cil_push_background(). The global CIL space used is an atomic variable that is backed by per-cpu aggregation to minimise the number of atomic updates we do to the global state in the fast path. While we are under the soft limit, we aggregate only when the per-cpu aggregation is over the proportion of the soft limit assigned to that CPU. This means that all CPUs can use all but one byte of their aggregation threshold and we will not go over the soft limit. Hence once we detect that we've gone over both a per-cpu aggregation threshold and the soft limit, we know that we have only exceeded the soft limit by one per-cpu aggregation threshold. Even if all CPUs hit this at the same time, we can't be over the hard limit, so we can run an aggregation back into the atomic counter at this point and still be under the hard limit. At this point, we will be over the soft limit and hence we'll aggregate into the global atomic used space directly rather than the per-cpu counters, hence providing accurate detection of hard limit excursion for accounting and reservation purposes. Hence we get the best of both worlds - lockless, scalable per-cpu fast path plus accurate, atomic detection of hard limit excursion. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
- 02 7月, 2022 5 次提交
-
-
由 Dave Chinner 提交于
The CIL push lock is highly contended on larger machines, becoming a hard bottleneck that about 700,000 transaction commits/s on >16p machines. To address this, start moving the CIL tracking infrastructure to utilise per-CPU structures. We need to track the space used, the amount of log reservation space reserved to write the CIL, the log items in the CIL and the busy extents that need to be completed by the CIL commit. This requires a couple of per-cpu counters, an unordered per-cpu list and a globally ordered per-cpu list. Create a per-cpu structure to hold these and all the management interfaces needed, as well as the hooks to handle hotplug CPUs. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
For every iclog that a CIL push will use up, we need to ensure we have space reserved for the iclog header in each iclog. It is extremely difficult to do this accurately with a per-cpu counter without expensive summing of the counter in every commit. However, we know what the maximum CIL size is going to be because of the hard space limit we have, and hence we know exactly how many iclogs we are going to need to write out the CIL. We are constrained by the requirement that small transactions only have reservation space for a single iclog header built into them. At commit time we don't know how much of the current transaction reservation is made up of iclog header reservations as calculated by xfs_log_calc_unit_res() when the ticket was reserved. As larger reservations have multiple header spaces reserved, we can steal more than one iclog header reservation at a time, but we only steal the exact number needed for the given log vector size delta. As a result, we don't know exactly when we are going to steal iclog header reservations, nor do we know exactly how many we are going to need for a given CIL. To make things simple, start by calculating the worst case number of iclog headers a full CIL push will require. Record this into an atomic variable in the CIL. Then add a byte counter to the log ticket that records exactly how much iclog header space has been reserved in this ticket by xfs_log_calc_unit_res(). This tells us exactly how much space we can steal from the ticket at transaction commit time. Now, at transaction commit time, we can check if the CIL has a full iclog header reservation and, if not, steal the entire reservation the current ticket holds for iclog headers. This minimises the number of times we need to do atomic operations in the fast path, but still guarantees we get all the reservations we need. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
The xc_cil_lock is the most highly contended lock in XFS now. To start the process of getting rid of it, lift the initial reservation of the CIL log space out from under the xc_cil_lock. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Dave Chinner 提交于
In the next patches we are going to make the CIL list itself per-cpu, and so we cannot use list_empty() to check is the list is empty. Replace the list_empty() checks with a flag in the CIL to indicate we have committed at least one transaction to the CIL and hence the CIL is not empty. We need this flag to be an atomic so that we can clear it without holding any locks in the commit fast path, but we also need to be careful to avoid atomic operations in the fast path. Hence we use the fact that test_bit() is not an atomic op to first check if the flag is set and then run the atomic test_and_clear_bit() operation to clear it and steal the initial unit reservation for the CIL context checkpoint. When we are switching to a new context in a push, we place the setting of the XLOG_CIL_EMPTY flag under the xc_push_lock. THis allows all the other places that need to check whether the CIL is empty to use test_bit() and still be serialised correctly with the CIL context swaps that set the bit. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
由 Darrick J. Wong 提交于
KASAN reported the following use after free bug when running generic/475: XFS (dm-0): Mounting V5 Filesystem XFS (dm-0): Starting recovery (logdev: internal) XFS (dm-0): Ending recovery (logdev: internal) Buffer I/O error on dev dm-0, logical block 20639616, async page read Buffer I/O error on dev dm-0, logical block 20639617, async page read XFS (dm-0): log I/O error -5 XFS (dm-0): Filesystem has been shut down due to log error (0x2). XFS (dm-0): Unmounting Filesystem XFS (dm-0): Please unmount the filesystem and rectify the problem(s). ================================================================== BUG: KASAN: use-after-free in do_raw_spin_lock+0x246/0x270 Read of size 4 at addr ffff888109dd84c4 by task 3:1H/136 CPU: 3 PID: 136 Comm: 3:1H Not tainted 5.19.0-rc4-xfsx #rc4 8e53ab5ad0fddeb31cee5e7063ff9c361915a9c4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 Workqueue: xfs-log/dm-0 xlog_ioend_work [xfs] Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_report.cold+0x2b8/0x661 ? do_raw_spin_lock+0x246/0x270 kasan_report+0xab/0x120 ? do_raw_spin_lock+0x246/0x270 do_raw_spin_lock+0x246/0x270 ? rwlock_bug.part.0+0x90/0x90 xlog_force_shutdown+0xf6/0x370 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318] xlog_ioend_work+0x100/0x190 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318] process_one_work+0x672/0x1040 worker_thread+0x59b/0xec0 ? __kthread_parkme+0xc6/0x1f0 ? process_one_work+0x1040/0x1040 ? process_one_work+0x1040/0x1040 kthread+0x29e/0x340 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Allocated by task 154099: kasan_save_stack+0x1e/0x40 __kasan_kmalloc+0x81/0xa0 kmem_alloc+0x8d/0x2e0 [xfs] xlog_cil_init+0x1f/0x540 [xfs] xlog_alloc_log+0xd1e/0x1260 [xfs] xfs_log_mount+0xba/0x640 [xfs] xfs_mountfs+0xf2b/0x1d00 [xfs] xfs_fs_fill_super+0x10af/0x1910 [xfs] get_tree_bdev+0x383/0x670 vfs_get_tree+0x7d/0x240 path_mount+0xdb7/0x1890 __x64_sys_mount+0x1fa/0x270 do_syscall_64+0x2b/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Freed by task 154151: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_set_free_info+0x20/0x30 ____kasan_slab_free+0x110/0x190 slab_free_freelist_hook+0xab/0x180 kfree+0xbc/0x310 xlog_dealloc_log+0x1b/0x2b0 [xfs] xfs_unmountfs+0x119/0x200 [xfs] xfs_fs_put_super+0x6e/0x2e0 [xfs] generic_shutdown_super+0x12b/0x3a0 kill_block_super+0x95/0xd0 deactivate_locked_super+0x80/0x130 cleanup_mnt+0x329/0x4d0 task_work_run+0xc5/0x160 exit_to_user_mode_prepare+0xd4/0xe0 syscall_exit_to_user_mode+0x1d/0x40 entry_SYSCALL_64_after_hwframe+0x46/0xb0 This appears to be a race between the unmount process, which frees the CIL and waits for in-flight iclog IO; and the iclog IO completion. When generic/475 runs, it starts fsstress in the background, waits a few seconds, and substitutes a dm-error device to simulate a disk falling out of a machine. If the fsstress encounters EIO on a pure data write, it will exit but the filesystem will still be online. The next thing the test does is unmount the filesystem, which tries to clean the log, free the CIL, and wait for iclog IO completion. If an iclog was being written when the dm-error switch occurred, it can race with log unmounting as follows: Thread 1 Thread 2 xfs_log_unmount xfs_log_clean xfs_log_quiesce xlog_ioend_work <observe error> xlog_force_shutdown test_and_set_bit(XLOG_IOERROR) xfs_log_force <log is shut down, nop> xfs_log_umount_write <log is shut down, nop> xlog_dealloc_log xlog_cil_destroy <wait for iclogs> spin_lock(&log->l_cilp->xc_push_lock) <KABOOM> Therefore, free the CIL after waiting for the iclogs to complete. I /think/ this race has existed for quite a few years now, though I don't remember the ~2014 era logging code well enough to know if it was a real threat then or if the actual race was exposed only more recently. Fixes: ac983517 ("xfs: don't sleep in xlog_cil_force_lsn on shutdown") Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com>
-
- 01 7月, 2022 4 次提交
-
-
由 Amir Goldstein 提交于
A regression has been reported by Nicolas Boichat, found while using the copy_file_range syscall to copy a tracefs file. Before commit 5dae222a ("vfs: allow copy_file_range to copy across devices") the kernel would return -EXDEV to userspace when trying to copy a file across different filesystems. After this commit, the syscall doesn't fail anymore and instead returns zero (zero bytes copied), as this file's content is generated on-the-fly and thus reports a size of zero. Another regression has been reported by He Zhe - the assertion of WARN_ON_ONCE(ret == -EOPNOTSUPP) can be triggered from userspace when copying from a sysfs file whose read operation may return -EOPNOTSUPP. Since we do not have test coverage for copy_file_range() between any two types of filesystems, the best way to avoid these sort of issues in the future is for the kernel to be more picky about filesystems that are allowed to do copy_file_range(). This patch restores some cross-filesystem copy restrictions that existed prior to commit 5dae222a ("vfs: allow copy_file_range to copy across devices"), namely, cross-sb copy is not allowed for filesystems that do not implement ->copy_file_range(). Filesystems that do implement ->copy_file_range() have full control of the result - if this method returns an error, the error is returned to the user. Before this change this was only true for fs that did not implement the ->remap_file_range() operation (i.e. nfsv3). Filesystems that do not implement ->copy_file_range() still fall-back to the generic_copy_file_range() implementation when the copy is within the same sb. This helps the kernel can maintain a more consistent story about which filesystems support copy_file_range(). nfsd and ksmbd servers are modified to fall-back to the generic_copy_file_range() implementation in case vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV, which preserves behavior of server-side-copy. fall-back to generic_copy_file_range() is not implemented for the smb operation FSCTL_DUPLICATE_EXTENTS_TO_FILE, which is arguably a correct change of behavior. Fixes: 5dae222a ("vfs: allow copy_file_range to copy across devices") Link: https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drinkcat@chromium.org/ Link: https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=ZmpsG47CyHFJwnw7zSX6Q@mail.gmail.com/ Link: https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/ Link: https://lore.kernel.org/linux-fsdevel/20210630161320.29006-1-lhenriques@suse.de/Reported-by: NNicolas Boichat <drinkcat@chromium.org> Reported-by: Nkernel test robot <oliver.sang@intel.com> Signed-off-by: NLuis Henriques <lhenriques@suse.de> Fixes: 64bf5ff5 ("vfs: no fallback for ->copy_file_range") Link: https://lore.kernel.org/linux-fsdevel/20f17f64-88cb-4e80-07c1-85cb96c83619@windriver.com/Reported-by: NHe Zhe <zhe.he@windriver.com> Tested-by: NNamjae Jeon <linkinjeon@kernel.org> Tested-by: NLuis Henriques <lhenriques@suse.de> Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Scott Mayhew 提交于
This was missed in c3ed2227 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.") and causes a panic when mounting with '-o trunkdiscovery': PID: 1604 TASK: ffff93dac3520000 CPU: 3 COMMAND: "mount.nfs" #0 [ffffb79140f738f8] machine_kexec at ffffffffaec64bee #1 [ffffb79140f73950] __crash_kexec at ffffffffaeda67fd #2 [ffffb79140f73a18] crash_kexec at ffffffffaeda76ed #3 [ffffb79140f73a30] oops_end at ffffffffaec2658d #4 [ffffb79140f73a50] general_protection at ffffffffaf60111e [exception RIP: nfs_fattr_init+0x5] RIP: ffffffffc0c18265 RSP: ffffb79140f73b08 RFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff93dac304a800 RCX: 0000000000000000 RDX: ffffb79140f73bb0 RSI: ffff93dadc8cbb40 RDI: d03ee11cfaf6bd50 RBP: ffffb79140f73be8 R8: ffffffffc0691560 R9: 0000000000000006 R10: ffff93db3ffd3df8 R11: 0000000000000000 R12: ffff93dac4040000 R13: ffff93dac2848e00 R14: ffffb79140f73b60 R15: ffffb79140f73b30 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffffb79140f73b08] _nfs41_proc_get_locations at ffffffffc0c73d53 [nfsv4] #6 [ffffb79140f73bf0] nfs4_proc_get_locations at ffffffffc0c83e90 [nfsv4] #7 [ffffb79140f73c60] nfs4_discover_trunking at ffffffffc0c83fb7 [nfsv4] #8 [ffffb79140f73cd8] nfs_probe_fsinfo at ffffffffc0c0f95f [nfs] #9 [ffffb79140f73da0] nfs_probe_server at ffffffffc0c1026a [nfs] RIP: 00007f6254fce26e RSP: 00007ffc69496ac8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f6254fce26e RDX: 00005600220a82a0 RSI: 00005600220a64d0 RDI: 00005600220a6520 RBP: 00007ffc69496c50 R8: 00005600220a8710 R9: 003035322e323231 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc69496c50 R13: 00005600220a8440 R14: 0000000000000010 R15: 0000560020650ef9 ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b Fixes: c3ed2227 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.") Signed-off-by: NScott Mayhew <smayhew@redhat.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 NeilBrown 提交于
Commit f49169c9 ("NFSD: Remove svc_serv_ops::svo_module") removed calls to module_put_and_kthread_exit() from threads that acted as SUNRPC servers and had a related svc_serv_ops structure. This was correct. It ALSO removed the module_put_and_kthread_exit() call from nfs4_run_state_manager() which is NOT a SUNRPC service. Consequently every time the NFSv4 state manager runs the module count increments and won't be decremented. So the nfsv4 module cannot be unloaded. So restore the module_put_and_kthread_exit() call. Fixes: f49169c9 ("NFSD: Remove svc_serv_ops::svo_module") Signed-off-by: NNeilBrown <neilb@suse.de> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Dylan Yudaken 提交于
io_import_iovec uses the s pointer, but this was changed immediately after the iovec was re-imported and so it was imported into the wrong place. Change the ordering. Fixes: 2be2eb02 ("io_uring: ensure reads re-import for selected buffers") Signed-off-by: NDylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630132006.2825668-1-dylany@fb.com [axboe: ensure we don't half-import as well] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 30 6月, 2022 1 次提交
-
-
由 Pavel Begunkov 提交于
We waste a u64 SQE field for flags even though we don't need as many bits and it can be used for something more useful later. Store io_uring specific send/recv flags in sqe->ioprio instead of ->addr2. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Fixes: 0455d4cc ("io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg") [axboe: change comment in io_uring.h as well] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-