1. 13 7月, 2022 2 次提交
  2. 10 7月, 2022 5 次提交
    • D
      xfs: use XFS_IFORK_Q to determine the presence of an xattr fork · e45d7cb2
      Darrick J. Wong 提交于
      Modify xfs_ifork_ptr to return a NULL pointer if the caller asks for the
      attribute fork but i_forkoff is zero.  This eliminates the ambiguity
      between i_forkoff and i_af.if_present, which should make it easier to
      understand the lifetime of attr forks.
      
      While we're at it, remove the if_present checks around calls to
      xfs_idestroy_fork and xfs_ifork_zap_attr since they can both handle attr
      forks that have already been torn down.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      e45d7cb2
    • D
      xfs: make inode attribute forks a permanent part of struct xfs_inode · 2ed5b09b
      Darrick J. Wong 提交于
      Syzkaller reported a UAF bug a while back:
      
      ==================================================================
      BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
      Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958
      
      CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted
      5.15.0-0.30.3-20220406_1406 #3
      Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29
      04/01/2014
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106
       print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459
       xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
       xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159
       xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36
       __vfs_getxattr+0xdf/0x13d fs/xattr.c:399
       cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300
       security_inode_need_killpriv+0x4c/0x97 security/security.c:1408
       dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912
       dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908
       do_truncate+0xc3/0x1e0 fs/open.c:56
       handle_truncate fs/namei.c:3084 [inline]
       do_open fs/namei.c:3432 [inline]
       path_openat+0x30ab/0x396d fs/namei.c:3561
       do_filp_open+0x1c4/0x290 fs/namei.c:3588
       do_sys_openat2+0x60d/0x98c fs/open.c:1212
       do_sys_open+0xcf/0x13c fs/open.c:1228
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      RIP: 0033:0x7f7ef4bb753d
      Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
      89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73
      01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48
      RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
      RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d
      RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0
      RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e
      R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0
       </TASK>
      
      Allocated by task 2953:
       kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:434 [inline]
       __kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467
       kasan_slab_alloc include/linux/kasan.h:254 [inline]
       slab_post_alloc_hook mm/slab.h:519 [inline]
       slab_alloc_node mm/slub.c:3213 [inline]
       slab_alloc mm/slub.c:3221 [inline]
       kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226
       kmem_cache_zalloc include/linux/slab.h:711 [inline]
       xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287
       xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098
       xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746
       xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
       __vfs_setxattr+0x11b/0x177 fs/xattr.c:180
       __vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214
       __vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275
       vfs_setxattr+0x154/0x33d fs/xattr.c:301
       setxattr+0x216/0x29f fs/xattr.c:575
       __do_sys_fsetxattr fs/xattr.c:632 [inline]
       __se_sys_fsetxattr fs/xattr.c:621 [inline]
       __x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      
      Freed by task 2949:
       kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
       kasan_set_track+0x1c/0x21 mm/kasan/common.c:46
       kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
       ____kasan_slab_free mm/kasan/common.c:366 [inline]
       ____kasan_slab_free mm/kasan/common.c:328 [inline]
       __kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374
       kasan_slab_free include/linux/kasan.h:230 [inline]
       slab_free_hook mm/slub.c:1700 [inline]
       slab_free_freelist_hook mm/slub.c:1726 [inline]
       slab_free mm/slub.c:3492 [inline]
       kmem_cache_free+0xdc/0x3ce mm/slub.c:3508
       xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773
       xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822
       xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413
       xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684
       xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802
       xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
       __vfs_removexattr+0x106/0x16a fs/xattr.c:468
       cap_inode_killpriv+0x24/0x47 security/commoncap.c:324
       security_inode_killpriv+0x54/0xa1 security/security.c:1414
       setattr_prepare+0x1a6/0x897 fs/attr.c:146
       xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682
       xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065
       xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093
       notify_change+0xae5/0x10a1 fs/attr.c:410
       do_truncate+0x134/0x1e0 fs/open.c:64
       handle_truncate fs/namei.c:3084 [inline]
       do_open fs/namei.c:3432 [inline]
       path_openat+0x30ab/0x396d fs/namei.c:3561
       do_filp_open+0x1c4/0x290 fs/namei.c:3588
       do_sys_openat2+0x60d/0x98c fs/open.c:1212
       do_sys_open+0xcf/0x13c fs/open.c:1228
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      
      The buggy address belongs to the object at ffff88802cec9188
       which belongs to the cache xfs_ifork of size 40
      The buggy address is located 20 bytes inside of
       40-byte region [ffff88802cec9188, ffff88802cec91b0)
      The buggy address belongs to the page:
      page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000
      index:0x0 pfn:0x2cec9
      flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff)
      raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80
      raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb
       ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc
      >ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb
                                  ^
       ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb
       ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb
      ==================================================================
      
      The root cause of this bug is the unlocked access to xfs_inode.i_afp
      from the getxattr code paths while trying to determine which ILOCK mode
      to use to stabilize the xattr data.  Unfortunately, the VFS does not
      acquire i_rwsem when vfs_getxattr (or listxattr) call into the
      filesystem, which means that getxattr can race with a removexattr that's
      tearing down the attr fork and crash:
      
      xfs_attr_set:                          xfs_attr_get:
      xfs_attr_fork_remove:                  xfs_ilock_attr_map_shared:
      
      xfs_idestroy_fork(ip->i_afp);
      kmem_cache_free(xfs_ifork_cache, ip->i_afp);
      
                                             if (ip->i_afp &&
      
      ip->i_afp = NULL;
      
                                                 xfs_need_iread_extents(ip->i_afp))
                                             <KABOOM>
      
      ip->i_forkoff = 0;
      
      Regrettably, the VFS is much more lax about i_rwsem and getxattr than
      is immediately obvious -- not only does it not guarantee that we hold
      i_rwsem, it actually doesn't guarantee that we *don't* hold it either.
      The getxattr system call won't acquire the lock before calling XFS, but
      the file capabilities code calls getxattr with and without i_rwsem held
      to determine if the "security.capabilities" xattr is set on the file.
      
      Fixing the VFS locking requires a treewide investigation into every code
      path that could touch an xattr and what i_rwsem state it expects or sets
      up.  That could take years or even prove impossible; fortunately, we
      can fix this UAF problem inside XFS.
      
      An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to
      ensure that i_forkoff is always zeroed before i_afp is set to null and
      changed the read paths to use smp_rmb before accessing i_forkoff and
      i_afp, which avoided these UAF problems.  However, the patch author was
      too busy dealing with other problems in the meantime, and by the time he
      came back to this issue, the situation had changed a bit.
      
      On a modern system with selinux, each inode will always have at least
      one xattr for the selinux label, so it doesn't make much sense to keep
      incurring the extra pointer dereference.  Furthermore, Allison's
      upcoming parent pointer patchset will also cause nearly every inode in
      the filesystem to have extended attributes.  Therefore, make the inode
      attribute fork structure part of struct xfs_inode, at a cost of 40 more
      bytes.
      
      This patch adds a clunky if_present field where necessary to maintain
      the existing logic of xattr fork null pointer testing in the existing
      codebase.  The next patch switches the logic over to XFS_IFORK_Q and it
      all goes away.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      2ed5b09b
    • D
      xfs: convert XFS_IFORK_PTR to a static inline helper · 732436ef
      Darrick J. Wong 提交于
      We're about to make this logic do a bit more, so convert the macro to a
      static inline function for better typechecking and fewer shouty macros.
      No functional changes here.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      732436ef
    • A
      xfs: removed useless condition in function xfs_attr_node_get · 0f38063d
      Andrey Strachuk 提交于
      At line 1561, variable "state" is being compared
      with NULL every loop iteration.
      
      -------------------------------------------------------------------
      1561	for (i = 0; state != NULL && i < state->path.active; i++) {
      1562		xfs_trans_brelse(args->trans, state->path.blk[i].bp);
      1563		state->path.blk[i].bp = NULL;
      1564	}
      -------------------------------------------------------------------
      
      However, it cannot be NULL.
      
      ----------------------------------------
      1546	state = xfs_da_state_alloc(args);
      ----------------------------------------
      
      xfs_da_state_alloc calls kmem_cache_zalloc. kmem_cache_zalloc is
      called with __GFP_NOFAIL flag and, therefore, it cannot return NULL.
      
      --------------------------------------------------------------------------
      	struct xfs_da_state *
      	xfs_da_state_alloc(
      	struct xfs_da_args	*args)
      	{
      		struct xfs_da_state	*state;
      
      		state = kmem_cache_zalloc(xfs_da_state_cache, GFP_NOFS | __GFP_NOFAIL);
      		state->args = args;
      		state->mp = args->dp->i_mount;
      		return state;
      	}
      --------------------------------------------------------------------------
      
      Found by Linux Verification Center (linuxtesting.org) with SVACE.
      Signed-off-by: NAndrey Strachuk <strochuk@ispras.ru>
      
      Fixes: 4d0cdd2b ("xfs: clean up xfs_attr_node_hasname")
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0f38063d
    • E
      xfs: add selinux labels to whiteout inodes · 70b589a3
      Eric Sandeen 提交于
      We got a report that "renameat2() with flags=RENAME_WHITEOUT doesn't
      apply an SELinux label on xfs" as it does on other filesystems
      (for example, ext4 and tmpfs.)  While I'm not quite sure how labels
      may interact w/ whiteout files, leaving them as unlabeled seems
      inconsistent at best. Now that xfs_init_security is not static,
      rename it to xfs_inode_init_security per dchinner's suggestion.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      70b589a3
  3. 07 7月, 2022 23 次提交
    • D
      xfs: make is_log_ag() a first class helper · 36029dee
      Dave Chinner 提交于
      We check if an ag contains the log in many places, so make this
      a first class XFS helper by lifting it to fs/xfs/libxfs/xfs_ag.h and
      renaming it xfs_ag_contains_log(). The convert all the places that
      check if the AG contains the log to use this helper.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      36029dee
    • D
      xfs: replace xfs_ag_block_count() with perag accesses · 3829c9a1
      Dave Chinner 提交于
      Many of the places that call xfs_ag_block_count() have a perag
      available. These places can just read pag->block_count directly
      instead of calculating the AG block count from first principles.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      3829c9a1
    • D
      xfs: Pre-calculate per-AG agino geometry · 2d6ca832
      Dave Chinner 提交于
      There is a lot of overhead in functions like xfs_verify_agino() that
      repeatedly calculate the geometry limits of an AG. These can be
      pre-calculated as they are static and the verification context has
      a per-ag context it can quickly reference.
      
      In the case of xfs_verify_agino(), we now always have a perag
      context handy, so we can store the minimum and maximum agino values
      in the AG in the perag. This means we don't have to calculate
      it on every call and it can be inlined in callers if we move it
      to xfs_ag.h.
      
      xfs_verify_agino_or_null() gets the same perag treatment.
      
      xfs_agino_range() is moved to xfs_ag.c as it's not really a type
      function, and it's use is largely restricted as the first and last
      aginos can be grabbed straight from the perag in most cases.
      
      Note that we leave the original xfs_verify_agino in place in
      xfs_types.c as a static function as other callers in that file do
      not have per-ag contexts so still need to go the long way. It's been
      renamed to xfs_verify_agno_agino() to indicate it takes both an agno
      and an agino to differentiate it from new function.
      
      $ size --totals fs/xfs/built-in.a
      	   text    data     bss     dec     hex filename
      before	1482185	 329588	    572	1812345	 1ba779	(TOTALS)
      after	1481937	 329588	    572	1812097	 1ba681	(TOTALS)
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      2d6ca832
    • D
      xfs: Pre-calculate per-AG agbno geometry · 0800169e
      Dave Chinner 提交于
      There is a lot of overhead in functions like xfs_verify_agbno() that
      repeatedly calculate the geometry limits of an AG. These can be
      pre-calculated as they are static and the verification context has
      a per-ag context it can quickly reference.
      
      In the case of xfs_verify_agbno(), we now always have a perag
      context handy, so we can store the AG length and the minimum valid
      block in the AG in the perag. This means we don't have to calculate
      it on every call and it can be inlined in callers if we move it
      to xfs_ag.h.
      
      Move xfs_ag_block_count() to xfs_ag.c because it's really a
      per-ag function and not an XFS type function. We need a little
      bit of rework that is specific to xfs_initialise_perag() to allow
      growfs to calculate the new perag sizes before we've updated the
      primary superblock during the grow (chicken/egg situation).
      
      Note that we leave the original xfs_verify_agbno in place in
      xfs_types.c as a static function as other callers in that file do
      not have per-ag contexts so still need to go the long way. It's been
      renamed to xfs_verify_agno_agbno() to indicate it takes both an agno
      and an agbno to differentiate it from new function.
      
      Future commits will make similar changes for other per-ag geometry
      validation functions.
      
      Further:
      
      $ size --totals fs/xfs/built-in.a
      	   text    data     bss     dec     hex filename
      before	1483006	 329588	    572	1813166	 1baaae	(TOTALS)
      after	1482185	 329588	    572	1812345	 1ba779	(TOTALS)
      
      This rework reduces the binary size by ~820 bytes, indicating
      that much less work is being done to bounds check the agbno values
      against on per-ag geometry information.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      0800169e
    • D
      xfs: pass perag to xfs_alloc_read_agfl · cec7bb7d
      Dave Chinner 提交于
      We have the perag in most places we call xfs_alloc_read_agfl, so
      pass the perag instead of a mount/agno pair.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      cec7bb7d
    • D
      xfs: pass perag to xfs_alloc_put_freelist · 8c392eb2
      Dave Chinner 提交于
      It's available in all callers, so pass it in so that the perag can
      be passed further down the stack.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      8c392eb2
    • D
      xfs: pass perag to xfs_alloc_get_freelist · 49f0d84e
      Dave Chinner 提交于
      It's available in all callers, so pass it in so that the perag can
      be passed further down the stack.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      49f0d84e
    • D
      xfs: pass perag to xfs_read_agf · fa044ae7
      Dave Chinner 提交于
      We have the perag in most places we call xfs_read_agf, so pass the
      perag instead of a mount/agno pair.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      fa044ae7
    • D
      xfs: pass perag to xfs_read_agi · 61021deb
      Dave Chinner 提交于
      We have the perag in most palces we call xfs_read_agi, so pass the
      perag instead of a mount/agno pair.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      61021deb
    • D
      xfs: pass perag to xfs_alloc_read_agf() · 08d3e84f
      Dave Chinner 提交于
      xfs_alloc_read_agf() initialises the perag if it hasn't been done
      yet, so it makes sense to pass it the perag rather than pull a
      reference from the buffer. This allows callers to be per-ag centric
      rather than passing mount/agno pairs everywhere.
      
      Whilst modifying the xfs_reflink_find_shared() function definition,
      declare it static and remove the extern declaration as it is an
      internal function only these days.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      08d3e84f
    • D
      xfs: kill xfs_alloc_pagf_init() · 76b47e52
      Dave Chinner 提交于
      Trivial wrapper around xfs_alloc_read_agf(), can be easily replaced
      by passing a NULL agfbp to xfs_alloc_read_agf().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      76b47e52
    • D
      xfs: pass perag to xfs_ialloc_read_agi() · 99b13c7f
      Dave Chinner 提交于
      xfs_ialloc_read_agi() initialises the perag if it hasn't been done
      yet, so it makes sense to pass it the perag rather than pull a
      reference from the buffer. This allows callers to be per-ag centric
      rather than passing mount/agno pairs everywhere.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      99b13c7f
    • D
      xfs: kill xfs_ialloc_pagi_init() · a95fee40
      Dave Chinner 提交于
      This is just a basic wrapper around xfs_ialloc_read_agi(), which can
      be entirely handled by xfs_ialloc_read_agi() by passing a NULL
      agibpp....
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      a95fee40
    • D
      xfs: make last AG grow/shrink perag centric · c6aee248
      Dave Chinner 提交于
      Because the perag must exist for these operations, look it up as
      part of the common shrink operations and pass it instead of the
      mount/agno pair.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      c6aee248
    • D
      xfs: xlog_sync() manually adjusts grant head space · d9f68777
      Dave Chinner 提交于
      When xlog_sync() rounds off the tail the iclog that is being
      flushed, it manually subtracts that space from the grant heads. This
      space is actually reserved by the transaction ticket that covers
      the xlog_sync() call from xlog_write(), but we don't plumb the
      ticket down far enough for it to account for the space consumed in
      the current log ticket.
      
      The grant heads are hot, so we really should be accounting this to
      the ticket is we can, rather than adding thousands of extra grant
      head updates every CIL commit.
      
      Interestingly, this actually indicates a potential log space overrun
      can occur when we force the log. By the time that xfs_log_force()
      pushes out an active iclog and consumes the roundoff space, the
      reservation for that roundoff space has been returned to the grant
      heads and is no longer covered by a reservation. In theory the
      roundoff added to log force on an already full log could push the
      write head past the tail. In practice, the CIL commit that writes to
      the log and needs the iclog pushed will have reserved space for
      roundoff, so when it releases the ticket there will still be
      physical space for the roundoff to be committed to the log, even
      though it is no longer reserved. This roundoff won't be enough space
      to allow a transaction to be woken if the log is full, so overruns
      should not actually occur in practice.
      
      That said, it indicates that we should not release the CIL context
      log ticket until after we've released the commit iclog. It also
      means that xlog_sync() still needs the direct grant head
      manipulation if we don't provide it with a ticket. Log forces are
      rare when we are in fast paths running 1.5 million transactions/s
      that make the grant heads hot, so let's optimise the hot case and
      pass CIL log tickets down to the xlog_sync() code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      d9f68777
    • D
      xfs: avoid cil push lock if possible · 1ccb0745
      Dave Chinner 提交于
      Because now it hurts when the CIL fills up.
      
        - 37.20% __xfs_trans_commit
            - 35.84% xfs_log_commit_cil
               - 19.34% _raw_spin_lock
                  - do_raw_spin_lock
                       19.01% __pv_queued_spin_lock_slowpath
               - 4.20% xfs_log_ticket_ungrant
                    0.90% xfs_log_space_wake
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      1ccb0745
    • D
      xfs: move CIL ordering to the logvec chain · 4eb56069
      Dave Chinner 提交于
      Adding a list_sort() call to the CIL push work while the xc_ctx_lock
      is held exclusively has resulted in fairly long lock hold times and
      that stops all front end transaction commits from making progress.
      
      We can move the sorting out of the xc_ctx_lock if we can transfer
      the ordering information to the log vectors as they are detached
      from the log items and then we can sort the log vectors.  With these
      changes, we can move the list_sort() call to just before we call
      xlog_write() when we aren't holding any locks at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      4eb56069
    • D
      xfs: convert log vector chain to use list heads · 16924853
      Dave Chinner 提交于
      Because the next change is going to require sorting log vectors, and
      that requires arbitrary rearrangement of the list which cannot be
      done easily with a single linked list.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      16924853
    • D
      xfs: convert CIL to unordered per cpu lists · c0fb4765
      Dave Chinner 提交于
      So that we can remove the cil_lock which is a global serialisation
      point. We've already got ordering sorted, so all we need to do is
      treat the CIL list like the busy extent list and reconstruct it
      before the push starts.
      
      This is what we're trying to avoid:
      
       -   75.35%     1.83%  [kernel]            [k] xfs_log_commit_cil
          - 46.35% xfs_log_commit_cil
             - 41.54% _raw_spin_lock
                - 67.30% do_raw_spin_lock
                     66.96% __pv_queued_spin_lock_slowpath
      
      Which happens on a 32p system when running a 32-way 'rm -rf'
      workload. After this patch:
      
      -   20.90%     3.23%  [kernel]               [k] xfs_log_commit_cil
         - 17.67% xfs_log_commit_cil
            - 6.51% xfs_log_ticket_ungrant
                 1.40% xfs_log_space_wake
              2.32% memcpy_erms
            - 2.18% xfs_buf_item_committing
               - 2.12% xfs_buf_item_release
                  - 1.03% xfs_buf_unlock
                       0.96% up
                    0.72% xfs_buf_rele
              1.33% xfs_inode_item_format
              1.19% down_read
              0.91% up_read
              0.76% xfs_buf_item_format
            - 0.68% kmem_alloc_large
               - 0.67% kmem_alloc
                    0.64% __kmalloc
              0.50% xfs_buf_item_size
      
      It kinda looks like the workload is running out of log space all
      the time. But all the spinlock contention is gone and the
      transaction commit rate has gone from 800k/s to 1.3M/s so the amount
      of real work being done has gone up a *lot*.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      c0fb4765
    • D
      xfs: Add order IDs to log items in CIL · 016a2338
      Dave Chinner 提交于
      Before we split the ordered CIL up into per cpu lists, we need a
      mechanism to track the order of the items in the CIL. We need to do
      this because there are rules around the order in which related items
      must physically appear in the log even inside a single checkpoint
      transaction.
      
      An example of this is intents - an intent must appear in the log
      before it's intent done record so that log recovery can cancel the
      intent correctly. If we have these two records misordered in the
      CIL, then they will not be recovered correctly by journal replay.
      
      We also will not be able to move items to the tail of
      the CIL list when they are relogged, hence the log items will need
      some mechanism to allow the correct log item order to be recreated
      before we write log items to the hournal.
      
      Hence we need to have a mechanism for recording global order of
      transactions in the log items  so that we can recover that order
      from un-ordered per-cpu lists.
      
      Do this with a simple monotonic increasing commit counter in the CIL
      context. Each log item in the transaction gets stamped with the
      current commit order ID before it is added to the CIL. If the item
      is already in the CIL, leave it where it is instead of moving it to
      the tail of the list and instead sort the list before we start the
      push work.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      016a2338
    • D
      xfs: convert CIL busy extents to per-cpu · df7a4a21
      Dave Chinner 提交于
      To get them out from under the CIL lock.
      
      This is an unordered list, so we can simply punt it to per-cpu lists
      during transaction commits and reaggregate it back into a single
      list during the CIL push work.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      df7a4a21
    • D
      xfs: track CIL ticket reservation in percpu structure · 1dd2a2c1
      Dave Chinner 提交于
      To get it out from under the cil spinlock.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      1dd2a2c1
    • D
      xfs: implement percpu cil space used calculation · 7c8ade21
      Dave Chinner 提交于
      Now that we have the CIL percpu structures in place, implement the
      space used counter as a per-cpu counter.
      
      We have to be really careful now about ensuring that the checks and
      updates run without arbitrary delays, which means they need to run
      with pre-emption disabled. We do this by careful placement of
      the get_cpu_ptr/put_cpu_ptr calls to access the per-cpu structures
      for that CPU.
      
      We need to be able to reliably detect that the CIL has reached
      the hard limit threshold so we can take extra reservations for the
      iclog headers when the space used overruns the original reservation.
      hence we factor out xlog_cil_over_hard_limit() from
      xlog_cil_push_background().
      
      The global CIL space used is an atomic variable that is backed by
      per-cpu aggregation to minimise the number of atomic updates we do
      to the global state in the fast path. While we are under the soft
      limit, we aggregate only when the per-cpu aggregation is over the
      proportion of the soft limit assigned to that CPU. This means that
      all CPUs can use all but one byte of their aggregation threshold
      and we will not go over the soft limit.
      
      Hence once we detect that we've gone over both a per-cpu aggregation
      threshold and the soft limit, we know that we have only
      exceeded the soft limit by one per-cpu aggregation threshold. Even
      if all CPUs hit this at the same time, we can't be over the hard
      limit, so we can run an aggregation back into the atomic counter
      at this point and still be under the hard limit.
      
      At this point, we will be over the soft limit and hence we'll
      aggregate into the global atomic used space directly rather than the
      per-cpu counters, hence providing accurate detection of hard limit
      excursion for accounting and reservation purposes.
      
      Hence we get the best of both worlds - lockless, scalable per-cpu
      fast path plus accurate, atomic detection of hard limit excursion.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      7c8ade21
  4. 02 7月, 2022 5 次提交
    • D
      xfs: introduce per-cpu CIL tracking structure · af1c2146
      Dave Chinner 提交于
      The CIL push lock is highly contended on larger machines, becoming a
      hard bottleneck that about 700,000 transaction commits/s on >16p
      machines. To address this, start moving the CIL tracking
      infrastructure to utilise per-CPU structures.
      
      We need to track the space used, the amount of log reservation space
      reserved to write the CIL, the log items in the CIL and the busy
      extents that need to be completed by the CIL commit.  This requires
      a couple of per-cpu counters, an unordered per-cpu list and a
      globally ordered per-cpu list.
      
      Create a per-cpu structure to hold these and all the management
      interfaces needed, as well as the hooks to handle hotplug CPUs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      
      af1c2146
    • D
      xfs: rework per-iclog header CIL reservation · 31151cc3
      Dave Chinner 提交于
      For every iclog that a CIL push will use up, we need to ensure we
      have space reserved for the iclog header in each iclog. It is
      extremely difficult to do this accurately with a per-cpu counter
      without expensive summing of the counter in every commit. However,
      we know what the maximum CIL size is going to be because of the
      hard space limit we have, and hence we know exactly how many iclogs
      we are going to need to write out the CIL.
      
      We are constrained by the requirement that small transactions only
      have reservation space for a single iclog header built into them.
      At commit time we don't know how much of the current transaction
      reservation is made up of iclog header reservations as calculated by
      xfs_log_calc_unit_res() when the ticket was reserved. As larger
      reservations have multiple header spaces reserved, we can steal
      more than one iclog header reservation at a time, but we only steal
      the exact number needed for the given log vector size delta.
      
      As a result, we don't know exactly when we are going to steal iclog
      header reservations, nor do we know exactly how many we are going to
      need for a given CIL.
      
      To make things simple, start by calculating the worst case number of
      iclog headers a full CIL push will require. Record this into an
      atomic variable in the CIL. Then add a byte counter to the log
      ticket that records exactly how much iclog header space has been
      reserved in this ticket by xfs_log_calc_unit_res(). This tells us
      exactly how much space we can steal from the ticket at transaction
      commit time.
      
      Now, at transaction commit time, we can check if the CIL has a full
      iclog header reservation and, if not, steal the entire reservation
      the current ticket holds for iclog headers. This minimises the
      number of times we need to do atomic operations in the fast path,
      but still guarantees we get all the reservations we need.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      31151cc3
    • D
      xfs: lift init CIL reservation out of xc_cil_lock · 12380d23
      Dave Chinner 提交于
      The xc_cil_lock is the most highly contended lock in XFS now. To
      start the process of getting rid of it, lift the initial reservation
      of the CIL log space out from under the xc_cil_lock.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      12380d23
    • D
      xfs: use the CIL space used counter for emptiness checks · 88591e7f
      Dave Chinner 提交于
      In the next patches we are going to make the CIL list itself
      per-cpu, and so we cannot use list_empty() to check is the list is
      empty. Replace the list_empty() checks with a flag in the CIL to
      indicate we have committed at least one transaction to the CIL and
      hence the CIL is not empty.
      
      We need this flag to be an atomic so that we can clear it without
      holding any locks in the commit fast path, but we also need to be
      careful to avoid atomic operations in the fast path. Hence we use
      the fact that test_bit() is not an atomic op to first check if the
      flag is set and then run the atomic test_and_clear_bit() operation
      to clear it and steal the initial unit reservation for the CIL
      context checkpoint.
      
      When we are switching to a new context in a push, we place the
      setting of the XLOG_CIL_EMPTY flag under the xc_push_lock. THis
      allows all the other places that need to check whether the CIL is
      empty to use test_bit() and still be serialised correctly with the
      CIL context swaps that set the bit.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      88591e7f
    • D
      xfs: prevent a UAF when log IO errors race with unmount · 7561cea5
      Darrick J. Wong 提交于
      KASAN reported the following use after free bug when running
      generic/475:
      
       XFS (dm-0): Mounting V5 Filesystem
       XFS (dm-0): Starting recovery (logdev: internal)
       XFS (dm-0): Ending recovery (logdev: internal)
       Buffer I/O error on dev dm-0, logical block 20639616, async page read
       Buffer I/O error on dev dm-0, logical block 20639617, async page read
       XFS (dm-0): log I/O error -5
       XFS (dm-0): Filesystem has been shut down due to log error (0x2).
       XFS (dm-0): Unmounting Filesystem
       XFS (dm-0): Please unmount the filesystem and rectify the problem(s).
       ==================================================================
       BUG: KASAN: use-after-free in do_raw_spin_lock+0x246/0x270
       Read of size 4 at addr ffff888109dd84c4 by task 3:1H/136
      
       CPU: 3 PID: 136 Comm: 3:1H Not tainted 5.19.0-rc4-xfsx #rc4 8e53ab5ad0fddeb31cee5e7063ff9c361915a9c4
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
       Workqueue: xfs-log/dm-0 xlog_ioend_work [xfs]
       Call Trace:
        <TASK>
        dump_stack_lvl+0x34/0x44
        print_report.cold+0x2b8/0x661
        ? do_raw_spin_lock+0x246/0x270
        kasan_report+0xab/0x120
        ? do_raw_spin_lock+0x246/0x270
        do_raw_spin_lock+0x246/0x270
        ? rwlock_bug.part.0+0x90/0x90
        xlog_force_shutdown+0xf6/0x370 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
        xlog_ioend_work+0x100/0x190 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
        process_one_work+0x672/0x1040
        worker_thread+0x59b/0xec0
        ? __kthread_parkme+0xc6/0x1f0
        ? process_one_work+0x1040/0x1040
        ? process_one_work+0x1040/0x1040
        kthread+0x29e/0x340
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x1f/0x30
        </TASK>
      
       Allocated by task 154099:
        kasan_save_stack+0x1e/0x40
        __kasan_kmalloc+0x81/0xa0
        kmem_alloc+0x8d/0x2e0 [xfs]
        xlog_cil_init+0x1f/0x540 [xfs]
        xlog_alloc_log+0xd1e/0x1260 [xfs]
        xfs_log_mount+0xba/0x640 [xfs]
        xfs_mountfs+0xf2b/0x1d00 [xfs]
        xfs_fs_fill_super+0x10af/0x1910 [xfs]
        get_tree_bdev+0x383/0x670
        vfs_get_tree+0x7d/0x240
        path_mount+0xdb7/0x1890
        __x64_sys_mount+0x1fa/0x270
        do_syscall_64+0x2b/0x80
        entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
       Freed by task 154151:
        kasan_save_stack+0x1e/0x40
        kasan_set_track+0x21/0x30
        kasan_set_free_info+0x20/0x30
        ____kasan_slab_free+0x110/0x190
        slab_free_freelist_hook+0xab/0x180
        kfree+0xbc/0x310
        xlog_dealloc_log+0x1b/0x2b0 [xfs]
        xfs_unmountfs+0x119/0x200 [xfs]
        xfs_fs_put_super+0x6e/0x2e0 [xfs]
        generic_shutdown_super+0x12b/0x3a0
        kill_block_super+0x95/0xd0
        deactivate_locked_super+0x80/0x130
        cleanup_mnt+0x329/0x4d0
        task_work_run+0xc5/0x160
        exit_to_user_mode_prepare+0xd4/0xe0
        syscall_exit_to_user_mode+0x1d/0x40
        entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      This appears to be a race between the unmount process, which frees the
      CIL and waits for in-flight iclog IO; and the iclog IO completion.  When
      generic/475 runs, it starts fsstress in the background, waits a few
      seconds, and substitutes a dm-error device to simulate a disk falling
      out of a machine.  If the fsstress encounters EIO on a pure data write,
      it will exit but the filesystem will still be online.
      
      The next thing the test does is unmount the filesystem, which tries to
      clean the log, free the CIL, and wait for iclog IO completion.  If an
      iclog was being written when the dm-error switch occurred, it can race
      with log unmounting as follows:
      
      Thread 1				Thread 2
      
      					xfs_log_unmount
      					xfs_log_clean
      					xfs_log_quiesce
      xlog_ioend_work
      <observe error>
      xlog_force_shutdown
      test_and_set_bit(XLOG_IOERROR)
      					xfs_log_force
      					<log is shut down, nop>
      					xfs_log_umount_write
      					<log is shut down, nop>
      					xlog_dealloc_log
      					xlog_cil_destroy
      					<wait for iclogs>
      spin_lock(&log->l_cilp->xc_push_lock)
      <KABOOM>
      
      Therefore, free the CIL after waiting for the iclogs to complete.  I
      /think/ this race has existed for quite a few years now, though I don't
      remember the ~2014 era logging code well enough to know if it was a real
      threat then or if the actual race was exposed only more recently.
      
      Fixes: ac983517 ("xfs: don't sleep in xlog_cil_force_lsn on shutdown")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      7561cea5
  5. 29 6月, 2022 3 次提交
    • D
      xfs: dont treat rt extents beyond EOF as eofblocks to be cleared · 8944c6fb
      Darrick J. Wong 提交于
      On a system with a realtime volume and a 28k realtime extent,
      generic/491 fails because the test opens a file on a frozen filesystem
      and closing it causes xfs_release -> xfs_can_free_eofblocks to
      mistakenly think that the the blocks of the realtime extent beyond EOF
      are posteof blocks to be freed.  Realtime extents cannot be partially
      unmapped, so this is pointless.  Worse yet, this triggers posteof
      cleanup, which stalls on a transaction allocation, which is why the test
      fails.
      
      Teach the predicate to account for realtime extents properly.
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      8944c6fb
    • D
      xfs: don't hold xattr leaf buffers across transaction rolls · e53bcffa
      Darrick J. Wong 提交于
      Now that we've established (again!) that empty xattr leaf buffers are
      ok, we no longer need to bhold them to transactions when we're creating
      new leaf blocks.  Get rid of the entire mechanism, which should simplify
      the xattr code quite a bit.
      
      The original justification for using bhold here was to prevent the AIL
      from trying to write the empty leaf block into the fs during the brief
      time that we release the buffer lock.  The reason for /that/ was to
      prevent recovery from tripping over the empty ondisk block.
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      e53bcffa
    • D
      xfs: empty xattr leaf header blocks are not corruption · 7be3bd88
      Darrick J. Wong 提交于
      TLDR: Revert commit 51e6104f ("xfs: detect empty attr leaf blocks in
      xfs_attr3_leaf_verify") because it was wrong.
      
      Every now and then we get a corruption report from the kernel or
      xfs_repair about empty leaf blocks in the extended attribute structure.
      We've long thought that these shouldn't be possible, but prior to 5.18
      one would shake loose in the recoveryloop fstests about once a month.
      
      A new addition to the xattr leaf block verifier in 5.19-rc1 makes this
      happen every 7 minutes on my testing cloud.  I added a ton of logging to
      detect any time we set the header count on an xattr leaf block to zero.
      This produced the following dmesg output on generic/388:
      
      XFS (sda4): ino 0x21fcbaf leaf 0x129bf78 hdcount==0!
      Call Trace:
       <TASK>
       dump_stack_lvl+0x34/0x44
       xfs_attr3_leaf_create+0x187/0x230
       xfs_attr_shortform_to_leaf+0xd1/0x2f0
       xfs_attr_set_iter+0x73e/0xa90
       xfs_xattri_finish_update+0x45/0x80
       xfs_attr_finish_item+0x1b/0xd0
       xfs_defer_finish_noroll+0x19c/0x770
       __xfs_trans_commit+0x153/0x3e0
       xfs_attr_set+0x36b/0x740
       xfs_xattr_set+0x89/0xd0
       __vfs_setxattr+0x67/0x80
       __vfs_setxattr_noperm+0x6e/0x120
       vfs_setxattr+0x97/0x180
       setxattr+0x88/0xa0
       path_setxattr+0xc3/0xe0
       __x64_sys_setxattr+0x27/0x30
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      So now we know that someone is creating empty xattr leaf blocks as part
      of converting a sf xattr structure into a leaf xattr structure.  The
      conversion routine logs any existing sf attributes in the same
      transaction that creates the leaf block, so we know this is a setxattr
      to a file that has no attributes at all.
      
      Next, g/388 calls the shutdown ioctl and cycles the mount to trigger log
      recovery.  I also augmented buffer item recovery to call ->verify_struct
      on any attr leaf blocks and complain if it finds a failure:
      
      XFS (sda4): Unmounting Filesystem
      XFS (sda4): Mounting V5 Filesystem
      XFS (sda4): Starting recovery (logdev: internal)
      XFS (sda4): xattr leaf daddr 0x129bf78 hdrcount == 0!
      Call Trace:
       <TASK>
       dump_stack_lvl+0x34/0x44
       xfs_attr3_leaf_verify+0x3b8/0x420
       xlog_recover_buf_commit_pass2+0x60a/0x6c0
       xlog_recover_items_pass2+0x4e/0xc0
       xlog_recover_commit_trans+0x33c/0x350
       xlog_recovery_process_trans+0xa5/0xe0
       xlog_recover_process_data+0x8d/0x140
       xlog_do_recovery_pass+0x19b/0x720
       xlog_do_log_recovery+0x62/0xc0
       xlog_do_recover+0x33/0x1d0
       xlog_recover+0xda/0x190
       xfs_log_mount+0x14c/0x360
       xfs_mountfs+0x517/0xa60
       xfs_fs_fill_super+0x6bc/0x950
       get_tree_bdev+0x175/0x280
       vfs_get_tree+0x1a/0x80
       path_mount+0x6f5/0xaa0
       __x64_sys_mount+0x103/0x140
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      RIP: 0033:0x7fc61e241eae
      
      And a moment later, the _delwri_submit of the recovered buffers trips
      the same verifier and recovery fails:
      
      XFS (sda4): Metadata corruption detected at xfs_attr3_leaf_verify+0x393/0x420 [xfs], xfs_attr3_leaf block 0x129bf78
      XFS (sda4): Unmount and run xfs_repair
      XFS (sda4): First 128 bytes of corrupted metadata buffer:
      00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
      00000010: 00 00 00 00 01 29 bf 78 00 00 00 00 00 00 00 00  .....).x........
      00000020: a5 1b d0 02 b2 9a 49 df 8e 9c fb 8d f8 31 3e 9d  ......I......1>.
      00000030: 00 00 00 00 02 1f cb af 00 00 00 00 10 00 00 00  ................
      00000040: 00 50 0f b0 00 00 00 00 00 00 00 00 00 00 00 00  .P..............
      00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      XFS (sda4): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply+0x37f/0x3b0 [xfs] (fs/xfs/xfs_buf.c:1518).  Shutting down filesystem.
      XFS (sda4): Please unmount the filesystem and rectify the problem(s)
      XFS (sda4): log mount/recovery failed: error -117
      XFS (sda4): log mount failed
      
      I think I see what's going on here -- setxattr is racing with something
      that shuts down the filesystem:
      
      Thread 1				Thread 2
      --------				--------
      xfs_attr_sf_addname
      xfs_attr_shortform_to_leaf
      <create empty leaf>
      xfs_trans_bhold(leaf)
      xattri_dela_state = XFS_DAS_LEAF_ADD
      <roll transaction>
      					<flush log>
      					<shut down filesystem>
      xfs_trans_bhold_release(leaf)
      <discover fs is dead, bail>
      
      Thread 3
      --------
      <cycle mount, start recovery>
      xlog_recover_buf_commit_pass2
      xlog_recover_do_reg_buffer
      <replay empty leaf buffer from recovered buf item>
      xfs_buf_delwri_queue(leaf)
      xfs_buf_delwri_submit
      _xfs_buf_ioapply(leaf)
      xfs_attr3_leaf_write_verify
      <trip over empty leaf buffer>
      <fail recovery>
      
      As you can see, the bhold keeps the leaf buffer locked and thus prevents
      the *AIL* from tripping over the ichdr.count==0 check in the write
      verifier.  Unfortunately, it doesn't prevent the log from getting
      flushed to disk, which sets up log recovery to fail.
      
      So.  It's clear that the kernel has always had the ability to persist
      attr leaf blocks with ichdr.count==0, which means that it's part of the
      ondisk format now.
      
      Unfortunately, this check has been added and removed multiple times
      throughout history.  It first appeared in[1] kernel 3.10 as part of the
      early V5 format patches.  The check was later discovered to break log
      recovery and hence disabled[2] during log recovery in kernel 4.10.
      Simultaneously, the check was added[3] to xfs_repair 4.9.0 to try to
      weed out the empty leaf blocks.  This was still not correct because log
      recovery would recover an empty attr leaf block successfully only for
      regular xattr operations to trip over the empty block during of the
      block during regular operation.  Therefore, the check was removed
      entirely[4] in kernel 5.7 but removal of the xfs_repair check was
      forgotten.  The continued complaints from xfs_repair lead to us
      mistakenly re-adding[5] the verifier check for kernel 5.19.  Remove it
      once again.
      
      [1] 517c2220 ("xfs: add CRCs to attr leaf blocks")
      [2] 2e1d2337 ("xfs: ignore leaf attr ichdr.count in verifier
                         during log replay")
      [3] f7140161 ("xfs_repair: junk leaf attribute if count == 0")
      [4] f28cef9e ("xfs: don't fail verifier on empty attr3 leaf
                         block")
      [5] 51e6104f ("xfs: detect empty attr leaf blocks in
                         xfs_attr3_leaf_verify")
      
      Looking at the rest of the xattr code, it seems that files with empty
      leaf blocks behave as expected -- listxattr reports no attributes;
      getxattr on any xattr returns nothing as expected; removexattr does
      nothing; and setxattr can add attributes just fine.
      
      Original-bug: 517c2220 ("xfs: add CRCs to attr leaf blocks")
      Still-not-fixed-by: 2e1d2337 ("xfs: ignore leaf attr ichdr.count in verifier during log replay")
      Removed-in: f28cef9e ("xfs: don't fail verifier on empty attr3 leaf block")
      Fixes: 51e6104f ("xfs: detect empty attr leaf blocks in xfs_attr3_leaf_verify")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      7be3bd88
  6. 27 6月, 2022 2 次提交
    • D
      xfs: clean up the end of xfs_attri_item_recover · f94e08b6
      Darrick J. Wong 提交于
      The end of this function could use some cleanup -- the EAGAIN
      conditionals make it harder to figure out what's going on with the
      disposal of xattri_leaf_bp, and the dual error/ret variables aren't
      needed.  Turn the EAGAIN case into a separate block documenting all the
      subtleties of recovering in the middle of an xattr update chain, which
      makes the rest of the prologue much simpler.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      f94e08b6
    • D
      xfs: always free xattri_leaf_bp when cancelling a deferred op · b822ea17
      Darrick J. Wong 提交于
      While running the following fstest with logged xattrs DISabled, I
      noticed the following:
      
      # FSSTRESS_AVOID="-z -f unlink=1 -f rmdir=1 -f creat=2 -f mkdir=2 -f
      getfattr=3 -f listfattr=3 -f attr_remove=4 -f removefattr=4 -f
      setfattr=20 -f attr_set=60" ./check generic/475
      
      INFO: task u9:1:40 blocked for more than 61 seconds.
            Tainted: G           O      5.19.0-rc2-djwx #rc2
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      task:u9:1            state:D stack:12872 pid:   40 ppid:     2 flags:0x00004000
      Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs]
      Call Trace:
       <TASK>
       __schedule+0x2db/0x1110
       schedule+0x58/0xc0
       schedule_timeout+0x115/0x160
       __down_common+0x126/0x210
       down+0x54/0x70
       xfs_buf_lock+0x2d/0xe0 [xfs 0532c1cb1d67dd81d15cb79ac6e415c8dec58f73]
       xfs_buf_item_unpin+0x227/0x3a0 [xfs 0532c1cb1d67dd81d15cb79ac6e415c8dec58f73]
       xfs_trans_committed_bulk+0x18e/0x320 [xfs 0532c1cb1d67dd81d15cb79ac6e415c8dec58f73]
       xlog_cil_committed+0x2ea/0x360 [xfs 0532c1cb1d67dd81d15cb79ac6e415c8dec58f73]
       xlog_cil_push_work+0x60f/0x690 [xfs 0532c1cb1d67dd81d15cb79ac6e415c8dec58f73]
       process_one_work+0x1df/0x3c0
       worker_thread+0x53/0x3b0
       kthread+0xea/0x110
       ret_from_fork+0x1f/0x30
       </TASK>
      
      This appears to be the result of shortform_to_leaf creating a new leaf
      buffer as part of adding an xattr to a file.  The new leaf buffer is
      held and attached to the xfs_attr_intent structure, but then the
      filesystem shuts down.  Instead of the usual path (which adds the attr
      to the held leaf buffer which releases the hold), we instead cancel the
      entire deferred operation.
      
      Unfortunately, xfs_attr_cancel_item doesn't release any attached leaf
      buffers, so we leak the locked buffer.  The CIL cannot do anything
      about that, and hangs.  Fix this by teaching it to release leaf buffers,
      and make XFS a little more careful about not leaving a dangling
      reference.
      
      The prologue of xfs_attri_item_recover is (in this author's opinion) a
      little hard to figure out, so I'll clean that up in the next patch.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      b822ea17