1. 13 7月, 2022 2 次提交
  2. 10 7月, 2022 1 次提交
    • D
      xfs: make inode attribute forks a permanent part of struct xfs_inode · 2ed5b09b
      Darrick J. Wong 提交于
      Syzkaller reported a UAF bug a while back:
      
      ==================================================================
      BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
      Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958
      
      CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted
      5.15.0-0.30.3-20220406_1406 #3
      Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29
      04/01/2014
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106
       print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459
       xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
       xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159
       xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36
       __vfs_getxattr+0xdf/0x13d fs/xattr.c:399
       cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300
       security_inode_need_killpriv+0x4c/0x97 security/security.c:1408
       dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912
       dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908
       do_truncate+0xc3/0x1e0 fs/open.c:56
       handle_truncate fs/namei.c:3084 [inline]
       do_open fs/namei.c:3432 [inline]
       path_openat+0x30ab/0x396d fs/namei.c:3561
       do_filp_open+0x1c4/0x290 fs/namei.c:3588
       do_sys_openat2+0x60d/0x98c fs/open.c:1212
       do_sys_open+0xcf/0x13c fs/open.c:1228
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      RIP: 0033:0x7f7ef4bb753d
      Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
      89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73
      01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48
      RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
      RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d
      RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0
      RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e
      R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0
       </TASK>
      
      Allocated by task 2953:
       kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:434 [inline]
       __kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467
       kasan_slab_alloc include/linux/kasan.h:254 [inline]
       slab_post_alloc_hook mm/slab.h:519 [inline]
       slab_alloc_node mm/slub.c:3213 [inline]
       slab_alloc mm/slub.c:3221 [inline]
       kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226
       kmem_cache_zalloc include/linux/slab.h:711 [inline]
       xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287
       xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098
       xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746
       xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
       __vfs_setxattr+0x11b/0x177 fs/xattr.c:180
       __vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214
       __vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275
       vfs_setxattr+0x154/0x33d fs/xattr.c:301
       setxattr+0x216/0x29f fs/xattr.c:575
       __do_sys_fsetxattr fs/xattr.c:632 [inline]
       __se_sys_fsetxattr fs/xattr.c:621 [inline]
       __x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      
      Freed by task 2949:
       kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
       kasan_set_track+0x1c/0x21 mm/kasan/common.c:46
       kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
       ____kasan_slab_free mm/kasan/common.c:366 [inline]
       ____kasan_slab_free mm/kasan/common.c:328 [inline]
       __kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374
       kasan_slab_free include/linux/kasan.h:230 [inline]
       slab_free_hook mm/slub.c:1700 [inline]
       slab_free_freelist_hook mm/slub.c:1726 [inline]
       slab_free mm/slub.c:3492 [inline]
       kmem_cache_free+0xdc/0x3ce mm/slub.c:3508
       xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773
       xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822
       xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413
       xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684
       xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802
       xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
       __vfs_removexattr+0x106/0x16a fs/xattr.c:468
       cap_inode_killpriv+0x24/0x47 security/commoncap.c:324
       security_inode_killpriv+0x54/0xa1 security/security.c:1414
       setattr_prepare+0x1a6/0x897 fs/attr.c:146
       xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682
       xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065
       xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093
       notify_change+0xae5/0x10a1 fs/attr.c:410
       do_truncate+0x134/0x1e0 fs/open.c:64
       handle_truncate fs/namei.c:3084 [inline]
       do_open fs/namei.c:3432 [inline]
       path_openat+0x30ab/0x396d fs/namei.c:3561
       do_filp_open+0x1c4/0x290 fs/namei.c:3588
       do_sys_openat2+0x60d/0x98c fs/open.c:1212
       do_sys_open+0xcf/0x13c fs/open.c:1228
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      
      The buggy address belongs to the object at ffff88802cec9188
       which belongs to the cache xfs_ifork of size 40
      The buggy address is located 20 bytes inside of
       40-byte region [ffff88802cec9188, ffff88802cec91b0)
      The buggy address belongs to the page:
      page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000
      index:0x0 pfn:0x2cec9
      flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff)
      raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80
      raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb
       ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc
      >ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb
                                  ^
       ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb
       ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb
      ==================================================================
      
      The root cause of this bug is the unlocked access to xfs_inode.i_afp
      from the getxattr code paths while trying to determine which ILOCK mode
      to use to stabilize the xattr data.  Unfortunately, the VFS does not
      acquire i_rwsem when vfs_getxattr (or listxattr) call into the
      filesystem, which means that getxattr can race with a removexattr that's
      tearing down the attr fork and crash:
      
      xfs_attr_set:                          xfs_attr_get:
      xfs_attr_fork_remove:                  xfs_ilock_attr_map_shared:
      
      xfs_idestroy_fork(ip->i_afp);
      kmem_cache_free(xfs_ifork_cache, ip->i_afp);
      
                                             if (ip->i_afp &&
      
      ip->i_afp = NULL;
      
                                                 xfs_need_iread_extents(ip->i_afp))
                                             <KABOOM>
      
      ip->i_forkoff = 0;
      
      Regrettably, the VFS is much more lax about i_rwsem and getxattr than
      is immediately obvious -- not only does it not guarantee that we hold
      i_rwsem, it actually doesn't guarantee that we *don't* hold it either.
      The getxattr system call won't acquire the lock before calling XFS, but
      the file capabilities code calls getxattr with and without i_rwsem held
      to determine if the "security.capabilities" xattr is set on the file.
      
      Fixing the VFS locking requires a treewide investigation into every code
      path that could touch an xattr and what i_rwsem state it expects or sets
      up.  That could take years or even prove impossible; fortunately, we
      can fix this UAF problem inside XFS.
      
      An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to
      ensure that i_forkoff is always zeroed before i_afp is set to null and
      changed the read paths to use smp_rmb before accessing i_forkoff and
      i_afp, which avoided these UAF problems.  However, the patch author was
      too busy dealing with other problems in the meantime, and by the time he
      came back to this issue, the situation had changed a bit.
      
      On a modern system with selinux, each inode will always have at least
      one xattr for the selinux label, so it doesn't make much sense to keep
      incurring the extra pointer dereference.  Furthermore, Allison's
      upcoming parent pointer patchset will also cause nearly every inode in
      the filesystem to have extended attributes.  Therefore, make the inode
      attribute fork structure part of struct xfs_inode, at a cost of 40 more
      bytes.
      
      This patch adds a clunky if_present field where necessary to maintain
      the existing logic of xattr fork null pointer testing in the existing
      codebase.  The next patch switches the logic over to XFS_IFORK_Q and it
      all goes away.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      2ed5b09b
  3. 04 5月, 2022 1 次提交
  4. 11 4月, 2022 1 次提交
  5. 30 3月, 2022 1 次提交
    • D
      xfs: aborting inodes on shutdown may need buffer lock · d2d7c047
      Dave Chinner 提交于
      Most buffer io list operations are run with the bp->b_lock held, but
      xfs_iflush_abort() can be called without the buffer lock being held
      resulting in inodes being removed from the buffer list while other
      list operations are occurring. This causes problems with corrupted
      bp->b_io_list inode lists during filesystem shutdown, leading to
      traversals that never end, double removals from the AIL, etc.
      
      Fix this by passing the buffer to xfs_iflush_abort() if we have
      it locked. If the inode is attached to the buffer, we're going to
      have to remove it from the buffer list and we'd have to get the
      buffer off the inode log item to do that anyway.
      
      If we don't have a buffer passed in (e.g. from xfs_reclaim_inode())
      then we can determine if the inode has a log item and if it is
      attached to a buffer before we do anything else. If it does have an
      attached buffer, we can lock it safely (because the inode has a
      reference to it) and then perform the inode abort.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      d2d7c047
  6. 20 3月, 2022 1 次提交
    • D
      xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight · 01728b44
      Dave Chinner 提交于
      I've been chasing a recent resurgence in generic/388 recovery
      failure and/or corruption events. The events have largely been
      uninitialised inode chunks being tripped over in log recovery
      such as:
      
       XFS (pmem1): User initiated shutdown received.
       pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
       XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500).  Shutting down filesystem.
       XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
       XFS (pmem1): Unmounting Filesystem
       XFS (pmem1): Mounting V5 Filesystem
       XFS (pmem1): Starting recovery (logdev: internal)
       XFS (pmem1): bad inode magic/vsn daddr 8723584 #0 (magic=1818)
       XFS (pmem1): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x851c80 xfs_inode_buf_verify
       XFS (pmem1): Unmount and run xfs_repair
       XFS (pmem1): First 128 bytes of corrupted metadata buffer:
       00000000: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000010: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000020: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000030: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000040: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000050: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000060: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000070: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       XFS (pmem1): metadata I/O error in "xlog_recover_items_pass2+0x52/0xc0" at daddr 0x851c80 len 32 error 117
       XFS (pmem1): log mount/recovery failed: error -117
       XFS (pmem1): log mount failed
      
      There have been isolated random other issues, too - xfs_repair fails
      because it finds some corruption in symlink blocks, rmap
      inconsistencies, etc - but they are nowhere near as common as the
      uninitialised inode chunk failure.
      
      The problem has clearly happened at runtime before recovery has run;
      I can see the ICREATE log item in the log shortly before the
      actively recovered range of the log. This means the ICREATE was
      definitely created and written to the log, but for some reason the
      tail of the log has been moved past the ordered buffer log item that
      tracks INODE_ALLOC buffers and, supposedly, prevents the tail of the
      log moving past the ICREATE log item before the inode chunk buffer
      is written to disk.
      
      Tracing the fsstress processes that are running when the filesystem
      shut down immediately pin-pointed the problem:
      
      user shutdown marks xfs_mount as shutdown
      
               godown-213341 [008]  6398.022871: console:              [ 6397.915392] XFS (pmem1): User initiated shutdown received.
      .....
      
      aild tries to push ordered inode cluster buffer
      
        xfsaild/pmem1-213314 [001]  6398.022974: xfs_buf_trylock:      dev 259:1 daddr 0x851c80 bbcount 0x20 hold 16 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_inode_item_push+0x8e
        xfsaild/pmem1-213314 [001]  6398.022976: xfs_ilock_nowait:     dev 259:1 ino 0x851c80 flags ILOCK_SHARED caller xfs_iflush_cluster+0xae
      
      xfs_iflush_cluster() checks xfs_is_shutdown(), returns true,
      calls xfs_iflush_abort() to kill writeback of the inode.
      Inode is removed from AIL, drops cluster buffer reference.
      
        xfsaild/pmem1-213314 [001]  6398.022977: xfs_ail_delete:       dev 259:1 lip 0xffff88880247ed80 old lsn 7/20344 new lsn 7/21000 type XFS_LI_INODE flags IN_AIL
        xfsaild/pmem1-213314 [001]  6398.022978: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 17 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_iflush_abort+0xd7
      
      .....
      
      All inodes on cluster buffer are aborted, then the cluster buffer
      itself is aborted and removed from the AIL *without writeback*:
      
      xfsaild/pmem1-213314 [001]  6398.023011: xfs_buf_error_relse:  dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_ioend_fail+0x33
         xfsaild/pmem1-213314 [001]  6398.023012: xfs_ail_delete:       dev 259:1 lip 0xffff8888053efde8 old lsn 7/20344 new lsn 7/20344 type XFS_LI_BUF flags IN_AIL
      
      The inode buffer was at 7/20344 when it was removed from the AIL.
      
         xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_item_relse:   dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_done+0x31
         xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_relse+0x39
      
      .....
      
      Userspace is still running, doing stuff. an fsstress process runs
      syncfs() or sync() and we end up in sync_fs_one_sb() which issues
      a log force. This pushes on the CIL:
      
              fsstress-213322 [001]  6398.024430: xfs_fs_sync_fs:       dev 259:1 m_features 0x20000000019ff6e9 opstate (clean|shutdown|inodegc|blockgc) s_flags 0x70810000 caller sync_fs_one_sb+0x26
              fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x0 caller xfs_fs_sync_fs+0x82
              fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x5f caller xfs_log_force+0x7c
                 <...>-194402 [001]  6398.024467: kmem_alloc:           size 176 flags 0x14 caller xlog_cil_push_work+0x9f
      
      And the CIL fills up iclogs with pending changes. This picks up
      the current tail from the AIL:
      
                 <...>-194402 [001]  6398.024497: xlog_iclog_get_space: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x0 flags  caller xlog_write+0x149
                 <...>-194402 [001]  6398.024498: xlog_iclog_switch:    dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x700005408 flags  caller xlog_state_get_iclog_space+0x37e
                 <...>-194402 [001]  6398.024521: xlog_iclog_release:   dev 259:1 state XLOG_STATE_WANT_SYNC refcnt 1 offset 32256 lsn 0x700005408 flags  caller xlog_write+0x5f9
                 <...>-194402 [001]  6398.024522: xfs_log_assign_tail_lsn: dev 259:1 new tail lsn 7/21000, old lsn 7/20344, last sync 7/21448
      
      And it moves the tail of the log to 7/21000 from 7/20344. This
      *moves the tail of the log beyond the ICREATE transaction* that was
      at 7/20344 and pinned by the inode cluster buffer that was cancelled
      above.
      
      ....
      
               godown-213341 [008]  6398.027005: xfs_force_shutdown:   dev 259:1 tag logerror flags log_io|force_umount file fs/xfs/xfs_fsops.c line_num 500
                godown-213341 [008]  6398.027022: console:              [ 6397.915406] pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
                godown-213341 [008]  6398.030551: console:              [ 6397.919546] XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/
      
      And finally the log itself is now shutdown, stopping all further
      writes to the log. But this is too late to prevent the corruption
      that moving the tail of the log forwards after we start cancelling
      writeback causes.
      
      The fundamental problem here is that we are using the wrong shutdown
      checks for log items. We've long conflated mount shutdown with log
      shutdown state, and I started separating that recently with the
      atomic shutdown state changes in commit b36d4651 ("xfs: make
      forced shutdown processing atomic"). The changes in that commit
      series are directly responsible for being able to diagnose this
      issue because it clearly separated mount shutdown from log shutdown.
      
      Essentially, once we start cancelling writeback of log items and
      removing them from the AIL because the filesystem is shut down, we
      *cannot* update the journal because we may have cancelled the items
      that pin the tail of the log. That moves the tail of the log
      forwards without having written the metadata back, hence we have
      corrupt in memory state and writing to the journal propagates that
      to the on-disk state.
      
      What commit b36d4651 makes clear is that log item state needs to
      change relative to log shutdown, not mount shutdown. IOWs, anything
      that aborts metadata writeback needs to check log shutdown state
      because log items directly affect log consistency. Having them check
      mount shutdown state introduces the above race condition where we
      cancel metadata writeback before the log shuts down.
      
      To fix this, this patch works through all log items and converts
      shutdown checks to use xlog_is_shutdown() rather than
      xfs_is_shutdown(), so that we don't start aborting metadata
      writeback before we shut off journal writes.
      
      AFAICT, this race condition is a zero day IO error handling bug in
      XFS that dates back to the introduction of XLOG_IO_ERROR,
      XLOG_STATE_IOERROR and XFS_FORCED_SHUTDOWN back in January 1997.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      01728b44
  7. 23 10月, 2021 2 次提交
  8. 20 8月, 2021 1 次提交
    • D
      xfs: replace xfs_sb_version checks with feature flag checks · 38c26bfd
      Dave Chinner 提交于
      Convert the xfs_sb_version_hasfoo() to checks against
      mp->m_features. Checks of the superblock itself during disk
      operations (e.g. in the read/write verifiers and the to/from disk
      formatters) are not converted - they operate purely on the
      superblock state. Everything else should use the mount features.
      
      Large parts of this conversion were done with sed with commands like
      this:
      
      for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
      	sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
      done
      
      With manual cleanups for things like "xfs_has_extflgbit" and other
      little inconsistencies in naming.
      
      The result is ia lot less typing to check features and an XFS binary
      size reduced by a bit over 3kB:
      
      $ size -t fs/xfs/built-in.a
      	text	   data	    bss	    dec	    hex	filenam
      before	1130866  311352     484 1442702  16038e (TOTALS)
      after	1127727  311352     484 1439563  15f74b (TOTALS)
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      38c26bfd
  9. 22 6月, 2021 2 次提交
    • D
      xfs: xfs_log_force_lsn isn't passed a LSN · 5f9b4b0d
      Dave Chinner 提交于
      In doing an investigation into AIL push stalls, I was looking at the
      log force code to see if an async CIL push could be done instead.
      This lead me to xfs_log_force_lsn() and looking at how it works.
      
      xfs_log_force_lsn() is only called from inode synchronisation
      contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
      value as the LSN to sync the log to. This gets passed to
      xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
      journal, and then used by xfs_log_force_lsn() to flush the iclogs to
      the journal.
      
      The problem is that ip->i_itemp->ili_last_lsn does not store a
      log sequence number. What it stores is passed to it from the
      ->iop_committing method, which is called by xfs_log_commit_cil().
      The value this passes to the iop_committing method is the CIL
      context sequence number that the item was committed to.
      
      As it turns out, xlog_cil_force_lsn() converts the sequence to an
      actual commit LSN for the related context and returns that to
      xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
      variable that contained a sequence with an actual LSN and then uses
      that to sync the iclogs.
      
      This caused me some confusion for a while, even though I originally
      wrote all this code a decade ago. ->iop_committing is only used by
      a couple of log item types, and only inode items use the sequence
      number it is passed.
      
      Let's clean up the API, CIL structures and inode log item to call it
      a sequence number, and make it clear that the high level code is
      using CIL sequence numbers and not on-disk LSNs for integrity
      synchronisation purposes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      5f9b4b0d
    • D
      xfs: Fix CIL throttle hang when CIL space used going backwards · 19f4e7cc
      Dave Chinner 提交于
      A hang with tasks stuck on the CIL hard throttle was reported and
      largely diagnosed by Donald Buczek, who discovered that it was a
      result of the CIL context space usage decrementing in committed
      transactions once the hard throttle limit had been hit and processes
      were already blocked.  This resulted in the CIL push not waking up
      those waiters because the CIL context was no longer over the hard
      throttle limit.
      
      The surprising aspect of this was the CIL space usage going
      backwards regularly enough to trigger this situation. Assumptions
      had been made in design that the relogging process would only
      increase the size of the objects in the CIL, and so that space would
      only increase.
      
      This change and commit message fixes the issue and documents the
      result of an audit of the triggers that can cause the CIL space to
      go backwards, how large the backwards steps tend to be, the
      frequency in which they occur, and what the impact on the CIL
      accounting code is.
      
      Even though the CIL ctx->space_used can go backwards, it will only
      do so if the log item is already logged to the CIL and contains a
      space reservation for it's entire logged state. This is tracked by
      the shadow buffer state on the log item. If the item is not
      previously logged in the CIL it has no shadow buffer nor log vector,
      and hence the entire size of the logged item copied to the log
      vector is accounted to the CIL space usage. i.e.  it will always go
      up in this case.
      
      If the item has a log vector (i.e. already in the CIL) and the size
      decreases, then the existing log vector will be overwritten and the
      space usage will go down. This is the only condition where the space
      usage reduces, and it can only occur when an item is already tracked
      in the CIL. Hence we are safe from CIL space usage underruns as a
      result of log items decreasing in size when they are relogged.
      
      Typically this reduction in CIL usage occurs from metadata blocks
      being free, such as when a btree block merge occurs or a directory
      enter/xattr entry is removed and the da-tree is reduced in size.
      This generally results in a reduction in size of around a single
      block in the CIL, but also tends to increase the number of log
      vectors because the parent and sibling nodes in the tree needs to be
      updated when a btree block is removed. If a multi-level merge
      occurs, then we see reduction in size of 2+ blocks, but again the
      log vector count goes up.
      
      The other vector is inode fork size changes, which only log the
      current size of the fork and ignore the previously logged size when
      the fork is relogged. Hence if we are removing items from the inode
      fork (dir/xattr removal in shortform, extent record removal in
      extent form, etc) the relogged size of the inode for can decrease.
      
      No other log items can decrease in size either because they are a
      fixed size (e.g. dquots) or they cannot be relogged (e.g. relogging
      an intent actually creates a new intent log item and doesn't relog
      the old item at all.) Hence the only two vectors for CIL context
      size reduction are relogging inode forks and marking buffers active
      in the CIL as stale.
      
      Long story short: the majority of the code does the right thing and
      handles the reduction in log item size correctly, and only the CIL
      hard throttle implementation is problematic and needs fixing. This
      patch makes that fix, as well as adds comments in the log item code
      that result in items shrinking in size when they are relogged as a
      clear reminder that this can and does happen frequently.
      
      The throttle fix is based upon the change Donald proposed, though it
      goes further to ensure that once the throttle is activated, it
      captures all tasks until the CIL push issues a wakeup, regardless of
      whether the CIL space used has gone back under the throttle
      threshold.
      
      This ensures that we prevent tasks reducing the CIL slightly under
      the throttle threshold and then making more changes that push it
      well over the throttle limit. This is acheived by checking if the
      throttle wait queue is already active as a condition of throttling.
      Hence once we start throttling, we continue to apply the throttle
      until the CIL context push wakes everything on the wait queue.
      
      We can use waitqueue_active() for the waitqueue manipulations and
      checks as they are all done under the ctx->xc_push_lock. Hence the
      waitqueue has external serialisation and we can safely peek inside
      the wait queue without holding the internal waitqueue locks.
      
      Many thanks to Donald for his diagnostic and analysis work to
      isolate the cause of this hang.
      Reported-and-tested-by: NDonald Buczek <buczek@molgen.mpg.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      19f4e7cc
  10. 23 4月, 2021 2 次提交
  11. 08 4月, 2021 11 次提交
  12. 16 9月, 2020 3 次提交
  13. 07 9月, 2020 1 次提交
    • D
      xfs: xfs_iflock is no longer a completion · 718ecc50
      Dave Chinner 提交于
      With the recent rework of the inode cluster flushing, we no longer
      ever wait on the the inode flush "lock". It was never a lock in the
      first place, just a completion to allow callers to wait for inode IO
      to complete. We now never wait for flush completion as all inode
      flushing is non-blocking. Hence we can get rid of all the iflock
      infrastructure and instead just set and check a state flag.
      
      Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
      xfs_iflock_nowait() test-and-set operations on that flag, and
      replace all the xfs_ifunlock() calls to clear operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      718ecc50
  14. 05 8月, 2020 1 次提交
  15. 29 7月, 2020 1 次提交
  16. 07 7月, 2020 9 次提交
    • D
      xfs: factor xfs_iflush_done · a69a1dc2
      Dave Chinner 提交于
      xfs_iflush_done() does 3 distinct operations to the inodes attached
      to the buffer. Separate these operations out into functions so that
      it is easier to modify these operations independently in future.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a69a1dc2
    • D
      xfs: rework xfs_iflush_cluster() dirty inode iteration · 5717ea4d
      Dave Chinner 提交于
      Now that we have all the dirty inodes attached to the cluster
      buffer, we don't actually have to do radix tree lookups to find
      them. Sure, the radix tree is efficient, but walking a linked list
      of just the dirty inodes attached to the buffer is much better.
      
      We are also no longer dependent on having a locked inode passed into
      the function to determine where to start the lookup. This means we
      can drop it from the function call and treat all inodes the same.
      
      We also make xfs_iflush_cluster skip inodes marked with
      XFS_IRECLAIM. This we avoid races with inodes that reclaim is
      actively referencing or are being re-initialised by inode lookup. If
      they are actually dirty, they'll get written by a future cluster
      flush....
      
      We also add a shutdown check after obtaining the flush lock so that
      we catch inodes that are dirty in memory and may have inconsistent
      state due to the shutdown in progress. We abort these inodes
      directly and so they remove themselves directly from the buffer list
      and the AIL rather than having to wait for the buffer to be failed
      and callbacks run to be processed correctly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      5717ea4d
    • D
      xfs: xfs_iflush() is no longer necessary · 90c60e16
      Dave Chinner 提交于
      Now we have a cached buffer on inode log items, we don't need
      to do buffer lookups when flushing inodes anymore - all we need
      to do is lock the buffer and we are ready to go.
      
      This largely gets rid of the need for xfs_iflush(), which is
      essentially just a mechanism to look up the buffer and flush the
      inode to it. Instead, we can just call xfs_iflush_cluster() with a
      few modifications to ensure it also flushes the inode we already
      hold locked.
      
      This allows the AIL inode item pushing to be almost entirely
      non-blocking in XFS - we won't block unless memory allocation
      for the cluster inode lookup blocks or the block device queues are
      full.
      
      Writeback during inode reclaim becomes a little more complex because
      we now have to lock the buffer ourselves, but otherwise this change
      is largely a functional no-op that removes a whole lot of code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      90c60e16
    • D
      xfs: attach inodes to the cluster buffer when dirtied · 48d55e2a
      Dave Chinner 提交于
      Rather than attach inodes to the cluster buffer just when we are
      doing IO, attach the inodes to the cluster buffer when they are
      dirtied. The means the buffer always carries a list of dirty inodes
      that reference it, and we can use that list to make more fundamental
      changes to inode writeback that aren't otherwise possible.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      48d55e2a
    • D
      xfs: pin inode backing buffer to the inode log item · 298f7bec
      Dave Chinner 提交于
      When we dirty an inode, we are going to have to write it disk at
      some point in the near future. This requires the inode cluster
      backing buffer to be present in memory. Unfortunately, under severe
      memory pressure we can reclaim the inode backing buffer while the
      inode is dirty in memory, resulting in stalling the AIL pushing
      because it has to do a read-modify-write cycle on the cluster
      buffer.
      
      When we have no memory available, the read of the cluster buffer
      blocks the AIL pushing process, and this causes all sorts of issues
      for memory reclaim as it requires inode writeback to make forwards
      progress. Allocating a cluster buffer causes more memory pressure,
      and results in more cluster buffers to be reclaimed, resulting in
      more RMW cycles to be done in the AIL context and everything then
      backs up on AIL progress. Only the synchronous inode cluster
      writeback in the the inode reclaim code provides some level of
      forwards progress guarantees that prevent OOM-killer rampages in
      this situation.
      
      Fix this by pinning the inode backing buffer to the inode log item
      when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
      This may mean the first modification of an inode that has been held
      in cache for a long time may block on a cluster buffer read, but
      we can do that in transaction context and block safely until the
      buffer has been allocated and read.
      
      Once we have the cluster buffer, the inode log item takes a
      reference to it, pinning it in memory, and attaches it to the log
      item for future reference. This means we can always grab the cluster
      buffer from the inode log item when we need it.
      
      When the inode is finally cleaned and removed from the AIL, we can
      drop the reference the inode log item holds on the cluster buffer.
      Once all inodes on the cluster buffer are clean, the cluster buffer
      will be unpinned and it will be available for memory reclaim to
      reclaim again.
      
      This avoids the issues with needing to do RMW cycles in the AIL
      pushing context, and hence allows complete non-blocking inode
      flushing to be performed by the AIL pushing context.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      298f7bec
    • D
      xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() · e98084b8
      Dave Chinner 提交于
      xfs_ail_delete_one() is called directly from dquot and inode IO
      completion, as well as from the generic xfs_trans_ail_delete()
      function. Inodes are about to have their own failure handling, and
      dquots will in future, too. Pull the clearing of the LI_FAILED flag
      up into the callers so we can customise the code appropriately.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e98084b8
    • D
      xfs: unwind log item error flagging · 3536b61e
      Dave Chinner 提交于
      When an buffer IO error occurs, we want to mark all
      the log items attached to the buffer as failed. Open code
      the error handling loop so that we can modify the flagging for the
      different types of objects directly and independently of each other.
      
      This also allows us to remove the ->iop_error method from the log
      item operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      3536b61e
    • D
      xfs: make inode IO completion buffer centric · aac855ab
      Dave Chinner 提交于
      Having different io completion callbacks for different inode states
      makes things complex. We can detect if the inode is stale via the
      XFS_ISTALE flag in IO completion, so we don't need a special
      callback just for this.
      
      This means inodes only have a single iodone callback, and inode IO
      completion is entirely buffer centric at this point. Hence we no
      longer need to use a log item callback at all as we can just call
      xfs_iflush_done() directly from the buffer completions and walk the
      buffer log item list to complete the all inodes under IO.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      aac855ab
    • D
      xfs: add an inode item lock · 1319ebef
      Dave Chinner 提交于
      The inode log item is kind of special in that it can be aggregating
      new changes in memory at the same time time existing changes are
      being written back to disk. This means there are fields in the log
      item that are accessed concurrently from contexts that don't share
      any locking at all.
      
      e.g. updating ili_last_fields occurs at flush time under the
      ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
      completion time, and is read under the ILOCK_EXCL when the inode is
      logged.  Hence there is no actual serialisation between reading the
      field during logging of the inode in transactions vs clearing the
      field in IO completion.
      
      We currently get away with this by the fact that we are only
      clearing fields in IO completion, and nothing bad happens if we
      accidentally log more of the inode than we actually modify. Worst
      case is we consume a tiny bit more memory and log bandwidth.
      
      However, if we want to do more complex state manipulations on the
      log item that requires updates at all three of these potential
      locations, we need to have some mechanism of serialising those
      operations. To do this, introduce a spinlock into the log item to
      serialise internal state.
      
      This could be done via the xfs_inode i_flags_lock, but this then
      leads to potential lock inversion issues where inode flag updates
      need to occur inside locks that best nest inside the inode log item
      locks (e.g. marking inodes stale during inode cluster freeing).
      Using a separate spinlock avoids these sorts of problems and
      simplifies future code.
      
      This does not touch the use of ili_fields in the item formatting
      code - that is entirely protected by the ILOCK_EXCL at this point in
      time, so it remains untouched.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      1319ebef