1. 04 1月, 2023 1 次提交
    • W
      xfs: Fix deadlock on xfs_inodegc_worker · 4da11251
      Wu Guanghao 提交于
      We are doing a test about deleting a large number of files
      when memory is low. A deadlock problem was found.
      
      [ 1240.279183] -> #1 (fs_reclaim){+.+.}-{0:0}:
      [ 1240.280450]        lock_acquire+0x197/0x460
      [ 1240.281548]        fs_reclaim_acquire.part.0+0x20/0x30
      [ 1240.282625]        kmem_cache_alloc+0x2b/0x940
      [ 1240.283816]        xfs_trans_alloc+0x8a/0x8b0
      [ 1240.284757]        xfs_inactive_ifree+0xe4/0x4e0
      [ 1240.285935]        xfs_inactive+0x4e9/0x8a0
      [ 1240.286836]        xfs_inodegc_worker+0x160/0x5e0
      [ 1240.287969]        process_one_work+0xa19/0x16b0
      [ 1240.289030]        worker_thread+0x9e/0x1050
      [ 1240.290131]        kthread+0x34f/0x460
      [ 1240.290999]        ret_from_fork+0x22/0x30
      [ 1240.291905]
      [ 1240.291905] -> #0 ((work_completion)(&gc->work)){+.+.}-{0:0}:
      [ 1240.293569]        check_prev_add+0x160/0x2490
      [ 1240.294473]        __lock_acquire+0x2c4d/0x5160
      [ 1240.295544]        lock_acquire+0x197/0x460
      [ 1240.296403]        __flush_work+0x6bc/0xa20
      [ 1240.297522]        xfs_inode_mark_reclaimable+0x6f0/0xdc0
      [ 1240.298649]        destroy_inode+0xc6/0x1b0
      [ 1240.299677]        dispose_list+0xe1/0x1d0
      [ 1240.300567]        prune_icache_sb+0xec/0x150
      [ 1240.301794]        super_cache_scan+0x2c9/0x480
      [ 1240.302776]        do_shrink_slab+0x3f0/0xaa0
      [ 1240.303671]        shrink_slab+0x170/0x660
      [ 1240.304601]        shrink_node+0x7f7/0x1df0
      [ 1240.305515]        balance_pgdat+0x766/0xf50
      [ 1240.306657]        kswapd+0x5bd/0xd20
      [ 1240.307551]        kthread+0x34f/0x460
      [ 1240.308346]        ret_from_fork+0x22/0x30
      [ 1240.309247]
      [ 1240.309247] other info that might help us debug this:
      [ 1240.309247]
      [ 1240.310944]  Possible unsafe locking scenario:
      [ 1240.310944]
      [ 1240.312379]        CPU0                    CPU1
      [ 1240.313363]        ----                    ----
      [ 1240.314433]   lock(fs_reclaim);
      [ 1240.315107]                                lock((work_completion)(&gc->work));
      [ 1240.316828]                                lock(fs_reclaim);
      [ 1240.318088]   lock((work_completion)(&gc->work));
      [ 1240.319203]
      [ 1240.319203]  *** DEADLOCK ***
      ...
      [ 2438.431081] Workqueue: xfs-inodegc/sda xfs_inodegc_worker
      [ 2438.432089] Call Trace:
      [ 2438.432562]  __schedule+0xa94/0x1d20
      [ 2438.435787]  schedule+0xbf/0x270
      [ 2438.436397]  schedule_timeout+0x6f8/0x8b0
      [ 2438.445126]  wait_for_completion+0x163/0x260
      [ 2438.448610]  __flush_work+0x4c4/0xa40
      [ 2438.455011]  xfs_inode_mark_reclaimable+0x6ef/0xda0
      [ 2438.456695]  destroy_inode+0xc6/0x1b0
      [ 2438.457375]  dispose_list+0xe1/0x1d0
      [ 2438.458834]  prune_icache_sb+0xe8/0x150
      [ 2438.461181]  super_cache_scan+0x2b3/0x470
      [ 2438.461950]  do_shrink_slab+0x3cf/0xa50
      [ 2438.462687]  shrink_slab+0x17d/0x660
      [ 2438.466392]  shrink_node+0x87e/0x1d40
      [ 2438.467894]  do_try_to_free_pages+0x364/0x1300
      [ 2438.471188]  try_to_free_pages+0x26c/0x5b0
      [ 2438.473567]  __alloc_pages_slowpath.constprop.136+0x7aa/0x2100
      [ 2438.482577]  __alloc_pages+0x5db/0x710
      [ 2438.485231]  alloc_pages+0x100/0x200
      [ 2438.485923]  allocate_slab+0x2c0/0x380
      [ 2438.486623]  ___slab_alloc+0x41f/0x690
      [ 2438.490254]  __slab_alloc+0x54/0x70
      [ 2438.491692]  kmem_cache_alloc+0x23e/0x270
      [ 2438.492437]  xfs_trans_alloc+0x88/0x880
      [ 2438.493168]  xfs_inactive_ifree+0xe2/0x4e0
      [ 2438.496419]  xfs_inactive+0x4eb/0x8b0
      [ 2438.497123]  xfs_inodegc_worker+0x16b/0x5e0
      [ 2438.497918]  process_one_work+0xbf7/0x1a20
      [ 2438.500316]  worker_thread+0x8c/0x1060
      [ 2438.504938]  ret_from_fork+0x22/0x30
      
      When the memory is insufficient, xfs_inonodegc_worker will trigger memory
      reclamation when memory is allocated, then flush_work() may be called to
      wait for the work to complete. This causes a deadlock.
      
      So use memalloc_nofs_save() to avoid triggering memory reclamation in
      xfs_inodegc_worker.
      Signed-off-by: NWu Guanghao <wuguanghao3@huawei.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      4da11251
  2. 22 11月, 2022 1 次提交
    • L
      xfs: fix incorrect i_nlink caused by inode racing · 28b4b059
      Long Li 提交于
      The following error occurred during the fsstress test:
      
      XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2452
      
      The problem was that inode race condition causes incorrect i_nlink to be
      written to disk, and then it is read into memory. Consider the following
      call graph, inodes that are marked as both XFS_IFLUSHING and
      XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
      value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
      may be set to 1.
      
        xfsaild
            xfs_inode_item_push
                xfs_iflush_cluster
                    xfs_iflush
                        xfs_inode_to_disk
      
        xfs_iget
            xfs_iget_cache_hit
                xfs_iget_recycle
                    xfs_reinit_inode
                        inode_init_always
      
      xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing internal
      inode state and can race with other RCU protected inode lookups. On the
      read side, xfs_iflush_cluster() grabs the ILOCK_SHARED while under rcu +
      ip->i_flags_lock, and so xfs_iflush/xfs_inode_to_disk() are protected from
      racing inode updates (during transactions) by that lock.
      
      Fixes: ff7bebeb ("xfs: refactor the inode recycling code") # goes further back than this
      Signed-off-by: NLong Li <leo.lilong@huawei.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      28b4b059
  3. 12 10月, 2022 1 次提交
  4. 21 7月, 2022 1 次提交
    • D
      xfs: don't leak memory when attr fork loading fails · c78c2d09
      Darrick J. Wong 提交于
      I observed the following evidence of a memory leak while running xfs/399
      from the xfs fsck test suite (edited for brevity):
      
      XFS (sde): Metadata corruption detected at xfs_attr_shortform_verify_struct.part.0+0x7b/0xb0 [xfs], inode 0x1172 attr fork
      XFS: Assertion failed: ip->i_af.if_u1.if_data == NULL, file: fs/xfs/libxfs/xfs_inode_fork.c, line: 315
      ------------[ cut here ]------------
      WARNING: CPU: 2 PID: 91635 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
      CPU: 2 PID: 91635 Comm: xfs_scrub Tainted: G        W         5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
      RIP: 0010:assfail+0x46/0x4a [xfs]
      Call Trace:
       <TASK>
       xfs_ifork_zap_attr+0x7c/0xb0
       xfs_iformat_attr_fork+0x86/0x110
       xfs_inode_from_disk+0x41d/0x480
       xfs_iget+0x389/0xd70
       xfs_bulkstat_one_int+0x5b/0x540
       xfs_bulkstat_iwalk+0x1e/0x30
       xfs_iwalk_ag_recs+0xd1/0x160
       xfs_iwalk_run_callbacks+0xb9/0x180
       xfs_iwalk_ag+0x1d8/0x2e0
       xfs_iwalk+0x141/0x220
       xfs_bulkstat+0x105/0x180
       xfs_ioc_bulkstat.constprop.0.isra.0+0xc5/0x130
       xfs_file_ioctl+0xa5f/0xef0
       __x64_sys_ioctl+0x82/0xa0
       do_syscall_64+0x2b/0x80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      This newly-added assertion checks that there aren't any incore data
      structures hanging off the incore fork when we're trying to reset its
      contents.  From the call trace, it is evident that iget was trying to
      construct an incore inode from the ondisk inode, but the attr fork
      verifier failed and we were trying to undo all the memory allocations
      that we had done earlier.
      
      The three assertions in xfs_ifork_zap_attr check that the caller has
      already called xfs_idestroy_fork, which clearly has not been done here.
      As the zap function then zeroes the pointers, we've effectively leaked
      the memory.
      
      The shortest change would have been to insert an extra call to
      xfs_idestroy_fork, but it makes more sense to bundle the _idestroy_fork
      call into _zap_attr, since all other callsites call _idestroy_fork
      immediately prior to calling _zap_attr.  IOWs, it eliminates one way to
      fail.
      
      Note: This change only applies cleanly to 2ed5b09b, since we just
      reworked the attr fork lifetime.  However, I think this memory leak has
      existed since 0f45a1b2, since the chain xfs_iformat_attr_fork ->
      xfs_iformat_local -> xfs_init_local_fork will allocate
      ifp->if_u1.if_data, but if xfs_ifork_verify_local_attr fails,
      xfs_iformat_attr_fork will free i_afp without freeing any of the stuff
      hanging off i_afp.  The solution for older kernels I think is to add the
      missing call to xfs_idestroy_fork just prior to calling kmem_cache_free.
      
      Found by fuzzing a.sfattr.hdr.totsize = lastbit in xfs/399.
      
      Fixes: 2ed5b09b ("xfs: make inode attribute forks a permanent part of struct xfs_inode")
      Probably-Fixes: 0f45a1b2 ("xfs: improve local fork verification")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      c78c2d09
  5. 14 7月, 2022 2 次提交
    • D
      xfs: add log item precommit operation · fad743d7
      Dave Chinner 提交于
      For inodes that are dirty, we have an attached cluster buffer that
      we want to use to track the dirty inode through the AIL.
      Unfortunately, locking the cluster buffer and adding it to the
      transaction when the inode is first logged in a transaction leads to
      buffer lock ordering inversions.
      
      The specific problem is ordering against the AGI buffer. When
      modifying unlinked lists, the buffer lock order is AGI -> inode
      cluster buffer as the AGI buffer lock serialises all access to the
      unlinked lists. Unfortunately, functionality like xfs_droplink()
      logs the inode before calling xfs_iunlink(), as do various directory
      manipulation functions. The inode can be logged way down in the
      stack as far as the bmapi routines and hence, without a major
      rewrite of lots of APIs there's no way we can avoid the inode being
      logged by something until after the AGI has been logged.
      
      As we are going to be using ordered buffers for inode AIL tracking,
      there isn't a need to actually lock that buffer against modification
      as all the modifications are captured by logging the inode item
      itself. Hence we don't actually need to join the cluster buffer into
      the transaction until just before it is committed. This means we do
      not perturb any of the existing buffer lock orders in transactions,
      and the inode cluster buffer is always locked last in a transaction
      that doesn't otherwise touch inode cluster buffers.
      
      We do this by introducing a precommit log item method.  This commit
      just introduces the mechanism; the inode item implementation is in
      followup commits.
      
      The precommit items need to be sorted into consistent order as we
      may be locking multiple items here. Hence if we have two dirty
      inodes in cluster buffers A and B, and some other transaction has
      two separate dirty inodes in the same cluster buffers, locking them
      in different orders opens us up to ABBA deadlocks. Hence we sort the
      items on the transaction based on the presence of a sort log item
      method.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      fad743d7
    • D
      xfs: double link the unlinked inode list · 2fd26cc0
      Dave Chinner 提交于
      Now we have forwards traversal via the incore inode in place, we now
      need to add back pointers to the incore inode to entirely replace
      the back reference cache. We use the same lookup semantics and
      constraints as for the forwards pointer lookups during unlinks, and
      so we can look up any inode in the unlinked list directly and update
      the list pointers, forwards or backwards, at any time.
      
      The only wrinkle in converting the unlinked list manipulations to
      use in-core previous pointers is that log recovery doesn't have the
      incore inode state built up so it can't just read in an inode and
      release it to finish off the unlink. Hence we need to modify the
      traversal in recovery to read one inode ahead before we
      release the inode at the head of the list. This populates the
      next->prev relationship sufficient to be able to replay the unlinked
      list and hence greatly simplify the runtime code.
      
      This recovery algorithm also requires that we actually remove inodes
      from the unlinked list one at a time as background inode
      inactivation will result in unlinked list removal racing with the
      building of the in-memory unlinked list state. We could serialise
      this by holding the AGI buffer lock when constructing the in memory
      state, but all that does is lockstep background processing with list
      building. It is much simpler to flush the inodegc immediately after
      releasing the inode so that it is unlinked immediately and there is
      no races present at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      2fd26cc0
  6. 10 7月, 2022 3 次提交
    • D
      xfs: use XFS_IFORK_Q to determine the presence of an xattr fork · e45d7cb2
      Darrick J. Wong 提交于
      Modify xfs_ifork_ptr to return a NULL pointer if the caller asks for the
      attribute fork but i_forkoff is zero.  This eliminates the ambiguity
      between i_forkoff and i_af.if_present, which should make it easier to
      understand the lifetime of attr forks.
      
      While we're at it, remove the if_present checks around calls to
      xfs_idestroy_fork and xfs_ifork_zap_attr since they can both handle attr
      forks that have already been torn down.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      e45d7cb2
    • D
      xfs: make inode attribute forks a permanent part of struct xfs_inode · 2ed5b09b
      Darrick J. Wong 提交于
      Syzkaller reported a UAF bug a while back:
      
      ==================================================================
      BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
      Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958
      
      CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted
      5.15.0-0.30.3-20220406_1406 #3
      Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29
      04/01/2014
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106
       print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459
       xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
       xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159
       xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36
       __vfs_getxattr+0xdf/0x13d fs/xattr.c:399
       cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300
       security_inode_need_killpriv+0x4c/0x97 security/security.c:1408
       dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912
       dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908
       do_truncate+0xc3/0x1e0 fs/open.c:56
       handle_truncate fs/namei.c:3084 [inline]
       do_open fs/namei.c:3432 [inline]
       path_openat+0x30ab/0x396d fs/namei.c:3561
       do_filp_open+0x1c4/0x290 fs/namei.c:3588
       do_sys_openat2+0x60d/0x98c fs/open.c:1212
       do_sys_open+0xcf/0x13c fs/open.c:1228
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      RIP: 0033:0x7f7ef4bb753d
      Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
      89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73
      01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48
      RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
      RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d
      RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0
      RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e
      R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0
       </TASK>
      
      Allocated by task 2953:
       kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:434 [inline]
       __kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467
       kasan_slab_alloc include/linux/kasan.h:254 [inline]
       slab_post_alloc_hook mm/slab.h:519 [inline]
       slab_alloc_node mm/slub.c:3213 [inline]
       slab_alloc mm/slub.c:3221 [inline]
       kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226
       kmem_cache_zalloc include/linux/slab.h:711 [inline]
       xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287
       xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098
       xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746
       xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
       __vfs_setxattr+0x11b/0x177 fs/xattr.c:180
       __vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214
       __vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275
       vfs_setxattr+0x154/0x33d fs/xattr.c:301
       setxattr+0x216/0x29f fs/xattr.c:575
       __do_sys_fsetxattr fs/xattr.c:632 [inline]
       __se_sys_fsetxattr fs/xattr.c:621 [inline]
       __x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      
      Freed by task 2949:
       kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
       kasan_set_track+0x1c/0x21 mm/kasan/common.c:46
       kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
       ____kasan_slab_free mm/kasan/common.c:366 [inline]
       ____kasan_slab_free mm/kasan/common.c:328 [inline]
       __kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374
       kasan_slab_free include/linux/kasan.h:230 [inline]
       slab_free_hook mm/slub.c:1700 [inline]
       slab_free_freelist_hook mm/slub.c:1726 [inline]
       slab_free mm/slub.c:3492 [inline]
       kmem_cache_free+0xdc/0x3ce mm/slub.c:3508
       xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773
       xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822
       xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413
       xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684
       xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802
       xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
       __vfs_removexattr+0x106/0x16a fs/xattr.c:468
       cap_inode_killpriv+0x24/0x47 security/commoncap.c:324
       security_inode_killpriv+0x54/0xa1 security/security.c:1414
       setattr_prepare+0x1a6/0x897 fs/attr.c:146
       xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682
       xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065
       xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093
       notify_change+0xae5/0x10a1 fs/attr.c:410
       do_truncate+0x134/0x1e0 fs/open.c:64
       handle_truncate fs/namei.c:3084 [inline]
       do_open fs/namei.c:3432 [inline]
       path_openat+0x30ab/0x396d fs/namei.c:3561
       do_filp_open+0x1c4/0x290 fs/namei.c:3588
       do_sys_openat2+0x60d/0x98c fs/open.c:1212
       do_sys_open+0xcf/0x13c fs/open.c:1228
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      
      The buggy address belongs to the object at ffff88802cec9188
       which belongs to the cache xfs_ifork of size 40
      The buggy address is located 20 bytes inside of
       40-byte region [ffff88802cec9188, ffff88802cec91b0)
      The buggy address belongs to the page:
      page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000
      index:0x0 pfn:0x2cec9
      flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff)
      raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80
      raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb
       ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc
      >ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb
                                  ^
       ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb
       ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb
      ==================================================================
      
      The root cause of this bug is the unlocked access to xfs_inode.i_afp
      from the getxattr code paths while trying to determine which ILOCK mode
      to use to stabilize the xattr data.  Unfortunately, the VFS does not
      acquire i_rwsem when vfs_getxattr (or listxattr) call into the
      filesystem, which means that getxattr can race with a removexattr that's
      tearing down the attr fork and crash:
      
      xfs_attr_set:                          xfs_attr_get:
      xfs_attr_fork_remove:                  xfs_ilock_attr_map_shared:
      
      xfs_idestroy_fork(ip->i_afp);
      kmem_cache_free(xfs_ifork_cache, ip->i_afp);
      
                                             if (ip->i_afp &&
      
      ip->i_afp = NULL;
      
                                                 xfs_need_iread_extents(ip->i_afp))
                                             <KABOOM>
      
      ip->i_forkoff = 0;
      
      Regrettably, the VFS is much more lax about i_rwsem and getxattr than
      is immediately obvious -- not only does it not guarantee that we hold
      i_rwsem, it actually doesn't guarantee that we *don't* hold it either.
      The getxattr system call won't acquire the lock before calling XFS, but
      the file capabilities code calls getxattr with and without i_rwsem held
      to determine if the "security.capabilities" xattr is set on the file.
      
      Fixing the VFS locking requires a treewide investigation into every code
      path that could touch an xattr and what i_rwsem state it expects or sets
      up.  That could take years or even prove impossible; fortunately, we
      can fix this UAF problem inside XFS.
      
      An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to
      ensure that i_forkoff is always zeroed before i_afp is set to null and
      changed the read paths to use smp_rmb before accessing i_forkoff and
      i_afp, which avoided these UAF problems.  However, the patch author was
      too busy dealing with other problems in the meantime, and by the time he
      came back to this issue, the situation had changed a bit.
      
      On a modern system with selinux, each inode will always have at least
      one xattr for the selinux label, so it doesn't make much sense to keep
      incurring the extra pointer dereference.  Furthermore, Allison's
      upcoming parent pointer patchset will also cause nearly every inode in
      the filesystem to have extended attributes.  Therefore, make the inode
      attribute fork structure part of struct xfs_inode, at a cost of 40 more
      bytes.
      
      This patch adds a clunky if_present field where necessary to maintain
      the existing logic of xattr fork null pointer testing in the existing
      codebase.  The next patch switches the logic over to XFS_IFORK_Q and it
      all goes away.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      2ed5b09b
    • D
      xfs: convert XFS_IFORK_PTR to a static inline helper · 732436ef
      Darrick J. Wong 提交于
      We're about to make this logic do a bit more, so convert the macro to a
      static inline function for better typechecking and fewer shouty macros.
      No functional changes here.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      732436ef
  7. 04 7月, 2022 1 次提交
    • R
      mm: shrinkers: provide shrinkers with names · e33c267a
      Roman Gushchin 提交于
      Currently shrinkers are anonymous objects.  For debugging purposes they
      can be identified by count/scan function names, but it's not always
      useful: e.g.  for superblock's shrinkers it's nice to have at least an
      idea of to which superblock the shrinker belongs.
      
      This commit adds names to shrinkers.  register_shrinker() and
      prealloc_shrinker() functions are extended to take a format and arguments
      to master a name.
      
      In some cases it's not possible to determine a good name at the time when
      a shrinker is allocated.  For such cases shrinker_debugfs_rename() is
      provided.
      
      The expected format is:
          <subsystem>-<shrinker_type>[:<instance>]-<id>
      For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.
      
      After this change the shrinker debugfs directory looks like:
        $ cd /sys/kernel/debug/shrinker/
        $ ls
          dquota-cache-16     sb-devpts-28     sb-proc-47       sb-tmpfs-42
          mm-shadow-18        sb-devtmpfs-5    sb-proc-48       sb-tmpfs-43
          mm-zspool:zram0-34  sb-hugetlbfs-17  sb-pstore-31     sb-tmpfs-44
          rcu-kfree-0         sb-hugetlbfs-33  sb-rootfs-2      sb-tmpfs-49
          sb-aio-20           sb-iomem-12      sb-securityfs-6  sb-tracefs-13
          sb-anon_inodefs-15  sb-mqueue-21     sb-selinuxfs-22  sb-xfs:vda1-36
          sb-bdev-3           sb-nsfs-4        sb-sockfs-8      sb-zsmalloc-19
          sb-bpf-32           sb-pipefs-14     sb-sysfs-26      thp-deferred_split-10
          sb-btrfs:vda2-24    sb-proc-25       sb-tmpfs-1       thp-zero-9
          sb-cgroup2-30       sb-proc-39       sb-tmpfs-27      xfs-buf:vda1-37
          sb-configfs-23      sb-proc-41       sb-tmpfs-29      xfs-inodegc:vda1-38
          sb-dax-11           sb-proc-45       sb-tmpfs-35
          sb-debugfs-7        sb-proc-46       sb-tmpfs-40
      
      [roman.gushchin@linux.dev: fix build warnings]
        Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castleReported-by: Nkernel test robot <lkp@intel.com>
      Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.devSigned-off-by: NRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      e33c267a
  8. 24 6月, 2022 2 次提交
    • D
      xfs: introduce xfs_inodegc_push() · 5e672cd6
      Dave Chinner 提交于
      The current blocking mechanism for pushing the inodegc queue out to
      disk can result in systems becoming unusable when there is a long
      running inodegc operation. This is because the statfs()
      implementation currently issues a blocking flush of the inodegc
      queue and a significant number of common system utilities will call
      statfs() to discover something about the underlying filesystem.
      
      This can result in userspace operations getting stuck on inodegc
      progress, and when trying to remove a heavily reflinked file on slow
      storage with a full journal, this can result in delays measuring in
      hours.
      
      Avoid this problem by adding "push" function that expedites the
      flushing of the inodegc queue, but doesn't wait for it to complete.
      
      Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this
      mechanism so they don't block but still ensure that queued
      operations are expedited.
      
      Fixes: ab23a776 ("xfs: per-cpu deferred inode inactivation queues")
      Reported-by: NChris Dunlop <chris@onthe.net.au>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [djwong: fix _getquota_next to use _inodegc_push too]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      5e672cd6
    • D
      xfs: bound maximum wait time for inodegc work · 7cf2b0f9
      Dave Chinner 提交于
      Currently inodegc work can sit queued on the per-cpu queue until
      the workqueue is either flushed of the queue reaches a depth that
      triggers work queuing (and later throttling). This means that we
      could queue work that waits for a long time for some other event to
      trigger flushing.
      
      Hence instead of just queueing work at a specific depth, use a
      delayed work that queues the work at a bound time. We can still
      schedule the work immediately at a given depth, but we no long need
      to worry about leaving a number of items on the list that won't get
      processed until external events prevail.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      7cf2b0f9
  9. 12 4月, 2022 1 次提交
    • D
      xfs: use a separate frextents counter for rt extent reservations · 2229276c
      Darrick J. Wong 提交于
      As mentioned in the previous commit, the kernel misuses sb_frextents in
      the incore mount to reflect both incore reservations made by running
      transactions as well as the actual count of free rt extents on disk.
      This results in the superblock being written to the log with an
      underestimate of the number of rt extents that are marked free in the
      rtbitmap.
      
      Teaching XFS to recompute frextents after log recovery avoids
      operational problems in the current mount, but it doesn't solve the
      problem of us writing undercounted frextents which are then recovered by
      an older kernel that doesn't have that fix.
      
      Create an incore percpu counter to mirror the ondisk frextents.  This
      new counter will track transaction reservations and the only time we
      will touch the incore super counter (i.e the one that gets logged) is
      when those transactions commit updates to the rt bitmap.  This is in
      contrast to the lazysbcount counters (e.g. fdblocks), where we know that
      log recovery will always fix any incorrect counter that we log.
      As a bonus, we only take m_sb_lock at transaction commit time.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2229276c
  10. 30 3月, 2022 1 次提交
    • D
      xfs: aborting inodes on shutdown may need buffer lock · d2d7c047
      Dave Chinner 提交于
      Most buffer io list operations are run with the bp->b_lock held, but
      xfs_iflush_abort() can be called without the buffer lock being held
      resulting in inodes being removed from the buffer list while other
      list operations are occurring. This causes problems with corrupted
      bp->b_io_list inode lists during filesystem shutdown, leading to
      traversals that never end, double removals from the AIL, etc.
      
      Fix this by passing the buffer to xfs_iflush_abort() if we have
      it locked. If the inode is attached to the buffer, we're going to
      have to remove it from the buffer list and we'd have to get the
      buffer off the inode log item to do that anyway.
      
      If we don't have a buffer passed in (e.g. from xfs_reclaim_inode())
      then we can determine if the inode has a log item and if it is
      attached to a buffer before we do anything else. If it does have an
      attached buffer, we can lock it safely (because the inode has a
      reference to it) and then perform the inode abort.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      d2d7c047
  11. 23 3月, 2022 1 次提交
  12. 20 3月, 2022 1 次提交
    • D
      xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight · 01728b44
      Dave Chinner 提交于
      I've been chasing a recent resurgence in generic/388 recovery
      failure and/or corruption events. The events have largely been
      uninitialised inode chunks being tripped over in log recovery
      such as:
      
       XFS (pmem1): User initiated shutdown received.
       pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
       XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500).  Shutting down filesystem.
       XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
       XFS (pmem1): Unmounting Filesystem
       XFS (pmem1): Mounting V5 Filesystem
       XFS (pmem1): Starting recovery (logdev: internal)
       XFS (pmem1): bad inode magic/vsn daddr 8723584 #0 (magic=1818)
       XFS (pmem1): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x851c80 xfs_inode_buf_verify
       XFS (pmem1): Unmount and run xfs_repair
       XFS (pmem1): First 128 bytes of corrupted metadata buffer:
       00000000: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000010: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000020: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000030: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000040: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000050: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000060: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000070: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       XFS (pmem1): metadata I/O error in "xlog_recover_items_pass2+0x52/0xc0" at daddr 0x851c80 len 32 error 117
       XFS (pmem1): log mount/recovery failed: error -117
       XFS (pmem1): log mount failed
      
      There have been isolated random other issues, too - xfs_repair fails
      because it finds some corruption in symlink blocks, rmap
      inconsistencies, etc - but they are nowhere near as common as the
      uninitialised inode chunk failure.
      
      The problem has clearly happened at runtime before recovery has run;
      I can see the ICREATE log item in the log shortly before the
      actively recovered range of the log. This means the ICREATE was
      definitely created and written to the log, but for some reason the
      tail of the log has been moved past the ordered buffer log item that
      tracks INODE_ALLOC buffers and, supposedly, prevents the tail of the
      log moving past the ICREATE log item before the inode chunk buffer
      is written to disk.
      
      Tracing the fsstress processes that are running when the filesystem
      shut down immediately pin-pointed the problem:
      
      user shutdown marks xfs_mount as shutdown
      
               godown-213341 [008]  6398.022871: console:              [ 6397.915392] XFS (pmem1): User initiated shutdown received.
      .....
      
      aild tries to push ordered inode cluster buffer
      
        xfsaild/pmem1-213314 [001]  6398.022974: xfs_buf_trylock:      dev 259:1 daddr 0x851c80 bbcount 0x20 hold 16 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_inode_item_push+0x8e
        xfsaild/pmem1-213314 [001]  6398.022976: xfs_ilock_nowait:     dev 259:1 ino 0x851c80 flags ILOCK_SHARED caller xfs_iflush_cluster+0xae
      
      xfs_iflush_cluster() checks xfs_is_shutdown(), returns true,
      calls xfs_iflush_abort() to kill writeback of the inode.
      Inode is removed from AIL, drops cluster buffer reference.
      
        xfsaild/pmem1-213314 [001]  6398.022977: xfs_ail_delete:       dev 259:1 lip 0xffff88880247ed80 old lsn 7/20344 new lsn 7/21000 type XFS_LI_INODE flags IN_AIL
        xfsaild/pmem1-213314 [001]  6398.022978: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 17 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_iflush_abort+0xd7
      
      .....
      
      All inodes on cluster buffer are aborted, then the cluster buffer
      itself is aborted and removed from the AIL *without writeback*:
      
      xfsaild/pmem1-213314 [001]  6398.023011: xfs_buf_error_relse:  dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_ioend_fail+0x33
         xfsaild/pmem1-213314 [001]  6398.023012: xfs_ail_delete:       dev 259:1 lip 0xffff8888053efde8 old lsn 7/20344 new lsn 7/20344 type XFS_LI_BUF flags IN_AIL
      
      The inode buffer was at 7/20344 when it was removed from the AIL.
      
         xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_item_relse:   dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_done+0x31
         xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_relse+0x39
      
      .....
      
      Userspace is still running, doing stuff. an fsstress process runs
      syncfs() or sync() and we end up in sync_fs_one_sb() which issues
      a log force. This pushes on the CIL:
      
              fsstress-213322 [001]  6398.024430: xfs_fs_sync_fs:       dev 259:1 m_features 0x20000000019ff6e9 opstate (clean|shutdown|inodegc|blockgc) s_flags 0x70810000 caller sync_fs_one_sb+0x26
              fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x0 caller xfs_fs_sync_fs+0x82
              fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x5f caller xfs_log_force+0x7c
                 <...>-194402 [001]  6398.024467: kmem_alloc:           size 176 flags 0x14 caller xlog_cil_push_work+0x9f
      
      And the CIL fills up iclogs with pending changes. This picks up
      the current tail from the AIL:
      
                 <...>-194402 [001]  6398.024497: xlog_iclog_get_space: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x0 flags  caller xlog_write+0x149
                 <...>-194402 [001]  6398.024498: xlog_iclog_switch:    dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x700005408 flags  caller xlog_state_get_iclog_space+0x37e
                 <...>-194402 [001]  6398.024521: xlog_iclog_release:   dev 259:1 state XLOG_STATE_WANT_SYNC refcnt 1 offset 32256 lsn 0x700005408 flags  caller xlog_write+0x5f9
                 <...>-194402 [001]  6398.024522: xfs_log_assign_tail_lsn: dev 259:1 new tail lsn 7/21000, old lsn 7/20344, last sync 7/21448
      
      And it moves the tail of the log to 7/21000 from 7/20344. This
      *moves the tail of the log beyond the ICREATE transaction* that was
      at 7/20344 and pinned by the inode cluster buffer that was cancelled
      above.
      
      ....
      
               godown-213341 [008]  6398.027005: xfs_force_shutdown:   dev 259:1 tag logerror flags log_io|force_umount file fs/xfs/xfs_fsops.c line_num 500
                godown-213341 [008]  6398.027022: console:              [ 6397.915406] pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
                godown-213341 [008]  6398.030551: console:              [ 6397.919546] XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/
      
      And finally the log itself is now shutdown, stopping all further
      writes to the log. But this is too late to prevent the corruption
      that moving the tail of the log forwards after we start cancelling
      writeback causes.
      
      The fundamental problem here is that we are using the wrong shutdown
      checks for log items. We've long conflated mount shutdown with log
      shutdown state, and I started separating that recently with the
      atomic shutdown state changes in commit b36d4651 ("xfs: make
      forced shutdown processing atomic"). The changes in that commit
      series are directly responsible for being able to diagnose this
      issue because it clearly separated mount shutdown from log shutdown.
      
      Essentially, once we start cancelling writeback of log items and
      removing them from the AIL because the filesystem is shut down, we
      *cannot* update the journal because we may have cancelled the items
      that pin the tail of the log. That moves the tail of the log
      forwards without having written the metadata back, hence we have
      corrupt in memory state and writing to the journal propagates that
      to the on-disk state.
      
      What commit b36d4651 makes clear is that log item state needs to
      change relative to log shutdown, not mount shutdown. IOWs, anything
      that aborts metadata writeback needs to check log shutdown state
      because log items directly affect log consistency. Having them check
      mount shutdown state introduces the above race condition where we
      cancel metadata writeback before the log shuts down.
      
      To fix this, this patch works through all log items and converts
      shutdown checks to use xlog_is_shutdown() rather than
      xfs_is_shutdown(), so that we don't start aborting metadata
      writeback before we shut off journal writes.
      
      AFAICT, this race condition is a zero day IO error handling bug in
      XFS that dates back to the introduction of XLOG_IO_ERROR,
      XLOG_STATE_IOERROR and XFS_FORCED_SHUTDOWN back in January 1997.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      01728b44
  13. 20 1月, 2022 1 次提交
    • B
      xfs: flush inodegc workqueue tasks before cancel · 6191cf3a
      Brian Foster 提交于
      The xfs_inodegc_stop() helper performs a high level flush of pending
      work on the percpu queues and then runs a cancel_work_sync() on each
      of the percpu work tasks to ensure all work has completed before
      returning.  While cancel_work_sync() waits for wq tasks to complete,
      it does not guarantee work tasks have started. This means that the
      _stop() helper can queue and instantly cancel a wq task without
      having completed the associated work. This can be observed by
      tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>"
      test:
      
      	xfs_destroy_inode: ... ino 0x83 ...
      	xfs_inode_set_need_inactive: ... ino 0x83 ...
      	xfs_inodegc_stop: ...
      	...
      	xfs_inodegc_start: ...
      	xfs_inodegc_worker: ...
      	xfs_inode_inactivating: ... ino 0x83 ...
      
      The first few lines show that the inode is removed and need inactive
      state set, but the inactivation work has not completed before the
      inodegc mechanism stops. The inactivation doesn't actually occur
      until the fs is unfrozen and the gc mechanism starts back up. Note
      that this test requires fsfreeze to reproduce because xfs_freeze
      indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush().
      
      When this occurs, the workqueue try_to_grab_pending() logic first
      tries to steal the pending bit, which does not succeed because the
      bit has been set by queue_work_on(). Subsequently, it checks for
      association of a pool workqueue from the work item under the pool
      lock. This association is set at the point a work item is queued and
      cleared when dequeued for processing. If the association exists, the
      work item is removed from the queue and cancel_work_sync() returns
      true. If the pwq association is cleared, the remove attempt assumes
      the task is busy and retries (eventually returning false to the
      caller after waiting for the work task to complete).
      
      To avoid this race, we can flush each work item explicitly before
      cancel. However, since the _queue_all() already schedules each
      underlying work item, the workqueue level helpers are sufficient to
      achieve the same ordering effect. E.g., the inodegc enabled flag
      prevents scheduling any further work in the _stop() case. Use the
      drain_workqueue() helper in this particular case to make the intent
      a bit more self explanatory.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      6191cf3a
  14. 22 12月, 2021 1 次提交
  15. 18 12月, 2021 1 次提交
  16. 25 11月, 2021 1 次提交
  17. 23 10月, 2021 1 次提交
  18. 25 8月, 2021 1 次提交
  19. 20 8月, 2021 4 次提交
  20. 19 8月, 2021 1 次提交
  21. 10 8月, 2021 7 次提交
    • D
      xfs: throttle inode inactivation queuing on memory reclaim · 40b1de00
      Darrick J. Wong 提交于
      Now that we defer inode inactivation, we've decoupled the process of
      unlinking or closing an inode from the process of inactivating it.  In
      theory this should lead to better throughput since we now inactivate the
      queued inodes in batches instead of one at a time.
      
      Unfortunately, one of the primary risks with this decoupling is the loss
      of rate control feedback between the frontend and background threads.
      In other words, a rm -rf /* thread can run the system out of memory if
      it can queue inodes for inactivation and jump to a new CPU faster than
      the background threads can actually clear the deferred work.  The
      workers can get scheduled off the CPU if they have to do IO, etc.
      
      To solve this problem, we configure a shrinker so that it will activate
      the /second/ time the shrinkers are called.  The custom shrinker will
      queue all percpu deferred inactivation workers immediately and set a
      flag to force frontend callers who are releasing a vfs inode to wait for
      the inactivation workers.
      
      On my test VM with 560M of RAM and a 2TB filesystem, this seems to solve
      most of the OOMing problem when deleting 10 million inodes.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      40b1de00
    • D
      xfs: use background worker pool when transactions can't get free space · e8d04c2a
      Darrick J. Wong 提交于
      In xfs_trans_alloc, if the block reservation call returns ENOSPC, we
      call xfs_blockgc_free_space with a NULL icwalk structure to try to free
      space.  Each frontend thread that encounters this situation starts its
      own walk of the inode cache to see if it can find anything, which is
      wasteful since we don't have any additional selection criteria.  For
      this one common case, create a function that reschedules all pending
      background work immediately and flushes the workqueue so that the scan
      can run in parallel.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      e8d04c2a
    • D
      xfs: don't run speculative preallocation gc when fs is frozen · 6f649091
      Darrick J. Wong 提交于
      Now that we have the infrastructure to switch background workers on and
      off at will, fix the block gc worker code so that we don't actually run
      the worker when the filesystem is frozen, same as we do for deferred
      inactivation.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      6f649091
    • D
      xfs: inactivate inodes any time we try to free speculative preallocations · 2eb66502
      Darrick J. Wong 提交于
      Other parts of XFS have learned to call xfs_blockgc_free_{space,quota}
      to try to free speculative preallocations when space is tight.  This
      means that file writes, transaction reservation failures, quota limit
      enforcement, and the EOFBLOCKS ioctl all call this function to free
      space when things are tight.
      
      Since inode inactivation is now a background task, this means that the
      filesystem can be hanging on to unlinked but not yet freed space.  Add
      this to the list of things that xfs_blockgc_free_* makes writer threads
      scan for when they cannot reserve space.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      2eb66502
    • D
      xfs: queue inactivation immediately when free realtime extents are tight · 65f03d86
      Darrick J. Wong 提交于
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      Similar to the patch doing this for free space on the data device, if
      the file being inactivated is a realtime file and the realtime volume is
      running low on free extents, we want to run the worker ASAP so that the
      realtime allocator can make better decisions.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      65f03d86
    • D
      xfs: queue inactivation immediately when quota is nearing enforcement · 108523b8
      Darrick J. Wong 提交于
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      Specifically, if the dquots attached to the inode being inactivated are
      nearing any kind of enforcement boundary, we want to queue that
      inactivation work immediately so that users don't get EDQUOT/ENOSPC
      errors even after they deleted a bunch of files to stay within quota.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      108523b8
    • D
      xfs: queue inactivation immediately when free space is tight · 7d6f07d2
      Darrick J. Wong 提交于
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      On a mostly empty filesystem, the risk of the allocator making poor
      decisions due to fragmentation of the free space on account a lengthy
      delay in background updates is minimal because there's plenty of space.
      However, if free space is tight, we want to deallocate unlinked inodes
      as quickly as possible to avoid fallocate ENOSPC and to give the
      allocator the best shot at optimal allocations for new writes.
      
      Therefore, queue the percpu worker immediately if the filesystem is more
      than 95% full.  This follows the same principle that XFS becomes less
      aggressive about speculative allocations and lazy cleanup (and more
      precise about accounting) when nearing full.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      7d6f07d2
  22. 07 8月, 2021 4 次提交
    • D
      xfs: per-cpu deferred inode inactivation queues · ab23a776
      Dave Chinner 提交于
      Move inode inactivation to background work contexts so that it no
      longer runs in the context that releases the final reference to an
      inode. This will allow process work that ends up blocking on
      inactivation to continue doing work while the filesytem processes
      the inactivation in the background.
      
      A typical demonstration of this is unlinking an inode with lots of
      extents. The extents are removed during inactivation, so this blocks
      the process that unlinked the inode from the directory structure. By
      moving the inactivation to the background process, the userspace
      applicaiton can keep working (e.g. unlinking the next inode in the
      directory) while the inactivation work on the previous inode is
      done by a different CPU.
      
      The implementation of the queue is relatively simple. We use a
      per-cpu lockless linked list (llist) to queue inodes for
      inactivation without requiring serialisation mechanisms, and a work
      item to allow the queue to be processed by a CPU bound worker
      thread. We also keep a count of the queue depth so that we can
      trigger work after a number of deferred inactivations have been
      queued.
      
      The use of a bound workqueue with a single work depth allows the
      workqueue to run one work item per CPU. We queue the work item on
      the CPU we are currently running on, and so this essentially gives
      us affine per-cpu worker threads for the per-cpu queues. THis
      maintains the effective CPU affinity that occurs within XFS at the
      AG level due to all objects in a directory being local to an AG.
      Hence inactivation work tends to run on the same CPU that last
      accessed all the objects that inactivation accesses and this
      maintains hot CPU caches for unlink workloads.
      
      A depth of 32 inodes was chosen to match the number of inodes in an
      inode cluster buffer. This hopefully allows sequential
      allocation/unlink behaviours to defering inactivation of all the
      inodes in a single cluster buffer at a time, further helping
      maintain hot CPU and buffer cache accesses while running
      inactivations.
      
      A hard per-cpu queue throttle of 256 inode has been set to avoid
      runaway queuing when inodes that take a long to time inactivate are
      being processed. For example, when unlinking inodes with large
      numbers of extents that can take a lot of processing to free.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [djwong: tweak comments and tracepoints, convert opflags to state bits]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ab23a776
    • D
      xfs: detach dquots from inode if we don't need to inactivate it · 62af7d54
      Darrick J. Wong 提交于
      If we don't need to inactivate an inode, we can detach the dquots and
      move on to reclamation.  This isn't strictly required here; it's a
      preparation patch for deferred inactivation per reviewer request[1] to
      move the creation of xfs_inode_needs_inactivation into a separate
      change.  Eventually this !need_inactive chunk will turn into the code
      path for inodes that skip xfs_inactive and go straight to memory
      reclaim.
      
      [1] https://lore.kernel.org/linux-xfs/20210609012838.GW2945738@locust/T/#mca6d958521cb88bbc1bfe1a30767203328d410b5Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      62af7d54
    • D
      xfs: move xfs_inactive call to xfs_inode_mark_reclaimable · c6c2066d
      Darrick J. Wong 提交于
      Move the xfs_inactive call and all the other debugging checks and stats
      updates into xfs_inode_mark_reclaimable because most of that are
      implementation details about the inode cache.  This is preparation for
      deferred inactivation that is coming up.  We also move it around
      xfs_icache.c in preparation for deferred inactivation.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      c6c2066d
    • C
      xfs: remove xfs_dqrele_all_inodes · 777eb1fa
      Christoph Hellwig 提交于
      xfs_dqrele_all_inodes is unused now, remove it and all supporting code.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      777eb1fa
  23. 22 6月, 2021 2 次提交