1. 14 7月, 2022 1 次提交
    • D
      xfs: double link the unlinked inode list · 2fd26cc0
      Dave Chinner 提交于
      Now we have forwards traversal via the incore inode in place, we now
      need to add back pointers to the incore inode to entirely replace
      the back reference cache. We use the same lookup semantics and
      constraints as for the forwards pointer lookups during unlinks, and
      so we can look up any inode in the unlinked list directly and update
      the list pointers, forwards or backwards, at any time.
      
      The only wrinkle in converting the unlinked list manipulations to
      use in-core previous pointers is that log recovery doesn't have the
      incore inode state built up so it can't just read in an inode and
      release it to finish off the unlink. Hence we need to modify the
      traversal in recovery to read one inode ahead before we
      release the inode at the head of the list. This populates the
      next->prev relationship sufficient to be able to replay the unlinked
      list and hence greatly simplify the runtime code.
      
      This recovery algorithm also requires that we actually remove inodes
      from the unlinked list one at a time as background inode
      inactivation will result in unlinked list removal racing with the
      building of the in-memory unlinked list state. We could serialise
      this by holding the AGI buffer lock when constructing the in memory
      state, but all that does is lockstep background processing with list
      building. It is much simpler to flush the inodegc immediately after
      releasing the inode so that it is unlinked immediately and there is
      no races present at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      2fd26cc0
  2. 24 6月, 2022 2 次提交
    • D
      xfs: introduce xfs_inodegc_push() · 5e672cd6
      Dave Chinner 提交于
      The current blocking mechanism for pushing the inodegc queue out to
      disk can result in systems becoming unusable when there is a long
      running inodegc operation. This is because the statfs()
      implementation currently issues a blocking flush of the inodegc
      queue and a significant number of common system utilities will call
      statfs() to discover something about the underlying filesystem.
      
      This can result in userspace operations getting stuck on inodegc
      progress, and when trying to remove a heavily reflinked file on slow
      storage with a full journal, this can result in delays measuring in
      hours.
      
      Avoid this problem by adding "push" function that expedites the
      flushing of the inodegc queue, but doesn't wait for it to complete.
      
      Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this
      mechanism so they don't block but still ensure that queued
      operations are expedited.
      
      Fixes: ab23a776 ("xfs: per-cpu deferred inode inactivation queues")
      Reported-by: NChris Dunlop <chris@onthe.net.au>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [djwong: fix _getquota_next to use _inodegc_push too]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      5e672cd6
    • D
      xfs: bound maximum wait time for inodegc work · 7cf2b0f9
      Dave Chinner 提交于
      Currently inodegc work can sit queued on the per-cpu queue until
      the workqueue is either flushed of the queue reaches a depth that
      triggers work queuing (and later throttling). This means that we
      could queue work that waits for a long time for some other event to
      trigger flushing.
      
      Hence instead of just queueing work at a specific depth, use a
      delayed work that queues the work at a bound time. We can still
      schedule the work immediately at a given depth, but we no long need
      to worry about leaving a number of items on the list that won't get
      processed until external events prevail.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      7cf2b0f9
  3. 12 4月, 2022 1 次提交
    • D
      xfs: use a separate frextents counter for rt extent reservations · 2229276c
      Darrick J. Wong 提交于
      As mentioned in the previous commit, the kernel misuses sb_frextents in
      the incore mount to reflect both incore reservations made by running
      transactions as well as the actual count of free rt extents on disk.
      This results in the superblock being written to the log with an
      underestimate of the number of rt extents that are marked free in the
      rtbitmap.
      
      Teaching XFS to recompute frextents after log recovery avoids
      operational problems in the current mount, but it doesn't solve the
      problem of us writing undercounted frextents which are then recovered by
      an older kernel that doesn't have that fix.
      
      Create an incore percpu counter to mirror the ondisk frextents.  This
      new counter will track transaction reservations and the only time we
      will touch the incore super counter (i.e the one that gets logged) is
      when those transactions commit updates to the rt bitmap.  This is in
      contrast to the lazysbcount counters (e.g. fdblocks), where we know that
      log recovery will always fix any incorrect counter that we log.
      As a bonus, we only take m_sb_lock at transaction commit time.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2229276c
  4. 30 3月, 2022 1 次提交
    • D
      xfs: aborting inodes on shutdown may need buffer lock · d2d7c047
      Dave Chinner 提交于
      Most buffer io list operations are run with the bp->b_lock held, but
      xfs_iflush_abort() can be called without the buffer lock being held
      resulting in inodes being removed from the buffer list while other
      list operations are occurring. This causes problems with corrupted
      bp->b_io_list inode lists during filesystem shutdown, leading to
      traversals that never end, double removals from the AIL, etc.
      
      Fix this by passing the buffer to xfs_iflush_abort() if we have
      it locked. If the inode is attached to the buffer, we're going to
      have to remove it from the buffer list and we'd have to get the
      buffer off the inode log item to do that anyway.
      
      If we don't have a buffer passed in (e.g. from xfs_reclaim_inode())
      then we can determine if the inode has a log item and if it is
      attached to a buffer before we do anything else. If it does have an
      attached buffer, we can lock it safely (because the inode has a
      reference to it) and then perform the inode abort.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      d2d7c047
  5. 23 3月, 2022 1 次提交
  6. 20 3月, 2022 1 次提交
    • D
      xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight · 01728b44
      Dave Chinner 提交于
      I've been chasing a recent resurgence in generic/388 recovery
      failure and/or corruption events. The events have largely been
      uninitialised inode chunks being tripped over in log recovery
      such as:
      
       XFS (pmem1): User initiated shutdown received.
       pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
       XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500).  Shutting down filesystem.
       XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
       XFS (pmem1): Unmounting Filesystem
       XFS (pmem1): Mounting V5 Filesystem
       XFS (pmem1): Starting recovery (logdev: internal)
       XFS (pmem1): bad inode magic/vsn daddr 8723584 #0 (magic=1818)
       XFS (pmem1): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x851c80 xfs_inode_buf_verify
       XFS (pmem1): Unmount and run xfs_repair
       XFS (pmem1): First 128 bytes of corrupted metadata buffer:
       00000000: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000010: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000020: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000030: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000040: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000050: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000060: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000070: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       XFS (pmem1): metadata I/O error in "xlog_recover_items_pass2+0x52/0xc0" at daddr 0x851c80 len 32 error 117
       XFS (pmem1): log mount/recovery failed: error -117
       XFS (pmem1): log mount failed
      
      There have been isolated random other issues, too - xfs_repair fails
      because it finds some corruption in symlink blocks, rmap
      inconsistencies, etc - but they are nowhere near as common as the
      uninitialised inode chunk failure.
      
      The problem has clearly happened at runtime before recovery has run;
      I can see the ICREATE log item in the log shortly before the
      actively recovered range of the log. This means the ICREATE was
      definitely created and written to the log, but for some reason the
      tail of the log has been moved past the ordered buffer log item that
      tracks INODE_ALLOC buffers and, supposedly, prevents the tail of the
      log moving past the ICREATE log item before the inode chunk buffer
      is written to disk.
      
      Tracing the fsstress processes that are running when the filesystem
      shut down immediately pin-pointed the problem:
      
      user shutdown marks xfs_mount as shutdown
      
               godown-213341 [008]  6398.022871: console:              [ 6397.915392] XFS (pmem1): User initiated shutdown received.
      .....
      
      aild tries to push ordered inode cluster buffer
      
        xfsaild/pmem1-213314 [001]  6398.022974: xfs_buf_trylock:      dev 259:1 daddr 0x851c80 bbcount 0x20 hold 16 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_inode_item_push+0x8e
        xfsaild/pmem1-213314 [001]  6398.022976: xfs_ilock_nowait:     dev 259:1 ino 0x851c80 flags ILOCK_SHARED caller xfs_iflush_cluster+0xae
      
      xfs_iflush_cluster() checks xfs_is_shutdown(), returns true,
      calls xfs_iflush_abort() to kill writeback of the inode.
      Inode is removed from AIL, drops cluster buffer reference.
      
        xfsaild/pmem1-213314 [001]  6398.022977: xfs_ail_delete:       dev 259:1 lip 0xffff88880247ed80 old lsn 7/20344 new lsn 7/21000 type XFS_LI_INODE flags IN_AIL
        xfsaild/pmem1-213314 [001]  6398.022978: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 17 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_iflush_abort+0xd7
      
      .....
      
      All inodes on cluster buffer are aborted, then the cluster buffer
      itself is aborted and removed from the AIL *without writeback*:
      
      xfsaild/pmem1-213314 [001]  6398.023011: xfs_buf_error_relse:  dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_ioend_fail+0x33
         xfsaild/pmem1-213314 [001]  6398.023012: xfs_ail_delete:       dev 259:1 lip 0xffff8888053efde8 old lsn 7/20344 new lsn 7/20344 type XFS_LI_BUF flags IN_AIL
      
      The inode buffer was at 7/20344 when it was removed from the AIL.
      
         xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_item_relse:   dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_done+0x31
         xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_relse+0x39
      
      .....
      
      Userspace is still running, doing stuff. an fsstress process runs
      syncfs() or sync() and we end up in sync_fs_one_sb() which issues
      a log force. This pushes on the CIL:
      
              fsstress-213322 [001]  6398.024430: xfs_fs_sync_fs:       dev 259:1 m_features 0x20000000019ff6e9 opstate (clean|shutdown|inodegc|blockgc) s_flags 0x70810000 caller sync_fs_one_sb+0x26
              fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x0 caller xfs_fs_sync_fs+0x82
              fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x5f caller xfs_log_force+0x7c
                 <...>-194402 [001]  6398.024467: kmem_alloc:           size 176 flags 0x14 caller xlog_cil_push_work+0x9f
      
      And the CIL fills up iclogs with pending changes. This picks up
      the current tail from the AIL:
      
                 <...>-194402 [001]  6398.024497: xlog_iclog_get_space: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x0 flags  caller xlog_write+0x149
                 <...>-194402 [001]  6398.024498: xlog_iclog_switch:    dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x700005408 flags  caller xlog_state_get_iclog_space+0x37e
                 <...>-194402 [001]  6398.024521: xlog_iclog_release:   dev 259:1 state XLOG_STATE_WANT_SYNC refcnt 1 offset 32256 lsn 0x700005408 flags  caller xlog_write+0x5f9
                 <...>-194402 [001]  6398.024522: xfs_log_assign_tail_lsn: dev 259:1 new tail lsn 7/21000, old lsn 7/20344, last sync 7/21448
      
      And it moves the tail of the log to 7/21000 from 7/20344. This
      *moves the tail of the log beyond the ICREATE transaction* that was
      at 7/20344 and pinned by the inode cluster buffer that was cancelled
      above.
      
      ....
      
               godown-213341 [008]  6398.027005: xfs_force_shutdown:   dev 259:1 tag logerror flags log_io|force_umount file fs/xfs/xfs_fsops.c line_num 500
                godown-213341 [008]  6398.027022: console:              [ 6397.915406] pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
                godown-213341 [008]  6398.030551: console:              [ 6397.919546] XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/
      
      And finally the log itself is now shutdown, stopping all further
      writes to the log. But this is too late to prevent the corruption
      that moving the tail of the log forwards after we start cancelling
      writeback causes.
      
      The fundamental problem here is that we are using the wrong shutdown
      checks for log items. We've long conflated mount shutdown with log
      shutdown state, and I started separating that recently with the
      atomic shutdown state changes in commit b36d4651 ("xfs: make
      forced shutdown processing atomic"). The changes in that commit
      series are directly responsible for being able to diagnose this
      issue because it clearly separated mount shutdown from log shutdown.
      
      Essentially, once we start cancelling writeback of log items and
      removing them from the AIL because the filesystem is shut down, we
      *cannot* update the journal because we may have cancelled the items
      that pin the tail of the log. That moves the tail of the log
      forwards without having written the metadata back, hence we have
      corrupt in memory state and writing to the journal propagates that
      to the on-disk state.
      
      What commit b36d4651 makes clear is that log item state needs to
      change relative to log shutdown, not mount shutdown. IOWs, anything
      that aborts metadata writeback needs to check log shutdown state
      because log items directly affect log consistency. Having them check
      mount shutdown state introduces the above race condition where we
      cancel metadata writeback before the log shuts down.
      
      To fix this, this patch works through all log items and converts
      shutdown checks to use xlog_is_shutdown() rather than
      xfs_is_shutdown(), so that we don't start aborting metadata
      writeback before we shut off journal writes.
      
      AFAICT, this race condition is a zero day IO error handling bug in
      XFS that dates back to the introduction of XLOG_IO_ERROR,
      XLOG_STATE_IOERROR and XFS_FORCED_SHUTDOWN back in January 1997.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      01728b44
  7. 20 1月, 2022 1 次提交
    • B
      xfs: flush inodegc workqueue tasks before cancel · 6191cf3a
      Brian Foster 提交于
      The xfs_inodegc_stop() helper performs a high level flush of pending
      work on the percpu queues and then runs a cancel_work_sync() on each
      of the percpu work tasks to ensure all work has completed before
      returning.  While cancel_work_sync() waits for wq tasks to complete,
      it does not guarantee work tasks have started. This means that the
      _stop() helper can queue and instantly cancel a wq task without
      having completed the associated work. This can be observed by
      tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>"
      test:
      
      	xfs_destroy_inode: ... ino 0x83 ...
      	xfs_inode_set_need_inactive: ... ino 0x83 ...
      	xfs_inodegc_stop: ...
      	...
      	xfs_inodegc_start: ...
      	xfs_inodegc_worker: ...
      	xfs_inode_inactivating: ... ino 0x83 ...
      
      The first few lines show that the inode is removed and need inactive
      state set, but the inactivation work has not completed before the
      inodegc mechanism stops. The inactivation doesn't actually occur
      until the fs is unfrozen and the gc mechanism starts back up. Note
      that this test requires fsfreeze to reproduce because xfs_freeze
      indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush().
      
      When this occurs, the workqueue try_to_grab_pending() logic first
      tries to steal the pending bit, which does not succeed because the
      bit has been set by queue_work_on(). Subsequently, it checks for
      association of a pool workqueue from the work item under the pool
      lock. This association is set at the point a work item is queued and
      cleared when dequeued for processing. If the association exists, the
      work item is removed from the queue and cancel_work_sync() returns
      true. If the pwq association is cleared, the remove attempt assumes
      the task is busy and retries (eventually returning false to the
      caller after waiting for the work task to complete).
      
      To avoid this race, we can flush each work item explicitly before
      cancel. However, since the _queue_all() already schedules each
      underlying work item, the workqueue level helpers are sufficient to
      achieve the same ordering effect. E.g., the inodegc enabled flag
      prevents scheduling any further work in the _stop() case. Use the
      drain_workqueue() helper in this particular case to make the intent
      a bit more self explanatory.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      6191cf3a
  8. 22 12月, 2021 1 次提交
  9. 18 12月, 2021 1 次提交
  10. 25 11月, 2021 1 次提交
  11. 23 10月, 2021 1 次提交
  12. 25 8月, 2021 1 次提交
  13. 20 8月, 2021 4 次提交
  14. 19 8月, 2021 1 次提交
  15. 10 8月, 2021 7 次提交
    • D
      xfs: throttle inode inactivation queuing on memory reclaim · 40b1de00
      Darrick J. Wong 提交于
      Now that we defer inode inactivation, we've decoupled the process of
      unlinking or closing an inode from the process of inactivating it.  In
      theory this should lead to better throughput since we now inactivate the
      queued inodes in batches instead of one at a time.
      
      Unfortunately, one of the primary risks with this decoupling is the loss
      of rate control feedback between the frontend and background threads.
      In other words, a rm -rf /* thread can run the system out of memory if
      it can queue inodes for inactivation and jump to a new CPU faster than
      the background threads can actually clear the deferred work.  The
      workers can get scheduled off the CPU if they have to do IO, etc.
      
      To solve this problem, we configure a shrinker so that it will activate
      the /second/ time the shrinkers are called.  The custom shrinker will
      queue all percpu deferred inactivation workers immediately and set a
      flag to force frontend callers who are releasing a vfs inode to wait for
      the inactivation workers.
      
      On my test VM with 560M of RAM and a 2TB filesystem, this seems to solve
      most of the OOMing problem when deleting 10 million inodes.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      40b1de00
    • D
      xfs: use background worker pool when transactions can't get free space · e8d04c2a
      Darrick J. Wong 提交于
      In xfs_trans_alloc, if the block reservation call returns ENOSPC, we
      call xfs_blockgc_free_space with a NULL icwalk structure to try to free
      space.  Each frontend thread that encounters this situation starts its
      own walk of the inode cache to see if it can find anything, which is
      wasteful since we don't have any additional selection criteria.  For
      this one common case, create a function that reschedules all pending
      background work immediately and flushes the workqueue so that the scan
      can run in parallel.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      e8d04c2a
    • D
      xfs: don't run speculative preallocation gc when fs is frozen · 6f649091
      Darrick J. Wong 提交于
      Now that we have the infrastructure to switch background workers on and
      off at will, fix the block gc worker code so that we don't actually run
      the worker when the filesystem is frozen, same as we do for deferred
      inactivation.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      6f649091
    • D
      xfs: inactivate inodes any time we try to free speculative preallocations · 2eb66502
      Darrick J. Wong 提交于
      Other parts of XFS have learned to call xfs_blockgc_free_{space,quota}
      to try to free speculative preallocations when space is tight.  This
      means that file writes, transaction reservation failures, quota limit
      enforcement, and the EOFBLOCKS ioctl all call this function to free
      space when things are tight.
      
      Since inode inactivation is now a background task, this means that the
      filesystem can be hanging on to unlinked but not yet freed space.  Add
      this to the list of things that xfs_blockgc_free_* makes writer threads
      scan for when they cannot reserve space.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      2eb66502
    • D
      xfs: queue inactivation immediately when free realtime extents are tight · 65f03d86
      Darrick J. Wong 提交于
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      Similar to the patch doing this for free space on the data device, if
      the file being inactivated is a realtime file and the realtime volume is
      running low on free extents, we want to run the worker ASAP so that the
      realtime allocator can make better decisions.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      65f03d86
    • D
      xfs: queue inactivation immediately when quota is nearing enforcement · 108523b8
      Darrick J. Wong 提交于
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      Specifically, if the dquots attached to the inode being inactivated are
      nearing any kind of enforcement boundary, we want to queue that
      inactivation work immediately so that users don't get EDQUOT/ENOSPC
      errors even after they deleted a bunch of files to stay within quota.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      108523b8
    • D
      xfs: queue inactivation immediately when free space is tight · 7d6f07d2
      Darrick J. Wong 提交于
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      On a mostly empty filesystem, the risk of the allocator making poor
      decisions due to fragmentation of the free space on account a lengthy
      delay in background updates is minimal because there's plenty of space.
      However, if free space is tight, we want to deallocate unlinked inodes
      as quickly as possible to avoid fallocate ENOSPC and to give the
      allocator the best shot at optimal allocations for new writes.
      
      Therefore, queue the percpu worker immediately if the filesystem is more
      than 95% full.  This follows the same principle that XFS becomes less
      aggressive about speculative allocations and lazy cleanup (and more
      precise about accounting) when nearing full.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      7d6f07d2
  16. 07 8月, 2021 4 次提交
    • D
      xfs: per-cpu deferred inode inactivation queues · ab23a776
      Dave Chinner 提交于
      Move inode inactivation to background work contexts so that it no
      longer runs in the context that releases the final reference to an
      inode. This will allow process work that ends up blocking on
      inactivation to continue doing work while the filesytem processes
      the inactivation in the background.
      
      A typical demonstration of this is unlinking an inode with lots of
      extents. The extents are removed during inactivation, so this blocks
      the process that unlinked the inode from the directory structure. By
      moving the inactivation to the background process, the userspace
      applicaiton can keep working (e.g. unlinking the next inode in the
      directory) while the inactivation work on the previous inode is
      done by a different CPU.
      
      The implementation of the queue is relatively simple. We use a
      per-cpu lockless linked list (llist) to queue inodes for
      inactivation without requiring serialisation mechanisms, and a work
      item to allow the queue to be processed by a CPU bound worker
      thread. We also keep a count of the queue depth so that we can
      trigger work after a number of deferred inactivations have been
      queued.
      
      The use of a bound workqueue with a single work depth allows the
      workqueue to run one work item per CPU. We queue the work item on
      the CPU we are currently running on, and so this essentially gives
      us affine per-cpu worker threads for the per-cpu queues. THis
      maintains the effective CPU affinity that occurs within XFS at the
      AG level due to all objects in a directory being local to an AG.
      Hence inactivation work tends to run on the same CPU that last
      accessed all the objects that inactivation accesses and this
      maintains hot CPU caches for unlink workloads.
      
      A depth of 32 inodes was chosen to match the number of inodes in an
      inode cluster buffer. This hopefully allows sequential
      allocation/unlink behaviours to defering inactivation of all the
      inodes in a single cluster buffer at a time, further helping
      maintain hot CPU and buffer cache accesses while running
      inactivations.
      
      A hard per-cpu queue throttle of 256 inode has been set to avoid
      runaway queuing when inodes that take a long to time inactivate are
      being processed. For example, when unlinking inodes with large
      numbers of extents that can take a lot of processing to free.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [djwong: tweak comments and tracepoints, convert opflags to state bits]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ab23a776
    • D
      xfs: detach dquots from inode if we don't need to inactivate it · 62af7d54
      Darrick J. Wong 提交于
      If we don't need to inactivate an inode, we can detach the dquots and
      move on to reclamation.  This isn't strictly required here; it's a
      preparation patch for deferred inactivation per reviewer request[1] to
      move the creation of xfs_inode_needs_inactivation into a separate
      change.  Eventually this !need_inactive chunk will turn into the code
      path for inodes that skip xfs_inactive and go straight to memory
      reclaim.
      
      [1] https://lore.kernel.org/linux-xfs/20210609012838.GW2945738@locust/T/#mca6d958521cb88bbc1bfe1a30767203328d410b5Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      62af7d54
    • D
      xfs: move xfs_inactive call to xfs_inode_mark_reclaimable · c6c2066d
      Darrick J. Wong 提交于
      Move the xfs_inactive call and all the other debugging checks and stats
      updates into xfs_inode_mark_reclaimable because most of that are
      implementation details about the inode cache.  This is preparation for
      deferred inactivation that is coming up.  We also move it around
      xfs_icache.c in preparation for deferred inactivation.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      c6c2066d
    • C
      xfs: remove xfs_dqrele_all_inodes · 777eb1fa
      Christoph Hellwig 提交于
      xfs_dqrele_all_inodes is unused now, remove it and all supporting code.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      777eb1fa
  17. 22 6月, 2021 3 次提交
  18. 09 6月, 2021 4 次提交
    • D
      xfs: rename struct xfs_eofblocks to xfs_icwalk · b26b2bf1
      Darrick J. Wong 提交于
      The xfs_eofblocks structure is no longer well-named -- nowadays it
      provides optional filtering criteria to any walk of the incore inode
      cache.  Only one of the cache walk goals has anything to do with
      clearing of speculative post-EOF preallocations, so change the name to
      be more appropriate.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      b26b2bf1
    • D
      xfs: selectively keep sick inodes in memory · 9492750a
      Darrick J. Wong 提交于
      It's important that the filesystem retain its memory of sick inodes for
      a little while after problems are found so that reports can be collected
      about what was wrong.  Don't let inode reclamation free sick inodes
      unless we're unmounting or the fs already went down.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      9492750a
    • D
      xfs: change the prefix of XFS_EOF_FLAGS_* to XFS_ICWALK_FLAG_ · 2d53f66b
      Darrick J. Wong 提交于
      In preparation for renaming struct xfs_eofblocks to struct xfs_icwalk,
      change the prefix of the existing XFS_EOF_FLAGS_* flags to
      XFS_ICWALK_FLAG_ and convert all the existing users.  This adds a degree
      of interface separation between the ioctl definitions and the incore
      parameters.  Since FLAGS_UNION is only used in xfs_icache.c, move it
      there as a private flag.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      2d53f66b
    • D
      xfs: only reset incore inode health state flags when reclaiming an inode · 255794c7
      Darrick J. Wong 提交于
      While running some fuzz tests on inode metadata, I noticed that the
      filesystem health report (as provided by xfs_spaceman) failed to report
      the file corruption even when spaceman was run immediately after running
      xfs_scrub to detect the corruption.  That isn't the intended behavior;
      one ought to be able to run scrub to detect errors in the ondisk
      metadata and be able to access to those reports for some time after the
      scrub.
      
      After running the same sequence through an instrumented kernel, I
      discovered the reason why -- scrub igets the file, scans it, marks it
      sick, and ireleases the inode.  When the VFS lets go of the incore
      inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
      inode before it moves to RECLAIM state, iget reinitializes the VFS
      state, clears the sick and checked masks, and hands back the inode.  At
      this point, the caller has the exact same incore inode, but with all the
      health state erased.
      
      In other words, we're erasing the incore inode's health state flags when
      we've decided NOT to sever the link between the incore inode and the
      ondisk inode.  This is wrong, so we need to remove the lines that zero
      the fields from xfs_iget_cache_hit.
      
      As a precaution, we add the same lines into xfs_reclaim_inode just after
      we sever the link between incore and ondisk inode.  Strictly speaking
      this isn't necessary because once an inode has gone through reclaim it
      must go through xfs_inode_alloc (which also zeroes the state) and
      xfs_iget is careful to check for mismatches between the inode it pulls
      out of the radix tree and the one it wants.
      
      Fixes: 6772c1f1 ("xfs: track metadata health status")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      255794c7
  19. 04 6月, 2021 4 次提交