1. 29 7月, 2020 3 次提交
    • D
      xfs: clear XFS_DQ_FREEING if we can't lock the dquot buffer to flush · c97738a9
      Darrick J. Wong 提交于
      In commit 8d3d7e2b, we changed xfs_qm_dqpurge to bail out if we
      can't lock the dquot buf to flush the dquot.  This prevents the AIL from
      blocking on the dquot, but it also forgets to clear the FREEING flag on
      its way out.  A subsequent purge attempt will see the FREEING flag is
      set and bail out, which leads to dqpurge_all failing to purge all the
      dquots.
      
      (copy-pasting from Dave Chinner's identical patch)
      
      This was found by inspection after having xfs/305 hang 1 in ~50
      iterations in a quotaoff operation:
      
      [ 8872.301115] xfs_quota       D13888 92262  91813 0x00004002
      [ 8872.302538] Call Trace:
      [ 8872.303193]  __schedule+0x2d2/0x780
      [ 8872.304108]  ? do_raw_spin_unlock+0x57/0xd0
      [ 8872.305198]  schedule+0x6e/0xe0
      [ 8872.306021]  schedule_timeout+0x14d/0x300
      [ 8872.307060]  ? __next_timer_interrupt+0xe0/0xe0
      [ 8872.308231]  ? xfs_qm_dqusage_adjust+0x200/0x200
      [ 8872.309422]  schedule_timeout_uninterruptible+0x2a/0x30
      [ 8872.310759]  xfs_qm_dquot_walk.isra.0+0x15a/0x1b0
      [ 8872.311971]  xfs_qm_dqpurge_all+0x7f/0x90
      [ 8872.313022]  xfs_qm_scall_quotaoff+0x18d/0x2b0
      [ 8872.314163]  xfs_quota_disable+0x3a/0x60
      [ 8872.315179]  kernel_quotactl+0x7e2/0x8d0
      [ 8872.316196]  ? __do_sys_newstat+0x51/0x80
      [ 8872.317238]  __x64_sys_quotactl+0x1e/0x30
      [ 8872.318266]  do_syscall_64+0x46/0x90
      [ 8872.319193]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [ 8872.320490] RIP: 0033:0x7f46b5490f2a
      [ 8872.321414] Code: Bad RIP value.
      
      Returning -EAGAIN from xfs_qm_dqpurge() without clearing the
      XFS_DQ_FREEING flag means the xfs_qm_dqpurge_all() code can never
      free the dquot, and we loop forever waiting for the XFS_DQ_FREEING
      flag to go away on the dquot that leaked it via -EAGAIN.
      
      Fixes: 8d3d7e2b ("xfs: trylock underlying buffer on dquot flush")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAllison Collins <allison.henderson@oracle.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      c97738a9
    • B
      xfs: fix inode allocation block res calculation precedence · b2a88647
      Brian Foster 提交于
      The block reservation calculation for inode allocation is supposed
      to consist of the blocks required for the inode chunk plus
      (maxlevels-1) of the inode btree multiplied by the number of inode
      btrees in the fs (2 when finobt is enabled, 1 otherwise).
      
      Instead, the macro returns (ialloc_blocks + 2) due to a precedence
      error in the calculation logic. This leads to block reservation
      overruns via generic/531 on small block filesystems with finobt
      enabled. Add braces to fix the calculation and reserve the
      appropriate number of blocks.
      
      Fixes: 9d43b180 ("xfs: update inode allocation/free transaction reservations for finobt")
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      b2a88647
    • B
      xfs: drain the buf delwri queue before xfsaild idles · f376b45e
      Brian Foster 提交于
      xfsaild is racy with respect to transaction abort and shutdown in
      that the task can idle or exit with an empty AIL but buffers still
      on the delwri queue. This was partly addressed by cancelling the
      delwri queue before the task exits to prevent memory leaks, but it's
      also possible for xfsaild to empty and idle with buffers on the
      delwri queue. For example, a transaction that pins a buffer that
      also happens to sit on the AIL delwri queue will explicitly remove
      the associated log item from the AIL if the transaction aborts. The
      side effect of this is an unmount hang in xfs_wait_buftarg() as the
      associated buffers remain held by the delwri queue indefinitely.
      This is reproduced on repeated runs of generic/531 with an fs format
      (-mrmapbt=1 -bsize=1k) that happens to also reproduce transaction
      aborts.
      
      Update xfsaild to not idle until both the AIL and associated delwri
      queue are empty and update the push code to continue delwri queue
      submission attempts even when the AIL is empty. This allows the AIL
      to eventually release aborted buffers stranded on the delwri queue
      when they are unlocked by the associated transaction. This should
      have no significant effect on normal runtime behavior because the
      xfsaild currently idles only when the AIL is empty and in practice
      the AIL is rarely empty with a populated delwri queue. The items
      must be AIL resident to land in the queue in the first place and
      generally aren't removed until writeback completes.
      
      Note that the pre-existing delwri queue cancel logic in the exit
      path is retained because task stop is external, could technically
      come at any point, and xfsaild is still responsible to release its
      buffer references before it exits.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f376b45e
  2. 18 7月, 2020 1 次提交
    • E
      xfs: preserve inode versioning across remounts · 4750a171
      Eric Sandeen 提交于
      The MS_I_VERSION mount flag is exposed via the VFS, as documented
      in the mount manpages etc; see the iversion and noiversion mount
      options in mount(8).
      
      As a result, mount -o remount looks for this option in /proc/mounts
      and will only send the I_VERSION flag back in during remount it it
      is present.  Since it's not there, a remount will /remove/ the
      I_VERSION flag at the vfs level, and iversion functionality is lost.
      
      xfs v5 superblocks intend to always have i_version enabled; it is
      set as a default at mount time, but is lost during remount for the
      reasons above.
      
      The generic fix would be to expose this documented option in
      /proc/mounts, but since that was rejected, fix it up again in the
      xfs remount path instead, so that at least xfs won't suffer from
      this misbehavior.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4750a171
  3. 14 7月, 2020 3 次提交
  4. 10 7月, 2020 1 次提交
    • W
      xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim · c3f2375b
      Waiman Long 提交于
      Depending on the workloads, the following circular locking dependency
      warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
      lock) may show up:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.0.0-rc1+ #60 Tainted: G        W
      ------------------------------------------------------
      fsfreeze/4346 is trying to acquire lock:
      0000000026f1d784 (fs_reclaim){+.+.}, at:
      fs_reclaim_acquire.part.19+0x5/0x30
      
      but task is already holding lock:
      0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
      
      which lock already depends on the new lock.
        :
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(sb_internal);
                                     lock(fs_reclaim);
                                     lock(sb_internal);
        lock(fs_reclaim);
      
       *** DEADLOCK ***
      
      4 locks held by fsfreeze/4346:
       #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
       #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
       #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
       #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
      
      stack backtrace:
      Call Trace:
       dump_stack+0xe0/0x19a
       print_circular_bug.isra.10.cold.34+0x2f4/0x435
       check_prev_add.constprop.19+0xca1/0x15f0
       validate_chain.isra.14+0x11af/0x3b50
       __lock_acquire+0x728/0x1200
       lock_acquire+0x269/0x5a0
       fs_reclaim_acquire.part.19+0x29/0x30
       fs_reclaim_acquire+0x19/0x20
       kmem_cache_alloc+0x3e/0x3f0
       kmem_zone_alloc+0x79/0x150
       xfs_trans_alloc+0xfa/0x9d0
       xfs_sync_sb+0x86/0x170
       xfs_log_sbcount+0x10f/0x140
       xfs_quiesce_attr+0x134/0x270
       xfs_fs_freeze+0x4a/0x70
       freeze_super+0x1af/0x290
       do_vfs_ioctl+0xedc/0x16c0
       ksys_ioctl+0x41/0x80
       __x64_sys_ioctl+0x73/0xa9
       do_syscall_64+0x18f/0xd23
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This is a false positive as all the dirty pages are flushed out before
      the filesystem can be frozen.
      
      One way to avoid this splat is to add GFP_NOFS to the affected allocation
      calls by using the memalloc_nofs_save()/memalloc_nofs_restore() pair.
      This shouldn't matter unless the system is really running out of memory.
      In that particular case, the filesystem freeze operation may fail while
      it was succeeding previously.
      
      Without this patch, the command sequence below will show that the lock
      dependency chain sb_internal -> fs_reclaim exists.
      
       # fsfreeze -f /home
       # fsfreeze --unfreeze /home
       # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal
      
      After applying the patch, such sb_internal -> fs_reclaim lock dependency
      chain can no longer be found. Because of that, the locking dependency
      warning will not be shown.
      Suggested-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c3f2375b
  5. 07 7月, 2020 32 次提交
    • D
      xfs: rtbitmap scrubber should check inode size · 2fb94e36
      Darrick J. Wong 提交于
      Make sure the rtbitmap is large enough to store the entire bitmap.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAllison Collins <allison.henderson@oracle.com>
      2fb94e36
    • D
      xfs: rtbitmap scrubber should verify written extents · f866560b
      Darrick J. Wong 提交于
      Ensure that the realtime bitmap file is backed entirely by written
      extents.  No holes, no unwritten blocks, etc.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAllison Collins <allison.henderson@oracle.com>
      f866560b
    • D
      xfs: remove xfs_inobp_check() · e2705b03
      Dave Chinner 提交于
      This debug code is called on every xfs_iflush() call, which then
      checks every inode in the buffer for non-zero unlinked list field.
      Hence it checks every inode in the cluster buffer every time a
      single inode on that cluster it flushed. This is resulting in:
      
      -   38.91%     5.33%  [kernel]  [k] xfs_iflush
         - 17.70% xfs_iflush
            - 9.93% xfs_inobp_check
                 4.36% xfs_buf_offset
      
      10% of the CPU time spent flushing inodes is repeatedly checking
      unlinked fields in the buffer. We don't need to do this.
      
      The other place we call xfs_inobp_check() is
      xfs_iunlink_update_dinode(), and this is after we've done this
      assert for the agino we are about to write into that inode:
      
      	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
      
      which means we've already checked that the agino we are about to
      write is not 0 on debug kernels. The inode buffer verifiers do
      everything else we need, so let's just remove this debug code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e2705b03
    • D
      xfs: factor xfs_iflush_done · a69a1dc2
      Dave Chinner 提交于
      xfs_iflush_done() does 3 distinct operations to the inodes attached
      to the buffer. Separate these operations out into functions so that
      it is easier to modify these operations independently in future.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a69a1dc2
    • D
      xfs: rework xfs_iflush_cluster() dirty inode iteration · 5717ea4d
      Dave Chinner 提交于
      Now that we have all the dirty inodes attached to the cluster
      buffer, we don't actually have to do radix tree lookups to find
      them. Sure, the radix tree is efficient, but walking a linked list
      of just the dirty inodes attached to the buffer is much better.
      
      We are also no longer dependent on having a locked inode passed into
      the function to determine where to start the lookup. This means we
      can drop it from the function call and treat all inodes the same.
      
      We also make xfs_iflush_cluster skip inodes marked with
      XFS_IRECLAIM. This we avoid races with inodes that reclaim is
      actively referencing or are being re-initialised by inode lookup. If
      they are actually dirty, they'll get written by a future cluster
      flush....
      
      We also add a shutdown check after obtaining the flush lock so that
      we catch inodes that are dirty in memory and may have inconsistent
      state due to the shutdown in progress. We abort these inodes
      directly and so they remove themselves directly from the buffer list
      and the AIL rather than having to wait for the buffer to be failed
      and callbacks run to be processed correctly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      5717ea4d
    • D
      xfs: rename xfs_iflush_int() · e6187b34
      Dave Chinner 提交于
      with xfs_iflush() gone, we can rename xfs_iflush_int() back to
      xfs_iflush(). Also move it up above xfs_iflush_cluster() so we don't
      need the forward definition any more.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e6187b34
    • D
      xfs: xfs_iflush() is no longer necessary · 90c60e16
      Dave Chinner 提交于
      Now we have a cached buffer on inode log items, we don't need
      to do buffer lookups when flushing inodes anymore - all we need
      to do is lock the buffer and we are ready to go.
      
      This largely gets rid of the need for xfs_iflush(), which is
      essentially just a mechanism to look up the buffer and flush the
      inode to it. Instead, we can just call xfs_iflush_cluster() with a
      few modifications to ensure it also flushes the inode we already
      hold locked.
      
      This allows the AIL inode item pushing to be almost entirely
      non-blocking in XFS - we won't block unless memory allocation
      for the cluster inode lookup blocks or the block device queues are
      full.
      
      Writeback during inode reclaim becomes a little more complex because
      we now have to lock the buffer ourselves, but otherwise this change
      is largely a functional no-op that removes a whole lot of code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      90c60e16
    • D
      xfs: attach inodes to the cluster buffer when dirtied · 48d55e2a
      Dave Chinner 提交于
      Rather than attach inodes to the cluster buffer just when we are
      doing IO, attach the inodes to the cluster buffer when they are
      dirtied. The means the buffer always carries a list of dirty inodes
      that reference it, and we can use that list to make more fundamental
      changes to inode writeback that aren't otherwise possible.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      48d55e2a
    • D
      xfs: rework stale inodes in xfs_ifree_cluster · 71e3e356
      Dave Chinner 提交于
      Once we have inodes pinning the cluster buffer and attached whenever
      they are dirty, we no longer have a guarantee that the items are
      flush locked when we lock the cluster buffer. Hence we cannot just
      walk the buffer log item list and modify the attached inodes.
      
      If the inode is not flush locked, we have to ILOCK it first and then
      flush lock it to do all the prerequisite checks needed to avoid
      races with other code. This is already handled by
      xfs_ifree_get_one_inode(), so rework the inode iteration loop and
      function to update all inodes in cache whether they are attached to
      the buffer or not.
      
      Note: we also remove the copying of the log item lsn to the
      ili_flush_lsn as xfs_iflush_done() now uses the XFS_ISTALE flag to
      trigger aborts and so flush lsn matching is not needed in IO
      completion for processing freed inodes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      71e3e356
    • D
      xfs: clean up inode reclaim comments · 02511a5a
      Dave Chinner 提交于
      Inode reclaim is quite different now to the way described in various
      comments, so update all the comments explaining what it does and how
      it works.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      02511a5a
    • D
      xfs: remove SYNC_WAIT from xfs_reclaim_inodes() · 4d0bab3a
      Dave Chinner 提交于
      Clean up xfs_reclaim_inodes() callers. Most callers want blocking
      behaviour, so just make the existing SYNC_WAIT behaviour the
      default.
      
      For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag()
      directly because we just want optimistic clean inode reclaim to be
      done in the background.
      
      For xfs_quiesce_attr() we can just remove the inode reclaim calls as
      they are a historic relic that was required to flush dirty inodes
      that contained unlogged changes. We now log all changes to the
      inodes, so the sync AIL push from xfs_log_quiesce() called by
      xfs_quiesce_attr() will do all the required inode writeback for
      freeze.
      
      Seeing as we now want to loop until all reclaimable inodes have been
      reclaimed, make xfs_reclaim_inodes() loop on the XFS_ICI_RECLAIM_TAG
      tag rather than having xfs_reclaim_inodes_ag() tell it that inodes
      were skipped. This is much more reliable and will always loop until
      all reclaimable inodes are reclaimed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4d0bab3a
    • D
      xfs: remove SYNC_TRYLOCK from inode reclaim · 50718b8d
      Dave Chinner 提交于
      All background reclaim is SYNC_TRYLOCK already, and even blocking
      reclaim (SYNC_WAIT) can use trylock mechanisms as
      xfs_reclaim_inodes_ag() will keep cycling until there are no more
      reclaimable inodes. Hence we can kill SYNC_TRYLOCK from inode
      reclaim and make everything unconditionally non-blocking.
      
      We remove all the optimistic "avoid blocking on locks" checks done
      in xfs_reclaim_inode_grab() as nothing blocks on locks anymore.
      Further, checking XFS_IFLOCK optimistically can result in detecting
      inodes in the process of being cleaned (i.e. between being removed
      from the AIL and having the flush lock dropped), so for
      xfs_reclaim_inodes() to reliably reclaim all inodes we need to drop
      these checks anyway.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      50718b8d
    • D
      xfs: don't block inode reclaim on the ILOCK · 9552e14d
      Dave Chinner 提交于
      When we attempt to reclaim an inode, the first thing we do is take
      the inode lock. This is blocking right now, so if the inode being
      accessed by something else (e.g. being flushed to the cluster
      buffer) we will block here.
      
      Change this to a trylock so that we do not block inode reclaim
      unnecessarily here.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9552e14d
    • D
      xfs: allow multiple reclaimers per AG · 0e8e2c63
      Dave Chinner 提交于
      Inode reclaim will still throttle direct reclaim on the per-ag
      reclaim locks. This is no longer necessary as reclaim can run
      non-blocking now. Hence we can remove these locks so that we don't
      arbitrarily block reclaimers just because there are more direct
      reclaimers than there are AGs.
      
      This can result in multiple reclaimers working on the same range of
      an AG, but this doesn't cause any apparent issues. Optimising the
      spread of concurrent reclaimers for best efficiency can be done in a
      future patchset.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0e8e2c63
    • D
      xfs: remove IO submission from xfs_reclaim_inode() · 617825fe
      Dave Chinner 提交于
      We no longer need to issue IO from shrinker based inode reclaim to
      prevent spurious OOM killer invocation. This leaves only the global
      filesystem management operations such as unmount needing to
      writeback dirty inodes and reclaim them.
      
      Instead of using the reclaim pass to write dirty inodes before
      reclaiming them, use the AIL to push all the dirty inodes before we
      try to reclaim them. This allows us to remove all the conditional
      SYNC_WAIT locking and the writeback code from xfs_reclaim_inode()
      and greatly simplify the checks we need to do to reclaim an inode.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      617825fe
    • D
      xfs: make inode reclaim almost non-blocking · 993f951f
      Dave Chinner 提交于
      Now that dirty inode writeback doesn't cause read-modify-write
      cycles on the inode cluster buffer under memory pressure, the need
      to throttle memory reclaim to the rate at which we can clean dirty
      inodes goes away. That is due to the fact that we no longer thrash
      inode cluster buffers under memory pressure to clean dirty inodes.
      
      This means inode writeback no longer stalls on memory allocation
      or read IO, and hence can be done asynchronously without generating
      memory pressure. As a result, blocking inode writeback in reclaim is
      no longer necessary to prevent reclaim priority windup as cleaning
      dirty inodes is no longer dependent on having memory reserves
      available for the filesystem to make progress reclaiming inodes.
      
      Hence we can convert inode reclaim to be non-blocking for shrinker
      callouts, both for direct reclaim and kswapd.
      
      On a vanilla kernel, running a 16-way fsmark create workload on a
      4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
      userspace mlock(). The OOM killer gets invoked at 15GB of
      pinned RAM.
      
      Without the inode cluster pinning, this non-blocking reclaim patch
      triggers premature OOM killer invocation with the same memory
      pinning, sometimes with as much as 45% of RAM being free.  It's
      trivially easy to trigger the OOM killer when reclaim does not
      block.
      
      With pinning inode clusters in RAM and then adding this patch, I can
      reliably pin 14.5GB of RAM and still have the fsmark workload run to
      completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
      is only a small amount of memory less than the vanilla kernel. It is
      much more reliable than just with async reclaim alone.
      
      simoops shows that allocation stalls go away when async reclaim is
      used. Vanilla kernel:
      
      Run time: 1924 seconds
      Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
      Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
      Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
      work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
      alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)
      
      With inode cluster pinning and async reclaim:
      
      Run time: 1924 seconds
      Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
      Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
      Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
      work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
      alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)
      
      Latencies don't really change much, nor does the work rate. However,
      allocation almost never stalls with these changes, whilst the
      vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
      period. This difference is due to inode reclaim being largely
      non-blocking now.
      
      IOWs, once we have pinned inode cluster buffers, we can make inode
      reclaim non-blocking without a major risk of premature and/or
      spurious OOM killer invocation, and without any changes to memory
      reclaim infrastructure.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      993f951f
    • D
      xfs: pin inode backing buffer to the inode log item · 298f7bec
      Dave Chinner 提交于
      When we dirty an inode, we are going to have to write it disk at
      some point in the near future. This requires the inode cluster
      backing buffer to be present in memory. Unfortunately, under severe
      memory pressure we can reclaim the inode backing buffer while the
      inode is dirty in memory, resulting in stalling the AIL pushing
      because it has to do a read-modify-write cycle on the cluster
      buffer.
      
      When we have no memory available, the read of the cluster buffer
      blocks the AIL pushing process, and this causes all sorts of issues
      for memory reclaim as it requires inode writeback to make forwards
      progress. Allocating a cluster buffer causes more memory pressure,
      and results in more cluster buffers to be reclaimed, resulting in
      more RMW cycles to be done in the AIL context and everything then
      backs up on AIL progress. Only the synchronous inode cluster
      writeback in the the inode reclaim code provides some level of
      forwards progress guarantees that prevent OOM-killer rampages in
      this situation.
      
      Fix this by pinning the inode backing buffer to the inode log item
      when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
      This may mean the first modification of an inode that has been held
      in cache for a long time may block on a cluster buffer read, but
      we can do that in transaction context and block safely until the
      buffer has been allocated and read.
      
      Once we have the cluster buffer, the inode log item takes a
      reference to it, pinning it in memory, and attaches it to the log
      item for future reference. This means we can always grab the cluster
      buffer from the inode log item when we need it.
      
      When the inode is finally cleaned and removed from the AIL, we can
      drop the reference the inode log item holds on the cluster buffer.
      Once all inodes on the cluster buffer are clean, the cluster buffer
      will be unpinned and it will be available for memory reclaim to
      reclaim again.
      
      This avoids the issues with needing to do RMW cycles in the AIL
      pushing context, and hence allows complete non-blocking inode
      flushing to be performed by the AIL pushing context.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      298f7bec
    • D
      xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() · e98084b8
      Dave Chinner 提交于
      xfs_ail_delete_one() is called directly from dquot and inode IO
      completion, as well as from the generic xfs_trans_ail_delete()
      function. Inodes are about to have their own failure handling, and
      dquots will in future, too. Pull the clearing of the LI_FAILED flag
      up into the callers so we can customise the code appropriately.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e98084b8
    • D
      xfs: unwind log item error flagging · 3536b61e
      Dave Chinner 提交于
      When an buffer IO error occurs, we want to mark all
      the log items attached to the buffer as failed. Open code
      the error handling loop so that we can modify the flagging for the
      different types of objects directly and independently of each other.
      
      This also allows us to remove the ->iop_error method from the log
      item operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      3536b61e
    • D
      xfs: handle buffer log item IO errors directly · 428947e9
      Dave Chinner 提交于
      Currently when a buffer with attached log items has an IO error
      it called ->iop_error for each attched log item. These all call
      xfs_set_li_failed() to handle the error, but we are about to change
      the way log items manage buffers. hence we first need to remove the
      per-item dependency on buffer handling done by xfs_set_li_failed().
      
      We already have specific buffer type IO completion routines, so move
      the log item error handling out of the generic error handling and
      into the log item specific functions so we can implement per-type
      error handling easily.
      
      This requires a more complex return value from the error handling
      code so that we can take the correct action the failure handling
      requires.  This results in some repeated boilerplate in the
      functions, but that can be cleaned up later once all the changes
      cascade through this code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      428947e9
    • D
      xfs: get rid of log item callbacks · 2ef3f7f5
      Dave Chinner 提交于
      They are not used anymore, so remove them from the log item and the
      buffer iodone attachment interfaces.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      2ef3f7f5
    • D
      xfs: clean up the buffer iodone callback functions · fec671cd
      Dave Chinner 提交于
      Now that we've sorted inode and dquot buffers, we can apply the same
      cleanups to dirty buffers with buffer log items. They only have one
      callback, too, so we don't need the log item callback. Collapse the
      iodone functions and remove all the now unnecessary infrastructure
      around callback processing.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      fec671cd
    • D
      xfs: use direct calls for dquot IO completion · 6f5de180
      Dave Chinner 提交于
      Similar to inodes, we can call the dquot IO completion functions
      directly from the buffer completion code, removing another user of
      log item callbacks for IO completion processing.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      6f5de180
    • D
      xfs: make inode IO completion buffer centric · aac855ab
      Dave Chinner 提交于
      Having different io completion callbacks for different inode states
      makes things complex. We can detect if the inode is stale via the
      XFS_ISTALE flag in IO completion, so we don't need a special
      callback just for this.
      
      This means inodes only have a single iodone callback, and inode IO
      completion is entirely buffer centric at this point. Hence we no
      longer need to use a log item callback at all as we can just call
      xfs_iflush_done() directly from the buffer completions and walk the
      buffer log item list to complete the all inodes under IO.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      aac855ab
    • D
      xfs: clean up whacky buffer log item list reinit · a7e134ef
      Dave Chinner 提交于
      When we've emptied the buffer log item list, it does a list_del_init
      on itself to reset it's pointers to itself. This is unnecessary as
      the list is already empty at this point - it was a left-over
      fragment from the list_head conversion of the buffer log item list.
      Remove them.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a7e134ef
    • D
      xfs: call xfs_buf_iodone directly · b01d1461
      Dave Chinner 提交于
      All unmarked dirty buffers should be in the AIL and have log items
      attached to them. Hence when they are written, we will run a
      callback to remove the item from the AIL if appropriate. Now that
      we've handled inode and dquot buffers, all remaining calls are to
      xfs_buf_iodone() and so we can hard code this rather than use an
      indirect call.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      b01d1461
    • D
      xfs: mark log recovery buffers for completion · 9fe5c77c
      Dave Chinner 提交于
      Log recovery has it's own buffer write completion handler for
      buffers that it directly recovers. Convert these to direct calls by
      flagging these buffers as being log recovery buffers. The flag will
      get cleared by the log recovery IO completion routine, so it will
      never leak out of log recovery.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9fe5c77c
    • D
      xfs: mark dquot buffers in cache · 0c7e5afb
      Dave Chinner 提交于
      dquot buffers always have write IO callbacks, so by marking them
      directly we can avoid needing to attach ->b_iodone functions to
      them. This avoids an indirect call, and makes future modifications
      much simpler.
      
      This is largely a rearrangement of the code at this point - no IO
      completion functionality changes at this point, just how the
      code is run is modified.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0c7e5afb
    • D
      xfs: mark inode buffers in cache · f593bf14
      Dave Chinner 提交于
      Inode buffers always have write IO callbacks, so by marking them
      directly we can avoid needing to attach ->b_iodone functions to
      them. This avoids an indirect call, and makes future modifications
      much simpler.
      
      While this is largely a refactor of existing functionality, we
      broaden the scope of the flag to beyond where inodes are explicitly
      attached because future changes need to know what type of log items
      are attached to the buffer. Adding this buffer flag may invoke the
      inode iodone callback in cases where it wouldn't have been
      previously, but this is not a functional change because the callback
      is identical to the normal buffer write iodone callback when inodes
      are not attached.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f593bf14
    • D
      xfs: add an inode item lock · 1319ebef
      Dave Chinner 提交于
      The inode log item is kind of special in that it can be aggregating
      new changes in memory at the same time time existing changes are
      being written back to disk. This means there are fields in the log
      item that are accessed concurrently from contexts that don't share
      any locking at all.
      
      e.g. updating ili_last_fields occurs at flush time under the
      ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
      completion time, and is read under the ILOCK_EXCL when the inode is
      logged.  Hence there is no actual serialisation between reading the
      field during logging of the inode in transactions vs clearing the
      field in IO completion.
      
      We currently get away with this by the fact that we are only
      clearing fields in IO completion, and nothing bad happens if we
      accidentally log more of the inode than we actually modify. Worst
      case is we consume a tiny bit more memory and log bandwidth.
      
      However, if we want to do more complex state manipulations on the
      log item that requires updates at all three of these potential
      locations, we need to have some mechanism of serialising those
      operations. To do this, introduce a spinlock into the log item to
      serialise internal state.
      
      This could be done via the xfs_inode i_flags_lock, but this then
      leads to potential lock inversion issues where inode flag updates
      need to occur inside locks that best nest inside the inode log item
      locks (e.g. marking inodes stale during inode cluster freeing).
      Using a separate spinlock avoids these sorts of problems and
      simplifies future code.
      
      This does not touch the use of ili_fields in the item formatting
      code - that is entirely protected by the ILOCK_EXCL at this point in
      time, so it remains untouched.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      1319ebef
    • D
      xfs: remove logged flag from inode log item · 1dfde687
      Dave Chinner 提交于
      This was used to track if the item had logged fields being flushed
      to disk. We log everything in the inode these days, so this logic is
      no longer needed. Remove it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      1dfde687
    • D
      xfs: Don't allow logging of XFS_ISTALE inodes · 96355d5a
      Dave Chinner 提交于
      In tracking down a problem in this patchset, I discovered we are
      reclaiming dirty stale inodes. This wasn't discovered until inodes
      were always attached to the cluster buffer and then the rcu callback
      that freed inodes was assert failing because the inode still had an
      active pointer to the cluster buffer after it had been reclaimed.
      
      Debugging the issue indicated that this was a pre-existing issue
      resulting from the way the inodes are handled in xfs_inactive_ifree.
      When we free a cluster buffer from xfs_ifree_cluster, all the inodes
      in cache are marked XFS_ISTALE. Those that are clean have nothing
      else done to them and so eventually get cleaned up by background
      reclaim. i.e. it is assumed we'll never dirty/relog an inode marked
      XFS_ISTALE.
      
      On journal commit dirty stale inodes as are handled by both
      buffer and inode log items to run though xfs_istale_done() and
      removed from the AIL (buffer log item commit) or the log item will
      simply unpin it because the buffer log item will clean it. What happens
      to any specific inode is entirely dependent on which log item wins
      the commit race, but the result is the same - stale inodes are
      clean, not attached to the cluster buffer, and not in the AIL. Hence
      inode reclaim can just free these inodes without further care.
      
      However, if the stale inode is relogged, it gets dirtied again and
      relogged into the CIL. Most of the time this isn't an issue, because
      relogging simply changes the inode's location in the current
      checkpoint. Problems arise, however, when the CIL checkpoints
      between two transactions in the xfs_inactive_ifree() deferops
      processing. This results in the XFS_ISTALE inode being redirtied
      and inserted into the CIL without any of the other stale cluster
      buffer infrastructure being in place.
      
      Hence on journal commit, it simply gets unpinned, so it remains
      dirty in memory. Everything in inode writeback avoids XFS_ISTALE
      inodes so it can't be written back, and it is not tracked in the AIL
      so there's not even a trigger to attempt to clean the inode. Hence
      the inode just sits dirty in memory until inode reclaim comes along,
      sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming
      of a dirty inode caused use after free, list corruptions and other
      nasty issues later in this patchset.
      
      Hence this patch addresses a violation of the "never log XFS_ISTALE
      inodes" caused by the deferops processing rolling a transaction
      and relogging a stale inode in xfs_inactive_free. It also adds a
      bunch of asserts to catch this problem in debug kernels so that
      we don't reintroduce this problem in future.
      
      Reproducer for this issue was generic/558 on a v4 filesystem.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      96355d5a