1. 29 7月, 2020 18 次提交
  2. 18 7月, 2020 1 次提交
    • E
      xfs: preserve inode versioning across remounts · 4750a171
      Eric Sandeen 提交于
      The MS_I_VERSION mount flag is exposed via the VFS, as documented
      in the mount manpages etc; see the iversion and noiversion mount
      options in mount(8).
      
      As a result, mount -o remount looks for this option in /proc/mounts
      and will only send the I_VERSION flag back in during remount it it
      is present.  Since it's not there, a remount will /remove/ the
      I_VERSION flag at the vfs level, and iversion functionality is lost.
      
      xfs v5 superblocks intend to always have i_version enabled; it is
      set as a default at mount time, but is lost during remount for the
      reasons above.
      
      The generic fix would be to expose this documented option in
      /proc/mounts, but since that was rejected, fix it up again in the
      xfs remount path instead, so that at least xfs won't suffer from
      this misbehavior.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4750a171
  3. 14 7月, 2020 3 次提交
  4. 10 7月, 2020 1 次提交
    • W
      xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim · c3f2375b
      Waiman Long 提交于
      Depending on the workloads, the following circular locking dependency
      warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
      lock) may show up:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.0.0-rc1+ #60 Tainted: G        W
      ------------------------------------------------------
      fsfreeze/4346 is trying to acquire lock:
      0000000026f1d784 (fs_reclaim){+.+.}, at:
      fs_reclaim_acquire.part.19+0x5/0x30
      
      but task is already holding lock:
      0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
      
      which lock already depends on the new lock.
        :
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(sb_internal);
                                     lock(fs_reclaim);
                                     lock(sb_internal);
        lock(fs_reclaim);
      
       *** DEADLOCK ***
      
      4 locks held by fsfreeze/4346:
       #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
       #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
       #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
       #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
      
      stack backtrace:
      Call Trace:
       dump_stack+0xe0/0x19a
       print_circular_bug.isra.10.cold.34+0x2f4/0x435
       check_prev_add.constprop.19+0xca1/0x15f0
       validate_chain.isra.14+0x11af/0x3b50
       __lock_acquire+0x728/0x1200
       lock_acquire+0x269/0x5a0
       fs_reclaim_acquire.part.19+0x29/0x30
       fs_reclaim_acquire+0x19/0x20
       kmem_cache_alloc+0x3e/0x3f0
       kmem_zone_alloc+0x79/0x150
       xfs_trans_alloc+0xfa/0x9d0
       xfs_sync_sb+0x86/0x170
       xfs_log_sbcount+0x10f/0x140
       xfs_quiesce_attr+0x134/0x270
       xfs_fs_freeze+0x4a/0x70
       freeze_super+0x1af/0x290
       do_vfs_ioctl+0xedc/0x16c0
       ksys_ioctl+0x41/0x80
       __x64_sys_ioctl+0x73/0xa9
       do_syscall_64+0x18f/0xd23
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This is a false positive as all the dirty pages are flushed out before
      the filesystem can be frozen.
      
      One way to avoid this splat is to add GFP_NOFS to the affected allocation
      calls by using the memalloc_nofs_save()/memalloc_nofs_restore() pair.
      This shouldn't matter unless the system is really running out of memory.
      In that particular case, the filesystem freeze operation may fail while
      it was succeeding previously.
      
      Without this patch, the command sequence below will show that the lock
      dependency chain sb_internal -> fs_reclaim exists.
      
       # fsfreeze -f /home
       # fsfreeze --unfreeze /home
       # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal
      
      After applying the patch, such sb_internal -> fs_reclaim lock dependency
      chain can no longer be found. Because of that, the locking dependency
      warning will not be shown.
      Suggested-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c3f2375b
  5. 07 7月, 2020 17 次提交
    • D
      xfs: rtbitmap scrubber should check inode size · 2fb94e36
      Darrick J. Wong 提交于
      Make sure the rtbitmap is large enough to store the entire bitmap.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAllison Collins <allison.henderson@oracle.com>
      2fb94e36
    • D
      xfs: rtbitmap scrubber should verify written extents · f866560b
      Darrick J. Wong 提交于
      Ensure that the realtime bitmap file is backed entirely by written
      extents.  No holes, no unwritten blocks, etc.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAllison Collins <allison.henderson@oracle.com>
      f866560b
    • D
      xfs: remove xfs_inobp_check() · e2705b03
      Dave Chinner 提交于
      This debug code is called on every xfs_iflush() call, which then
      checks every inode in the buffer for non-zero unlinked list field.
      Hence it checks every inode in the cluster buffer every time a
      single inode on that cluster it flushed. This is resulting in:
      
      -   38.91%     5.33%  [kernel]  [k] xfs_iflush
         - 17.70% xfs_iflush
            - 9.93% xfs_inobp_check
                 4.36% xfs_buf_offset
      
      10% of the CPU time spent flushing inodes is repeatedly checking
      unlinked fields in the buffer. We don't need to do this.
      
      The other place we call xfs_inobp_check() is
      xfs_iunlink_update_dinode(), and this is after we've done this
      assert for the agino we are about to write into that inode:
      
      	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
      
      which means we've already checked that the agino we are about to
      write is not 0 on debug kernels. The inode buffer verifiers do
      everything else we need, so let's just remove this debug code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e2705b03
    • D
      xfs: factor xfs_iflush_done · a69a1dc2
      Dave Chinner 提交于
      xfs_iflush_done() does 3 distinct operations to the inodes attached
      to the buffer. Separate these operations out into functions so that
      it is easier to modify these operations independently in future.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a69a1dc2
    • D
      xfs: rework xfs_iflush_cluster() dirty inode iteration · 5717ea4d
      Dave Chinner 提交于
      Now that we have all the dirty inodes attached to the cluster
      buffer, we don't actually have to do radix tree lookups to find
      them. Sure, the radix tree is efficient, but walking a linked list
      of just the dirty inodes attached to the buffer is much better.
      
      We are also no longer dependent on having a locked inode passed into
      the function to determine where to start the lookup. This means we
      can drop it from the function call and treat all inodes the same.
      
      We also make xfs_iflush_cluster skip inodes marked with
      XFS_IRECLAIM. This we avoid races with inodes that reclaim is
      actively referencing or are being re-initialised by inode lookup. If
      they are actually dirty, they'll get written by a future cluster
      flush....
      
      We also add a shutdown check after obtaining the flush lock so that
      we catch inodes that are dirty in memory and may have inconsistent
      state due to the shutdown in progress. We abort these inodes
      directly and so they remove themselves directly from the buffer list
      and the AIL rather than having to wait for the buffer to be failed
      and callbacks run to be processed correctly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      5717ea4d
    • D
      xfs: rename xfs_iflush_int() · e6187b34
      Dave Chinner 提交于
      with xfs_iflush() gone, we can rename xfs_iflush_int() back to
      xfs_iflush(). Also move it up above xfs_iflush_cluster() so we don't
      need the forward definition any more.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e6187b34
    • D
      xfs: xfs_iflush() is no longer necessary · 90c60e16
      Dave Chinner 提交于
      Now we have a cached buffer on inode log items, we don't need
      to do buffer lookups when flushing inodes anymore - all we need
      to do is lock the buffer and we are ready to go.
      
      This largely gets rid of the need for xfs_iflush(), which is
      essentially just a mechanism to look up the buffer and flush the
      inode to it. Instead, we can just call xfs_iflush_cluster() with a
      few modifications to ensure it also flushes the inode we already
      hold locked.
      
      This allows the AIL inode item pushing to be almost entirely
      non-blocking in XFS - we won't block unless memory allocation
      for the cluster inode lookup blocks or the block device queues are
      full.
      
      Writeback during inode reclaim becomes a little more complex because
      we now have to lock the buffer ourselves, but otherwise this change
      is largely a functional no-op that removes a whole lot of code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      90c60e16
    • D
      xfs: attach inodes to the cluster buffer when dirtied · 48d55e2a
      Dave Chinner 提交于
      Rather than attach inodes to the cluster buffer just when we are
      doing IO, attach the inodes to the cluster buffer when they are
      dirtied. The means the buffer always carries a list of dirty inodes
      that reference it, and we can use that list to make more fundamental
      changes to inode writeback that aren't otherwise possible.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      48d55e2a
    • D
      xfs: rework stale inodes in xfs_ifree_cluster · 71e3e356
      Dave Chinner 提交于
      Once we have inodes pinning the cluster buffer and attached whenever
      they are dirty, we no longer have a guarantee that the items are
      flush locked when we lock the cluster buffer. Hence we cannot just
      walk the buffer log item list and modify the attached inodes.
      
      If the inode is not flush locked, we have to ILOCK it first and then
      flush lock it to do all the prerequisite checks needed to avoid
      races with other code. This is already handled by
      xfs_ifree_get_one_inode(), so rework the inode iteration loop and
      function to update all inodes in cache whether they are attached to
      the buffer or not.
      
      Note: we also remove the copying of the log item lsn to the
      ili_flush_lsn as xfs_iflush_done() now uses the XFS_ISTALE flag to
      trigger aborts and so flush lsn matching is not needed in IO
      completion for processing freed inodes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      71e3e356
    • D
      xfs: clean up inode reclaim comments · 02511a5a
      Dave Chinner 提交于
      Inode reclaim is quite different now to the way described in various
      comments, so update all the comments explaining what it does and how
      it works.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      02511a5a
    • D
      xfs: remove SYNC_WAIT from xfs_reclaim_inodes() · 4d0bab3a
      Dave Chinner 提交于
      Clean up xfs_reclaim_inodes() callers. Most callers want blocking
      behaviour, so just make the existing SYNC_WAIT behaviour the
      default.
      
      For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag()
      directly because we just want optimistic clean inode reclaim to be
      done in the background.
      
      For xfs_quiesce_attr() we can just remove the inode reclaim calls as
      they are a historic relic that was required to flush dirty inodes
      that contained unlogged changes. We now log all changes to the
      inodes, so the sync AIL push from xfs_log_quiesce() called by
      xfs_quiesce_attr() will do all the required inode writeback for
      freeze.
      
      Seeing as we now want to loop until all reclaimable inodes have been
      reclaimed, make xfs_reclaim_inodes() loop on the XFS_ICI_RECLAIM_TAG
      tag rather than having xfs_reclaim_inodes_ag() tell it that inodes
      were skipped. This is much more reliable and will always loop until
      all reclaimable inodes are reclaimed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4d0bab3a
    • D
      xfs: remove SYNC_TRYLOCK from inode reclaim · 50718b8d
      Dave Chinner 提交于
      All background reclaim is SYNC_TRYLOCK already, and even blocking
      reclaim (SYNC_WAIT) can use trylock mechanisms as
      xfs_reclaim_inodes_ag() will keep cycling until there are no more
      reclaimable inodes. Hence we can kill SYNC_TRYLOCK from inode
      reclaim and make everything unconditionally non-blocking.
      
      We remove all the optimistic "avoid blocking on locks" checks done
      in xfs_reclaim_inode_grab() as nothing blocks on locks anymore.
      Further, checking XFS_IFLOCK optimistically can result in detecting
      inodes in the process of being cleaned (i.e. between being removed
      from the AIL and having the flush lock dropped), so for
      xfs_reclaim_inodes() to reliably reclaim all inodes we need to drop
      these checks anyway.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      50718b8d
    • D
      xfs: don't block inode reclaim on the ILOCK · 9552e14d
      Dave Chinner 提交于
      When we attempt to reclaim an inode, the first thing we do is take
      the inode lock. This is blocking right now, so if the inode being
      accessed by something else (e.g. being flushed to the cluster
      buffer) we will block here.
      
      Change this to a trylock so that we do not block inode reclaim
      unnecessarily here.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9552e14d
    • D
      xfs: allow multiple reclaimers per AG · 0e8e2c63
      Dave Chinner 提交于
      Inode reclaim will still throttle direct reclaim on the per-ag
      reclaim locks. This is no longer necessary as reclaim can run
      non-blocking now. Hence we can remove these locks so that we don't
      arbitrarily block reclaimers just because there are more direct
      reclaimers than there are AGs.
      
      This can result in multiple reclaimers working on the same range of
      an AG, but this doesn't cause any apparent issues. Optimising the
      spread of concurrent reclaimers for best efficiency can be done in a
      future patchset.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0e8e2c63
    • D
      xfs: remove IO submission from xfs_reclaim_inode() · 617825fe
      Dave Chinner 提交于
      We no longer need to issue IO from shrinker based inode reclaim to
      prevent spurious OOM killer invocation. This leaves only the global
      filesystem management operations such as unmount needing to
      writeback dirty inodes and reclaim them.
      
      Instead of using the reclaim pass to write dirty inodes before
      reclaiming them, use the AIL to push all the dirty inodes before we
      try to reclaim them. This allows us to remove all the conditional
      SYNC_WAIT locking and the writeback code from xfs_reclaim_inode()
      and greatly simplify the checks we need to do to reclaim an inode.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      617825fe
    • D
      xfs: make inode reclaim almost non-blocking · 993f951f
      Dave Chinner 提交于
      Now that dirty inode writeback doesn't cause read-modify-write
      cycles on the inode cluster buffer under memory pressure, the need
      to throttle memory reclaim to the rate at which we can clean dirty
      inodes goes away. That is due to the fact that we no longer thrash
      inode cluster buffers under memory pressure to clean dirty inodes.
      
      This means inode writeback no longer stalls on memory allocation
      or read IO, and hence can be done asynchronously without generating
      memory pressure. As a result, blocking inode writeback in reclaim is
      no longer necessary to prevent reclaim priority windup as cleaning
      dirty inodes is no longer dependent on having memory reserves
      available for the filesystem to make progress reclaiming inodes.
      
      Hence we can convert inode reclaim to be non-blocking for shrinker
      callouts, both for direct reclaim and kswapd.
      
      On a vanilla kernel, running a 16-way fsmark create workload on a
      4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
      userspace mlock(). The OOM killer gets invoked at 15GB of
      pinned RAM.
      
      Without the inode cluster pinning, this non-blocking reclaim patch
      triggers premature OOM killer invocation with the same memory
      pinning, sometimes with as much as 45% of RAM being free.  It's
      trivially easy to trigger the OOM killer when reclaim does not
      block.
      
      With pinning inode clusters in RAM and then adding this patch, I can
      reliably pin 14.5GB of RAM and still have the fsmark workload run to
      completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
      is only a small amount of memory less than the vanilla kernel. It is
      much more reliable than just with async reclaim alone.
      
      simoops shows that allocation stalls go away when async reclaim is
      used. Vanilla kernel:
      
      Run time: 1924 seconds
      Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
      Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
      Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
      work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
      alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)
      
      With inode cluster pinning and async reclaim:
      
      Run time: 1924 seconds
      Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
      Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
      Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
      work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
      alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)
      
      Latencies don't really change much, nor does the work rate. However,
      allocation almost never stalls with these changes, whilst the
      vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
      period. This difference is due to inode reclaim being largely
      non-blocking now.
      
      IOWs, once we have pinned inode cluster buffers, we can make inode
      reclaim non-blocking without a major risk of premature and/or
      spurious OOM killer invocation, and without any changes to memory
      reclaim infrastructure.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      993f951f
    • D
      xfs: pin inode backing buffer to the inode log item · 298f7bec
      Dave Chinner 提交于
      When we dirty an inode, we are going to have to write it disk at
      some point in the near future. This requires the inode cluster
      backing buffer to be present in memory. Unfortunately, under severe
      memory pressure we can reclaim the inode backing buffer while the
      inode is dirty in memory, resulting in stalling the AIL pushing
      because it has to do a read-modify-write cycle on the cluster
      buffer.
      
      When we have no memory available, the read of the cluster buffer
      blocks the AIL pushing process, and this causes all sorts of issues
      for memory reclaim as it requires inode writeback to make forwards
      progress. Allocating a cluster buffer causes more memory pressure,
      and results in more cluster buffers to be reclaimed, resulting in
      more RMW cycles to be done in the AIL context and everything then
      backs up on AIL progress. Only the synchronous inode cluster
      writeback in the the inode reclaim code provides some level of
      forwards progress guarantees that prevent OOM-killer rampages in
      this situation.
      
      Fix this by pinning the inode backing buffer to the inode log item
      when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
      This may mean the first modification of an inode that has been held
      in cache for a long time may block on a cluster buffer read, but
      we can do that in transaction context and block safely until the
      buffer has been allocated and read.
      
      Once we have the cluster buffer, the inode log item takes a
      reference to it, pinning it in memory, and attaches it to the log
      item for future reference. This means we can always grab the cluster
      buffer from the inode log item when we need it.
      
      When the inode is finally cleaned and removed from the AIL, we can
      drop the reference the inode log item holds on the cluster buffer.
      Once all inodes on the cluster buffer are clean, the cluster buffer
      will be unpinned and it will be available for memory reclaim to
      reclaim again.
      
      This avoids the issues with needing to do RMW cycles in the AIL
      pushing context, and hence allows complete non-blocking inode
      flushing to be performed by the AIL pushing context.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      298f7bec