1. 10 2月, 2022 1 次提交
  2. 31 1月, 2022 1 次提交
  3. 22 12月, 2021 1 次提交
    • D
      xfs: only run COW extent recovery when there are no live extents · 7993f1a4
      Darrick J. Wong 提交于
      As part of multiple customer escalations due to file data corruption
      after copy on write operations, I wrote some fstests that use fsstress
      to hammer on COW to shake things loose.  Regrettably, I caught some
      filesystem shutdowns due to incorrect rmap operations with the following
      loop:
      
      mount <filesystem>				# (0)
      fsstress <run only readonly ops> &		# (1)
      while true; do
      	fsstress <run all ops>
      	mount -o remount,ro			# (2)
      	fsstress <run only readonly ops>
      	mount -o remount,rw			# (3)
      done
      
      When (2) happens, notice that (1) is still running.  xfs_remount_ro will
      call xfs_blockgc_stop to walk the inode cache to free all the COW
      extents, but the blockgc mechanism races with (1)'s reader threads to
      take IOLOCKs and loses, which means that it doesn't clean them all out.
      Call such a file (A).
      
      When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
      walks the ondisk refcount btree and frees any COW extent that it finds.
      This function does not check the inode cache, which means that incore
      COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
      one of those former COW extents are allocated and mapped into another
      file (B) and someone triggers a COW to the stale reservation in (A), A's
      dirty data will be written into (B) and once that's done, those blocks
      will be transferred to (A)'s data fork without bumping the refcount.
      
      The results are catastrophic -- file (B) and the refcount btree are now
      corrupt.  In the first patch, we fixed the race condition in (2) so that
      (A) will always flush the COW fork.  In this second patch, we move the
      _recover_cow call to the initial mount call in (0) for safety.
      
      As mentioned previously, xfs_reflink_recover_cow walks the refcount
      btree looking for COW staging extents, and frees them.  This was
      intended to be run at mount time (when we know there are no live inodes)
      to clean up any leftover staging events that may have been left behind
      during an unclean shutdown.  As a time "optimization" for readonly
      mounts, we deferred this to the ro->rw transition, not realizing that
      any failure to clean all COW forks during a rw->ro transition would
      result in catastrophic corruption.
      
      Therefore, remove this optimization and only run the recovery routine
      when we're guaranteed not to have any COW staging extents anywhere,
      which means we always run this at mount time.  While we're at it, move
      the callsite to xfs_log_mount_finish because any refcount btree
      expansion (however unlikely given that we're removing records from the
      right side of the index) must be fed by a per-AG reservation, which
      doesn't exist in its current location.
      
      Fixes: 174edb0e ("xfs: store in-progress CoW allocations in the refcount btree")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      7993f1a4
  4. 08 12月, 2021 1 次提交
    • D
      xfs: remove all COW fork extents when remounting readonly · 089558bc
      Darrick J. Wong 提交于
      As part of multiple customer escalations due to file data corruption
      after copy on write operations, I wrote some fstests that use fsstress
      to hammer on COW to shake things loose.  Regrettably, I caught some
      filesystem shutdowns due to incorrect rmap operations with the following
      loop:
      
      mount <filesystem>				# (0)
      fsstress <run only readonly ops> &		# (1)
      while true; do
      	fsstress <run all ops>
      	mount -o remount,ro			# (2)
      	fsstress <run only readonly ops>
      	mount -o remount,rw			# (3)
      done
      
      When (2) happens, notice that (1) is still running.  xfs_remount_ro will
      call xfs_blockgc_stop to walk the inode cache to free all the COW
      extents, but the blockgc mechanism races with (1)'s reader threads to
      take IOLOCKs and loses, which means that it doesn't clean them all out.
      Call such a file (A).
      
      When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
      walks the ondisk refcount btree and frees any COW extent that it finds.
      This function does not check the inode cache, which means that incore
      COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
      one of those former COW extents are allocated and mapped into another
      file (B) and someone triggers a COW to the stale reservation in (A), A's
      dirty data will be written into (B) and once that's done, those blocks
      will be transferred to (A)'s data fork without bumping the refcount.
      
      The results are catastrophic -- file (B) and the refcount btree are now
      corrupt.  Solve this race by forcing the xfs_blockgc_free_space to run
      synchronously, which causes xfs_icwalk to return to inodes that were
      skipped because the blockgc code couldn't take the IOLOCK.  This is safe
      to do here because the VFS has already prohibited new writer threads.
      
      Fixes: 10ddf64e ("xfs: remove leftover CoW reservations when remounting ro")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
      089558bc
  5. 05 12月, 2021 3 次提交
  6. 31 10月, 2021 1 次提交
  7. 23 10月, 2021 3 次提交
  8. 20 10月, 2021 3 次提交
  9. 27 8月, 2021 2 次提交
  10. 20 8月, 2021 7 次提交
    • D
      xfs: introduce xfs_sb_is_v5 helper · d6837c1a
      Dave Chinner 提交于
      Rather than open coding XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5
      checks everywhere, add a simple wrapper to encapsulate this and make
      the code easier to read.
      
      This allows us to remove the xfs_sb_version_has_v3inode() wrapper
      which is only used in xfs_format.h now and is just a version number
      check.
      
      There are a couple of places where we should be checking the mount
      feature bits rather than the superblock version (e.g. remount), so
      those are converted to use xfs_has_crc(mp) instead.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      d6837c1a
    • D
      xfs: convert xfs_sb_version_has checks to use mount features · ebd9027d
      Dave Chinner 提交于
      This is a conversion of the remaining xfs_sb_version_has..(sbp)
      checks to use xfs_has_..(mp) feature checks.
      
      This was largely done with a vim replacement macro that did:
      
      :0,$s/xfs_sb_version_has\(.*\)&\(.*\)->m_sb/xfs_has_\1\2/g<CR>
      
      A couple of other variants were also used, and the rest touched up
      by hand.
      
      $ size -t fs/xfs/built-in.a
      	   text    data     bss     dec     hex filename
      before	1127533  311352     484 1439369  15f689 (TOTALS)
      after	1125360  311352     484 1437196  15ee0c (TOTALS)
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ebd9027d
    • D
      xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown · 75c8c50f
      Dave Chinner 提交于
      Remove the shouty macro and instead use the inline function that
      matches other state/feature check wrapper naming. This conversion
      was done with sed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      75c8c50f
    • D
      xfs: convert remaining mount flags to state flags · 2e973b2c
      Dave Chinner 提交于
      The remaining mount flags kept in m_flags are actually runtime state
      flags. These change dynamically, so they really should be updated
      atomically so we don't potentially lose an update due to racing
      modifications.
      
      Convert these remaining flags to be stored in m_opstate and use
      atomic bitops to set and clear the flags. This also adds a couple of
      simple wrappers for common state checks - read only and shutdown.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      2e973b2c
    • D
      xfs: convert mount flags to features · 0560f31a
      Dave Chinner 提交于
      Replace m_flags feature checks with xfs_has_<feature>() calls and
      rework the setup code to set flags in m_features.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0560f31a
    • D
      xfs: replace xfs_sb_version checks with feature flag checks · 38c26bfd
      Dave Chinner 提交于
      Convert the xfs_sb_version_hasfoo() to checks against
      mp->m_features. Checks of the superblock itself during disk
      operations (e.g. in the read/write verifiers and the to/from disk
      formatters) are not converted - they operate purely on the
      superblock state. Everything else should use the mount features.
      
      Large parts of this conversion were done with sed with commands like
      this:
      
      for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
      	sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
      done
      
      With manual cleanups for things like "xfs_has_extflgbit" and other
      little inconsistencies in naming.
      
      The result is ia lot less typing to check features and an XFS binary
      size reduced by a bit over 3kB:
      
      $ size -t fs/xfs/built-in.a
      	text	   data	    bss	    dec	    hex	filenam
      before	1130866  311352     484 1442702  16038e (TOTALS)
      after	1127727  311352     484 1439563  15f74b (TOTALS)
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      38c26bfd
    • D
      xfs: rework attr2 feature and mount options · e23b55d5
      Dave Chinner 提交于
      The attr2 feature is somewhat unique in that it has both a superblock
      feature bit to enable it and mount options to enable and disable it.
      
      Back when it was first introduced in 2005, attr2 was disabled unless
      either the attr2 superblock feature bit was set, or the attr2 mount
      option was set. If the superblock feature bit was not set but the
      mount option was set, then when the first attr2 format inode fork
      was created, it would set the superblock feature bit. This is as it
      should be - the superblock feature bit indicated the presence of the
      attr2 on disk format.
      
      The noattr2 mount option, however, did not affect the superblock
      feature bit. If noattr2 was specified, the on-disk superblock
      feature bit was ignored and the code always just created attr1
      format inode forks.  If neither of the attr2 or noattr2 mounts
      option were specified, then the behaviour was determined by the
      superblock feature bit.
      
      This was all pretty sane.
      
      Fast foward 3 years, and we are dealing with fallout from the
      botched sb_features2 addition and having to deal with feature
      mismatches between the sb_features2 and sb_bad_features2 fields. The
      attr2 feature bit was one of these flags. The reconciliation was
      done well after mount option parsing and, unfortunately, the feature
      reconciliation had a bug where it ignored the noattr2 mount option.
      
      For reasons lost to the mists of time, it was decided that resolving
      this issue in commit 7c12f296 ("[XFS] Fix up noattr2 so that it
      will properly update the versionnum and features2 fields.") required
      noattr2 to clear the superblock attr2 feature bit.  This greatly
      complicated the attr2 behaviour and broke rules about feature bits
      needing to be set when those specific features are present in the
      filesystem.
      
      By complicated, I mean that it introduced problems due to feature
      bit interactions with log recovery. All of the superblock feature
      bit checks are done prior to log recovery, but if we crash after
      removing a feature bit, then on the next mount we see the feature
      bit in the unrecovered superblock, only to have it go away after the
      log has been replayed.  This means our mount time feature processing
      could be all wrong.
      
      Hence you can mount with noattr2, crash shortly afterwards, and
      mount again without attr2 or noattr2 and still have attr2 enabled
      because the second mount sees attr2 still enabled in the superblock
      before recovery runs and removes the feature bit. It's just a mess.
      
      Further, this is all legacy code as the v5 format requires attr2 to
      be enabled at all times and it cannot be disabled.  i.e. the noattr2
      mount option returns an error when used on v5 format filesystems.
      
      To straighten this all out, this patch reverts the attr2/noattr2
      mount option behaviour back to the original behaviour. There is no
      reason for disabling attr2 these days, so we will only do this when
      the noattr2 mount option is set. This will not remove the superblock
      feature bit. The superblock bit will provide the default behaviour
      and only track whether attr2 is present on disk or not. The attr2
      mount option will enable the creation of attr2 format inode forks,
      and if the superblock feature bit is not set it will be added when
      the first attr2 inode fork is created.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      e23b55d5
  11. 17 8月, 2021 3 次提交
    • D
      xfs: move the CIL workqueue to the CIL · 33c0dd78
      Dave Chinner 提交于
      We only use the CIL workqueue in the CIL, so it makes no sense to
      hang it off the xfs_mount and have to walk multiple pointers back up
      to the mount when we have the CIL structures right there.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      33c0dd78
    • D
      xfs: CIL work is serialised, not pipelined · 39823d0f
      Dave Chinner 提交于
      Because we use a single work structure attached to the CIL rather
      than the CIL context, we can only queue a single work item at a
      time. This results in the CIL being single threaded and limits
      performance when it becomes CPU bound.
      
      The design of the CIL is that it is pipelined and multiple commits
      can be running concurrently, but the way the work is currently
      implemented means that it is not pipelining as it was intended. The
      critical work to switch the CIL context can take a few milliseconds
      to run, but the rest of the CIL context flush can take hundreds of
      milliseconds to complete. The context switching is the serialisation
      point of the CIL, once the context has been switched the rest of the
      context push can run asynchrnously with all other context pushes.
      
      Hence we can move the work to the CIL context so that we can run
      multiple CIL pushes at the same time and spread the majority of
      the work out over multiple CPUs. We can keep the per-cpu CIL commit
      state on the CIL rather than the context, because the context is
      pinned to the CIL until the switch is done and we aggregate and
      drain the per-cpu state held on the CIL during the context switch.
      
      However, because we no longer serialise the CIL work, we can have
      effectively unlimited CIL pushes in progress. We don't want to do
      this - not only does it create contention on the iclogs and the
      state machine locks, we can run the log right out of space with
      outstanding pushes. Instead, limit the work concurrency to 4
      concurrent works being processed at a time. This is enough
      concurrency to remove the CIL from being a CPU bound bottleneck but
      not enough to create new contention points or unbound concurrency
      issues.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      39823d0f
    • D
      xfs: convert log flags to an operational state field · e1d06e5f
      Dave Chinner 提交于
      log->l_flags doesn't actually contain "flags" as such, it contains
      operational state information that can change at runtime. For the
      shutdown state, this at least should be an atomic bit because
      it is read without holding locks in many places and so using atomic
      bitops for the state field modifications makes sense.
      
      This allows us to use things like test_and_set_bit() on state
      changes (e.g. setting XLOG_TAIL_WARN) to avoid races in setting the
      state when we aren't holding locks.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      e1d06e5f
  12. 10 8月, 2021 3 次提交
  13. 07 8月, 2021 5 次提交
    • D
      xfs: per-cpu deferred inode inactivation queues · ab23a776
      Dave Chinner 提交于
      Move inode inactivation to background work contexts so that it no
      longer runs in the context that releases the final reference to an
      inode. This will allow process work that ends up blocking on
      inactivation to continue doing work while the filesytem processes
      the inactivation in the background.
      
      A typical demonstration of this is unlinking an inode with lots of
      extents. The extents are removed during inactivation, so this blocks
      the process that unlinked the inode from the directory structure. By
      moving the inactivation to the background process, the userspace
      applicaiton can keep working (e.g. unlinking the next inode in the
      directory) while the inactivation work on the previous inode is
      done by a different CPU.
      
      The implementation of the queue is relatively simple. We use a
      per-cpu lockless linked list (llist) to queue inodes for
      inactivation without requiring serialisation mechanisms, and a work
      item to allow the queue to be processed by a CPU bound worker
      thread. We also keep a count of the queue depth so that we can
      trigger work after a number of deferred inactivations have been
      queued.
      
      The use of a bound workqueue with a single work depth allows the
      workqueue to run one work item per CPU. We queue the work item on
      the CPU we are currently running on, and so this essentially gives
      us affine per-cpu worker threads for the per-cpu queues. THis
      maintains the effective CPU affinity that occurs within XFS at the
      AG level due to all objects in a directory being local to an AG.
      Hence inactivation work tends to run on the same CPU that last
      accessed all the objects that inactivation accesses and this
      maintains hot CPU caches for unlink workloads.
      
      A depth of 32 inodes was chosen to match the number of inodes in an
      inode cluster buffer. This hopefully allows sequential
      allocation/unlink behaviours to defering inactivation of all the
      inodes in a single cluster buffer at a time, further helping
      maintain hot CPU and buffer cache accesses while running
      inactivations.
      
      A hard per-cpu queue throttle of 256 inode has been set to avoid
      runaway queuing when inodes that take a long to time inactivate are
      being processed. For example, when unlinking inodes with large
      numbers of extents that can take a lot of processing to free.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [djwong: tweak comments and tracepoints, convert opflags to state bits]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ab23a776
    • D
      xfs: move xfs_inactive call to xfs_inode_mark_reclaimable · c6c2066d
      Darrick J. Wong 提交于
      Move the xfs_inactive call and all the other debugging checks and stats
      updates into xfs_inode_mark_reclaimable because most of that are
      implementation details about the inode cache.  This is preparation for
      deferred inactivation that is coming up.  We also move it around
      xfs_icache.c in preparation for deferred inactivation.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      c6c2066d
    • D
      xfs: introduce all-mounts list for cpu hotplug notifications · 0ed17f01
      Dave Chinner 提交于
      The inode inactivation and CIL tracking percpu structures are
      per-xfs_mount structures. That means when we get a CPU dead
      notification, we need to then iterate all the per-cpu structure
      instances to process them. Rather than keeping linked lists of
      per-cpu structures in each subsystem, add a list of all xfs_mounts
      that the generic xfs_cpu_dead() function will iterate and call into
      each subsystem appropriately.
      
      This allows us to handle both per-mount and global XFS percpu state
      from xfs_cpu_dead(), and avoids the need to link subsystem
      structures that can be easily found from the xfs_mount into their
      own global lists.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [djwong: expand some comments about mount list setup ordering rules]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0ed17f01
    • D
      xfs: introduce CPU hotplug infrastructure · f1653c2e
      Dave Chinner 提交于
      We need to move to per-cpu state for both deferred inode
      inactivation and CIL tracking, but to do that we
      need to handle CPUs being removed from the system by the hot-plug
      code. Introduce generic XFS infrastructure to handle CPU hotplug
      events that is set up at module init time and torn down at module
      exit time.
      
      Initially, we only need CPU dead notifications, so we only set
      up a callback for these notifications. The infrastructure can be
      updated in future for other CPU hotplug state machine notifications
      easily if ever needed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [djwong: rearrange some macros, fix function prototypes]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      f1653c2e
    • C
      xfs: remove the active vs running quota differentiation · 149e53af
      Christoph Hellwig 提交于
      These only made a difference when quotaoff supported disabling quota
      accounting on a mounted file system, so we can switch everyone to use
      a single set of flags and helpers now. Note that the *QUOTA_ON naming
      for the helpers is kept as it was the much more commonly used one.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      149e53af
  14. 13 7月, 2021 1 次提交
  15. 22 6月, 2021 1 次提交
  16. 04 6月, 2021 1 次提交
    • D
      xfs: refactor per-AG inode tagging functions · c076ae7a
      Darrick J. Wong 提交于
      In preparation for adding another incore inode tree tag, refactor the
      code that sets and clears tags from the per-AG inode tree and the tree
      of per-AG structures, and remove the open-coded versions used by the
      blockgc code.
      
      Note: For reclaim, we now rely on the radix tree tags instead of the
      reclaimable inode count more heavily than we used to.  The conversion
      should be fine, but the logic isn't 100% identical.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      c076ae7a
  17. 02 6月, 2021 1 次提交
  18. 08 4月, 2021 1 次提交
  19. 26 3月, 2021 1 次提交