1. 20 8月, 2021 1 次提交
    • D
      xfs: rework attr2 feature and mount options · e23b55d5
      Dave Chinner 提交于
      The attr2 feature is somewhat unique in that it has both a superblock
      feature bit to enable it and mount options to enable and disable it.
      
      Back when it was first introduced in 2005, attr2 was disabled unless
      either the attr2 superblock feature bit was set, or the attr2 mount
      option was set. If the superblock feature bit was not set but the
      mount option was set, then when the first attr2 format inode fork
      was created, it would set the superblock feature bit. This is as it
      should be - the superblock feature bit indicated the presence of the
      attr2 on disk format.
      
      The noattr2 mount option, however, did not affect the superblock
      feature bit. If noattr2 was specified, the on-disk superblock
      feature bit was ignored and the code always just created attr1
      format inode forks.  If neither of the attr2 or noattr2 mounts
      option were specified, then the behaviour was determined by the
      superblock feature bit.
      
      This was all pretty sane.
      
      Fast foward 3 years, and we are dealing with fallout from the
      botched sb_features2 addition and having to deal with feature
      mismatches between the sb_features2 and sb_bad_features2 fields. The
      attr2 feature bit was one of these flags. The reconciliation was
      done well after mount option parsing and, unfortunately, the feature
      reconciliation had a bug where it ignored the noattr2 mount option.
      
      For reasons lost to the mists of time, it was decided that resolving
      this issue in commit 7c12f296 ("[XFS] Fix up noattr2 so that it
      will properly update the versionnum and features2 fields.") required
      noattr2 to clear the superblock attr2 feature bit.  This greatly
      complicated the attr2 behaviour and broke rules about feature bits
      needing to be set when those specific features are present in the
      filesystem.
      
      By complicated, I mean that it introduced problems due to feature
      bit interactions with log recovery. All of the superblock feature
      bit checks are done prior to log recovery, but if we crash after
      removing a feature bit, then on the next mount we see the feature
      bit in the unrecovered superblock, only to have it go away after the
      log has been replayed.  This means our mount time feature processing
      could be all wrong.
      
      Hence you can mount with noattr2, crash shortly afterwards, and
      mount again without attr2 or noattr2 and still have attr2 enabled
      because the second mount sees attr2 still enabled in the superblock
      before recovery runs and removes the feature bit. It's just a mess.
      
      Further, this is all legacy code as the v5 format requires attr2 to
      be enabled at all times and it cannot be disabled.  i.e. the noattr2
      mount option returns an error when used on v5 format filesystems.
      
      To straighten this all out, this patch reverts the attr2/noattr2
      mount option behaviour back to the original behaviour. There is no
      reason for disabling attr2 these days, so we will only do this when
      the noattr2 mount option is set. This will not remove the superblock
      feature bit. The superblock bit will provide the default behaviour
      and only track whether attr2 is present on disk or not. The attr2
      mount option will enable the creation of attr2 format inode forks,
      and if the superblock feature bit is not set it will be added when
      the first attr2 inode fork is created.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      e23b55d5
  2. 10 8月, 2021 5 次提交
    • D
      xfs: allow setting and clearing of log incompat feature flags · 908ce71e
      Darrick J. Wong 提交于
      Log incompat feature flags in the superblock exist for one purpose: to
      protect the contents of a dirty log from replay on a kernel that isn't
      prepared to handle those dirty contents.  This means that they can be
      cleared if (a) we know the log is clean and (b) we know that there
      aren't any other threads in the system that might be setting or relying
      upon a log incompat flag.
      
      Therefore, clear the log incompat flags when we've finished recovering
      the log, when we're unmounting cleanly, remounting read-only, or
      freezing; and provide a function so that subsequent patches can start
      using this.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      908ce71e
    • D
      xfs: throttle inode inactivation queuing on memory reclaim · 40b1de00
      Darrick J. Wong 提交于
      Now that we defer inode inactivation, we've decoupled the process of
      unlinking or closing an inode from the process of inactivating it.  In
      theory this should lead to better throughput since we now inactivate the
      queued inodes in batches instead of one at a time.
      
      Unfortunately, one of the primary risks with this decoupling is the loss
      of rate control feedback between the frontend and background threads.
      In other words, a rm -rf /* thread can run the system out of memory if
      it can queue inodes for inactivation and jump to a new CPU faster than
      the background threads can actually clear the deferred work.  The
      workers can get scheduled off the CPU if they have to do IO, etc.
      
      To solve this problem, we configure a shrinker so that it will activate
      the /second/ time the shrinkers are called.  The custom shrinker will
      queue all percpu deferred inactivation workers immediately and set a
      flag to force frontend callers who are releasing a vfs inode to wait for
      the inactivation workers.
      
      On my test VM with 560M of RAM and a 2TB filesystem, this seems to solve
      most of the OOMing problem when deleting 10 million inodes.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      40b1de00
    • D
      xfs: don't run speculative preallocation gc when fs is frozen · 6f649091
      Darrick J. Wong 提交于
      Now that we have the infrastructure to switch background workers on and
      off at will, fix the block gc worker code so that we don't actually run
      the worker when the filesystem is frozen, same as we do for deferred
      inactivation.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      6f649091
    • D
      xfs: queue inactivation immediately when free realtime extents are tight · 65f03d86
      Darrick J. Wong 提交于
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      Similar to the patch doing this for free space on the data device, if
      the file being inactivated is a realtime file and the realtime volume is
      running low on free extents, we want to run the worker ASAP so that the
      realtime allocator can make better decisions.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      65f03d86
    • D
      xfs: queue inactivation immediately when free space is tight · 7d6f07d2
      Darrick J. Wong 提交于
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      On a mostly empty filesystem, the risk of the allocator making poor
      decisions due to fragmentation of the free space on account a lengthy
      delay in background updates is minimal because there's plenty of space.
      However, if free space is tight, we want to deallocate unlinked inodes
      as quickly as possible to avoid fallocate ENOSPC and to give the
      allocator the best shot at optimal allocations for new writes.
      
      Therefore, queue the percpu worker immediately if the filesystem is more
      than 95% full.  This follows the same principle that XFS becomes less
      aggressive about speculative allocations and lazy cleanup (and more
      precise about accounting) when nearing full.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      7d6f07d2
  3. 07 8月, 2021 2 次提交
    • D
      xfs: per-cpu deferred inode inactivation queues · ab23a776
      Dave Chinner 提交于
      Move inode inactivation to background work contexts so that it no
      longer runs in the context that releases the final reference to an
      inode. This will allow process work that ends up blocking on
      inactivation to continue doing work while the filesytem processes
      the inactivation in the background.
      
      A typical demonstration of this is unlinking an inode with lots of
      extents. The extents are removed during inactivation, so this blocks
      the process that unlinked the inode from the directory structure. By
      moving the inactivation to the background process, the userspace
      applicaiton can keep working (e.g. unlinking the next inode in the
      directory) while the inactivation work on the previous inode is
      done by a different CPU.
      
      The implementation of the queue is relatively simple. We use a
      per-cpu lockless linked list (llist) to queue inodes for
      inactivation without requiring serialisation mechanisms, and a work
      item to allow the queue to be processed by a CPU bound worker
      thread. We also keep a count of the queue depth so that we can
      trigger work after a number of deferred inactivations have been
      queued.
      
      The use of a bound workqueue with a single work depth allows the
      workqueue to run one work item per CPU. We queue the work item on
      the CPU we are currently running on, and so this essentially gives
      us affine per-cpu worker threads for the per-cpu queues. THis
      maintains the effective CPU affinity that occurs within XFS at the
      AG level due to all objects in a directory being local to an AG.
      Hence inactivation work tends to run on the same CPU that last
      accessed all the objects that inactivation accesses and this
      maintains hot CPU caches for unlink workloads.
      
      A depth of 32 inodes was chosen to match the number of inodes in an
      inode cluster buffer. This hopefully allows sequential
      allocation/unlink behaviours to defering inactivation of all the
      inodes in a single cluster buffer at a time, further helping
      maintain hot CPU and buffer cache accesses while running
      inactivations.
      
      A hard per-cpu queue throttle of 256 inode has been set to avoid
      runaway queuing when inodes that take a long to time inactivate are
      being processed. For example, when unlinking inodes with large
      numbers of extents that can take a lot of processing to free.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [djwong: tweak comments and tracepoints, convert opflags to state bits]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ab23a776
    • C
      xfs: remove the active vs running quota differentiation · 149e53af
      Christoph Hellwig 提交于
      These only made a difference when quotaoff supported disabling quota
      accounting on a mounted file system, so we can switch everyone to use
      a single set of flags and helpers now. Note that the *QUOTA_ON naming
      for the helpers is kept as it was the much more commonly used one.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      149e53af
  4. 22 6月, 2021 1 次提交
    • D
      xfs: fix log intent recovery ENOSPC shutdowns when inactivating inodes · 81ed9475
      Darrick J. Wong 提交于
      During regular operation, the xfs_inactive operations create
      transactions with zero block reservation because in general we're
      freeing space, not asking for more.  The per-AG space reservations
      created at mount time enable us to handle expansions of the refcount
      btree without needing to reserve blocks to the transaction.
      
      Unfortunately, log recovery doesn't create the per-AG space reservations
      when intent items are being recovered.  This isn't an issue for intent
      item recovery itself because they explicitly request blocks, but any
      inode inactivation that can happen during log recovery uses the same
      xfs_inactive paths as regular runtime.  If a refcount btree expansion
      happens, the transaction will fail due to blk_res_used > blk_res, and we
      shut down the filesystem unnecessarily.
      
      Fix this problem by making per-AG reservations temporarily so that we
      can handle the inactivations, and releasing them at the end.  This
      brings the recovery environment closer to the runtime environment.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      81ed9475
  5. 02 6月, 2021 3 次提交
  6. 29 4月, 2021 1 次提交
    • B
      xfs: set aside allocation btree blocks from block reservation · fd43cf60
      Brian Foster 提交于
      The blocks used for allocation btrees (bnobt and countbt) are
      technically considered free space. This is because as free space is
      used, allocbt blocks are removed and naturally become available for
      traditional allocation. However, this means that a significant
      portion of free space may consist of in-use btree blocks if free
      space is severely fragmented.
      
      On large filesystems with large perag reservations, this can lead to
      a rare but nasty condition where a significant amount of physical
      free space is available, but the majority of actual usable blocks
      consist of in-use allocbt blocks. We have a record of a (~12TB, 32
      AG) filesystem with multiple AGs in a state with ~2.5GB or so free
      blocks tracked across ~300 total allocbt blocks, but effectively at
      100% full because the the free space is entirely consumed by
      refcountbt perag reservation.
      
      Such a large perag reservation is by design on large filesystems.
      The problem is that because the free space is so fragmented, this AG
      contributes the 300 or so allocbt blocks to the global counters as
      free space. If this pattern repeats across enough AGs, the
      filesystem lands in a state where global block reservation can
      outrun physical block availability. For example, a streaming
      buffered write on the affected filesystem continues to allow delayed
      allocation beyond the point where writeback starts to fail due to
      physical block allocation failures. The expected behavior is for the
      delalloc block reservation to fail gracefully with -ENOSPC before
      physical block allocation failure is a possibility.
      
      To address this problem, set aside in-use allocbt blocks at
      reservation time and thus ensure they cannot be reserved until truly
      available for physical allocation. This allows alloc btree metadata
      to continue to reside in free space, but dynamically adjusts
      reservation availability based on internal state. Note that the
      logic requires that the allocbt counter is fully populated at
      reservation time before it is fully effective. We currently rely on
      the mount time AGF scan in the perag reservation initialization code
      for this dependency on filesystems where it's most important (i.e.
      with active perag reservations).
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      fd43cf60
  7. 08 4月, 2021 1 次提交
  8. 15 3月, 2021 1 次提交
    • D
      xfs: force log and push AIL to clear pinned inodes when aborting mount · d336f7eb
      Darrick J. Wong 提交于
      If we allocate quota inodes in the process of mounting a filesystem but
      then decide to abort the mount, it's possible that the quota inodes are
      sitting around pinned by the log.  Now that inode reclaim relies on the
      AIL to flush inodes, we have to force the log and push the AIL in
      between releasing the quota inodes and kicking off reclaim to tear down
      all the incore inodes.  Do this by extracting the bits we need from the
      unmount path and reusing them.  As an added bonus, failed writes during
      a failed mount will not retry forever now.
      
      This was originally found during a fuzz test of metadata directories
      (xfs/1546), but the actual symptom was that reclaim hung up on the quota
      inodes.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      d336f7eb
  9. 04 2月, 2021 2 次提交
  10. 23 1月, 2021 4 次提交
  11. 19 11月, 2020 1 次提交
  12. 16 9月, 2020 1 次提交
  13. 07 9月, 2020 2 次提交
  14. 07 7月, 2020 2 次提交
    • D
      xfs: remove SYNC_WAIT from xfs_reclaim_inodes() · 4d0bab3a
      Dave Chinner 提交于
      Clean up xfs_reclaim_inodes() callers. Most callers want blocking
      behaviour, so just make the existing SYNC_WAIT behaviour the
      default.
      
      For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag()
      directly because we just want optimistic clean inode reclaim to be
      done in the background.
      
      For xfs_quiesce_attr() we can just remove the inode reclaim calls as
      they are a historic relic that was required to flush dirty inodes
      that contained unlogged changes. We now log all changes to the
      inodes, so the sync AIL push from xfs_log_quiesce() called by
      xfs_quiesce_attr() will do all the required inode writeback for
      freeze.
      
      Seeing as we now want to loop until all reclaimable inodes have been
      reclaimed, make xfs_reclaim_inodes() loop on the XFS_ICI_RECLAIM_TAG
      tag rather than having xfs_reclaim_inodes_ag() tell it that inodes
      were skipped. This is much more reliable and will always loop until
      all reclaimable inodes are reclaimed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4d0bab3a
    • D
      xfs: allow multiple reclaimers per AG · 0e8e2c63
      Dave Chinner 提交于
      Inode reclaim will still throttle direct reclaim on the per-ag
      reclaim locks. This is no longer necessary as reclaim can run
      non-blocking now. Hence we can remove these locks so that we don't
      arbitrarily block reclaimers just because there are more direct
      reclaimers than there are AGs.
      
      This can result in multiple reclaimers working on the same range of
      an AG, but this doesn't cause any apparent issues. Optimising the
      spread of concurrent reclaimers for best efficiency can be done in a
      future patchset.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0e8e2c63
  15. 27 5月, 2020 1 次提交
    • D
      xfs: reduce free inode accounting overhead · f18c9a90
      Dave Chinner 提交于
      Shaokun Zhang reported that XFS was using substantial CPU time in
      percpu_count_sum() when running a single threaded benchmark on
      a high CPU count (128p) machine from xfs_mod_ifree(). The issue
      is that the filesystem is empty when the benchmark runs, so inode
      allocation is running with a very low inode free count.
      
      With the percpu counter batching, this means comparisons when the
      counter is less that 128 * 256 = 32768 use the slow path of adding
      up all the counters across the CPUs, and this is expensive on high
      CPU count machines.
      
      The summing in xfs_mod_ifree() is only used to fire an assert if an
      underrun occurs. The error is ignored by the higher level code.
      Hence this is really just debug code and we don't need to run it
      on production kernels, nor do we need such debug checks to return
      error values just to trigger an assert.
      
      Finally, xfs_mod_icount/xfs_mod_ifree are only called from
      xfs_trans_unreserve_and_mod_sb(), so get rid of them and just
      directly call the percpu_counter_add/percpu_counter_compare
      functions. The compare functions are now run only on debug builds as
      they are internal to ASSERT() checks and so only compiled in when
      ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y).
      Reported-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f18c9a90
  16. 05 5月, 2020 1 次提交
  17. 12 3月, 2020 1 次提交
  18. 19 12月, 2019 2 次提交
    • D
      xfs: don't commit sunit/swidth updates to disk if that would cause repair failures · 13eaec4b
      Darrick J. Wong 提交于
      Alex Lyakas reported[1] that mounting an xfs filesystem with new sunit
      and swidth values could cause xfs_repair to fail loudly.  The problem
      here is that repair calculates the where mkfs should have allocated the
      root inode, based on the superblock geometry.  The allocation decisions
      depend on sunit, which means that we really can't go updating sunit if
      it would lead to a subsequent repair failure on an otherwise correct
      filesystem.
      
      Port from xfs_repair some code that computes the location of the root
      inode and teach mount to skip the ondisk update if it would cause
      problems for repair.  Along the way we'll update the documentation,
      provide a function for computing the minimum AGFL size instead of
      open-coding it, and cut down some indenting in the mount code.
      
      Note that we allow the mount to proceed (and new allocations will
      reflect this new geometry) because we've never screened this kind of
      thing before.  We'll have to wait for a new future incompat feature to
      enforce correct behavior, alas.
      
      Note that the geometry reporting always uses the superblock values, not
      the incore ones, so that is what xfs_info and xfs_growfs will report.
      
      [1] https://lore.kernel.org/linux-xfs/20191125130744.GA44777@bfoster/T/#m00f9594b511e076e2fcdd489d78bc30216d72a7dReported-by: NAlex Lyakas <alex@zadara.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      13eaec4b
    • D
      xfs: split the sunit parameter update into two parts · 4f5b1b3a
      Darrick J. Wong 提交于
      If the administrator provided a sunit= mount option, we need to validate
      the raw parameter, convert the mount option units (512b blocks) into the
      internal unit (fs blocks), and then validate that the (now cooked)
      parameter doesn't screw anything up on disk.  The incore inode geometry
      computation can depend on the new sunit option, but a subsequent patch
      will make validating the cooked value depends on the computed inode
      geometry, so break the sunit update into two steps.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      4f5b1b3a
  19. 14 11月, 2019 1 次提交
  20. 06 11月, 2019 1 次提交
  21. 30 10月, 2019 4 次提交
  22. 06 9月, 2019 1 次提交
  23. 27 8月, 2019 1 次提交