1. 10 8月, 2021 3 次提交
    • D
      xfs: don't run speculative preallocation gc when fs is frozen · 6f649091
      Darrick J. Wong 提交于
      Now that we have the infrastructure to switch background workers on and
      off at will, fix the block gc worker code so that we don't actually run
      the worker when the filesystem is frozen, same as we do for deferred
      inactivation.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      6f649091
    • D
      xfs: queue inactivation immediately when free realtime extents are tight · 65f03d86
      Darrick J. Wong 提交于
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      Similar to the patch doing this for free space on the data device, if
      the file being inactivated is a realtime file and the realtime volume is
      running low on free extents, we want to run the worker ASAP so that the
      realtime allocator can make better decisions.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      65f03d86
    • D
      xfs: queue inactivation immediately when free space is tight · 7d6f07d2
      Darrick J. Wong 提交于
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      On a mostly empty filesystem, the risk of the allocator making poor
      decisions due to fragmentation of the free space on account a lengthy
      delay in background updates is minimal because there's plenty of space.
      However, if free space is tight, we want to deallocate unlinked inodes
      as quickly as possible to avoid fallocate ENOSPC and to give the
      allocator the best shot at optimal allocations for new writes.
      
      Therefore, queue the percpu worker immediately if the filesystem is more
      than 95% full.  This follows the same principle that XFS becomes less
      aggressive about speculative allocations and lazy cleanup (and more
      precise about accounting) when nearing full.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      7d6f07d2
  2. 07 8月, 2021 2 次提交
    • D
      xfs: per-cpu deferred inode inactivation queues · ab23a776
      Dave Chinner 提交于
      Move inode inactivation to background work contexts so that it no
      longer runs in the context that releases the final reference to an
      inode. This will allow process work that ends up blocking on
      inactivation to continue doing work while the filesytem processes
      the inactivation in the background.
      
      A typical demonstration of this is unlinking an inode with lots of
      extents. The extents are removed during inactivation, so this blocks
      the process that unlinked the inode from the directory structure. By
      moving the inactivation to the background process, the userspace
      applicaiton can keep working (e.g. unlinking the next inode in the
      directory) while the inactivation work on the previous inode is
      done by a different CPU.
      
      The implementation of the queue is relatively simple. We use a
      per-cpu lockless linked list (llist) to queue inodes for
      inactivation without requiring serialisation mechanisms, and a work
      item to allow the queue to be processed by a CPU bound worker
      thread. We also keep a count of the queue depth so that we can
      trigger work after a number of deferred inactivations have been
      queued.
      
      The use of a bound workqueue with a single work depth allows the
      workqueue to run one work item per CPU. We queue the work item on
      the CPU we are currently running on, and so this essentially gives
      us affine per-cpu worker threads for the per-cpu queues. THis
      maintains the effective CPU affinity that occurs within XFS at the
      AG level due to all objects in a directory being local to an AG.
      Hence inactivation work tends to run on the same CPU that last
      accessed all the objects that inactivation accesses and this
      maintains hot CPU caches for unlink workloads.
      
      A depth of 32 inodes was chosen to match the number of inodes in an
      inode cluster buffer. This hopefully allows sequential
      allocation/unlink behaviours to defering inactivation of all the
      inodes in a single cluster buffer at a time, further helping
      maintain hot CPU and buffer cache accesses while running
      inactivations.
      
      A hard per-cpu queue throttle of 256 inode has been set to avoid
      runaway queuing when inodes that take a long to time inactivate are
      being processed. For example, when unlinking inodes with large
      numbers of extents that can take a lot of processing to free.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [djwong: tweak comments and tracepoints, convert opflags to state bits]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ab23a776
    • D
      xfs: introduce all-mounts list for cpu hotplug notifications · 0ed17f01
      Dave Chinner 提交于
      The inode inactivation and CIL tracking percpu structures are
      per-xfs_mount structures. That means when we get a CPU dead
      notification, we need to then iterate all the per-cpu structure
      instances to process them. Rather than keeping linked lists of
      per-cpu structures in each subsystem, add a list of all xfs_mounts
      that the generic xfs_cpu_dead() function will iterate and call into
      each subsystem appropriately.
      
      This allows us to handle both per-mount and global XFS percpu state
      from xfs_cpu_dead(), and avoids the need to link subsystem
      structures that can be easily found from the xfs_mount into their
      own global lists.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [djwong: expand some comments about mount list setup ordering rules]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0ed17f01
  3. 02 6月, 2021 2 次提交
    • D
      xfs: move perag structure and setup to libxfs/xfs_ag.[ch] · 07b6403a
      Dave Chinner 提交于
      Move the xfs_perag infrastructure to the libxfs files that contain
      all the per AG infrastructure. This helps set up for passing perags
      around all the code instead of bare agnos with minimal extra
      includes for existing files.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      07b6403a
    • D
      xfs: prepare for moving perag definitions and support to libxfs · 61aa005a
      Dave Chinner 提交于
      The perag structures really need to be defined with the rest of the
      AG support infrastructure. The struct xfs_perag and init/teardown
      has been placed in xfs_mount.[ch] because there are differences in
      the structure between kernel and userspace. Mainly that userspace
      doesn't have a lot of the internal stuff that the kernel has for
      caches and discard and other such structures.
      
      However, it makes more sense to move this to libxfs than to keep
      this separation because we are now moving to use struct perags
      everywhere in the code instead of passing raw agnumber_t values
      about. Hence we shoudl really move the support infrastructure to
      libxfs/xfs_ag.[ch].
      
      To do this without breaking userspace, first we need to rearrange
      the structures and code so that all the kernel specific code is
      located together. This makes it simple for userspace to ifdef out
      the all the parts it does not need, minimising the code differences
      between kernel and userspace. The next commit will do the move...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      61aa005a
  4. 29 4月, 2021 1 次提交
  5. 26 3月, 2021 1 次提交
  6. 04 2月, 2021 2 次提交
  7. 23 1月, 2021 1 次提交
  8. 16 9月, 2020 1 次提交
  9. 07 7月, 2020 1 次提交
  10. 30 5月, 2020 2 次提交
  11. 27 5月, 2020 3 次提交
    • D
      xfs: remove the m_active_trans counter · b41b46c2
      Dave Chinner 提交于
      It's a global atomic counter, and we are hitting it at a rate of
      half a million transactions a second, so it's bouncing the counter
      cacheline all over the place on large machines. We don't actually
      need it anymore - it used to be required because the VFS freeze code
      could not track/prevent filesystem transactions that were running,
      but that problem no longer exists.
      
      Hence to remove the counter, we simply have to ensure that nothing
      calls xfs_sync_sb() while we are trying to quiesce the filesytem.
      That only happens if the log worker is still running when we call
      xfs_quiesce_attr(). The log worker is cancelled at the end of
      xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it
      early here and then we can remove the counter altogether.
      
      Concurrent create, 50 million inodes, identical 16p/16GB virtual
      machines on different physical hosts. Machine A has twice the CPU
      cores per socket of machine B:
      
      		unpatched	patched
      machine A:	3m16s		2m00s
      machine B:	4m04s		4m05s
      
      Create rates:
      		unpatched	patched
      machine A:	282k+/-31k	468k+/-21k
      machine B:	231k+/-8k	233k+/-11k
      
      Concurrent rm of same 50 million inodes:
      
      		unpatched	patched
      machine A:	6m42s		2m33s
      machine B:	4m47s		4m47s
      
      The transaction rate on the fast machine went from just under
      300k/sec to 700k/sec, which indicates just how much of a bottleneck
      this atomic counter was.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      b41b46c2
    • D
      xfs: separate read-only variables in struct xfs_mount · b0dff466
      Dave Chinner 提交于
      Seeing massive cpu usage from xfs_agino_range() on one machine;
      instruction level profiles look similar to another machine running
      the same workload, only one machine is consuming 10x as much CPU as
      the other and going much slower. The only real difference between
      the two machines is core count per socket. Both are running
      identical 16p/16GB virtual machine configurations
      
      Machine A:
      
        25.83%  [k] xfs_agino_range
        12.68%  [k] __xfs_dir3_data_check
         6.95%  [k] xfs_verify_ino
         6.78%  [k] xfs_dir2_data_entry_tag_p
         3.56%  [k] xfs_buf_find
         2.31%  [k] xfs_verify_dir_ino
         2.02%  [k] xfs_dabuf_map.constprop.0
         1.65%  [k] xfs_ag_block_count
      
      And takes around 13 minutes to remove 50 million inodes.
      
      Machine B:
      
        13.90%  [k] __pv_queued_spin_lock_slowpath
         3.76%  [k] do_raw_spin_lock
         2.83%  [k] xfs_dir3_leaf_check_int
         2.75%  [k] xfs_agino_range
         2.51%  [k] __raw_callee_save___pv_queued_spin_unlock
         2.18%  [k] __xfs_dir3_data_check
         2.02%  [k] xfs_log_commit_cil
      
      And takes around 5m30s to remove 50 million inodes.
      
      Suspect is cacheline contention on m_sectbb_log which is used in one
      of the macros in xfs_agino_range. This is a read-only variable but
      shares a cacheline with m_active_trans which is a global atomic that
      gets bounced all around the machine.
      
      The workload is trying to run hundreds of thousands of transactions
      per second and hence cacheline contention will be occurring on this
      atomic counter. Hence xfs_agino_range() is likely just be an
      innocent bystander as the cache coherency protocol fights over the
      cacheline between CPU cores and sockets.
      
      On machine A, this rearrangement of the struct xfs_mount
      results in the profile changing to:
      
         9.77%  [kernel]  [k] xfs_agino_range
         6.27%  [kernel]  [k] __xfs_dir3_data_check
         5.31%  [kernel]  [k] __pv_queued_spin_lock_slowpath
         4.54%  [kernel]  [k] xfs_buf_find
         3.79%  [kernel]  [k] do_raw_spin_lock
         3.39%  [kernel]  [k] xfs_verify_ino
         2.73%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
      
      Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
      of machine B and still runs substantially slower than it should.
      
      Current rm -rf of 50 million files:
      
      		vanilla		patched
      machine A	13m20s		6m42s
      machine B	5m30s		5m02s
      
      It's an improvement, hence indicating that separation and further
      optimisation of read-only global filesystem data is worthwhile, but
      it clearly isn't the underlying issue causing this specific
      performance degradation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      b0dff466
    • D
      xfs: reduce free inode accounting overhead · f18c9a90
      Dave Chinner 提交于
      Shaokun Zhang reported that XFS was using substantial CPU time in
      percpu_count_sum() when running a single threaded benchmark on
      a high CPU count (128p) machine from xfs_mod_ifree(). The issue
      is that the filesystem is empty when the benchmark runs, so inode
      allocation is running with a very low inode free count.
      
      With the percpu counter batching, this means comparisons when the
      counter is less that 128 * 256 = 32768 use the slow path of adding
      up all the counters across the CPUs, and this is expensive on high
      CPU count machines.
      
      The summing in xfs_mod_ifree() is only used to fire an assert if an
      underrun occurs. The error is ignored by the higher level code.
      Hence this is really just debug code and we don't need to run it
      on production kernels, nor do we need such debug checks to return
      error values just to trigger an assert.
      
      Finally, xfs_mod_icount/xfs_mod_ifree are only called from
      xfs_trans_unreserve_and_mod_sb(), so get rid of them and just
      directly call the percpu_counter_add/percpu_counter_compare
      functions. The compare functions are now run only on debug builds as
      they are internal to ASSERT() checks and so only compiled in when
      ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y).
      Reported-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f18c9a90
  12. 07 5月, 2020 1 次提交
  13. 05 5月, 2020 2 次提交
  14. 17 4月, 2020 1 次提交
    • D
      xfs: move inode flush to the sync workqueue · f0f7a674
      Darrick J. Wong 提交于
      Move the inode dirty data flushing to a workqueue so that multiple
      threads can take advantage of a single thread's flushing work.  The
      ratelimiting technique used in bdd4ee4 was not successful, because
      threads that skipped the inode flush scan due to ratelimiting would
      ENOSPC early, which caused occasional (but noticeable) changes in
      behavior and sporadic fstest regressions.
      
      Therefore, make all the writer threads wait on a single inode flush,
      which eliminates both the stampeding hordes of flushers and the small
      window in which a write could fail with ENOSPC because it lost the
      ratelimit race after even another thread freed space.
      
      Fixes: c6425702 ("xfs: ratelimit inode flush on buffered write ENOSPC")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      f0f7a674
  15. 31 3月, 2020 1 次提交
    • D
      xfs: ratelimit inode flush on buffered write ENOSPC · c6425702
      Darrick J. Wong 提交于
      A customer reported rcu stalls and softlockup warnings on a computer
      with many CPU cores and many many more IO threads trying to write to a
      filesystem that is totally out of space.  Subsequent analysis pointed to
      the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
      which causes a lot of wb_writeback_work to be queued.  The writeback
      worker spends so much time trying to wake the many many threads waiting
      for writeback completion that it trips the softlockup detector, and (in
      this case) the system automatically reboots.
      
      In addition, they complain that the lengthy xfs_flush_inodes scan traps
      all of those threads in uninterruptible sleep, which hampers their
      ability to kill the program or do anything else to escape the situation.
      
      If there's thousands of threads trying to write to files on a full
      filesystem, each of those threads will start separate copies of the
      inode flush scan.  This is kind of pointless since we only need one
      scan, so rate limit the inode flush.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      c6425702
  16. 14 11月, 2019 3 次提交
  17. 11 11月, 2019 2 次提交
  18. 06 11月, 2019 2 次提交
  19. 30 10月, 2019 7 次提交
  20. 27 8月, 2019 1 次提交
    • D
      xfs: add kmem allocation trace points · 0ad95687
      Dave Chinner 提交于
      When trying to correlate XFS kernel allocations to memory reclaim
      behaviour, it is useful to know what allocations XFS is actually
      attempting. This information is not directly available from
      tracepoints in the generic memory allocation and reclaim
      tracepoints, so these new trace points provide a high level
      indication of what the XFS memory demand actually is.
      
      There is no per-filesystem context in this code, so we just trace
      the type of allocation, the size and the allocation constraints.
      The kmem code also doesn't include much of the common XFS headers,
      so there are a few definitions that need to be added to the trace
      headers and a couple of types that need to be made common to avoid
      needing to include the whole world in the kmem code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0ad95687
  21. 29 6月, 2019 1 次提交