1. 26 4月, 2023 1 次提交
  2. 04 4月, 2023 2 次提交
  3. 19 1月, 2023 1 次提交
  4. 07 1月, 2022 9 次提交
  5. 27 12月, 2021 1 次提交
  6. 16 9月, 2020 1 次提交
  7. 07 7月, 2020 1 次提交
  8. 30 5月, 2020 2 次提交
  9. 27 5月, 2020 3 次提交
    • D
      xfs: remove the m_active_trans counter · b41b46c2
      Dave Chinner 提交于
      It's a global atomic counter, and we are hitting it at a rate of
      half a million transactions a second, so it's bouncing the counter
      cacheline all over the place on large machines. We don't actually
      need it anymore - it used to be required because the VFS freeze code
      could not track/prevent filesystem transactions that were running,
      but that problem no longer exists.
      
      Hence to remove the counter, we simply have to ensure that nothing
      calls xfs_sync_sb() while we are trying to quiesce the filesytem.
      That only happens if the log worker is still running when we call
      xfs_quiesce_attr(). The log worker is cancelled at the end of
      xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it
      early here and then we can remove the counter altogether.
      
      Concurrent create, 50 million inodes, identical 16p/16GB virtual
      machines on different physical hosts. Machine A has twice the CPU
      cores per socket of machine B:
      
      		unpatched	patched
      machine A:	3m16s		2m00s
      machine B:	4m04s		4m05s
      
      Create rates:
      		unpatched	patched
      machine A:	282k+/-31k	468k+/-21k
      machine B:	231k+/-8k	233k+/-11k
      
      Concurrent rm of same 50 million inodes:
      
      		unpatched	patched
      machine A:	6m42s		2m33s
      machine B:	4m47s		4m47s
      
      The transaction rate on the fast machine went from just under
      300k/sec to 700k/sec, which indicates just how much of a bottleneck
      this atomic counter was.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      b41b46c2
    • D
      xfs: separate read-only variables in struct xfs_mount · b0dff466
      Dave Chinner 提交于
      Seeing massive cpu usage from xfs_agino_range() on one machine;
      instruction level profiles look similar to another machine running
      the same workload, only one machine is consuming 10x as much CPU as
      the other and going much slower. The only real difference between
      the two machines is core count per socket. Both are running
      identical 16p/16GB virtual machine configurations
      
      Machine A:
      
        25.83%  [k] xfs_agino_range
        12.68%  [k] __xfs_dir3_data_check
         6.95%  [k] xfs_verify_ino
         6.78%  [k] xfs_dir2_data_entry_tag_p
         3.56%  [k] xfs_buf_find
         2.31%  [k] xfs_verify_dir_ino
         2.02%  [k] xfs_dabuf_map.constprop.0
         1.65%  [k] xfs_ag_block_count
      
      And takes around 13 minutes to remove 50 million inodes.
      
      Machine B:
      
        13.90%  [k] __pv_queued_spin_lock_slowpath
         3.76%  [k] do_raw_spin_lock
         2.83%  [k] xfs_dir3_leaf_check_int
         2.75%  [k] xfs_agino_range
         2.51%  [k] __raw_callee_save___pv_queued_spin_unlock
         2.18%  [k] __xfs_dir3_data_check
         2.02%  [k] xfs_log_commit_cil
      
      And takes around 5m30s to remove 50 million inodes.
      
      Suspect is cacheline contention on m_sectbb_log which is used in one
      of the macros in xfs_agino_range. This is a read-only variable but
      shares a cacheline with m_active_trans which is a global atomic that
      gets bounced all around the machine.
      
      The workload is trying to run hundreds of thousands of transactions
      per second and hence cacheline contention will be occurring on this
      atomic counter. Hence xfs_agino_range() is likely just be an
      innocent bystander as the cache coherency protocol fights over the
      cacheline between CPU cores and sockets.
      
      On machine A, this rearrangement of the struct xfs_mount
      results in the profile changing to:
      
         9.77%  [kernel]  [k] xfs_agino_range
         6.27%  [kernel]  [k] __xfs_dir3_data_check
         5.31%  [kernel]  [k] __pv_queued_spin_lock_slowpath
         4.54%  [kernel]  [k] xfs_buf_find
         3.79%  [kernel]  [k] do_raw_spin_lock
         3.39%  [kernel]  [k] xfs_verify_ino
         2.73%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
      
      Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
      of machine B and still runs substantially slower than it should.
      
      Current rm -rf of 50 million files:
      
      		vanilla		patched
      machine A	13m20s		6m42s
      machine B	5m30s		5m02s
      
      It's an improvement, hence indicating that separation and further
      optimisation of read-only global filesystem data is worthwhile, but
      it clearly isn't the underlying issue causing this specific
      performance degradation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      b0dff466
    • D
      xfs: reduce free inode accounting overhead · f18c9a90
      Dave Chinner 提交于
      Shaokun Zhang reported that XFS was using substantial CPU time in
      percpu_count_sum() when running a single threaded benchmark on
      a high CPU count (128p) machine from xfs_mod_ifree(). The issue
      is that the filesystem is empty when the benchmark runs, so inode
      allocation is running with a very low inode free count.
      
      With the percpu counter batching, this means comparisons when the
      counter is less that 128 * 256 = 32768 use the slow path of adding
      up all the counters across the CPUs, and this is expensive on high
      CPU count machines.
      
      The summing in xfs_mod_ifree() is only used to fire an assert if an
      underrun occurs. The error is ignored by the higher level code.
      Hence this is really just debug code and we don't need to run it
      on production kernels, nor do we need such debug checks to return
      error values just to trigger an assert.
      
      Finally, xfs_mod_icount/xfs_mod_ifree are only called from
      xfs_trans_unreserve_and_mod_sb(), so get rid of them and just
      directly call the percpu_counter_add/percpu_counter_compare
      functions. The compare functions are now run only on debug builds as
      they are internal to ASSERT() checks and so only compiled in when
      ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y).
      Reported-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f18c9a90
  10. 07 5月, 2020 1 次提交
  11. 05 5月, 2020 2 次提交
  12. 17 4月, 2020 1 次提交
    • D
      xfs: move inode flush to the sync workqueue · f0f7a674
      Darrick J. Wong 提交于
      Move the inode dirty data flushing to a workqueue so that multiple
      threads can take advantage of a single thread's flushing work.  The
      ratelimiting technique used in bdd4ee4 was not successful, because
      threads that skipped the inode flush scan due to ratelimiting would
      ENOSPC early, which caused occasional (but noticeable) changes in
      behavior and sporadic fstest regressions.
      
      Therefore, make all the writer threads wait on a single inode flush,
      which eliminates both the stampeding hordes of flushers and the small
      window in which a write could fail with ENOSPC because it lost the
      ratelimit race after even another thread freed space.
      
      Fixes: c6425702 ("xfs: ratelimit inode flush on buffered write ENOSPC")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      f0f7a674
  13. 31 3月, 2020 1 次提交
    • D
      xfs: ratelimit inode flush on buffered write ENOSPC · c6425702
      Darrick J. Wong 提交于
      A customer reported rcu stalls and softlockup warnings on a computer
      with many CPU cores and many many more IO threads trying to write to a
      filesystem that is totally out of space.  Subsequent analysis pointed to
      the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
      which causes a lot of wb_writeback_work to be queued.  The writeback
      worker spends so much time trying to wake the many many threads waiting
      for writeback completion that it trips the softlockup detector, and (in
      this case) the system automatically reboots.
      
      In addition, they complain that the lengthy xfs_flush_inodes scan traps
      all of those threads in uninterruptible sleep, which hampers their
      ability to kill the program or do anything else to escape the situation.
      
      If there's thousands of threads trying to write to files on a full
      filesystem, each of those threads will start separate copies of the
      inode flush scan.  This is kind of pointless since we only need one
      scan, so rate limit the inode flush.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      c6425702
  14. 14 11月, 2019 3 次提交
  15. 11 11月, 2019 2 次提交
  16. 06 11月, 2019 2 次提交
  17. 30 10月, 2019 7 次提交