1. 07 1月, 2022 1 次提交
  2. 27 12月, 2021 1 次提交
  3. 16 9月, 2020 1 次提交
  4. 07 7月, 2020 1 次提交
  5. 30 5月, 2020 2 次提交
  6. 27 5月, 2020 3 次提交
    • D
      xfs: remove the m_active_trans counter · b41b46c2
      Dave Chinner 提交于
      It's a global atomic counter, and we are hitting it at a rate of
      half a million transactions a second, so it's bouncing the counter
      cacheline all over the place on large machines. We don't actually
      need it anymore - it used to be required because the VFS freeze code
      could not track/prevent filesystem transactions that were running,
      but that problem no longer exists.
      
      Hence to remove the counter, we simply have to ensure that nothing
      calls xfs_sync_sb() while we are trying to quiesce the filesytem.
      That only happens if the log worker is still running when we call
      xfs_quiesce_attr(). The log worker is cancelled at the end of
      xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it
      early here and then we can remove the counter altogether.
      
      Concurrent create, 50 million inodes, identical 16p/16GB virtual
      machines on different physical hosts. Machine A has twice the CPU
      cores per socket of machine B:
      
      		unpatched	patched
      machine A:	3m16s		2m00s
      machine B:	4m04s		4m05s
      
      Create rates:
      		unpatched	patched
      machine A:	282k+/-31k	468k+/-21k
      machine B:	231k+/-8k	233k+/-11k
      
      Concurrent rm of same 50 million inodes:
      
      		unpatched	patched
      machine A:	6m42s		2m33s
      machine B:	4m47s		4m47s
      
      The transaction rate on the fast machine went from just under
      300k/sec to 700k/sec, which indicates just how much of a bottleneck
      this atomic counter was.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      b41b46c2
    • D
      xfs: separate read-only variables in struct xfs_mount · b0dff466
      Dave Chinner 提交于
      Seeing massive cpu usage from xfs_agino_range() on one machine;
      instruction level profiles look similar to another machine running
      the same workload, only one machine is consuming 10x as much CPU as
      the other and going much slower. The only real difference between
      the two machines is core count per socket. Both are running
      identical 16p/16GB virtual machine configurations
      
      Machine A:
      
        25.83%  [k] xfs_agino_range
        12.68%  [k] __xfs_dir3_data_check
         6.95%  [k] xfs_verify_ino
         6.78%  [k] xfs_dir2_data_entry_tag_p
         3.56%  [k] xfs_buf_find
         2.31%  [k] xfs_verify_dir_ino
         2.02%  [k] xfs_dabuf_map.constprop.0
         1.65%  [k] xfs_ag_block_count
      
      And takes around 13 minutes to remove 50 million inodes.
      
      Machine B:
      
        13.90%  [k] __pv_queued_spin_lock_slowpath
         3.76%  [k] do_raw_spin_lock
         2.83%  [k] xfs_dir3_leaf_check_int
         2.75%  [k] xfs_agino_range
         2.51%  [k] __raw_callee_save___pv_queued_spin_unlock
         2.18%  [k] __xfs_dir3_data_check
         2.02%  [k] xfs_log_commit_cil
      
      And takes around 5m30s to remove 50 million inodes.
      
      Suspect is cacheline contention on m_sectbb_log which is used in one
      of the macros in xfs_agino_range. This is a read-only variable but
      shares a cacheline with m_active_trans which is a global atomic that
      gets bounced all around the machine.
      
      The workload is trying to run hundreds of thousands of transactions
      per second and hence cacheline contention will be occurring on this
      atomic counter. Hence xfs_agino_range() is likely just be an
      innocent bystander as the cache coherency protocol fights over the
      cacheline between CPU cores and sockets.
      
      On machine A, this rearrangement of the struct xfs_mount
      results in the profile changing to:
      
         9.77%  [kernel]  [k] xfs_agino_range
         6.27%  [kernel]  [k] __xfs_dir3_data_check
         5.31%  [kernel]  [k] __pv_queued_spin_lock_slowpath
         4.54%  [kernel]  [k] xfs_buf_find
         3.79%  [kernel]  [k] do_raw_spin_lock
         3.39%  [kernel]  [k] xfs_verify_ino
         2.73%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
      
      Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
      of machine B and still runs substantially slower than it should.
      
      Current rm -rf of 50 million files:
      
      		vanilla		patched
      machine A	13m20s		6m42s
      machine B	5m30s		5m02s
      
      It's an improvement, hence indicating that separation and further
      optimisation of read-only global filesystem data is worthwhile, but
      it clearly isn't the underlying issue causing this specific
      performance degradation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      b0dff466
    • D
      xfs: reduce free inode accounting overhead · f18c9a90
      Dave Chinner 提交于
      Shaokun Zhang reported that XFS was using substantial CPU time in
      percpu_count_sum() when running a single threaded benchmark on
      a high CPU count (128p) machine from xfs_mod_ifree(). The issue
      is that the filesystem is empty when the benchmark runs, so inode
      allocation is running with a very low inode free count.
      
      With the percpu counter batching, this means comparisons when the
      counter is less that 128 * 256 = 32768 use the slow path of adding
      up all the counters across the CPUs, and this is expensive on high
      CPU count machines.
      
      The summing in xfs_mod_ifree() is only used to fire an assert if an
      underrun occurs. The error is ignored by the higher level code.
      Hence this is really just debug code and we don't need to run it
      on production kernels, nor do we need such debug checks to return
      error values just to trigger an assert.
      
      Finally, xfs_mod_icount/xfs_mod_ifree are only called from
      xfs_trans_unreserve_and_mod_sb(), so get rid of them and just
      directly call the percpu_counter_add/percpu_counter_compare
      functions. The compare functions are now run only on debug builds as
      they are internal to ASSERT() checks and so only compiled in when
      ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y).
      Reported-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f18c9a90
  7. 07 5月, 2020 1 次提交
  8. 05 5月, 2020 2 次提交
  9. 17 4月, 2020 1 次提交
    • D
      xfs: move inode flush to the sync workqueue · f0f7a674
      Darrick J. Wong 提交于
      Move the inode dirty data flushing to a workqueue so that multiple
      threads can take advantage of a single thread's flushing work.  The
      ratelimiting technique used in bdd4ee4 was not successful, because
      threads that skipped the inode flush scan due to ratelimiting would
      ENOSPC early, which caused occasional (but noticeable) changes in
      behavior and sporadic fstest regressions.
      
      Therefore, make all the writer threads wait on a single inode flush,
      which eliminates both the stampeding hordes of flushers and the small
      window in which a write could fail with ENOSPC because it lost the
      ratelimit race after even another thread freed space.
      
      Fixes: c6425702 ("xfs: ratelimit inode flush on buffered write ENOSPC")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      f0f7a674
  10. 31 3月, 2020 1 次提交
    • D
      xfs: ratelimit inode flush on buffered write ENOSPC · c6425702
      Darrick J. Wong 提交于
      A customer reported rcu stalls and softlockup warnings on a computer
      with many CPU cores and many many more IO threads trying to write to a
      filesystem that is totally out of space.  Subsequent analysis pointed to
      the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
      which causes a lot of wb_writeback_work to be queued.  The writeback
      worker spends so much time trying to wake the many many threads waiting
      for writeback completion that it trips the softlockup detector, and (in
      this case) the system automatically reboots.
      
      In addition, they complain that the lengthy xfs_flush_inodes scan traps
      all of those threads in uninterruptible sleep, which hampers their
      ability to kill the program or do anything else to escape the situation.
      
      If there's thousands of threads trying to write to files on a full
      filesystem, each of those threads will start separate copies of the
      inode flush scan.  This is kind of pointless since we only need one
      scan, so rate limit the inode flush.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      c6425702
  11. 14 11月, 2019 3 次提交
  12. 11 11月, 2019 2 次提交
  13. 06 11月, 2019 2 次提交
  14. 30 10月, 2019 7 次提交
  15. 27 8月, 2019 1 次提交
    • D
      xfs: add kmem allocation trace points · 0ad95687
      Dave Chinner 提交于
      When trying to correlate XFS kernel allocations to memory reclaim
      behaviour, it is useful to know what allocations XFS is actually
      attempting. This information is not directly available from
      tracepoints in the generic memory allocation and reclaim
      tracepoints, so these new trace points provide a high level
      indication of what the XFS memory demand actually is.
      
      There is no per-filesystem context in this code, so we just trace
      the type of allocation, the size and the allocation constraints.
      The kmem code also doesn't include much of the common XFS headers,
      so there are a few definitions that need to be added to the trace
      headers and a couple of types that need to be made common to avoid
      needing to include the whole world in the kmem code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0ad95687
  16. 29 6月, 2019 1 次提交
  17. 12 6月, 2019 2 次提交
  18. 27 4月, 2019 1 次提交
  19. 17 4月, 2019 1 次提交
  20. 15 4月, 2019 2 次提交
  21. 21 2月, 2019 1 次提交
    • C
      xfs: introduce an always_cow mode · 66ae56a5
      Christoph Hellwig 提交于
      Add a mode where XFS never overwrites existing blocks in place.  This
      is to aid debugging our COW code, and also put infatructure in place
      for things like possible future support for zoned block devices, which
      can't support overwrites.
      
      This mode is enabled globally by doing a:
      
          echo 1 > /sys/fs/xfs/debug/always_cow
      
      Note that the parameter is global to allow running all tests in xfstests
      easily in this mode, which would not easily be possible with a per-fs
      sysfs file.
      
      In always_cow mode persistent preallocations are disabled, and fallocate
      will fail when called with a 0 mode (with our without
      FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space
      when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE.
      
      There are a few interesting xfstests failures when run in always_cow
      mode:
      
       - generic/392 fails because the bytes used in the file used to test
         hole punch recovery are less after the log replay.  This is
         because the blocks written and then punched out are only freed
         with a delay due to the logging mechanism.
       - xfs/170 will fail as the already fragile file streams mechanism
         doesn't seem to interact well with the COW allocator
       - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim
         the file system is badly fragmented, but there is not much we
         can do to avoid that when always writing out of place
       - xfs/205 fails because overwriting a file in always_cow mode
         will require new space allocation and the assumption in the
         test thus don't work anymore.
       - xfs/326 fails to modify the file at all in always_cow mode after
         injecting the refcount error, leading to an unexpected md5sum
         after the remount, but that again is expected
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      66ae56a5
  22. 15 2月, 2019 1 次提交
  23. 12 2月, 2019 1 次提交
    • D
      xfs: cache unlinked pointers in an rhashtable · 9b247179
      Darrick J. Wong 提交于
      Use a rhashtable to cache the unlinked list incore.  This should speed
      up unlinked processing considerably when there are a lot of inodes on
      the unlinked list because iunlink_remove no longer has to traverse an
      entire bucket list to find which inode points to the one being removed.
      
      The incore list structure records "X.next_unlinked = Y" relations, with
      the rhashtable using Y to index the records.  This makes finding the
      inode X that points to a inode Y very quick.  If our cache fails to find
      anything we can always fall back on the old method.
      
      FWIW this drastically reduces the amount of time it takes to remove
      inodes from the unlinked list.  I wrote a program to open a lot of
      O_TMPFILE files and then close them in the same order, which takes
      a very long time if we have to traverse the unlinked lists.  With the
      ptach, I see:
      
      + /d/t/tmpfile/tmpfile
      Opened 193531 files in 6.33s.
      Closed 193531 files in 5.86s
      
      real    0m12.192s
      user    0m0.064s
      sys     0m11.619s
      + cd /
      + umount /mnt
      
      real    0m0.050s
      user    0m0.004s
      sys     0m0.030s
      
      And without the patch:
      
      + /d/t/tmpfile/tmpfile
      Opened 193588 files in 6.35s.
      Closed 193588 files in 751.61s
      
      real    12m38.853s
      user    0m0.084s
      sys     12m34.470s
      + cd /
      + umount /mnt
      
      real    0m0.086s
      user    0m0.000s
      sys     0m0.060s
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      9b247179
  24. 13 12月, 2018 1 次提交
    • O
      xfs: cache minimum realtime summary level · 355e3532
      Omar Sandoval 提交于
      The realtime summary is a two-dimensional array on disk, effectively:
      
      u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]
      
      rsum[log][bbno] is the number of extents of size 2**log which start in
      bitmap block bbno.
      
      xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
      rsum[log][bbno] != 0 for any log level. However, the summary array is
      stored in row-major order (i.e., like an array in C), so all of these
      entries are not adjacent, but rather spread across the entire summary
      file. In the worst case (a full bitmap block), xfs_rtany_summary() has
      to check every level.
      
      This means that on a moderately-used realtime device, an allocation will
      waste a lot of time finding, reading, and releasing buffers for the
      realtime summary. In particular, one of our storage services (which runs
      on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
      spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
      xfs_trans_brelse() called from xfs_rtany_summary().
      
      One solution would be to also store the summary with the dimensions
      swapped. However, this would require a disk format change to a very old
      component of XFS.
      
      Instead, we can cache the minimum size which contains any extents. We do
      so lazily; rather than guaranteeing that the cache contains the precise
      minimum, it always contains a loose lower bound which we tighten when we
      read or update a summary block. This only uses a few kilobytes of memory
      and is already serialized via the realtime bitmap and summary inode
      locks, so the cost is minimal. With this change, the same workload only
      spends 0.2% of its CPU cycles in the realtime allocator.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      355e3532