1. 07 7月, 2020 3 次提交
    • D
      xfs: rework xfs_iflush_cluster() dirty inode iteration · 5717ea4d
      Dave Chinner 提交于
      Now that we have all the dirty inodes attached to the cluster
      buffer, we don't actually have to do radix tree lookups to find
      them. Sure, the radix tree is efficient, but walking a linked list
      of just the dirty inodes attached to the buffer is much better.
      
      We are also no longer dependent on having a locked inode passed into
      the function to determine where to start the lookup. This means we
      can drop it from the function call and treat all inodes the same.
      
      We also make xfs_iflush_cluster skip inodes marked with
      XFS_IRECLAIM. This we avoid races with inodes that reclaim is
      actively referencing or are being re-initialised by inode lookup. If
      they are actually dirty, they'll get written by a future cluster
      flush....
      
      We also add a shutdown check after obtaining the flush lock so that
      we catch inodes that are dirty in memory and may have inconsistent
      state due to the shutdown in progress. We abort these inodes
      directly and so they remove themselves directly from the buffer list
      and the AIL rather than having to wait for the buffer to be failed
      and callbacks run to be processed correctly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      5717ea4d
    • D
      xfs: xfs_iflush() is no longer necessary · 90c60e16
      Dave Chinner 提交于
      Now we have a cached buffer on inode log items, we don't need
      to do buffer lookups when flushing inodes anymore - all we need
      to do is lock the buffer and we are ready to go.
      
      This largely gets rid of the need for xfs_iflush(), which is
      essentially just a mechanism to look up the buffer and flush the
      inode to it. Instead, we can just call xfs_iflush_cluster() with a
      few modifications to ensure it also flushes the inode we already
      hold locked.
      
      This allows the AIL inode item pushing to be almost entirely
      non-blocking in XFS - we won't block unless memory allocation
      for the cluster inode lookup blocks or the block device queues are
      full.
      
      Writeback during inode reclaim becomes a little more complex because
      we now have to lock the buffer ourselves, but otherwise this change
      is largely a functional no-op that removes a whole lot of code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      90c60e16
    • D
      xfs: move helpers that lock and unlock two inodes against userspace IO · e2aaee9c
      Darrick J. Wong 提交于
      Move the double-inode locking helpers to xfs_inode.c since they're not
      specific to reflink.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      e2aaee9c
  2. 30 5月, 2020 1 次提交
  3. 20 5月, 2020 3 次提交
  4. 13 5月, 2020 1 次提交
  5. 05 5月, 2020 1 次提交
  6. 06 4月, 2020 1 次提交
  7. 14 11月, 2019 1 次提交
  8. 11 11月, 2019 1 次提交
  9. 28 10月, 2019 1 次提交
  10. 23 4月, 2019 1 次提交
    • D
      xfs: widen inode delalloc block counter to 64-bits · 394aafdc
      Darrick J. Wong 提交于
      Widen the incore inode's i_delayed_blks counter to be a 64-bit integer.
      This is necessary to fix an integer overflow problem that can be
      reproduced easily now that we use the counter to track blocks that are
      assigned to the inode in memory but not on disk.  This includes actual
      delalloc reservations as well as real extents in the COW fork that
      are waiting to be remapped into the data fork.
      
      These 'delayed mapping' blocks can easily exceed 2^32 blocks if one
      creates a very large sparse file of size approximately 2^33 bytes with
      one byte written every 2^23 bytes, sets a very large COW extent size
      hint of 2^23 blocks, reflinks the first file into a second file, and
      then writes a single byte every 2^23 blocks in the original file.
      
      When this happens, we'll try to create approximately 1024 2^23 extent
      reservations in the COW fork, which will overflow the counter and cause
      problems.
      
      Note that on x64 we end up filling a 4-byte gap in the structure so this
      doesn't increase the incore size.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAllison Collins <allison.henderson@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      394aafdc
  11. 17 4月, 2019 1 次提交
    • D
      xfs: implement per-inode writeback completion queues · cb357bf3
      Darrick J. Wong 提交于
      When scheduling writeback of dirty file data in the page cache, XFS uses
      IO completion workqueue items to ensure that filesystem metadata only
      updates after the write completes successfully.  This is essential for
      converting unwritten extents to real extents at the right time and
      performing COW remappings.
      
      Unfortunately, XFS queues each IO completion work item to an unbounded
      workqueue, which means that the kernel can spawn dozens of threads to
      try to handle the items quickly.  These threads need to take the ILOCK
      to update file metadata, which results in heavy ILOCK contention if a
      large number of the work items target a single file, which is
      inefficient.
      
      Worse yet, the writeback completion threads get stuck waiting for the
      ILOCK while holding transaction reservations, which can use up all
      available log reservation space.  When that happens, metadata updates to
      other parts of the filesystem grind to a halt, even if the filesystem
      could otherwise have handled it.
      
      Even worse, if one of the things grinding to a halt happens to be a
      thread in the middle of a defer-ops finish holding the same ILOCK and
      trying to obtain more log reservation having exhausted the permanent
      reservation, we now have an ABBA deadlock - writeback completion has a
      transaction reserved and wants the ILOCK, and someone else has the ILOCK
      and wants a transaction reservation.
      
      Therefore, we create a per-inode writeback io completion queue + work
      item.  When writeback finishes, it can add the ioend to the per-inode
      queue and let the single worker item process that queue.  This
      dramatically cuts down on the number of kworkers and ILOCK contention in
      the system, and seems to have eliminated an occasional deadlock I was
      seeing while running generic/476.
      
      Testing with a program that simulates a heavy random-write workload to a
      single file demonstrates that the number of kworkers drops from
      approximately 120 threads per file to 1, without dramatically changing
      write bandwidth or pagecache access latency.
      
      Note that we leave the xfs-conv workqueue's max_active alone because we
      still want to be able to run ioend processing for as many inodes as the
      system can handle.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      cb357bf3
  12. 15 4月, 2019 1 次提交
  13. 12 2月, 2019 1 次提交
    • D
      xfs: cache unlinked pointers in an rhashtable · 9b247179
      Darrick J. Wong 提交于
      Use a rhashtable to cache the unlinked list incore.  This should speed
      up unlinked processing considerably when there are a lot of inodes on
      the unlinked list because iunlink_remove no longer has to traverse an
      entire bucket list to find which inode points to the one being removed.
      
      The incore list structure records "X.next_unlinked = Y" relations, with
      the rhashtable using Y to index the records.  This makes finding the
      inode X that points to a inode Y very quick.  If our cache fails to find
      anything we can always fall back on the old method.
      
      FWIW this drastically reduces the amount of time it takes to remove
      inodes from the unlinked list.  I wrote a program to open a lot of
      O_TMPFILE files and then close them in the same order, which takes
      a very long time if we have to traverse the unlinked lists.  With the
      ptach, I see:
      
      + /d/t/tmpfile/tmpfile
      Opened 193531 files in 6.33s.
      Closed 193531 files in 5.86s
      
      real    0m12.192s
      user    0m0.064s
      sys     0m11.619s
      + cd /
      + umount /mnt
      
      real    0m0.050s
      user    0m0.004s
      sys     0m0.030s
      
      And without the patch:
      
      + /d/t/tmpfile/tmpfile
      Opened 193588 files in 6.35s.
      Closed 193588 files in 751.61s
      
      real    12m38.853s
      user    0m0.084s
      sys     12m34.470s
      + cd /
      + umount /mnt
      
      real    0m0.086s
      user    0m0.000s
      sys     0m0.060s
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      9b247179
  14. 03 8月, 2018 1 次提交
  15. 30 7月, 2018 2 次提交
  16. 27 7月, 2018 2 次提交
  17. 12 7月, 2018 1 次提交
  18. 07 6月, 2018 1 次提交
    • D
      xfs: convert to SPDX license tags · 0b61f8a4
      Dave Chinner 提交于
      Remove the verbose license text from XFS files and replace them
      with SPDX tags. This does not change the license of any of the code,
      merely refers to the common, up-to-date license files in LICENSES/
      
      This change was mostly scripted. fs/xfs/Makefile and
      fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
      and modified by the following command:
      
      for f in `git grep -l "GNU General" fs/xfs/` ; do
      	echo $f
      	cat $f | awk -f hdr.awk > $f.new
      	mv -f $f.new $f
      done
      
      And the hdr.awk script that did the modification (including
      detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
      is as follows:
      
      $ cat hdr.awk
      BEGIN {
      	hdr = 1.0
      	tag = "GPL-2.0"
      	str = ""
      }
      
      /^ \* This program is free software/ {
      	hdr = 2.0;
      	next
      }
      
      /any later version./ {
      	tag = "GPL-2.0+"
      	next
      }
      
      /^ \*\// {
      	if (hdr > 0.0) {
      		print "// SPDX-License-Identifier: " tag
      		print str
      		print $0
      		str=""
      		hdr = 0.0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \* / {
      	if (hdr > 1.0)
      		next
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \*/ {
      	if (hdr > 0.0)
      		next
      	print $0
      	next
      }
      
      // {
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      }
      
      END { }
      $
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0b61f8a4
  19. 22 5月, 2018 1 次提交
    • D
      xfs: prepare xfs_break_layouts() for another layout type · 69eb5fa1
      Dan Williams 提交于
      When xfs is operating as the back-end of a pNFS block server, it
      prevents collisions between local and remote operations by requiring a
      lease to be held for remotely accessed blocks. Local filesystem
      operations break those leases before writing or mutating the extent map
      of the file.
      
      A similar mechanism is needed to prevent operations on pinned dax
      mappings, like device-DMA, from colliding with extent unmap operations.
      
      BREAK_WRITE and BREAK_UNMAP are introduced as two distinct levels of
      layout breaking.
      
      Layouts are broken in the BREAK_WRITE case to ensure that layout-holders
      do not collide with local writes. Additionally, layouts are broken in
      the BREAK_UNMAP case to make sure the layout-holder has a consistent
      view of the file's extent map. While BREAK_WRITE breaks can be satisfied
      be recalling FL_LAYOUT leases, BREAK_UNMAP breaks additionally require
      waiting for busy dax-pages to go idle while holding XFS_MMAPLOCK_EXCL.
      
      After this refactoring xfs_break_layouts() becomes the entry point for
      coordinating both types of breaks. Finally, xfs_break_leased_layouts()
      becomes just the BREAK_WRITE handler.
      
      Note that the unlock tracking is needed in a follow on change. That will
      coordinate retrying either break handler until both successfully test
      for a lease break while maintaining the lock state.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      69eb5fa1
  20. 16 5月, 2018 1 次提交
    • B
      xfs: factor out nodiscard helpers · 4e529339
      Brian Foster 提交于
      The changes to skip discards of speculative preallocation and
      unwritten extents introduced several new wrapper functions through
      the bunmapi -> extent free codepath to reduce churn in all of the
      associated callers. In several cases, these wrappers simply toggle a
      single flag to skip or not skip discards for the resulting blocks.
      
      The explicit _nodiscard() wrappers for such an isolated set of
      callers is a bit overkill. Kill off these wrappers and replace with
      the calls to the underlying functions in the contexts that need to
      control discard behavior. Retain the wrappers that preserve the
      original calling conventions to serve the original purpose of
      reducing code churn.
      
      This is a refactoring patch and does not change behavior.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4e529339
  21. 10 5月, 2018 1 次提交
    • B
      xfs: skip online discard during eofblocks trims · 13b86fc3
      Brian Foster 提交于
      We've had reports of online discard operations being sent from XFS
      on write-only workloads. These discards occur as a result of
      eofblocks trims that can occur after a large file copy completes.
      
      These discards are slightly confusing for users who might be paying
      close attention to online discards (i.e., vdo) due to performance
      sensitivity. They also happen to be spurious because freed post-eof
      blocks by definition have not been written to during the current
      allocation cycle.
      
      Update xfs_free_eofblocks() to skip discards that are purely
      attributed to eofblocks trims. This cuts down the number of spurious
      discards that may occur on write-only workloads due to normal
      preallocation activity.
      
      Note that discards of post-eof extents can still occur from other
      codepaths that do not isolate handling of post-eof blocks from those
      within eof. For example, file unlinks and truncates may still cause
      discards for any file blocks affected by the operation.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      13b86fc3
  22. 10 4月, 2018 1 次提交
  23. 03 4月, 2018 1 次提交
  24. 16 3月, 2018 1 次提交
  25. 29 1月, 2018 1 次提交
  26. 09 1月, 2018 1 次提交
  27. 21 12月, 2017 1 次提交
  28. 09 12月, 2017 1 次提交
    • C
      xfs: remove "no-allocation" reservations for file creations · f59cf5c2
      Christoph Hellwig 提交于
      If we create a new file we will need an inode, and usually some metadata
      in the parent direction.  Aiming for everything to go well despite the
      lack of a reservation leads to dirty transactions cancelled under a heavy
      create/delete load.  This patch removes those nospace transactions, which
      will lead to slightly earlier ENOSPC on some workloads, but instead
      prevent file system shutdowns due to cancelling dirty transactions for
      others.
      
      A customer could observe assertations failures and shutdowns due to
      cancelation of dirty transactions during heavy NFS workloads as shown
      below:
      
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728125] XFS: Assertion failed: error != -ENOSPC, file: fs/xfs/xfs_inode.c, line: 1262
      
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728222] Call Trace:
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728246]  [<ffffffff81795daf>] dump_stack+0x63/0x81
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728262]  [<ffffffff810a1a5a>] warn_slowpath_common+0x8a/0xc0
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728264]  [<ffffffff810a1b8a>] warn_slowpath_null+0x1a/0x20
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728285]  [<ffffffffa01bf403>] asswarn+0x33/0x40 [xfs]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728308]  [<ffffffffa01bb07e>] xfs_create+0x7be/0x7d0 [xfs]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728329]  [<ffffffffa01b6ffb>] xfs_generic_create+0x1fb/0x2e0 [xfs]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728348]  [<ffffffffa01b7114>] xfs_vn_mknod+0x14/0x20 [xfs]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728366]  [<ffffffffa01b7153>] xfs_vn_create+0x13/0x20 [xfs]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728380]  [<ffffffff81231de5>] vfs_create+0xd5/0x140
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728390]  [<ffffffffa045ddb9>] do_nfsd_create+0x499/0x610 [nfsd]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728396]  [<ffffffffa0465fa5>] nfsd3_proc_create+0x135/0x210 [nfsd]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728401]  [<ffffffffa04561e3>] nfsd_dispatch+0xc3/0x210 [nfsd]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728416]  [<ffffffffa03bfa43>] svc_process_common+0x453/0x6f0 [sunrpc]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728423]  [<ffffffffa03bfdf3>] svc_process+0x113/0x1f0 [sunrpc]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728427]  [<ffffffffa0455bcf>] nfsd+0x10f/0x180 [nfsd]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728432]  [<ffffffffa0455ac0>] ? nfsd_destroy+0x80/0x80 [nfsd]
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728438]  [<ffffffff810c0d58>] kthread+0xd8/0xf0
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728441]  [<ffffffff810c0c80>] ? kthread_create_on_node+0x1b0/0x1b0
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728451]  [<ffffffff8179d962>] ret_from_fork+0x42/0x70
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728453]  [<ffffffff810c0c80>] ? kthread_create_on_node+0x1b0/0x1b0
      2017-05-30 21:17:06 kernel: WARNING: [ 2670.728454] ---[ end trace f9822c842fec81d4 ]---
      
      2017-05-30 21:17:06 kernel: ALERT: [ 2670.728477] XFS (sdb): Internal error xfs_trans_cancel at line 983 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x4ee/0x7d0 [xfs]
      
      2017-05-30 21:17:06 kernel: ALERT: [ 2670.728684] XFS (sdb): Corruption of in-memory data detected. Shutting down filesystem
      2017-05-30 21:17:06 kernel: ALERT: [ 2670.728685] XFS (sdb): Please umount the filesystem and rectify the problem(s)
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f59cf5c2
  29. 27 10月, 2017 1 次提交
  30. 03 7月, 2017 1 次提交
  31. 20 6月, 2017 1 次提交
    • D
      xfs: remove double-underscore integer types · c8ce540d
      Darrick J. Wong 提交于
      This is a purely mechanical patch that removes the private
      __{u,}int{8,16,32,64}_t typedefs in favor of using the system
      {u,}int{8,16,32,64}_t typedefs.  This is the sed script used to perform
      the transformation and fix the resulting whitespace and indentation
      errors:
      
      s/typedef\t__uint8_t/typedef __uint8_t\t/g
      s/typedef\t__uint/typedef __uint/g
      s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
      s/__uint8_t\t/__uint8_t\t\t/g
      s/__uint/uint/g
      s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
      s/__int/int/g
      /^typedef.*int[0-9]*_t;$/d
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c8ce540d
  32. 28 4月, 2017 1 次提交
  33. 30 11月, 2016 1 次提交
  34. 10 11月, 2016 1 次提交
    • B
      xfs: fix unbalanced inode reclaim flush locking · 98efe8af
      Brian Foster 提交于
      Filesystem shutdown testing on an older distro kernel has uncovered an
      imbalanced locking pattern for the inode flush lock in
      xfs_reclaim_inode(). Specifically, there is a double unlock sequence
      between the call to xfs_iflush_abort() and xfs_reclaim_inode() at the
      "reclaim:" label.
      
      This actually does not cause obvious problems on current kernels due to
      the current flush lock implementation. Older kernels use a counting
      based flush lock mechanism, however, which effectively breaks the lock
      indefinitely when an already unlocked flush lock is repeatedly unlocked.
      Though this only currently occurs on filesystem shutdown, it has
      reproduced the effect of elevating an fs shutdown to a system-wide crash
      or hang.
      
      As it turns out, the flush lock is not actually required for the reclaim
      logic in xfs_reclaim_inode() because by that time we have already cycled
      the flush lock once while holding ILOCK_EXCL. Therefore, remove the
      additional flush lock/unlock cycle around the 'reclaim:' label and
      update branches into this label to release the flush lock where
      appropriate. Add an assert to xfs_ifunlock() to help prevent future
      occurences of the same problem.
      Reported-by: NZorro Lang <zlang@redhat.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      98efe8af