1. 13 4月, 2022 1 次提交
    • C
      xfs: Directory's data fork extent counter can never overflow · 83a21c18
      Chandan Babu R 提交于
      The maximum file size that can be represented by the data fork extent counter
      in the worst case occurs when all extents are 1 block in length and each block
      is 1KB in size.
      
      With XFS_MAX_EXTCNT_DATA_FORK_SMALL representing maximum extent count and with
      1KB sized blocks, a file can reach upto,
      (2^31) * 1KB = 2TB
      
      This is much larger than the theoretical maximum size of a directory
      i.e. XFS_DIR2_SPACE_SIZE * 3 = ~96GB.
      
      Since a directory's inode can never overflow its data fork extent counter,
      this commit removes all the overflow checks associated with
      it. xfs_dinode_verify() now performs a rough check to verify if a diretory's
      data fork is larger than 96GB.
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NChandan Babu R <chandan.babu@oracle.com>
      83a21c18
  2. 11 4月, 2022 5 次提交
  3. 23 10月, 2021 2 次提交
  4. 16 4月, 2021 3 次提交
  5. 08 4月, 2021 1 次提交
  6. 26 3月, 2021 1 次提交
    • D
      xfs: initialise attr fork on inode create · e6a688c3
      Dave Chinner 提交于
      When we allocate a new inode, we often need to add an attribute to
      the inode as part of the create. This can happen as a result of
      needing to add default ACLs or security labels before the inode is
      made visible to userspace.
      
      This is highly inefficient right now. We do the create transaction
      to allocate the inode, then we do an "add attr fork" transaction to
      modify the just created empty inode to set the inode fork offset to
      allow attributes to be stored, then we go and do the attribute
      creation.
      
      This means 3 transactions instead of 1 to allocate an inode, and
      this greatly increases the load on the CIL commit code, resulting in
      excessive contention on the CIL spin locks and performance
      degradation:
      
       18.99%  [kernel]                [k] __pv_queued_spin_lock_slowpath
        3.57%  [kernel]                [k] do_raw_spin_lock
        2.51%  [kernel]                [k] __raw_callee_save___pv_queued_spin_unlock
        2.48%  [kernel]                [k] memcpy
        2.34%  [kernel]                [k] xfs_log_commit_cil
      
      The typical profile resulting from running fsmark on a selinux enabled
      filesytem is adds this overhead to the create path:
      
        - 15.30% xfs_init_security
           - 15.23% security_inode_init_security
      	- 13.05% xfs_initxattrs
      	   - 12.94% xfs_attr_set
      	      - 6.75% xfs_bmap_add_attrfork
      		 - 5.51% xfs_trans_commit
      		    - 5.48% __xfs_trans_commit
      		       - 5.35% xfs_log_commit_cil
      			  - 3.86% _raw_spin_lock
      			     - do_raw_spin_lock
      				  __pv_queued_spin_lock_slowpath
      		 - 0.70% xfs_trans_alloc
      		      0.52% xfs_trans_reserve
      	      - 5.41% xfs_attr_set_args
      		 - 5.39% xfs_attr_set_shortform.constprop.0
      		    - 4.46% xfs_trans_commit
      		       - 4.46% __xfs_trans_commit
      			  - 4.33% xfs_log_commit_cil
      			     - 2.74% _raw_spin_lock
      				- do_raw_spin_lock
      				     __pv_queued_spin_lock_slowpath
      			       0.60% xfs_inode_item_format
      		      0.90% xfs_attr_try_sf_addname
      	- 1.99% selinux_inode_init_security
      	   - 1.02% security_sid_to_context_force
      	      - 1.00% security_sid_to_context_core
      		 - 0.92% sidtab_entry_to_string
      		    - 0.90% sidtab_sid2str_get
      			 0.59% sidtab_sid2str_put.part.0
      	   - 0.82% selinux_determine_inode_label
      	      - 0.77% security_transition_sid
      		   0.70% security_compute_sid.part.0
      
      And fsmark creation rate performance drops by ~25%. The key point to
      note here is that half the additional overhead comes from adding the
      attribute fork to the newly created inode. That's crazy, considering
      we can do this same thing at inode create time with a couple of
      lines of code and no extra overhead.
      
      So, if we know we are going to add an attribute immediately after
      creating the inode, let's just initialise the attribute fork inside
      the create transaction and chop that whole chunk of code out of
      the create fast path. This completely removes the performance
      drop caused by enabling SELinux, and the profile looks like:
      
           - 8.99% xfs_init_security
               - 9.00% security_inode_init_security
                  - 6.43% xfs_initxattrs
                     - 6.37% xfs_attr_set
                        - 5.45% xfs_attr_set_args
                           - 5.42% xfs_attr_set_shortform.constprop.0
                              - 4.51% xfs_trans_commit
                                 - 4.54% __xfs_trans_commit
                                    - 4.59% xfs_log_commit_cil
                                       - 2.67% _raw_spin_lock
                                          - 3.28% do_raw_spin_lock
                                               3.08% __pv_queued_spin_lock_slowpath
                                         0.66% xfs_inode_item_format
                              - 0.90% xfs_attr_try_sf_addname
                        - 0.60% xfs_trans_alloc
                  - 2.35% selinux_inode_init_security
                     - 1.25% security_sid_to_context_force
                        - 1.21% security_sid_to_context_core
                           - 1.19% sidtab_entry_to_string
                              - 1.20% sidtab_sid2str_get
                                 - 0.86% sidtab_sid2str_put.part.0
                                    - 0.62% _raw_spin_lock_irqsave
                                       - 0.77% do_raw_spin_lock
                                            __pv_queued_spin_lock_slowpath
                     - 0.84% selinux_determine_inode_label
                        - 0.83% security_transition_sid
                             0.86% security_compute_sid.part.0
      
      Which indicates the XFS overhead of creating the selinux xattr has
      been halved. This doesn't fix the CIL lock contention problem, just
      means it's not a limiting factor for this workload. Lock contention
      in the security subsystems is going to be an issue soon, though...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      [djwong: fix compilation error when CONFIG_SECURITY=n]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NGao Xiang <hsiangkao@redhat.com>
      e6a688c3
  7. 23 1月, 2021 8 次提交
  8. 20 5月, 2020 6 次提交
  9. 19 3月, 2020 1 次提交
  10. 11 11月, 2019 1 次提交
    • D
      xfs: refactor "does this fork map blocks" predicate · 2fe4f928
      Darrick J. Wong 提交于
      Replace the open-coded checks for whether or not an inode fork maps
      blocks with a macro that will implant the code for us.  This helps us
      declutter the bmap code a bit.
      
      Note that I had to use a macro instead of a static inline function
      because of C header dependency problems between xfs_inode.h and
      xfs_inode_fork.h.
      
      Conversion was performed with the following Coccinelle script:
      
      @@
      expression ip, w;
      @@
      
      - XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_EXTENTS || XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_BTREE
      + xfs_ifork_has_extents(ip, w)
      
      @@
      expression ip, w;
      @@
      
      - XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_EXTENTS && XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_BTREE
      + !xfs_ifork_has_extents(ip, w)
      
      @@
      expression ip, w;
      @@
      
      - XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_BTREE || XFS_IFORK_FORMAT(ip, w) == XFS_DINODE_FMT_EXTENTS
      + xfs_ifork_has_extents(ip, w)
      
      @@
      expression ip, w;
      @@
      
      - XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_BTREE && XFS_IFORK_FORMAT(ip, w) != XFS_DINODE_FMT_EXTENTS
      + !xfs_ifork_has_extents(ip, w)
      
      @@
      expression ip, w;
      @@
      
      - (xfs_ifork_has_extents(ip, w))
      + xfs_ifork_has_extents(ip, w)
      
      @@
      expression ip, w;
      @@
      
      - (!xfs_ifork_has_extents(ip, w))
      + !xfs_ifork_has_extents(ip, w)
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      2fe4f928
  11. 22 10月, 2019 1 次提交
    • D
      xfs: fix inode fork extent count overflow · 3f8a4f1d
      Dave Chinner 提交于
      [commit message is verbose for discussion purposes - will trim it
      down later. Some questions about implementation details at the end.]
      
      Zorro Lang recently ran a new test to stress single inode extent
      counts now that they are no longer limited by memory allocation.
      The test was simply:
      
      # xfs_io -f -c "falloc 0 40t" /mnt/scratch/big-file
      # ~/src/xfstests-dev/punch-alternating /mnt/scratch/big-file
      
      This test uncovered a problem where the hole punching operation
      appeared to finish with no error, but apparently only created 268M
      extents instead of the 10 billion it was supposed to.
      
      Further, trying to punch out extents that should have been present
      resulted in success, but no change in the extent count. It looked
      like a silent failure.
      
      While running the test and observing the behaviour in real time,
      I observed the extent coutn growing at ~2M extents/minute, and saw
      this after about an hour:
      
      # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next ; \
      > sleep 60 ; \
      > xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
      fsxattr.nextents = 127657993
      fsxattr.nextents = 129683339
      #
      
      And a few minutes later this:
      
      # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
      fsxattr.nextents = 4177861124
      #
      
      Ah, what? Where did that 4 billion extra extents suddenly come from?
      
      Stop the workload, unmount, mount:
      
      # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
      fsxattr.nextents = 166044375
      #
      
      And it's back at the expected number. i.e. the extent count is
      correct on disk, but it's screwed up in memory. I loaded up the
      extent list, and immediately:
      
      # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
      fsxattr.nextents = 4192576215
      #
      
      It's bad again. So, where does that number come from?
      xfs_fill_fsxattr():
      
                      if (ip->i_df.if_flags & XFS_IFEXTENTS)
                              fa->fsx_nextents = xfs_iext_count(&ip->i_df);
                      else
                              fa->fsx_nextents = ip->i_d.di_nextents;
      
      And that's the behaviour I just saw in a nutshell. The on disk count
      is correct, but once the tree is loaded into memory, it goes whacky.
      Clearly there's something wrong with xfs_iext_count():
      
      inline xfs_extnum_t xfs_iext_count(struct xfs_ifork *ifp)
      {
              return ifp->if_bytes / sizeof(struct xfs_iext_rec);
      }
      
      Simple enough, but 134M extents is 2**27, and that's right about
      where things went wrong. A struct xfs_iext_rec is 16 bytes in size,
      which means 2**27 * 2**4 = 2**31 and we're right on target for an
      integer overflow. And, sure enough:
      
      struct xfs_ifork {
              int                     if_bytes;       /* bytes in if_u1 */
      ....
      
      Once we get 2**27 extents in a file, we overflow if_bytes and the
      in-core extent count goes wrong. And when we reach 2**28 extents,
      if_bytes wraps back to zero and things really start to go wrong
      there. This is where the silent failure comes from - only the first
      2**28 extents can be looked up directly due to the overflow, all the
      extents above this index wrap back to somewhere in the first 2**28
      extents. Hence with a regular pattern, trying to punch a hole in the
      range that didn't have holes mapped to a hole in the first 2**28
      extents and so "succeeded" without changing anything. Hence "silent
      failure"...
      
      Fix this by converting if_bytes to a int64_t and converting all the
      index variables and size calculations to use int64_t types to avoid
      overflows in future. Signed integers are still used to enable easy
      detection of extent count underflows. This enables scalability of
      extent counts to the limits of the on-disk format - MAXEXTNUM
      (2**31) extents.
      
      Current testing is at over 500M extents and still going:
      
      fsxattr.nextents = 517310478
      Reported-by: NZorro Lang <zlang@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      3f8a4f1d
  12. 12 2月, 2019 1 次提交
  13. 01 8月, 2018 1 次提交
  14. 30 7月, 2018 2 次提交
  15. 07 6月, 2018 1 次提交
    • D
      xfs: convert to SPDX license tags · 0b61f8a4
      Dave Chinner 提交于
      Remove the verbose license text from XFS files and replace them
      with SPDX tags. This does not change the license of any of the code,
      merely refers to the common, up-to-date license files in LICENSES/
      
      This change was mostly scripted. fs/xfs/Makefile and
      fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
      and modified by the following command:
      
      for f in `git grep -l "GNU General" fs/xfs/` ; do
      	echo $f
      	cat $f | awk -f hdr.awk > $f.new
      	mv -f $f.new $f
      done
      
      And the hdr.awk script that did the modification (including
      detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
      is as follows:
      
      $ cat hdr.awk
      BEGIN {
      	hdr = 1.0
      	tag = "GPL-2.0"
      	str = ""
      }
      
      /^ \* This program is free software/ {
      	hdr = 2.0;
      	next
      }
      
      /any later version./ {
      	tag = "GPL-2.0+"
      	next
      }
      
      /^ \*\// {
      	if (hdr > 0.0) {
      		print "// SPDX-License-Identifier: " tag
      		print str
      		print $0
      		str=""
      		hdr = 0.0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \* / {
      	if (hdr > 1.0)
      		next
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \*/ {
      	if (hdr > 0.0)
      		next
      	print $0
      	next
      }
      
      // {
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      }
      
      END { }
      $
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0b61f8a4
  16. 09 1月, 2018 1 次提交
  17. 07 11月, 2017 4 次提交
    • C
      xfs: remove the nr_extents argument to xfs_iext_remove · c38ccf59
      Christoph Hellwig 提交于
      We only have two places that remove 2 extents at the same time, so unroll
      the loop there.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c38ccf59
    • C
      xfs: remove the nr_extents argument to xfs_iext_insert · 0254c2f2
      Christoph Hellwig 提交于
      We only have two places that insert 2 extents at the same time, so unroll
      the loop there.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0254c2f2
    • C
      xfs: use a b+tree for the in-core extent list · 6bdcf26a
      Christoph Hellwig 提交于
      Replace the current linear list and the indirection array for the in-core
      extent list with a b+tree to avoid the need for larger memory allocations
      for the indirection array when lots of extents are present.  The current
      extent list implementations leads to heavy pressure on the memory
      allocator when modifying files with a high extent count, and can lead
      to high latencies because of that.
      
      The replacement is a b+tree with a few quirks.  The leaf nodes directly
      store the extent record in two u64 values.  The encoding is a little bit
      different from the existing in-core extent records so that the start
      offset and length which are required for lookups can be retreived with
      simple mask operations.  The inner nodes store a 64-bit key containing
      the start offset in the first half of the node, and the pointers to the
      next lower level in the second half.  In either case we walk the node
      from the beginninig to the end and do a linear search, as that is more
      efficient for the low number of cache lines touched during a search
      (2 for the inner nodes, 4 for the leaf nodes) than a binary search.
      We store termination markers (zero length for the leaf nodes, an
      otherwise impossible high bit for the inner nodes) to terminate the key
      list / records instead of storing a count to use the available cache
      lines as efficiently as possible.
      
      One quirk of the algorithm is that while we normally split a node half and
      half like usual btree implementations we just spill over entries added at
      the very end of the list to a new node on its own.  This means we get a
      100% fill grade for the common cases of bulk insertion when reading an
      inode into memory, and when only sequentially appending to a file.  The
      downside is a slightly higher chance of splits on the first random
      insertions.
      
      Both insert and removal manually recurse into the lower levels, but
      the bulk deletion of the whole tree is still implemented as a recursive
      function call, although one limited by the overall depth and with very
      little stack usage in every iteration.
      
      For the first few extents we dynamically grow the list from a single
      extent to the next powers of two until we have a first full leaf block
      and that building the actual tree.
      
      The code started out based on the generic lib/btree.c code from Joern
      Engel based on earlier work from Peter Zijlstra, but has since been
      rewritten beyond recognition.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      6bdcf26a
    • C
      xfs: remove support for inlining data/extents into the inode fork · 43518812
      Christoph Hellwig 提交于
      Supporting a small bit of data inside the inode fork blows up the fork size
      a lot, removing the 32 bytes of inline data halves the effective size of
      the inode fork (and it still has a lot of unused padding left), and the
      performance of a single kmalloc doesn't show up compared to the size to read
      an inode or create one.
      
      It also simplifies the fork management code a lot.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      43518812