1. 08 4月, 2021 10 次提交
  2. 26 3月, 2021 3 次提交
    • B
      xfs: Rudimentary spelling fix · 0145225e
      Bhaskar Chowdhury 提交于
      s/sytemcall/syscall/
      Signed-off-by: NBhaskar Chowdhury <unixbhaskar@gmail.com>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0145225e
    • D
      xfs: initialise attr fork on inode create · e6a688c3
      Dave Chinner 提交于
      When we allocate a new inode, we often need to add an attribute to
      the inode as part of the create. This can happen as a result of
      needing to add default ACLs or security labels before the inode is
      made visible to userspace.
      
      This is highly inefficient right now. We do the create transaction
      to allocate the inode, then we do an "add attr fork" transaction to
      modify the just created empty inode to set the inode fork offset to
      allow attributes to be stored, then we go and do the attribute
      creation.
      
      This means 3 transactions instead of 1 to allocate an inode, and
      this greatly increases the load on the CIL commit code, resulting in
      excessive contention on the CIL spin locks and performance
      degradation:
      
       18.99%  [kernel]                [k] __pv_queued_spin_lock_slowpath
        3.57%  [kernel]                [k] do_raw_spin_lock
        2.51%  [kernel]                [k] __raw_callee_save___pv_queued_spin_unlock
        2.48%  [kernel]                [k] memcpy
        2.34%  [kernel]                [k] xfs_log_commit_cil
      
      The typical profile resulting from running fsmark on a selinux enabled
      filesytem is adds this overhead to the create path:
      
        - 15.30% xfs_init_security
           - 15.23% security_inode_init_security
      	- 13.05% xfs_initxattrs
      	   - 12.94% xfs_attr_set
      	      - 6.75% xfs_bmap_add_attrfork
      		 - 5.51% xfs_trans_commit
      		    - 5.48% __xfs_trans_commit
      		       - 5.35% xfs_log_commit_cil
      			  - 3.86% _raw_spin_lock
      			     - do_raw_spin_lock
      				  __pv_queued_spin_lock_slowpath
      		 - 0.70% xfs_trans_alloc
      		      0.52% xfs_trans_reserve
      	      - 5.41% xfs_attr_set_args
      		 - 5.39% xfs_attr_set_shortform.constprop.0
      		    - 4.46% xfs_trans_commit
      		       - 4.46% __xfs_trans_commit
      			  - 4.33% xfs_log_commit_cil
      			     - 2.74% _raw_spin_lock
      				- do_raw_spin_lock
      				     __pv_queued_spin_lock_slowpath
      			       0.60% xfs_inode_item_format
      		      0.90% xfs_attr_try_sf_addname
      	- 1.99% selinux_inode_init_security
      	   - 1.02% security_sid_to_context_force
      	      - 1.00% security_sid_to_context_core
      		 - 0.92% sidtab_entry_to_string
      		    - 0.90% sidtab_sid2str_get
      			 0.59% sidtab_sid2str_put.part.0
      	   - 0.82% selinux_determine_inode_label
      	      - 0.77% security_transition_sid
      		   0.70% security_compute_sid.part.0
      
      And fsmark creation rate performance drops by ~25%. The key point to
      note here is that half the additional overhead comes from adding the
      attribute fork to the newly created inode. That's crazy, considering
      we can do this same thing at inode create time with a couple of
      lines of code and no extra overhead.
      
      So, if we know we are going to add an attribute immediately after
      creating the inode, let's just initialise the attribute fork inside
      the create transaction and chop that whole chunk of code out of
      the create fast path. This completely removes the performance
      drop caused by enabling SELinux, and the profile looks like:
      
           - 8.99% xfs_init_security
               - 9.00% security_inode_init_security
                  - 6.43% xfs_initxattrs
                     - 6.37% xfs_attr_set
                        - 5.45% xfs_attr_set_args
                           - 5.42% xfs_attr_set_shortform.constprop.0
                              - 4.51% xfs_trans_commit
                                 - 4.54% __xfs_trans_commit
                                    - 4.59% xfs_log_commit_cil
                                       - 2.67% _raw_spin_lock
                                          - 3.28% do_raw_spin_lock
                                               3.08% __pv_queued_spin_lock_slowpath
                                         0.66% xfs_inode_item_format
                              - 0.90% xfs_attr_try_sf_addname
                        - 0.60% xfs_trans_alloc
                  - 2.35% selinux_inode_init_security
                     - 1.25% security_sid_to_context_force
                        - 1.21% security_sid_to_context_core
                           - 1.19% sidtab_entry_to_string
                              - 1.20% sidtab_sid2str_get
                                 - 0.86% sidtab_sid2str_put.part.0
                                    - 0.62% _raw_spin_lock_irqsave
                                       - 0.77% do_raw_spin_lock
                                            __pv_queued_spin_lock_slowpath
                     - 0.84% selinux_determine_inode_label
                        - 0.83% security_transition_sid
                             0.86% security_compute_sid.part.0
      
      Which indicates the XFS overhead of creating the selinux xattr has
      been halved. This doesn't fix the CIL lock contention problem, just
      means it's not a limiting factor for this workload. Lock contention
      in the security subsystems is going to be an issue soon, though...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      [djwong: fix compilation error when CONFIG_SECURITY=n]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NGao Xiang <hsiangkao@redhat.com>
      e6a688c3
    • D
      xfs: prevent metadata files from being inactivated · 383e32b0
      Darrick J. Wong 提交于
      Files containing metadata (quota records, rt bitmap and summary info)
      are fully managed by the filesystem, which means that all resource
      cleanup must be explicit, not automatic.  This means that they should
      never be subjected automatic to post-eof truncation, nor should they be
      freed automatically even if the link count drops to zero.
      
      In other words, xfs_inactive() should leave these files alone.  Add the
      necessary predicate functions to make this happen.  This adds a second
      layer of prevention for the kinds of fs corruption that was fixed by
      commit f4c32e87.  If we ever decide to support removing metadata
      files, we should make all those metadata updates explicit.
      
      Rearrange the order of #includes to fix compiler errors, since
      xfs_mount.h is supposed to be included before xfs_inode.h
      
      Followup-to: f4c32e87 ("xfs: fix realtime bitmap/summary file truncation when growing rt volume")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      383e32b0
  3. 10 3月, 2021 1 次提交
  4. 04 2月, 2021 2 次提交
  5. 24 1月, 2021 1 次提交
  6. 23 1月, 2021 4 次提交
    • C
      xfs: fix up non-directory creation in SGID directories · 01ea173e
      Christoph Hellwig 提交于
      XFS always inherits the SGID bit if it is set on the parent inode, while
      the generic inode_init_owner does not do this in a few cases where it can
      create a possible security problem, see commit 0fa3ecd8
      ("Fix up non-directory creation in SGID directories") for details.
      
      Switch XFS to use the generic helper for the normal path to fix this,
      just keeping the simple field inheritance open coded for the case of the
      non-sgid case with the bsdgrpid mount option.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      01ea173e
    • C
      xfs: Check for extent overflow when renaming dir entries · 02092a2f
      Chandan Babu R 提交于
      A rename operation is essentially a directory entry remove operation
      from the perspective of parent directory (i.e. src_dp) of rename's
      source. Hence the only place where we check for extent count overflow
      for src_dp is in xfs_bmap_del_extent_real(). xfs_bmap_del_extent_real()
      returns -ENOSPC when it detects a possible extent count overflow and in
      response, the higher layers of directory handling code do the following:
      1. Data/Free blocks: XFS lets these blocks linger until a future remove
         operation removes them.
      2. Dabtree blocks: XFS swaps the blocks with the last block in the Leaf
         space and unmaps the last block.
      
      For target_dp, there are two cases depending on whether the destination
      directory entry exists or not.
      
      When destination directory entry does not exist (i.e. target_ip ==
      NULL), extent count overflow check is performed only when transaction
      has a non-zero sized space reservation associated with it.  With a
      zero-sized space reservation, XFS allows a rename operation to continue
      only when the directory has sufficient free space in its data/leaf/free
      space blocks to hold the new entry.
      
      When destination directory entry exists (i.e. target_ip != NULL), all
      we need to do is change the inode number associated with the already
      existing entry. Hence there is no need to perform an extent count
      overflow check.
      Signed-off-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      02092a2f
    • C
      xfs: Check for extent overflow when adding dir entries · f5d92749
      Chandan Babu R 提交于
      Directory entry addition can cause the following,
      1. Data block can be added/removed.
         A new extent can cause extent count to increase by 1.
      2. Free disk block can be added/removed.
         Same behaviour as described above for Data block.
      3. Dabtree blocks.
         XFS_DA_NODE_MAXDEPTH blocks can be added. Each of these
         can be new extents. Hence extent count can increase by
         XFS_DA_NODE_MAXDEPTH.
      Signed-off-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f5d92749
    • D
      xfs: fix an ABBA deadlock in xfs_rename · 6da1b4b1
      Darrick J. Wong 提交于
      When overlayfs is running on top of xfs and the user unlinks a file in
      the overlay, overlayfs will create a whiteout inode and ask xfs to
      "rename" the whiteout file atop the one being unlinked.  If the file
      being unlinked loses its one nlink, we then have to put the inode on the
      unlinked list.
      
      This requires us to grab the AGI buffer of the whiteout inode to take it
      off the unlinked list (which is where whiteouts are created) and to grab
      the AGI buffer of the file being deleted.  If the whiteout was created
      in a higher numbered AG than the file being deleted, we'll lock the AGIs
      in the wrong order and deadlock.
      
      Therefore, grab all the AGI locks we think we'll need ahead of time, and
      in order of increasing AG number per the locking rules.
      Reported-by: Nwenli xie <wlxie7296@gmail.com>
      Fixes: 93597ae8 ("xfs: Fix deadlock between AGI and AGF when target_ip exists in xfs_rename()")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      6da1b4b1
  7. 13 12月, 2020 4 次提交
  8. 10 12月, 2020 2 次提交
  9. 22 9月, 2020 1 次提交
  10. 16 9月, 2020 4 次提交
  11. 07 9月, 2020 1 次提交
    • D
      xfs: xfs_iflock is no longer a completion · 718ecc50
      Dave Chinner 提交于
      With the recent rework of the inode cluster flushing, we no longer
      ever wait on the the inode flush "lock". It was never a lock in the
      first place, just a completion to allow callers to wait for inode IO
      to complete. We now never wait for flush completion as all inode
      flushing is non-blocking. Hence we can get rid of all the iflock
      infrastructure and instead just set and check a state flag.
      
      Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
      xfs_iflock_nowait() test-and-set operations on that flag, and
      replace all the xfs_ifunlock() calls to clear operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      718ecc50
  12. 05 8月, 2020 1 次提交
  13. 14 7月, 2020 1 次提交
    • G
      xfs: get rid of unnecessary xfs_perag_{get,put} pairs · 92a00544
      Gao Xiang 提交于
      In the course of some operations, we look up the perag from
      the mount multiple times to get or change perag information.
      These are often very short pieces of code, so while the
      lookup cost is generally low, the cost of the lookup is far
      higher than the cost of the operation we are doing on the
      perag.
      
      Since we changed buffers to hold references to the perag
      they are cached in, many modification contexts already hold
      active references to the perag that are held across these
      operations. This is especially true for any operation that
      is serialised by an allocation group header buffer.
      
      In these cases, we can just use the buffer's reference to
      the perag to avoid needing to do lookups to access the
      perag. This means that many operations don't need to do
      perag lookups at all to access the perag because they've
      already looked up objects that own persistent references
      and hence can use that reference instead.
      
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Signed-off-by: NGao Xiang <hsiangkao@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      92a00544
  14. 07 7月, 2020 5 次提交
    • D
      xfs: remove xfs_inobp_check() · e2705b03
      Dave Chinner 提交于
      This debug code is called on every xfs_iflush() call, which then
      checks every inode in the buffer for non-zero unlinked list field.
      Hence it checks every inode in the cluster buffer every time a
      single inode on that cluster it flushed. This is resulting in:
      
      -   38.91%     5.33%  [kernel]  [k] xfs_iflush
         - 17.70% xfs_iflush
            - 9.93% xfs_inobp_check
                 4.36% xfs_buf_offset
      
      10% of the CPU time spent flushing inodes is repeatedly checking
      unlinked fields in the buffer. We don't need to do this.
      
      The other place we call xfs_inobp_check() is
      xfs_iunlink_update_dinode(), and this is after we've done this
      assert for the agino we are about to write into that inode:
      
      	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
      
      which means we've already checked that the agino we are about to
      write is not 0 on debug kernels. The inode buffer verifiers do
      everything else we need, so let's just remove this debug code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e2705b03
    • D
      xfs: rework xfs_iflush_cluster() dirty inode iteration · 5717ea4d
      Dave Chinner 提交于
      Now that we have all the dirty inodes attached to the cluster
      buffer, we don't actually have to do radix tree lookups to find
      them. Sure, the radix tree is efficient, but walking a linked list
      of just the dirty inodes attached to the buffer is much better.
      
      We are also no longer dependent on having a locked inode passed into
      the function to determine where to start the lookup. This means we
      can drop it from the function call and treat all inodes the same.
      
      We also make xfs_iflush_cluster skip inodes marked with
      XFS_IRECLAIM. This we avoid races with inodes that reclaim is
      actively referencing or are being re-initialised by inode lookup. If
      they are actually dirty, they'll get written by a future cluster
      flush....
      
      We also add a shutdown check after obtaining the flush lock so that
      we catch inodes that are dirty in memory and may have inconsistent
      state due to the shutdown in progress. We abort these inodes
      directly and so they remove themselves directly from the buffer list
      and the AIL rather than having to wait for the buffer to be failed
      and callbacks run to be processed correctly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      5717ea4d
    • D
      xfs: rename xfs_iflush_int() · e6187b34
      Dave Chinner 提交于
      with xfs_iflush() gone, we can rename xfs_iflush_int() back to
      xfs_iflush(). Also move it up above xfs_iflush_cluster() so we don't
      need the forward definition any more.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e6187b34
    • D
      xfs: xfs_iflush() is no longer necessary · 90c60e16
      Dave Chinner 提交于
      Now we have a cached buffer on inode log items, we don't need
      to do buffer lookups when flushing inodes anymore - all we need
      to do is lock the buffer and we are ready to go.
      
      This largely gets rid of the need for xfs_iflush(), which is
      essentially just a mechanism to look up the buffer and flush the
      inode to it. Instead, we can just call xfs_iflush_cluster() with a
      few modifications to ensure it also flushes the inode we already
      hold locked.
      
      This allows the AIL inode item pushing to be almost entirely
      non-blocking in XFS - we won't block unless memory allocation
      for the cluster inode lookup blocks or the block device queues are
      full.
      
      Writeback during inode reclaim becomes a little more complex because
      we now have to lock the buffer ourselves, but otherwise this change
      is largely a functional no-op that removes a whole lot of code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      90c60e16
    • D
      xfs: attach inodes to the cluster buffer when dirtied · 48d55e2a
      Dave Chinner 提交于
      Rather than attach inodes to the cluster buffer just when we are
      doing IO, attach the inodes to the cluster buffer when they are
      dirtied. The means the buffer always carries a list of dirty inodes
      that reference it, and we can use that list to make more fundamental
      changes to inode writeback that aren't otherwise possible.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      48d55e2a