1. 16 9月, 2020 4 次提交
  2. 27 8月, 2020 2 次提交
  3. 29 7月, 2020 4 次提交
  4. 27 5月, 2020 1 次提交
    • D
      xfs: more lockdep whackamole with kmem_alloc* · 6dcde60e
      Darrick J. Wong 提交于
      Dave Airlie reported the following lockdep complaint:
      
      >  ======================================================
      >  WARNING: possible circular locking dependency detected
      >  5.7.0-0.rc5.20200515git1ae7efb3.1.fc33.x86_64 #1 Not tainted
      >  ------------------------------------------------------
      >  kswapd0/159 is trying to acquire lock:
      >  ffff9b38d01a4470 (&xfs_nondir_ilock_class){++++}-{3:3},
      >  at: xfs_ilock+0xde/0x2c0 [xfs]
      >
      >  but task is already holding lock:
      >  ffffffffbbb8bd00 (fs_reclaim){+.+.}-{0:0}, at:
      >  __fs_reclaim_acquire+0x5/0x30
      >
      >  which lock already depends on the new lock.
      >
      >
      >  the existing dependency chain (in reverse order) is:
      >
      >  -> #1 (fs_reclaim){+.+.}-{0:0}:
      >         fs_reclaim_acquire+0x34/0x40
      >         __kmalloc+0x4f/0x270
      >         kmem_alloc+0x93/0x1d0 [xfs]
      >         kmem_alloc_large+0x4c/0x130 [xfs]
      >         xfs_attr_copy_value+0x74/0xa0 [xfs]
      >         xfs_attr_get+0x9d/0xc0 [xfs]
      >         xfs_get_acl+0xb6/0x200 [xfs]
      >         get_acl+0x81/0x160
      >         posix_acl_xattr_get+0x3f/0xd0
      >         vfs_getxattr+0x148/0x170
      >         getxattr+0xa7/0x240
      >         path_getxattr+0x52/0x80
      >         do_syscall_64+0x5c/0xa0
      >         entry_SYSCALL_64_after_hwframe+0x49/0xb3
      >
      >  -> #0 (&xfs_nondir_ilock_class){++++}-{3:3}:
      >         __lock_acquire+0x1257/0x20d0
      >         lock_acquire+0xb0/0x310
      >         down_write_nested+0x49/0x120
      >         xfs_ilock+0xde/0x2c0 [xfs]
      >         xfs_reclaim_inode+0x3f/0x400 [xfs]
      >         xfs_reclaim_inodes_ag+0x20b/0x410 [xfs]
      >         xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
      >         super_cache_scan+0x190/0x1e0
      >         do_shrink_slab+0x184/0x420
      >         shrink_slab+0x182/0x290
      >         shrink_node+0x174/0x680
      >         balance_pgdat+0x2d0/0x5f0
      >         kswapd+0x21f/0x510
      >         kthread+0x131/0x150
      >         ret_from_fork+0x3a/0x50
      >
      >  other info that might help us debug this:
      >
      >   Possible unsafe locking scenario:
      >
      >         CPU0                    CPU1
      >         ----                    ----
      >    lock(fs_reclaim);
      >                                 lock(&xfs_nondir_ilock_class);
      >                                 lock(fs_reclaim);
      >    lock(&xfs_nondir_ilock_class);
      >
      >   *** DEADLOCK ***
      >
      >  4 locks held by kswapd0/159:
      >   #0: ffffffffbbb8bd00 (fs_reclaim){+.+.}-{0:0}, at:
      >  __fs_reclaim_acquire+0x5/0x30
      >   #1: ffffffffbbb7cef8 (shrinker_rwsem){++++}-{3:3}, at:
      >  shrink_slab+0x115/0x290
      >   #2: ffff9b39f07a50e8
      >  (&type->s_umount_key#56){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
      >   #3: ffff9b39f077f258
      >  (&pag->pag_ici_reclaim_lock){+.+.}-{3:3}, at:
      >  xfs_reclaim_inodes_ag+0x82/0x410 [xfs]
      
      This is a known false positive because inodes cannot simultaneously be
      getting reclaimed and the target of a getxattr operation, but lockdep
      doesn't know that.  We can (selectively) shut up lockdep until either
      it gets smarter or we change inode reclaim not to require the ILOCK by
      applying a stupid GFP_NOLOCKDEP bandaid.
      Reported-by: NDave Airlie <airlied@gmail.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: NDave Airlie <airlied@gmail.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      6dcde60e
  5. 20 5月, 2020 4 次提交
  6. 19 3月, 2020 1 次提交
  7. 12 3月, 2020 1 次提交
  8. 03 3月, 2020 5 次提交
  9. 10 1月, 2020 1 次提交
  10. 23 11月, 2019 3 次提交
  11. 16 11月, 2019 1 次提交
    • B
      xfs: fix attr leaf header freemap.size underflow · 2a2b5932
      Brian Foster 提交于
      The leaf format xattr addition helper xfs_attr3_leaf_add_work()
      adjusts the block freemap in a couple places. The first update drops
      the size of the freemap that the caller had already selected to
      place the xattr name/value data. Before the function returns, it
      also checks whether the entries array has encroached on a freemap
      range by virtue of the new entry addition. This is necessary because
      the entries array grows from the start of the block (but end of the
      block header) towards the end of the block while the name/value data
      grows from the end of the block in the opposite direction. If the
      associated freemap is already empty, however, size is zero and the
      subtraction underflows the field and causes corruption.
      
      This is reproduced rarely by generic/070. The observed behavior is
      that a smaller sized freemap is aligned to the end of the entries
      list, several subsequent xattr additions land in larger freemaps and
      the entries list expands into the smaller freemap until it is fully
      consumed and then underflows. Note that it is not otherwise a
      corruption for the entries array to consume an empty freemap because
      the nameval list (i.e. the firstused pointer in the xattr header)
      starts beyond the end of the corrupted freemap.
      
      Update the freemap size modification to account for the fact that
      the freemap entry can be empty and thus stale.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      2a2b5932
  12. 11 11月, 2019 4 次提交
  13. 05 11月, 2019 1 次提交
  14. 30 10月, 2019 1 次提交
  15. 22 10月, 2019 1 次提交
    • D
      xfs: fix inode fork extent count overflow · 3f8a4f1d
      Dave Chinner 提交于
      [commit message is verbose for discussion purposes - will trim it
      down later. Some questions about implementation details at the end.]
      
      Zorro Lang recently ran a new test to stress single inode extent
      counts now that they are no longer limited by memory allocation.
      The test was simply:
      
      # xfs_io -f -c "falloc 0 40t" /mnt/scratch/big-file
      # ~/src/xfstests-dev/punch-alternating /mnt/scratch/big-file
      
      This test uncovered a problem where the hole punching operation
      appeared to finish with no error, but apparently only created 268M
      extents instead of the 10 billion it was supposed to.
      
      Further, trying to punch out extents that should have been present
      resulted in success, but no change in the extent count. It looked
      like a silent failure.
      
      While running the test and observing the behaviour in real time,
      I observed the extent coutn growing at ~2M extents/minute, and saw
      this after about an hour:
      
      # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next ; \
      > sleep 60 ; \
      > xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
      fsxattr.nextents = 127657993
      fsxattr.nextents = 129683339
      #
      
      And a few minutes later this:
      
      # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
      fsxattr.nextents = 4177861124
      #
      
      Ah, what? Where did that 4 billion extra extents suddenly come from?
      
      Stop the workload, unmount, mount:
      
      # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
      fsxattr.nextents = 166044375
      #
      
      And it's back at the expected number. i.e. the extent count is
      correct on disk, but it's screwed up in memory. I loaded up the
      extent list, and immediately:
      
      # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
      fsxattr.nextents = 4192576215
      #
      
      It's bad again. So, where does that number come from?
      xfs_fill_fsxattr():
      
                      if (ip->i_df.if_flags & XFS_IFEXTENTS)
                              fa->fsx_nextents = xfs_iext_count(&ip->i_df);
                      else
                              fa->fsx_nextents = ip->i_d.di_nextents;
      
      And that's the behaviour I just saw in a nutshell. The on disk count
      is correct, but once the tree is loaded into memory, it goes whacky.
      Clearly there's something wrong with xfs_iext_count():
      
      inline xfs_extnum_t xfs_iext_count(struct xfs_ifork *ifp)
      {
              return ifp->if_bytes / sizeof(struct xfs_iext_rec);
      }
      
      Simple enough, but 134M extents is 2**27, and that's right about
      where things went wrong. A struct xfs_iext_rec is 16 bytes in size,
      which means 2**27 * 2**4 = 2**31 and we're right on target for an
      integer overflow. And, sure enough:
      
      struct xfs_ifork {
              int                     if_bytes;       /* bytes in if_u1 */
      ....
      
      Once we get 2**27 extents in a file, we overflow if_bytes and the
      in-core extent count goes wrong. And when we reach 2**28 extents,
      if_bytes wraps back to zero and things really start to go wrong
      there. This is where the silent failure comes from - only the first
      2**28 extents can be looked up directly due to the overflow, all the
      extents above this index wrap back to somewhere in the first 2**28
      extents. Hence with a regular pattern, trying to punch a hole in the
      range that didn't have holes mapped to a hole in the first 2**28
      extents and so "succeeded" without changing anything. Hence "silent
      failure"...
      
      Fix this by converting if_bytes to a int64_t and converting all the
      index variables and size calculations to use int64_t types to avoid
      overflows in future. Signed integers are still used to enable easy
      detection of extent count underflows. This enables scalability of
      extent counts to the limits of the on-disk format - MAXEXTNUM
      (2**31) extents.
      
      Current testing is at over 500M extents and still going:
      
      fsxattr.nextents = 517310478
      Reported-by: NZorro Lang <zlang@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      3f8a4f1d
  16. 09 10月, 2019 3 次提交
    • B
      xfs: move local to extent inode logging into bmap helper · aeea4b75
      Brian Foster 提交于
      The callers of xfs_bmap_local_to_extents_empty() log the inode
      external to the function, yet this function is where the on-disk
      format value is updated. Push the inode logging down into the
      function itself to help prevent future mistakes.
      
      Note that internal bmap callers track the inode logging flags
      independently and thus may log the inode core twice due to this
      change. This is harmless, so leave this code around for consistency
      with the other attr fork conversion functions.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      aeea4b75
    • B
      xfs: remove broken error handling on failed attr sf to leaf change · 603efebd
      Brian Foster 提交于
      xfs_attr_shortform_to_leaf() attempts to put the shortform fork back
      together after a failed attempt to convert from shortform to leaf
      format. While this code reallocates and copies back the shortform
      attr fork data, it never resets the inode format field back to local
      format. Further, now that the inode is properly logged after the
      initial switch from local format, any error that triggers the
      recovery code will eventually abort the transaction and shutdown the
      fs. Therefore, remove the broken and unnecessary error handling
      code.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      603efebd
    • B
      xfs: log the inode on directory sf to block format change · 0b10d8a8
      Brian Foster 提交于
      When a directory changes from shortform (sf) to block format, the sf
      format is copied to a temporary buffer, the inode format is modified
      and the updated format filled with the dentries from the temporary
      buffer. If the inode format is modified and attempt to grow the
      inode fails (due to I/O error, for example), it is possible to
      return an error while leaving the directory in an inconsistent state
      and with an otherwise clean transaction. This results in corruption
      of the associated directory and leads to xfs_dabuf_map() errors as
      subsequent lookups cannot accurately determine the format of the
      directory. This problem is reproduced occasionally by generic/475.
      
      The fundamental problem is that xfs_dir2_sf_to_block() changes the
      on-disk inode format without logging the inode. The inode is
      eventually logged by the bmapi layer in the common case, but error
      checking introduces the possibility of failing the high level
      request before this happens.
      
      Update both of the dir2 and attr callers of
      xfs_bmap_local_to_extents_empty() to log the inode core as
      consistent with the bmap local to extent format change codepath.
      This ensures that any subsequent errors after the format has changed
      cause the transaction to abort.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0b10d8a8
  17. 31 8月, 2019 3 次提交
    • D
      xfs: allocate xattr buffer on demand · ddbca70c
      Dave Chinner 提交于
      When doing file lookups and checking for permissions, we end up in
      xfs_get_acl() to see if there are any ACLs on the inode. This
      requires and xattr lookup, and to do that we have to supply a buffer
      large enough to hold an maximum sized xattr.
      
      On workloads were we are accessing a wide range of cache cold files
      under memory pressure (e.g. NFS fileservers) we end up spending a
      lot of time allocating the buffer. The buffer is 64k in length, so
      is a contiguous multi-page allocation, and if that then fails we
      fall back to vmalloc(). Hence the allocation here is /expensive/
      when we are looking up hundreds of thousands of files a second.
      
      Initial numbers from a bpf trace show average time in xfs_get_acl()
      is ~32us, with ~19us of that in the memory allocation. Note these
      are average times, so there are going to be affected by the worst
      case allocations more than the common fast case...
      
      To avoid this, we could just do a "null"  lookup to see if the ACL
      xattr exists and then only do the allocation if it exists. This,
      however, optimises the path for the "no ACL present" case at the
      expense of the "acl present" case. i.e. we can halve the time in
      xfs_get_acl() for the no acl case (i.e down to ~10-15us), but that
      then increases the ACL case by 30% (i.e. up to 40-45us).
      
      To solve this and speed up both cases, drive the xattr buffer
      allocation into the attribute code once we know what the actual
      xattr length is. For the no-xattr case, we avoid the allocation
      completely, speeding up that case. For the common ACL case, we'll
      end up with a fast heap allocation (because it'll be smaller than a
      page), and only for the rarer "we have a remote xattr" will we have
      a multi-page allocation occur. Hence the common ACL case will be
      much faster, too.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      ddbca70c
    • D
      xfs: consolidate attribute value copying · 9df243a1
      Dave Chinner 提交于
      The same code is used to copy do the attribute copying in three
      different places. Consolidate them into a single function in
      preparation from on-demand buffer allocation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9df243a1
    • D
      xfs: move remote attr retrieval into xfs_attr3_leaf_getvalue · e3cc4554
      Dave Chinner 提交于
      Because we repeat exactly the same code to get the remote attribute
      value after both calls to xfs_attr3_leaf_getvalue() if it's a remote
      attr. Just do it in xfs_attr3_leaf_getvalue() so the callers don't
      have to care about it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e3cc4554