1. 12 2月, 2019 5 次提交
  2. 20 12月, 2018 5 次提交
  3. 13 12月, 2018 10 次提交
    • O
      xfs: cache minimum realtime summary level · 355e3532
      Omar Sandoval 提交于
      The realtime summary is a two-dimensional array on disk, effectively:
      
      u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]
      
      rsum[log][bbno] is the number of extents of size 2**log which start in
      bitmap block bbno.
      
      xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
      rsum[log][bbno] != 0 for any log level. However, the summary array is
      stored in row-major order (i.e., like an array in C), so all of these
      entries are not adjacent, but rather spread across the entire summary
      file. In the worst case (a full bitmap block), xfs_rtany_summary() has
      to check every level.
      
      This means that on a moderately-used realtime device, an allocation will
      waste a lot of time finding, reading, and releasing buffers for the
      realtime summary. In particular, one of our storage services (which runs
      on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
      spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
      xfs_trans_brelse() called from xfs_rtany_summary().
      
      One solution would be to also store the summary with the dimensions
      swapped. However, this would require a disk format change to a very old
      component of XFS.
      
      Instead, we can cache the minimum size which contains any extents. We do
      so lazily; rather than guaranteeing that the cache contains the precise
      minimum, it always contains a loose lower bound which we tighten when we
      read or update a summary block. This only uses a few kilobytes of memory
      and is already serialized via the realtime bitmap and summary inode
      locks, so the cost is minimal. With this change, the same workload only
      spends 0.2% of its CPU cycles in the realtime allocator.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      355e3532
    • D
      xfs: precalculate cluster alignment in inodes and blocks · c1b4a321
      Darrick J. Wong 提交于
      Store the inode cluster alignment information in units of inodes and
      blocks in the mount data so that we don't have to keep recalculating
      them.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      c1b4a321
    • D
      xfs: precalculate inodes and blocks per inode cluster · 83dcdb44
      Darrick J. Wong 提交于
      Store the number of inodes and blocks per inode cluster in the mount
      data so that we don't have to keep recalculating them.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      83dcdb44
    • D
      xfs: add a block to inode count converter · 43004b2a
      Darrick J. Wong 提交于
      Add new helpers to convert units of fs blocks into inodes, and AG blocks
      into AG inodes, respectively.  Convert all the open-coded conversions
      and XFS_OFFBNO_TO_AGINO(, , 0) calls to use them, as appropriate.  The
      OFFBNO_TO_AGINO macro is retained for xfs_repair.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      43004b2a
    • D
      xfs: remove xfs_rmap_ag_owner and friends · 7280feda
      Darrick J. Wong 提交于
      Owner information for static fs metadata can be defined readonly at
      build time because it never changes across filesystems.  This enables us
      to reduce stack usage (particularly in scrub) because we can use the
      statically defined oinfo structures.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      7280feda
    • D
      xfs: const-ify xfs_owner_info arguments · 66e3237e
      Darrick J. Wong 提交于
      Only certain functions actually change the contents of an
      xfs_owner_info; the rest can accept a const struct pointer.  This will
      enable us to save stack space by hoisting static owner info types to
      be const global variables.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      66e3237e
    • D
      xfs: streamline defer op type handling · 02b100fb
      Darrick J. Wong 提交于
      There's no need to bundle a pointer to the defer op type into the defer
      op control structure.  Instead, store the defer op type enum, which
      enables us to shorten some of the lines.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      02b100fb
    • D
      xfs: idiotproof defer op type configuration · bc9f2b7c
      Darrick J. Wong 提交于
      Recently, we forgot to port a new defer op type to xfsprogs, which
      caused us some userspace pain.  Reorganize the way we make libxfs
      clients supply defer op type information so that all type information
      has to be provided at build time instead of risky runtime dynamic
      configuration.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      bc9f2b7c
    • D
      xfs: zero length symlinks are not valid · 43feeea8
      Dave Chinner 提交于
      A log recovery failure has been reproduced where a symlink inode has
      a zero length in extent form. It was caused by a shutdown during a
      combined fstress+fsmark workload.
      
      The underlying problem is the issue in xfs_inactive_symlink(): the
      inode is unlocked between the symlink inactivation/truncation and
      the inode being freed. This opens a window for the inode to be
      written to disk before it xfs_ifree() removes it from the unlinked
      list, marks it free in the inobt and zeros the mode.
      
      For shortform inodes, the fix is simple. xfs_ifree() clears the data
      fork state, so there's no need to do it in xfs_inactive_symlink().
      This means the shortform fork verifier will not see a zero length
      data fork as it mirrors the inode size through to xfs_ifree()), and
      hence if the inode gets written back and the fork verifiers are run
      they will still see a fork that matches the on-disk inode size.
      
      For extent form (remote) symlinks, it is a little more tricky. Here
      we explicitly set the inode size to zero, so the above race can lead
      to zero length symlinks on disk. Because the inode is unlinked at
      this point (i.e. on the unlinked list) and unreferenced, it can
      never be seen again by a user. Hence when we set the inode size to
      zeor, also change the type to S_IFREG. xfs_ifree() expects S_IFREG
      inodes to be of zero length, and so this avoids all the problems of
      zero length symlinks ever hitting the disk. It also avoids the
      problem of needing to handle zero length symlink inodes in log
      recovery to replay the extent free intents and the remaining
      deferops to free the extents the symlink used.
      
      Also add a couple of asserts to warn us if zero length symlinks end
      up in either the symlink create or inactivation paths.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      43feeea8
    • P
      xfs: libxfs: move xfs_perag_put late · fe5ed6c2
      Pan Bian 提交于
      The function xfs_alloc_get_freelist calls xfs_perag_put to drop the
      reference. However, pag->pagf_btreeblks is read and written after the
      put operation. This patch moves the put operation later.
      Signed-off-by: NPan Bian <bianpan2016@163.com>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      [darrick: minor changelog edits]
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      fe5ed6c2
  4. 05 12月, 2018 1 次提交
  5. 22 11月, 2018 1 次提交
    • D
      xfs: delalloc -> unwritten COW fork allocation can go wrong · 9230a0b6
      Dave Chinner 提交于
      Long saga. There have been days spent following this through dead end
      after dead end in multi-GB event traces. This morning, after writing
      a trace-cmd wrapper that enabled me to be more selective about XFS
      trace points, I discovered that I could get just enough essential
      tracepoints enabled that there was a 50:50 chance the fsx config
      would fail at ~115k ops. If it didn't fail at op 115547, I stopped
      fsx at op 115548 anyway.
      
      That gave me two traces - one where the problem manifested, and one
      where it didn't. After refining the traces to have the necessary
      information, I found that in the failing case there was a real
      extent in the COW fork compared to an unwritten extent in the
      working case.
      
      Walking back through the two traces to the point where the CWO fork
      extents actually diverged, I found that the bad case had an extra
      unwritten extent in it. This is likely because the bug it led me to
      had triggered multiple times in those 115k ops, leaving stray
      COW extents around. What I saw was a COW delalloc conversion to an
      unwritten extent (as they should always be through
      xfs_iomap_write_allocate()) resulted in a /written extent/:
      
      xfs_writepage:        dev 259:0 ino 0x83 pgoff 0x17000 size 0x79a00 offset 0 length 0
      xfs_iext_remove:      dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/2 offset 32 block 152 count 20 flag 1 caller xfs_bmap_add_extent_delay_real
      xfs_bmap_pre_update:  dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 4503599627239429 count 31 flag 0 caller xfs_bmap_add_extent_delay_real
      xfs_bmap_post_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 121 count 51 flag 0 caller xfs_bmap_add_ex
      
      Basically, Cow fork before:
      
      	0 1            32          52
      	+H+DDDDDDDDDDDD+UUUUUUUUUUU+
      	   PREV		RIGHT
      
      COW delalloc conversion allocates:
      
      	  1	       32
      	  +uuuuuuuuuuuu+
      	  NEW
      
      And the result according to the xfs_bmap_post_update trace was:
      
      	0 1            32          52
      	+H+wwwwwwwwwwwwwwwwwwwwwwww+
      	   PREV
      
      Which is clearly wrong - it should be a merged unwritten extent,
      not an unwritten extent.
      
      That lead me to look at the LEFT_FILLING|RIGHT_FILLING|RIGHT_CONTIG
      case in xfs_bmap_add_extent_delay_real(), and sure enough, there's
      the bug.
      
      It takes the old delalloc extent (PREV) and adds the length of the
      RIGHT extent to it, takes the start block from NEW, removes the
      RIGHT extent and then updates PREV with the new extent.
      
      What it fails to do is update PREV.br_state. For delalloc, this is
      always XFS_EXT_NORM, while in this case we are converting the
      delayed allocation to unwritten, so it needs to be updated to
      XFS_EXT_UNWRITTEN. This LF|RF|RC case does not do this, and so
      the resultant extent is always written.
      
      And that's the bug I've been chasing for a week - a bmap btree bug,
      not a reflink/dedupe/copy_file_range bug, but a BMBT bug introduced
      with the recent in core extent tree scalability enhancements.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9230a0b6
  6. 21 11月, 2018 1 次提交
    • D
      xfs: finobt AG reserves don't consider last AG can be a runt · c0876897
      Dave Chinner 提交于
      The last AG may be very small comapred to all other AGs, and hence
      AG reservations based on the superblock AG size may actually consume
      more space than the AG actually has. This results on assert failures
      like:
      
      XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319
      [   48.932891]  xfs_ag_resv_init+0x1bd/0x1d0
      [   48.933853]  xfs_fs_reserve_ag_blocks+0x37/0xb0
      [   48.934939]  xfs_mountfs+0x5b3/0x920
      [   48.935804]  xfs_fs_fill_super+0x462/0x640
      [   48.936784]  ? xfs_test_remount_options+0x60/0x60
      [   48.937908]  mount_bdev+0x178/0x1b0
      [   48.938751]  mount_fs+0x36/0x170
      [   48.939533]  vfs_kern_mount.part.43+0x54/0x130
      [   48.940596]  do_mount+0x20e/0xcb0
      [   48.941396]  ? memdup_user+0x3e/0x70
      [   48.942249]  ksys_mount+0xba/0xd0
      [   48.943046]  __x64_sys_mount+0x21/0x30
      [   48.943953]  do_syscall_64+0x54/0x170
      [   48.944835]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Hence we need to ensure the finobt per-ag space reservations take
      into account the size of the last AG rather than treat it like all
      the other full size AGs.
      
      Note that both refcountbt and rmapbt already take the size of the AG
      into account via reading the AGF length directly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c0876897
  7. 06 11月, 2018 1 次提交
    • D
      xfs: fix overflow in xfs_attr3_leaf_verify · 837514f7
      Dave Chinner 提交于
      generic/070 on 64k block size filesystems is failing with a verifier
      corruption on writeback or an attribute leaf block:
      
      [   94.973083] XFS (pmem0): Metadata corruption detected at xfs_attr3_leaf_verify+0x246/0x260, xfs_attr3_leaf block 0x811480
      [   94.975623] XFS (pmem0): Unmount and run xfs_repair
      [   94.976720] XFS (pmem0): First 128 bytes of corrupted metadata buffer:
      [   94.978270] 000000004b2e7b45: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
      [   94.980268] 000000006b1db90b: 00 00 00 00 00 81 14 80 00 00 00 00 00 00 00 00  ................
      [   94.982251] 00000000433f2407: 22 7b 5c 82 2d 5c 47 4c bb 31 1c 37 fa a9 ce d6  "{\.-\GL.1.7....
      [   94.984157] 0000000010dc7dfb: 00 00 00 00 00 81 04 8a 00 0a 18 e8 dd 94 01 00  ................
      [   94.986215] 00000000d5a19229: 00 a0 dc f4 fe 98 01 68 f0 d8 07 e0 00 00 00 00  .......h........
      [   94.988171] 00000000521df36c: 0c 2d 32 e2 fe 20 01 00 0c 2d 58 65 fe 0c 01 00  .-2.. ...-Xe....
      [   94.990162] 000000008477ae06: 0c 2d 5b 66 fe 8c 01 00 0c 2d 71 35 fe 7c 01 00  .-[f.....-q5.|..
      [   94.992139] 00000000a4a6bca6: 0c 2d 72 37 fc d4 01 00 0c 2d d8 b8 f0 90 01 00  .-r7.....-......
      [   94.994789] XFS (pmem0): xfs_do_force_shutdown(0x8) called from line 1453 of file fs/xfs/xfs_buf.c. Return address = ffffffff815365f3
      
      This is failing this check:
      
                      end = ichdr.freemap[i].base + ichdr.freemap[i].size;
                      if (end < ichdr.freemap[i].base)
      >>>>>                   return __this_address;
                      if (end > mp->m_attr_geo->blksize)
                              return __this_address;
      
      And from the buffer output above, the freemap array is:
      
      	freemap[0].base = 0x00a0
      	freemap[0].size = 0xdcf4	end = 0xdd94
      	freemap[1].base = 0xfe98
      	freemap[1].size = 0x0168	end = 0x10000
      	freemap[2].base = 0xf0d8
      	freemap[2].size = 0x07e0	end = 0xf8b8
      
      These all look valid - the block size is 0x10000 and so from the
      last check in the above verifier fragment we know that the end
      of freemap[1] is valid. The problem is that end is declared as:
      
      	uint16_t	end;
      
      And (uint16_t)0x10000 = 0. So we have a verifier bug here, not a
      corruption. Fix the verifier to use uint32_t types for the check and
      hence avoid the overflow.
      
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=201577Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      837514f7
  8. 18 10月, 2018 5 次提交
  9. 01 10月, 2018 1 次提交
    • D
      xfs: fix error handling in xfs_bmap_extents_to_btree · e55ec4dd
      Dave Chinner 提交于
      Commit 01239d77 ("xfs: fix a null pointer dereference in
      xfs_bmap_extents_to_btree") attempted to fix a null pointer
      dreference when a fuzzing corruption of some kind was found.
      This fix was flawed, resulting in assert failures like:
      
      XFS: Assertion failed: ifp->if_broot == NULL, file: fs/xfs/libxfs/xfs_bmap.c, line: 715
      .....
      Call Trace:
        xfs_bmap_extents_to_btree+0x6b9/0x7b0
        __xfs_bunmapi+0xae7/0xf00
        ? xfs_log_reserve+0x1c8/0x290
        xfs_reflink_remap_extent+0x20b/0x620
        xfs_reflink_remap_blocks+0x7e/0x290
        xfs_reflink_remap_range+0x311/0x530
        vfs_dedupe_file_range_one+0xd7/0xe0
        vfs_dedupe_file_range+0x15b/0x1a0
        do_vfs_ioctl+0x267/0x6c0
      
      The problem is that the error handling code now asserts that the
      inode fork is not in btree format before the error handling code
      undoes the modifications that put the fork back in extent format.
      Fix this by moving the assert back to after the xfs_iroot_realloc()
      call that returns the fork to extent format, and clean up the jump
      labels to be meaningful.
      
      Also, returning ENOSPC when xfs_btree_get_bufl() fails to
      instantiate the buffer that was allocated (the actual fix in the
      commit mentioned above) is incorrect. This is a fatal error - only
      an invalid block address or a filesystem shutdown can result in
      failing to get a buffer here.
      
      Hence change this to EFSCORRUPTED so that the higher layer knows
      this was a corruption related failure and should not treat it as an
      ENOSPC error.  This should result in a shutdown (via cancelling a
      dirty transaction) which is necessary as we do not attempt to clean
      up the (invalid) block that we have already allocated.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      e55ec4dd
  10. 29 9月, 2018 3 次提交
  11. 12 8月, 2018 1 次提交
  12. 08 8月, 2018 1 次提交
  13. 03 8月, 2018 5 次提交