1. 20 12月, 2018 2 次提交
  2. 19 12月, 2018 3 次提交
    • N
      xfs: Fix x32 ioctls when cmd numbers differ from ia32. · a9d25bde
      Nick Bowler 提交于
      Several ioctl structs change size between native 32-bit (ia32) and x32
      applications, because x32 follows the native 64-bit (amd64) integer
      alignment rules and uses 64-bit time_t.  In these instances, the ioctl
      number changes so userspace simply gets -ENOTTY.  This scenario can be
      handled by simply adding more cases.
      
      Looking at the different ioctls implemented here:
      
      - All the ones marked 'No size or alignment issue on any arch' should
        presumably all be fine.
      
      - All the ones under BROKEN_X86_ALIGNMENT are different under integer
        alignment rules.  Since x32 matches amd64 here, we just need both
        sets of cases handled.
      
      - XFS_IOC_SWAPEXT has both integer alignment differences and time_t
        differences.  Since x32 matches amd64 here, we need to add a case
        which calls the native implementation.
      
      - The remaining ioctls have neither 64-bit integers nor time_t, so
        x32 matches ia32 here and no change is required at this level.  The
        bulkstat ioctl implementations have some pointer chasing which is
        handled separately.
      Signed-off-by: NNick Bowler <nbowler@draconx.ca>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a9d25bde
    • N
      xfs: Fix bulkstat compat ioctls on x32 userspace. · 7ca860e3
      Nick Bowler 提交于
      The bulkstat family of ioctls are problematic on x32, because there is
      a mixup of native 32-bit and 64-bit conventions.  The xfs_fsop_bulkreq
      struct contains pointers and 32-bit integers so that matches the native
      32-bit layout, and that means the ioctl implementation goes into the
      regular compat path on x32.
      
      However, the 'ubuffer' member of that struct in turn refers to either
      struct xfs_inogrp or xfs_bstat (or an array of these).  On x32, those
      structures match the native 64-bit layout.  The compat implementation
      writes out the 32-bit version of these structures.  This is not the
      expected format for x32 userspace, causing problems.
      
      Fortunately the functions which actually output these xfs_inogrp and
      xfs_bstat structures have an easy way to select which output format
      is required, so we just need a little tweak to select the right format
      on x32.
      Signed-off-by: NNick Bowler <nbowler@draconx.ca>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7ca860e3
    • N
      xfs: Align compat attrlist_by_handle with native implementation. · c456d644
      Nick Bowler 提交于
      While inspecting the ioctl implementations, I noticed that the compat
      implementation of XFS_IOC_ATTRLIST_BY_HANDLE does not do exactly the
      same thing as the native implementation.  Specifically, the "cursor"
      does not appear to be written out to userspace on the compat path,
      like it is on the native path.
      
      This adjusts the compat implementation to copy out the cursor just
      like the native implementation does.  The attrlist cursor does not
      require any special compat handling.  This fixes xfstests xfs/269
      on both IA-32 and x32 userspace, when running on an amd64 kernel.
      Signed-off-by: NNick Bowler <nbowler@draconx.ca>
      Fixes: 0facef7f ("xfs: in _attrlist_by_handle, copy the cursor back to userspace")
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c456d644
  3. 14 12月, 2018 1 次提交
  4. 13 12月, 2018 13 次提交
    • O
      xfs: cache minimum realtime summary level · 355e3532
      Omar Sandoval 提交于
      The realtime summary is a two-dimensional array on disk, effectively:
      
      u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]
      
      rsum[log][bbno] is the number of extents of size 2**log which start in
      bitmap block bbno.
      
      xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
      rsum[log][bbno] != 0 for any log level. However, the summary array is
      stored in row-major order (i.e., like an array in C), so all of these
      entries are not adjacent, but rather spread across the entire summary
      file. In the worst case (a full bitmap block), xfs_rtany_summary() has
      to check every level.
      
      This means that on a moderately-used realtime device, an allocation will
      waste a lot of time finding, reading, and releasing buffers for the
      realtime summary. In particular, one of our storage services (which runs
      on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
      spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
      xfs_trans_brelse() called from xfs_rtany_summary().
      
      One solution would be to also store the summary with the dimensions
      swapped. However, this would require a disk format change to a very old
      component of XFS.
      
      Instead, we can cache the minimum size which contains any extents. We do
      so lazily; rather than guaranteeing that the cache contains the precise
      minimum, it always contains a loose lower bound which we tighten when we
      read or update a summary block. This only uses a few kilobytes of memory
      and is already serialized via the realtime bitmap and summary inode
      locks, so the cost is minimal. With this change, the same workload only
      spends 0.2% of its CPU cycles in the realtime allocator.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      355e3532
    • D
      xfs: count inode blocks correctly in inobt scrub · 2c2d9d3a
      Darrick J. Wong 提交于
      A big block filesystem might require more than one inobt record to cover
      all the inodes in the block.  In these cases it is not correct to round
      the irec count up to the nearest block because this causes us to
      overestimate the number of inode blocks we expect to find.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      2c2d9d3a
    • D
      xfs: precalculate cluster alignment in inodes and blocks · c1b4a321
      Darrick J. Wong 提交于
      Store the inode cluster alignment information in units of inodes and
      blocks in the mount data so that we don't have to keep recalculating
      them.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      c1b4a321
    • D
      xfs: precalculate inodes and blocks per inode cluster · 83dcdb44
      Darrick J. Wong 提交于
      Store the number of inodes and blocks per inode cluster in the mount
      data so that we don't have to keep recalculating them.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      83dcdb44
    • D
      xfs: add a block to inode count converter · 43004b2a
      Darrick J. Wong 提交于
      Add new helpers to convert units of fs blocks into inodes, and AG blocks
      into AG inodes, respectively.  Convert all the open-coded conversions
      and XFS_OFFBNO_TO_AGINO(, , 0) calls to use them, as appropriate.  The
      OFFBNO_TO_AGINO macro is retained for xfs_repair.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      43004b2a
    • D
      xfs: remove xfs_rmap_ag_owner and friends · 7280feda
      Darrick J. Wong 提交于
      Owner information for static fs metadata can be defined readonly at
      build time because it never changes across filesystems.  This enables us
      to reduce stack usage (particularly in scrub) because we can use the
      statically defined oinfo structures.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      7280feda
    • D
      xfs: const-ify xfs_owner_info arguments · 66e3237e
      Darrick J. Wong 提交于
      Only certain functions actually change the contents of an
      xfs_owner_info; the rest can accept a const struct pointer.  This will
      enable us to save stack space by hoisting static owner info types to
      be const global variables.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      66e3237e
    • D
      xfs: streamline defer op type handling · 02b100fb
      Darrick J. Wong 提交于
      There's no need to bundle a pointer to the defer op type into the defer
      op control structure.  Instead, store the defer op type enum, which
      enables us to shorten some of the lines.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      02b100fb
    • D
      xfs: idiotproof defer op type configuration · bc9f2b7c
      Darrick J. Wong 提交于
      Recently, we forgot to port a new defer op type to xfsprogs, which
      caused us some userspace pain.  Reorganize the way we make libxfs
      clients supply defer op type information so that all type information
      has to be provided at build time instead of risky runtime dynamic
      configuration.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      bc9f2b7c
    • D
      xfs: zero length symlinks are not valid · 43feeea8
      Dave Chinner 提交于
      A log recovery failure has been reproduced where a symlink inode has
      a zero length in extent form. It was caused by a shutdown during a
      combined fstress+fsmark workload.
      
      The underlying problem is the issue in xfs_inactive_symlink(): the
      inode is unlocked between the symlink inactivation/truncation and
      the inode being freed. This opens a window for the inode to be
      written to disk before it xfs_ifree() removes it from the unlinked
      list, marks it free in the inobt and zeros the mode.
      
      For shortform inodes, the fix is simple. xfs_ifree() clears the data
      fork state, so there's no need to do it in xfs_inactive_symlink().
      This means the shortform fork verifier will not see a zero length
      data fork as it mirrors the inode size through to xfs_ifree()), and
      hence if the inode gets written back and the fork verifiers are run
      they will still see a fork that matches the on-disk inode size.
      
      For extent form (remote) symlinks, it is a little more tricky. Here
      we explicitly set the inode size to zero, so the above race can lead
      to zero length symlinks on disk. Because the inode is unlinked at
      this point (i.e. on the unlinked list) and unreferenced, it can
      never be seen again by a user. Hence when we set the inode size to
      zeor, also change the type to S_IFREG. xfs_ifree() expects S_IFREG
      inodes to be of zero length, and so this avoids all the problems of
      zero length symlinks ever hitting the disk. It also avoids the
      problem of needing to handle zero length symlink inodes in log
      recovery to replay the extent free intents and the remaining
      deferops to free the extents the symlink used.
      
      Also add a couple of asserts to warn us if zero length symlinks end
      up in either the symlink create or inactivation paths.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      43feeea8
    • C
      xfs: clean up indentation issues, remove an unwanted space · 8c4ce794
      Colin Ian King 提交于
      There is a statement that has an unwanted space in the indentation.
      Remove it.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      8c4ce794
    • P
      xfs: libxfs: move xfs_perag_put late · fe5ed6c2
      Pan Bian 提交于
      The function xfs_alloc_get_freelist calls xfs_perag_put to drop the
      reference. However, pag->pagf_btreeblks is read and written after the
      put operation. This patch moves the put operation later.
      Signed-off-by: NPan Bian <bianpan2016@163.com>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      [darrick: minor changelog edits]
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      fe5ed6c2
    • D
      xfs: split up the xfs_reflink_end_cow work into smaller transactions · d6f215f3
      Darrick J. Wong 提交于
      In xfs_reflink_end_cow, we allocate a single transaction for the entire
      end_cow operation and then loop the CoW fork mappings to move them to
      the data fork.  This design fails on a heavily fragmented filesystem
      where an inode's data fork has exactly one more extent than would fit in
      an extents-format fork, because the unmap can collapse the data fork
      into extents format (freeing the bmbt block) but the remap can expand
      the data fork back into a (newly allocated) bmbt block.  If the number
      of extents we end up remapping is large, we can overflow the block
      reservation because we reserved blocks assuming that we were adding
      mappings into an already-cleared area of the data fork.
      
      Let's say we have 8 extents in the data fork, 8 extents in the CoW fork,
      and the data fork can hold at most 7 extents before needing to convert
      to btree format; and that blocks A-P are discontiguous single-block
      extents:
      
         0......7
      D: ABCDEFGH
      C: IJKLMNOP
      
      When a write to file blocks 0-7 completes, we must remap I-P into the
      data fork.  We start by removing H from the btree-format data fork.  Now
      we have 7 extents, so we convert the fork to extents format, freeing the
      bmbt block.   We then move P into the data fork and it now has 8 extents
      again.  We must convert the data fork back to btree format, requiring a
      block allocation.  If we repeat this sequence for blocks 6-5-4-3-2-1-0,
      we'll need a total of 8 block allocations to remap all 8 blocks.  We
      reserved only enough blocks to handle one btree split (5 blocks on a 4k
      block filesystem), which means we overflow the block reservation.
      
      To fix this issue, create a separate helper function to remap a single
      extent, and change _reflink_end_cow to call it in a tight loop over the
      entire range we're completing.  As a side effect this also removes the
      size restrictions on how many extents we can end_cow at a time, though
      nobody ever hit that.  It is not reasonable to reserve N blocks to remap
      N blocks.
      
      Note that this can be reproduced after ~320 million fsx ops while
      running generic/938 (long soak directio fsx exerciser):
      
      XFS: Assertion failed: tp->t_blk_res >= tp->t_blk_res_used, file: fs/xfs/xfs_trans.c, line: 116
      <machine registers snipped>
      Call Trace:
       xfs_trans_dup+0x211/0x250 [xfs]
       xfs_trans_roll+0x6d/0x180 [xfs]
       xfs_defer_trans_roll+0x10c/0x3b0 [xfs]
       xfs_defer_finish_noroll+0xdf/0x740 [xfs]
       xfs_defer_finish+0x13/0x70 [xfs]
       xfs_reflink_end_cow+0x2c6/0x680 [xfs]
       xfs_dio_write_end_io+0x115/0x220 [xfs]
       iomap_dio_complete+0x3f/0x130
       iomap_dio_rw+0x3c3/0x420
       xfs_file_dio_aio_write+0x132/0x3c0 [xfs]
       xfs_file_write_iter+0x8b/0xc0 [xfs]
       __vfs_write+0x193/0x1f0
       vfs_write+0xba/0x1c0
       ksys_write+0x52/0xc0
       do_syscall_64+0x50/0x160
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      d6f215f3
  5. 05 12月, 2018 2 次提交
  6. 27 11月, 2018 1 次提交
  7. 22 11月, 2018 2 次提交
    • D
      xfs: delalloc -> unwritten COW fork allocation can go wrong · 9230a0b6
      Dave Chinner 提交于
      Long saga. There have been days spent following this through dead end
      after dead end in multi-GB event traces. This morning, after writing
      a trace-cmd wrapper that enabled me to be more selective about XFS
      trace points, I discovered that I could get just enough essential
      tracepoints enabled that there was a 50:50 chance the fsx config
      would fail at ~115k ops. If it didn't fail at op 115547, I stopped
      fsx at op 115548 anyway.
      
      That gave me two traces - one where the problem manifested, and one
      where it didn't. After refining the traces to have the necessary
      information, I found that in the failing case there was a real
      extent in the COW fork compared to an unwritten extent in the
      working case.
      
      Walking back through the two traces to the point where the CWO fork
      extents actually diverged, I found that the bad case had an extra
      unwritten extent in it. This is likely because the bug it led me to
      had triggered multiple times in those 115k ops, leaving stray
      COW extents around. What I saw was a COW delalloc conversion to an
      unwritten extent (as they should always be through
      xfs_iomap_write_allocate()) resulted in a /written extent/:
      
      xfs_writepage:        dev 259:0 ino 0x83 pgoff 0x17000 size 0x79a00 offset 0 length 0
      xfs_iext_remove:      dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/2 offset 32 block 152 count 20 flag 1 caller xfs_bmap_add_extent_delay_real
      xfs_bmap_pre_update:  dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 4503599627239429 count 31 flag 0 caller xfs_bmap_add_extent_delay_real
      xfs_bmap_post_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 121 count 51 flag 0 caller xfs_bmap_add_ex
      
      Basically, Cow fork before:
      
      	0 1            32          52
      	+H+DDDDDDDDDDDD+UUUUUUUUUUU+
      	   PREV		RIGHT
      
      COW delalloc conversion allocates:
      
      	  1	       32
      	  +uuuuuuuuuuuu+
      	  NEW
      
      And the result according to the xfs_bmap_post_update trace was:
      
      	0 1            32          52
      	+H+wwwwwwwwwwwwwwwwwwwwwwww+
      	   PREV
      
      Which is clearly wrong - it should be a merged unwritten extent,
      not an unwritten extent.
      
      That lead me to look at the LEFT_FILLING|RIGHT_FILLING|RIGHT_CONTIG
      case in xfs_bmap_add_extent_delay_real(), and sure enough, there's
      the bug.
      
      It takes the old delalloc extent (PREV) and adds the length of the
      RIGHT extent to it, takes the start block from NEW, removes the
      RIGHT extent and then updates PREV with the new extent.
      
      What it fails to do is update PREV.br_state. For delalloc, this is
      always XFS_EXT_NORM, while in this case we are converting the
      delayed allocation to unwritten, so it needs to be updated to
      XFS_EXT_UNWRITTEN. This LF|RF|RC case does not do this, and so
      the resultant extent is always written.
      
      And that's the bug I've been chasing for a week - a bmap btree bug,
      not a reflink/dedupe/copy_file_range bug, but a BMBT bug introduced
      with the recent in core extent tree scalability enhancements.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9230a0b6
    • D
      xfs: flush removing page cache in xfs_reflink_remap_prep · 2c307174
      Dave Chinner 提交于
      On a sub-page block size filesystem, fsx is failing with a data
      corruption after a series of operations involving copying a file
      with the destination offset beyond EOF of the destination of the file:
      
      8093(157 mod 256): TRUNCATE DOWN        from 0x7a120 to 0x50000 ******WWWW
      8094(158 mod 256): INSERT 0x25000 thru 0x25fff  (0x1000 bytes)
      8095(159 mod 256): COPY 0x18000 thru 0x1afff    (0x3000 bytes) to 0x2f400
      8096(160 mod 256): WRITE    0x5da00 thru 0x651ff        (0x7800 bytes) HOLE
      8097(161 mod 256): COPY 0x2000 thru 0x5fff      (0x4000 bytes) to 0x6fc00
      
      The second copy here is beyond EOF, and it is to sub-page (4k) but
      block aligned (1k) offset. The clone runs the EOF zeroing, landing
      in a pre-existing post-eof delalloc extent. This zeroes the post-eof
      extents in the page cache just fine, dirtying the pages correctly.
      
      The problem is that xfs_reflink_remap_prep() now truncates the page
      cache over the range that it is copying it to, and rounds that down
      to cover the entire start page. This removes the dirty page over the
      delalloc extent from the page cache without having written it back.
      Hence later, when the page cache is flushed, the page at offset
      0x6f000 has not been written back and hence exposes stale data,
      which fsx trips over less than 10 operations later.
      
      Fix this by changing xfs_reflink_remap_prep() to use
      xfs_flush_unmap_range().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      2c307174
  8. 21 11月, 2018 4 次提交
    • D
      xfs: extent shifting doesn't fully invalidate page cache · 7f9f71be
      Dave Chinner 提交于
      The extent shifting code uses a flush and invalidate mechainsm prior
      to shifting extents around. This is similar to what
      xfs_free_file_space() does, but it doesn't take into account things
      like page cache vs block size differences, and it will fail if there
      is a page that it currently busy.
      
      xfs_flush_unmap_range() handles all of these cases, so just convert
      xfs_prepare_shift() to us that mechanism rather than having it's own
      special sauce.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7f9f71be
    • D
      xfs: finobt AG reserves don't consider last AG can be a runt · c0876897
      Dave Chinner 提交于
      The last AG may be very small comapred to all other AGs, and hence
      AG reservations based on the superblock AG size may actually consume
      more space than the AG actually has. This results on assert failures
      like:
      
      XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319
      [   48.932891]  xfs_ag_resv_init+0x1bd/0x1d0
      [   48.933853]  xfs_fs_reserve_ag_blocks+0x37/0xb0
      [   48.934939]  xfs_mountfs+0x5b3/0x920
      [   48.935804]  xfs_fs_fill_super+0x462/0x640
      [   48.936784]  ? xfs_test_remount_options+0x60/0x60
      [   48.937908]  mount_bdev+0x178/0x1b0
      [   48.938751]  mount_fs+0x36/0x170
      [   48.939533]  vfs_kern_mount.part.43+0x54/0x130
      [   48.940596]  do_mount+0x20e/0xcb0
      [   48.941396]  ? memdup_user+0x3e/0x70
      [   48.942249]  ksys_mount+0xba/0xd0
      [   48.943046]  __x64_sys_mount+0x21/0x30
      [   48.943953]  do_syscall_64+0x54/0x170
      [   48.944835]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Hence we need to ensure the finobt per-ag space reservations take
      into account the size of the last AG rather than treat it like all
      the other full size AGs.
      
      Note that both refcountbt and rmapbt already take the size of the AG
      into account via reading the AGF length directly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c0876897
    • D
      xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers · d43aaf16
      Dave Chinner 提交于
      When retrying a failed inode or dquot buffer,
      xfs_buf_resubmit_failed_buffers() clears all the failed flags from
      the inde/dquot log items. In doing so, it also drops all the
      reference counts on the buffer that the failed log items hold. This
      means it can drop all the active references on the buffer and hence
      free the buffer before it queues it for write again.
      
      Putting the buffer on the delwri queue takes a reference to the
      buffer (so that it hangs around until it has been written and
      completed), but this goes bang if the buffer has already been freed.
      
      Hence we need to add the buffer to the delwri queue before we remove
      the failed flags from the log items attached to the buffer to ensure
      it always remains referenced during the resubmit process.
      Reported-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      d43aaf16
    • D
      xfs: uncached buffer tracing needs to print bno · d61fa8cb
      Dave Chinner 提交于
      Useless:
      
      xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_unlock:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_submit:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_hold:         dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_iowait:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_iodone:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_iowait_done:  dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_rele:         dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      
      Useful:
      
      
      xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_unlock:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_submit:       dev 253:32 bno 0x200b5 nblks 0x1 ...
      xfs_buf_hold:         dev 253:32 bno 0x200b5 nblks 0x1 ...
      xfs_buf_iowait:       dev 253:32 bno 0x200b5 nblks 0x1 ...
      xfs_buf_iodone:       dev 253:32 bno 0x200b5 nblks 0x1 ...
      xfs_buf_iowait_done:  dev 253:32 bno 0x200b5 nblks 0x1 ...
      xfs_buf_rele:         dev 253:32 bno 0x200b5 nblks 0x1 ...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      d61fa8cb
  9. 20 11月, 2018 2 次提交
    • E
      xfs: make xfs_file_remap_range() static · da034bcc
      Eric Biggers 提交于
      xfs_file_remap_range() is only used in fs/xfs/xfs_file.c, so make it
      static.
      
      This addresses a gcc warning when -Wmissing-prototypes is enabled.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      da034bcc
    • B
      xfs: fix shared extent data corruption due to missing cow reservation · 59e42931
      Brian Foster 提交于
      Page writeback indirectly handles shared extents via the existence
      of overlapping COW fork blocks. If COW fork blocks exist, writeback
      always performs the associated copy-on-write regardless if the
      underlying blocks are actually shared. If the blocks are shared,
      then overlapping COW fork blocks must always exist.
      
      fstests shared/010 reproduces a case where a buffered write occurs
      over a shared block without performing the requisite COW fork
      reservation.  This ultimately causes writeback to the shared extent
      and data corruption that is detected across md5 checks of the
      filesystem across a mount cycle.
      
      The problem occurs when a buffered write lands over a shared extent
      that crosses an extent size hint boundary and that also happens to
      have a partial COW reservation that doesn't cover the start and end
      blocks of the data fork extent.
      
      For example, a buffered write occurs across the file offset (in FSB
      units) range of [29, 57]. A shared extent exists at blocks [29, 35]
      and COW reservation already exists at blocks [32, 34]. After
      accommodating a COW extent size hint of 32 blocks and the existing
      reservation at offset 32, xfs_reflink_reserve_cow() allocates 32
      blocks of reservation at offset 0 and returns with COW reservation
      across the range of [0, 34]. The associated data fork extent is
      still [29, 35], however, which isn't fully covered by the COW
      reservation.
      
      This leads to a buffered write at file offset 35 over a shared
      extent without associated COW reservation. Writeback eventually
      kicks in, performs an overwrite of the underlying shared block and
      causes the associated data corruption.
      
      Update xfs_reflink_reserve_cow() to accommodate the fact that a
      delalloc allocation request may not fully cover the extent in the
      data fork. Trim the data fork extent appropriately, just as is done
      for shared extent boundaries and/or existing COW reservations that
      happen to overlap the start of the data fork extent. This prevents
      shared/010 failures due to data corruption on reflink enabled
      filesystems.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      59e42931
  10. 06 11月, 2018 3 次提交
    • D
      xfs: fix overflow in xfs_attr3_leaf_verify · 837514f7
      Dave Chinner 提交于
      generic/070 on 64k block size filesystems is failing with a verifier
      corruption on writeback or an attribute leaf block:
      
      [   94.973083] XFS (pmem0): Metadata corruption detected at xfs_attr3_leaf_verify+0x246/0x260, xfs_attr3_leaf block 0x811480
      [   94.975623] XFS (pmem0): Unmount and run xfs_repair
      [   94.976720] XFS (pmem0): First 128 bytes of corrupted metadata buffer:
      [   94.978270] 000000004b2e7b45: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
      [   94.980268] 000000006b1db90b: 00 00 00 00 00 81 14 80 00 00 00 00 00 00 00 00  ................
      [   94.982251] 00000000433f2407: 22 7b 5c 82 2d 5c 47 4c bb 31 1c 37 fa a9 ce d6  "{\.-\GL.1.7....
      [   94.984157] 0000000010dc7dfb: 00 00 00 00 00 81 04 8a 00 0a 18 e8 dd 94 01 00  ................
      [   94.986215] 00000000d5a19229: 00 a0 dc f4 fe 98 01 68 f0 d8 07 e0 00 00 00 00  .......h........
      [   94.988171] 00000000521df36c: 0c 2d 32 e2 fe 20 01 00 0c 2d 58 65 fe 0c 01 00  .-2.. ...-Xe....
      [   94.990162] 000000008477ae06: 0c 2d 5b 66 fe 8c 01 00 0c 2d 71 35 fe 7c 01 00  .-[f.....-q5.|..
      [   94.992139] 00000000a4a6bca6: 0c 2d 72 37 fc d4 01 00 0c 2d d8 b8 f0 90 01 00  .-r7.....-......
      [   94.994789] XFS (pmem0): xfs_do_force_shutdown(0x8) called from line 1453 of file fs/xfs/xfs_buf.c. Return address = ffffffff815365f3
      
      This is failing this check:
      
                      end = ichdr.freemap[i].base + ichdr.freemap[i].size;
                      if (end < ichdr.freemap[i].base)
      >>>>>                   return __this_address;
                      if (end > mp->m_attr_geo->blksize)
                              return __this_address;
      
      And from the buffer output above, the freemap array is:
      
      	freemap[0].base = 0x00a0
      	freemap[0].size = 0xdcf4	end = 0xdd94
      	freemap[1].base = 0xfe98
      	freemap[1].size = 0x0168	end = 0x10000
      	freemap[2].base = 0xf0d8
      	freemap[2].size = 0x07e0	end = 0xf8b8
      
      These all look valid - the block size is 0x10000 and so from the
      last check in the above verifier fragment we know that the end
      of freemap[1] is valid. The problem is that end is declared as:
      
      	uint16_t	end;
      
      And (uint16_t)0x10000 = 0. So we have a verifier bug here, not a
      corruption. Fix the verifier to use uint32_t types for the check and
      hence avoid the overflow.
      
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=201577Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      837514f7
    • D
      xfs: print buffer offsets when dumping corrupt buffers · bdec055b
      Darrick J. Wong 提交于
      Use DUMP_PREFIX_OFFSET when printing hex dumps of corrupt buffers
      because modern Linux now prints a 32-bit hash of our 64-bit pointer when
      using DUMP_PREFIX_ADDRESS:
      
      00000000b4bb4297: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
      00000005ec77e26: 00 00 00 00 02 d0 5a 00 00 00 00 00 00 00 00 00  ......Z.........
      000000015938018: 21 98 e8 b4 fd de 4c 07 bc ea 3c e5 ae b4 7c 48  !.....L...<...|H
      
      This is totally worthless for a sequential dump since we probably only
      care about tracking the buffer offsets and afaik there's no way to
      recover the actual pointer from the hashed value.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      bdec055b
    • C
      xfs: Fix error code in 'xfs_ioc_getbmap()' · 132bf672
      Christophe JAILLET 提交于
      In this function, once 'buf' has been allocated, we unconditionally
      return 0.
      However, 'error' is set to some error codes in several error handling
      paths.
      Before commit 232b5194 ("xfs: simplify the xfs_getbmap interface")
      this was not an issue because all error paths were returning directly,
      but now that some cleanup at the end may be needed, we must propagate the
      error code.
      
      Fixes: 232b5194 ("xfs: simplify the xfs_getbmap interface")
      Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      132bf672
  11. 30 10月, 2018 7 次提交