1. 07 7月, 2020 1 次提交
  2. 13 4月, 2020 1 次提交
  3. 27 1月, 2020 1 次提交
  4. 21 1月, 2020 1 次提交
  5. 15 1月, 2020 1 次提交
  6. 24 10月, 2019 1 次提交
    • B
      xfs: don't set bmapi total block req where minleft is · da781e64
      Brian Foster 提交于
      xfs_bmapi_write() takes a total block requirement parameter that is
      passed down to the block allocation code and is used to specify the
      total block requirement of the associated transaction. This is used
      to try and select an AG that can not only satisfy the requested
      extent allocation, but can also accommodate subsequent allocations
      that might be required to complete the transaction. For example,
      additional bmbt block allocations may be required on insertion of
      the resulting extent to an inode data fork.
      
      While it's important for callers to calculate and reserve such extra
      blocks in the transaction, it is not necessary to pass the total
      value to xfs_bmapi_write() in all cases. The latter automatically
      sets minleft to ensure that sufficient free blocks remain after the
      allocation attempt to expand the format of the associated inode
      (i.e., such as extent to btree conversion, btree splits, etc).
      Therefore, any callers that pass a total block requirement of the
      bmap mapping length plus worst case bmbt expansion essentially
      specify the additional reservation requirement twice. These callers
      can pass a total of zero to rely on the bmapi minleft policy.
      
      Beyond being superfluous, the primary motivation for this change is
      that the total reservation logic in the bmbt code is dubious in
      scenarios where minlen < maxlen and a maxlen extent cannot be
      allocated (which is more common for data extent allocations where
      contiguity is not required). The total value is based on maxlen in
      the xfs_bmapi_write() caller. If the bmbt code falls back to an
      allocation between minlen and maxlen, that allocation will not
      succeed until total is reset to minlen, which essentially throws
      away any additional reservation included in total by the caller. In
      addition, the total value is not reset until after alignment is
      dropped, which means that such callers drop alignment far too
      aggressively than necessary.
      
      Update all callers of xfs_bmapi_write() that pass a total block
      value of the mapping length plus bmbt reservation to instead pass
      zero and rely on xfs_bmapi_minleft() to enforce the bmbt reservation
      requirement. This trades off slightly less conservative AG selection
      for the ability to preserve alignment in more scenarios.
      xfs_bmapi_write() callers that incorporate unrelated or additional
      reservations in total beyond what is already included in minleft
      must continue to use the former.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      da781e64
  7. 22 10月, 2019 3 次提交
  8. 21 10月, 2019 1 次提交
  9. 28 8月, 2019 2 次提交
  10. 19 8月, 2019 1 次提交
    • D
      xfs: fix reflink source file racing with directio writes · 5d888b48
      Darrick J. Wong 提交于
      While trawling through the dedupe file comparison code trying to fix
      page deadlocking problems, Dave Chinner noticed that the reflink code
      only takes shared IOLOCK/MMAPLOCKs on the source file.  Because
      page_mkwrite and directio writes do not take the EXCL versions of those
      locks, this means that reflink can race with writer processes.
      
      For pure remapping this can lead to undefined behavior and file
      corruption; for dedupe this means that we cannot be sure that the
      contents are identical when we decide to go ahead with the remapping.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      5d888b48
  11. 01 7月, 2019 1 次提交
  12. 29 6月, 2019 1 次提交
  13. 26 2月, 2019 2 次提交
  14. 21 2月, 2019 4 次提交
    • C
      xfs: introduce an always_cow mode · 66ae56a5
      Christoph Hellwig 提交于
      Add a mode where XFS never overwrites existing blocks in place.  This
      is to aid debugging our COW code, and also put infatructure in place
      for things like possible future support for zoned block devices, which
      can't support overwrites.
      
      This mode is enabled globally by doing a:
      
          echo 1 > /sys/fs/xfs/debug/always_cow
      
      Note that the parameter is global to allow running all tests in xfstests
      easily in this mode, which would not easily be possible with a per-fs
      sysfs file.
      
      In always_cow mode persistent preallocations are disabled, and fallocate
      will fail when called with a 0 mode (with our without
      FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space
      when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE.
      
      There are a few interesting xfstests failures when run in always_cow
      mode:
      
       - generic/392 fails because the bytes used in the file used to test
         hole punch recovery are less after the log replay.  This is
         because the blocks written and then punched out are only freed
         with a delay due to the logging mechanism.
       - xfs/170 will fail as the already fragile file streams mechanism
         doesn't seem to interact well with the COW allocator
       - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim
         the file system is badly fragmented, but there is not much we
         can do to avoid that when always writing out of place
       - xfs/205 fails because overwriting a file in always_cow mode
         will require new space allocation and the assumption in the
         test thus don't work anymore.
       - xfs/326 fails to modify the file at all in always_cow mode after
         injecting the refcount error, leading to an unexpected md5sum
         after the remount, but that again is expected
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      66ae56a5
    • C
      xfs: make COW fork unwritten extent conversions more robust · 26b91c72
      Christoph Hellwig 提交于
      If we have racing buffered and direct I/O COW fork extents under
      writeback can have been moved to the data fork by the time we call
      xfs_reflink_convert_cow from xfs_submit_ioend.  This would be mostly
      harmless as the block numbers don't change by this move, except for
      the fact that xfs_bmapi_write will crash or trigger asserts when
      not finding existing extents, even despite trying to paper over this
      with the XFS_BMAPI_CONVERT_ONLY flag.
      
      Instead of special casing non-transaction conversions in the already
      way too complicated xfs_bmapi_write just add a new helper for the much
      simpler non-transactional COW fork case, which simplify ignores not
      found extents.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      26b91c72
    • C
      xfs: merge COW handling into xfs_file_iomap_begin_delay · db46e604
      Christoph Hellwig 提交于
      Besides simplifying the code a bit this allows to actually implement
      the behavior of using COW preallocation for non-COW data mentioned
      in the current comments.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      db46e604
    • C
      xfs: don't use delalloc extents for COW on files with extsize hints · 78f0cc9d
      Christoph Hellwig 提交于
      While using delalloc for extsize hints is generally a good idea, the
      current code that does so only for COW doesn't help us much and creates
      a lot of special cases.  Switch it to use real allocations like we
      do for direct I/O.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      78f0cc9d
  15. 18 2月, 2019 1 次提交
  16. 13 12月, 2018 1 次提交
    • D
      xfs: split up the xfs_reflink_end_cow work into smaller transactions · d6f215f3
      Darrick J. Wong 提交于
      In xfs_reflink_end_cow, we allocate a single transaction for the entire
      end_cow operation and then loop the CoW fork mappings to move them to
      the data fork.  This design fails on a heavily fragmented filesystem
      where an inode's data fork has exactly one more extent than would fit in
      an extents-format fork, because the unmap can collapse the data fork
      into extents format (freeing the bmbt block) but the remap can expand
      the data fork back into a (newly allocated) bmbt block.  If the number
      of extents we end up remapping is large, we can overflow the block
      reservation because we reserved blocks assuming that we were adding
      mappings into an already-cleared area of the data fork.
      
      Let's say we have 8 extents in the data fork, 8 extents in the CoW fork,
      and the data fork can hold at most 7 extents before needing to convert
      to btree format; and that blocks A-P are discontiguous single-block
      extents:
      
         0......7
      D: ABCDEFGH
      C: IJKLMNOP
      
      When a write to file blocks 0-7 completes, we must remap I-P into the
      data fork.  We start by removing H from the btree-format data fork.  Now
      we have 7 extents, so we convert the fork to extents format, freeing the
      bmbt block.   We then move P into the data fork and it now has 8 extents
      again.  We must convert the data fork back to btree format, requiring a
      block allocation.  If we repeat this sequence for blocks 6-5-4-3-2-1-0,
      we'll need a total of 8 block allocations to remap all 8 blocks.  We
      reserved only enough blocks to handle one btree split (5 blocks on a 4k
      block filesystem), which means we overflow the block reservation.
      
      To fix this issue, create a separate helper function to remap a single
      extent, and change _reflink_end_cow to call it in a tight loop over the
      entire range we're completing.  As a side effect this also removes the
      size restrictions on how many extents we can end_cow at a time, though
      nobody ever hit that.  It is not reasonable to reserve N blocks to remap
      N blocks.
      
      Note that this can be reproduced after ~320 million fsx ops while
      running generic/938 (long soak directio fsx exerciser):
      
      XFS: Assertion failed: tp->t_blk_res >= tp->t_blk_res_used, file: fs/xfs/xfs_trans.c, line: 116
      <machine registers snipped>
      Call Trace:
       xfs_trans_dup+0x211/0x250 [xfs]
       xfs_trans_roll+0x6d/0x180 [xfs]
       xfs_defer_trans_roll+0x10c/0x3b0 [xfs]
       xfs_defer_finish_noroll+0xdf/0x740 [xfs]
       xfs_defer_finish+0x13/0x70 [xfs]
       xfs_reflink_end_cow+0x2c6/0x680 [xfs]
       xfs_dio_write_end_io+0x115/0x220 [xfs]
       iomap_dio_complete+0x3f/0x130
       iomap_dio_rw+0x3c3/0x420
       xfs_file_dio_aio_write+0x132/0x3c0 [xfs]
       xfs_file_write_iter+0x8b/0xc0 [xfs]
       __vfs_write+0x193/0x1f0
       vfs_write+0xba/0x1c0
       ksys_write+0x52/0xc0
       do_syscall_64+0x50/0x160
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      d6f215f3
  17. 22 11月, 2018 1 次提交
    • D
      xfs: flush removing page cache in xfs_reflink_remap_prep · 2c307174
      Dave Chinner 提交于
      On a sub-page block size filesystem, fsx is failing with a data
      corruption after a series of operations involving copying a file
      with the destination offset beyond EOF of the destination of the file:
      
      8093(157 mod 256): TRUNCATE DOWN        from 0x7a120 to 0x50000 ******WWWW
      8094(158 mod 256): INSERT 0x25000 thru 0x25fff  (0x1000 bytes)
      8095(159 mod 256): COPY 0x18000 thru 0x1afff    (0x3000 bytes) to 0x2f400
      8096(160 mod 256): WRITE    0x5da00 thru 0x651ff        (0x7800 bytes) HOLE
      8097(161 mod 256): COPY 0x2000 thru 0x5fff      (0x4000 bytes) to 0x6fc00
      
      The second copy here is beyond EOF, and it is to sub-page (4k) but
      block aligned (1k) offset. The clone runs the EOF zeroing, landing
      in a pre-existing post-eof delalloc extent. This zeroes the post-eof
      extents in the page cache just fine, dirtying the pages correctly.
      
      The problem is that xfs_reflink_remap_prep() now truncates the page
      cache over the range that it is copying it to, and rounds that down
      to cover the entire start page. This removes the dirty page over the
      delalloc extent from the page cache without having written it back.
      Hence later, when the page cache is flushed, the page at offset
      0x6f000 has not been written back and hence exposes stale data,
      which fsx trips over less than 10 operations later.
      
      Fix this by changing xfs_reflink_remap_prep() to use
      xfs_flush_unmap_range().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      2c307174
  18. 20 11月, 2018 1 次提交
    • B
      xfs: fix shared extent data corruption due to missing cow reservation · 59e42931
      Brian Foster 提交于
      Page writeback indirectly handles shared extents via the existence
      of overlapping COW fork blocks. If COW fork blocks exist, writeback
      always performs the associated copy-on-write regardless if the
      underlying blocks are actually shared. If the blocks are shared,
      then overlapping COW fork blocks must always exist.
      
      fstests shared/010 reproduces a case where a buffered write occurs
      over a shared block without performing the requisite COW fork
      reservation.  This ultimately causes writeback to the shared extent
      and data corruption that is detected across md5 checks of the
      filesystem across a mount cycle.
      
      The problem occurs when a buffered write lands over a shared extent
      that crosses an extent size hint boundary and that also happens to
      have a partial COW reservation that doesn't cover the start and end
      blocks of the data fork extent.
      
      For example, a buffered write occurs across the file offset (in FSB
      units) range of [29, 57]. A shared extent exists at blocks [29, 35]
      and COW reservation already exists at blocks [32, 34]. After
      accommodating a COW extent size hint of 32 blocks and the existing
      reservation at offset 32, xfs_reflink_reserve_cow() allocates 32
      blocks of reservation at offset 0 and returns with COW reservation
      across the range of [0, 34]. The associated data fork extent is
      still [29, 35], however, which isn't fully covered by the COW
      reservation.
      
      This leads to a buffered write at file offset 35 over a shared
      extent without associated COW reservation. Writeback eventually
      kicks in, performs an overwrite of the underlying shared block and
      causes the associated data corruption.
      
      Update xfs_reflink_reserve_cow() to accommodate the fact that a
      delalloc allocation request may not fully cover the extent in the
      data fork. Trim the data fork extent appropriately, just as is done
      for shared extent boundaries and/or existing COW reservations that
      happen to overlap the start of the data fork extent. This prevents
      shared/010 failures due to data corruption on reflink enabled
      filesystems.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      59e42931
  19. 30 10月, 2018 12 次提交
  20. 18 10月, 2018 3 次提交