1. 26 4月, 2017 5 次提交
  2. 09 3月, 2017 2 次提交
    • C
      xfs: try any AG when allocating the first btree block when reflinking · 2fcc319d
      Christoph Hellwig 提交于
      When a reflink operation causes the bmap code to allocate a btree block
      we're currently doing single-AG allocations due to having ->firstblock
      set and then try any higher AG due a little reflink quirk we've put in
      when adding the reflink code.  But given that we do not have a minleft
      reservation of any kind in this AG we can still not have any space in
      the same or higher AG even if the file system has enough free space.
      To fix this use a XFS_ALLOCTYPE_FIRST_AG allocation in this fall back
      path instead.
      
      [And yes, we need to redo this properly instead of piling hacks over
       hacks.  I'm working on that, but it's not going to be a small series.
       In the meantime this fixes the customer reported issue]
      
      Also add a warning for failing allocations to make it easier to debug.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      2fcc319d
    • B
      xfs: use iomap new flag for newly allocated delalloc blocks · f65e6fad
      Brian Foster 提交于
      Commit fa7f138a ("xfs: clear delalloc and cache on buffered write
      failure") fixed one regression in the iomap error handling code and
      exposed another. The fundamental problem is that if a buffered write
      is a rewrite of preexisting delalloc blocks and the write fails, the
      failure handling code can punch out preexisting blocks with valid
      file data.
      
      This was reproduced directly by sub-block writes in the LTP
      kernel/syscalls/write/write03 test. A first 100 byte write allocates
      a single block in a file. A subsequent 100 byte write fails and
      punches out the block, including the data successfully written by
      the previous write.
      
      To address this problem, update the ->iomap_begin() handler to
      distinguish newly allocated delalloc blocks from preexisting
      delalloc blocks via the IOMAP_F_NEW flag. Use this flag in the
      ->iomap_end() handler to decide when a failed or short write should
      punch out delalloc blocks.
      
      This introduces the subtle requirement that ->iomap_begin() should
      never combine newly allocated delalloc blocks with existing blocks
      in the resulting iomap descriptor. This can occur when a new
      delalloc reservation merges with a neighboring extent that is part
      of the current write, for example. Therefore, drop the
      post-allocation extent lookup from xfs_bmapi_reserve_delalloc() and
      just return the record inserted into the fork. This ensures only new
      blocks are returned and thus that preexisting delalloc blocks are
      always handled as "found" blocks and not punched out on a failed
      rewrite.
      Reported-by: NXiong Zhou <xzhou@redhat.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f65e6fad
  3. 17 2月, 2017 3 次提交
    • C
      xfs: tune down agno asserts in the bmap code · 410d17f6
      Christoph Hellwig 提交于
      In various places we currently assert that xfs_bmap_btalloc allocates
      from the same as the firstblock value passed in, unless it's either
      NULLAGNO or the dop_low flag is set.  But the reflink code does not
      fully follow this convention as it passes in firstblock purely as
      a hint for the allocator without actually having previous allocations
      in the transaction, and without having a minleft check on the current
      AG, leading to the assert firing on a very full and heavily used
      file system.  As even the reflink code only allocates from equal or
      higher AGs for now we can simply the check to always allow for equal
      or higher AGs.
      
      Note that we need to eventually split the two meanings of the firstblock
      value.  At that point we can also allow the reflink code to allocate
      from any AG instead of limiting it in any way.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      410d17f6
    • B
      xfs: split indlen reservations fairly when under reserved · 75d65361
      Brian Foster 提交于
      Certain workoads that punch holes into speculative preallocation can
      cause delalloc indirect reservation splits when the delalloc extent is
      split in two. If further splits occur, an already short-handed extent
      can be split into two in a manner that leaves zero indirect blocks for
      one of the two new extents. This occurs because the shortage is large
      enough that the xfs_bmap_split_indlen() algorithm completely drains the
      requested indlen of one of the extents before it honors the existing
      reservation.
      
      This ultimately results in a warning from xfs_bmap_del_extent(). This
      has been observed during file copies of large, sparse files using 'cp
      --sparse=always.'
      
      To avoid this problem, update xfs_bmap_split_indlen() to explicitly
      apply the reservation shortage fairly between both extents. This smooths
      out the overall indlen shortage and defers the situation where we end up
      with a delalloc extent with zero indlen reservation to extreme
      circumstances.
      Reported-by: NPatrick Dung <mpatdung@gmail.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      75d65361
    • B
      xfs: handle indlen shortage on delalloc extent merge · 0e339ef8
      Brian Foster 提交于
      When a delalloc extent is created, it can be merged with pre-existing,
      contiguous, delalloc extents. When this occurs,
      xfs_bmap_add_extent_hole_delay() merges the extents along with the
      associated indirect block reservations. The expectation here is that the
      combined worst case indlen reservation is always less than or equal to
      the indlen reservation for the individual extents.
      
      This is not always the case, however, as existing extents can less than
      the expected indlen reservation if the extent was previously split due
      to a hole punch. If a new extent merges with such an extent, the total
      indlen requirement may be larger than the sum of the indlen reservations
      held by both extents.
      
      xfs_bmap_add_extent_hole_delay() assumes that the worst case indlen
      reservation is always available and assigns it to the merged extent
      without consideration for the indlen held by the pre-existing extent. As
      a result, the subsequent xfs_mod_fdblocks() call can attempt an
      unintentional allocation rather than a free (indicated by an ASSERT()
      failure). Further, if the allocation happens to fail in this context,
      the failure goes unhandled and creates a filesystem wide block
      accounting inconsistency.
      
      Fix xfs_bmap_add_extent_hole_delay() to function as designed. Cap the
      indlen reservation assigned to the merged extent to the sum of the
      indlen reservations held by each of the individual extents.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0e339ef8
  4. 07 2月, 2017 1 次提交
    • C
      xfs: go straight to real allocations for direct I/O COW writes · a14234c7
      Christoph Hellwig 提交于
      When we allocate COW fork blocks for direct I/O writes we currently first
      create a delayed allocation, and then convert it to a real allocation
      once we've got the delayed one.
      
      As there is no good reason for that this patch instead makes use call
      xfs_bmapi_write from the COW allocation path.  The only interesting bits
      are a few tweaks the low-level allocator to allow for this, most notably
      the need to remove the call to xfs_bmap_extsize_align for the cowextsize
      in xfs_bmap_btalloc - for the existing convert case it's a no-op, but
      for the direct allocation case it would blow up our block reservation
      way beyond what we reserved for the transaction.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a14234c7
  5. 03 2月, 2017 2 次提交
    • D
      xfs: allow unwritten extents in the CoW fork · 05a630d7
      Darrick J. Wong 提交于
      In the data fork, we only allow extents to perform the following state
      transitions:
      
      delay -> real <-> unwritten
      
      There's no way to move directly from a delalloc reservation to an
      /unwritten/ allocated extent.  However, for the CoW fork we want to be
      able to do the following to each extent:
      
      delalloc -> unwritten -> written -> remapped to data fork
      
      This will help us to avoid a race in the speculative CoW preallocation
      code between a first thread that is allocating a CoW extent and a second
      thread that is remapping part of a file after a write.  In order to do
      this, however, we need two things: first, we have to be able to
      transition from da to unwritten, and second the function that converts
      between real and unwritten has to be made aware of the cow fork.  Do
      both of those things.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      05a630d7
    • D
      xfs: filter out obviously bad btree pointers · d5a91bae
      Darrick J. Wong 提交于
      Don't let anybody load an obviously bad btree pointer.  Since the values
      come from disk, we must return an error, not just ASSERT.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      d5a91bae
  6. 31 1月, 2017 2 次提交
  7. 26 1月, 2017 1 次提交
  8. 24 1月, 2017 1 次提交
    • C
      xfs: fix COW writeback race · d2b3964a
      Christoph Hellwig 提交于
      Due to the way how xfs_iomap_write_allocate tries to convert the whole
      found extents from delalloc to real space we can run into a race
      condition with multiple threads doing writes to this same extent.
      For the non-COW case that is harmless as the only thing that can happen
      is that we call xfs_bmapi_write on an extent that has already been
      converted to a real allocation.  For COW writes where we move the extent
      from the COW to the data fork after I/O completion the race is, however,
      not quite as harmless.  In the worst case we are now calling
      xfs_bmapi_write on a region that contains hole in the COW work, which
      will trip up an assert in debug builds or lead to file system corruption
      in non-debug builds.  This seems to be reproducible with workloads of
      small O_DSYNC write, although so far I've not managed to come up with
      a with an isolated reproducer.
      
      The fix for the issue is relatively simple:  tell xfs_bmapi_write
      that we are only asked to convert delayed allocations and skip holes
      in that case.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      d2b3964a
  9. 10 1月, 2017 1 次提交
  10. 05 12月, 2016 4 次提交
  11. 28 11月, 2016 2 次提交
    • B
      xfs: track preallocation separately in xfs_bmapi_reserve_delalloc() · 974ae922
      Brian Foster 提交于
      Speculative preallocation is currently processed entirely by the callers
      of xfs_bmapi_reserve_delalloc(). The caller determines how much
      preallocation to include, adjusts the extent length and passes down the
      resulting request.
      
      While this works fine for post-eof speculative preallocation, it is not
      as reliable for COW fork preallocation. COW fork preallocation is
      implemented via the cowextszhint, which aligns the start offset as well
      as the length of the extent. Further, it is difficult for the caller to
      accurately identify when preallocation occurs because the returned
      extent could have been merged with neighboring extents in the fork.
      
      To simplify this situation and facilitate further COW fork preallocation
      enhancements, update xfs_bmapi_reserve_delalloc() to take a separate
      preallocation parameter to incorporate into the allocation request. The
      preallocation blocks value is tacked onto the end of the request and
      adjusted to accommodate neighboring extents and extent size limits.
      Since xfs_bmapi_reserve_delalloc() now knows precisely how much
      preallocation was included in the allocation, it can also tag the inodes
      appropriately to support preallocation reclaim.
      
      Note that xfs_bmapi_reserve_delalloc() callers are not yet updated to
      use the preallocation mechanism. This patch should not change behavior
      outside of correctly tagging reflink inodes when start offset
      preallocation occurs (which the caller does not handle correctly).
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      974ae922
    • D
      xfs: factor rmap btree size into the indlen calculations · fd26a880
      Darrick J. Wong 提交于
      When we're estimating the amount of space it's going to take to satisfy
      a delalloc reservation, we need to include the space that we might need
      to grow the rmapbt.  This helps us to avoid running out of space later
      when _iomap_write_allocate needs more space than we reserved.  Eryu Guan
      observed this happening on generic/224 when sunit/swidth were set.
      Reported-by: NEryu Guan <eguan@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      fd26a880
  12. 24 11月, 2016 7 次提交
  13. 08 11月, 2016 2 次提交
  14. 20 10月, 2016 5 次提交
  15. 06 10月, 2016 2 次提交
    • D
      xfs: try other AGs to allocate a BMBT block · 90e2056d
      Darrick J. Wong 提交于
      Prior to the introduction of reflink, allocating a block and mapping
      it into a file was performed in a single transaction with a single
      block reservation, and the allocator was supposed to find enough
      blocks to allocate the extent and any BMBT blocks that might be
      necessary (unless we're low on space).
      
      However, due to the way copy on write works, allocation and mapping
      have been split into two transactions, which means that we must be
      able to handle the case where we allocate an extent for CoW but that
      AG runs out of free space before the blocks can be mapped into a file,
      and the mapping requires a new BMBT block.  When this happens, look in
      one of the other AGs for a BMBT block instead of taking the FS down.
      
      The same applies to the functions that convert a data fork to extents
      and later btree format.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      90e2056d
    • D
      xfs: create a separate cow extent size hint for the allocator · f7ca3522
      Darrick J. Wong 提交于
      Create a per-inode extent size allocator hint for copy-on-write.  This
      hint is separate from the existing extent size hint so that CoW can
      take advantage of the fragmentation-reducing properties of extent size
      hints without disabling delalloc for regular writes.
      
      The extent size hint that's fed to the allocator during a copy on
      write operation is the greater of the cowextsize and regular extsize
      hint.
      
      During reflink, if we're sharing the entire source file to the entire
      destination file and the destination file doesn't already have a
      cowextsize hint, propagate the source file's cowextsize hint to the
      destination file.
      
      Furthermore, zero the bulkstat buffer prior to setting the fields
      so that we don't copy kernel memory contents into userspace.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      f7ca3522