1. 04 11月, 2019 3 次提交
  2. 30 10月, 2019 2 次提交
  3. 28 10月, 2019 1 次提交
  4. 24 10月, 2019 1 次提交
    • B
      xfs: don't set bmapi total block req where minleft is · da781e64
      Brian Foster 提交于
      xfs_bmapi_write() takes a total block requirement parameter that is
      passed down to the block allocation code and is used to specify the
      total block requirement of the associated transaction. This is used
      to try and select an AG that can not only satisfy the requested
      extent allocation, but can also accommodate subsequent allocations
      that might be required to complete the transaction. For example,
      additional bmbt block allocations may be required on insertion of
      the resulting extent to an inode data fork.
      
      While it's important for callers to calculate and reserve such extra
      blocks in the transaction, it is not necessary to pass the total
      value to xfs_bmapi_write() in all cases. The latter automatically
      sets minleft to ensure that sufficient free blocks remain after the
      allocation attempt to expand the format of the associated inode
      (i.e., such as extent to btree conversion, btree splits, etc).
      Therefore, any callers that pass a total block requirement of the
      bmap mapping length plus worst case bmbt expansion essentially
      specify the additional reservation requirement twice. These callers
      can pass a total of zero to rely on the bmapi minleft policy.
      
      Beyond being superfluous, the primary motivation for this change is
      that the total reservation logic in the bmbt code is dubious in
      scenarios where minlen < maxlen and a maxlen extent cannot be
      allocated (which is more common for data extent allocations where
      contiguity is not required). The total value is based on maxlen in
      the xfs_bmapi_write() caller. If the bmbt code falls back to an
      allocation between minlen and maxlen, that allocation will not
      succeed until total is reset to minlen, which essentially throws
      away any additional reservation included in total by the caller. In
      addition, the total value is not reset until after alignment is
      dropped, which means that such callers drop alignment far too
      aggressively than necessary.
      
      Update all callers of xfs_bmapi_write() that pass a total block
      value of the mapping length plus bmbt reservation to instead pass
      zero and rely on xfs_bmapi_minleft() to enforce the bmbt reservation
      requirement. This trades off slightly less conservative AG selection
      for the ability to preserve alignment in more scenarios.
      xfs_bmapi_write() callers that incorporate unrelated or additional
      reservations in total beyond what is already included in minleft
      must continue to use the former.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      da781e64
  5. 22 10月, 2019 11 次提交
  6. 21 10月, 2019 3 次提交
  7. 18 10月, 2019 1 次提交
    • D
      iomap: iomap that extends beyond EOF should be marked dirty · 7684e2c4
      Dave Chinner 提交于
      When doing a direct IO that spans the current EOF, and there are
      written blocks beyond EOF that extend beyond the current write, the
      only metadata update that needs to be done is a file size extension.
      
      However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
      there is IO completion metadata updates required, and hence we may
      fail to correctly sync file size extensions made in IO completion
      when O_DSYNC writes are being used and the hardware supports FUA.
      
      Hence when setting IOMAP_F_DIRTY, we need to also take into account
      whether the iomap spans the current EOF. If it does, then we need to
      mark it dirty so that IO completion will call generic_write_sync()
      to flush the inode size update to stable storage correctly.
      
      Fixes: 3460cac1 ("iomap: Use FUA for pure data O_DSYNC DIO writes")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      [darrick: removed the ext4 part; they'll handle it separately]
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7684e2c4
  8. 03 9月, 2019 1 次提交
  9. 01 7月, 2019 1 次提交
  10. 29 6月, 2019 1 次提交
  11. 26 2月, 2019 2 次提交
  12. 21 2月, 2019 6 次提交
  13. 18 2月, 2019 4 次提交
  14. 12 2月, 2019 2 次提交
    • B
      xfs: use the latest extent at writeback delalloc conversion time · c2b31643
      Brian Foster 提交于
      The writeback delalloc conversion code is racy with respect to
      changes in the currently cached file mapping outside of the current
      page. This is because the ilock is cycled between the time the
      caller originally looked up the mapping and across each real
      allocation of the provided file range. This code has collected
      various hacks over the years to help combat the symptoms of these
      races (i.e., truncate race detection, allocation into hole
      detection, etc.), but none address the fundamental problem that the
      imap may not be valid at allocation time.
      
      Rather than continue to use race detection hacks, update writeback
      delalloc conversion to a model that explicitly converts the delalloc
      extent backing the current file offset being processed. The current
      file offset is the only block we can trust to remain once the ilock
      is dropped because any operation that can remove the block
      (truncate, hole punch, etc.) must flush and discard pagecache pages
      first.
      
      Modify xfs_iomap_write_allocate() to use the xfs_bmapi_delalloc()
      mechanism to request allocation of the entire delalloc extent
      backing the current offset instead of assuming the extent passed by
      the caller is unchanged. Record the range specified by the caller
      and apply it to the resulting allocated extent so previous checks by
      the caller for COW fork overlap are not lost. Finally, overload the
      bmapi delalloc flag with the range reval flag behavior since this is
      the only use case for both.
      
      This ensures that writeback always picks up the correct
      and current extent associated with the page, regardless of races
      with other extent modifying operations. If operating on a data fork
      and the COW overlap state has changed since the ilock was cycled,
      the caller revalidates against the COW fork sequence number before
      using the imap for the next block.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c2b31643
    • B
      xfs: validate writeback mapping using data fork seq counter · d9252d52
      Brian Foster 提交于
      The writeback code caches the current extent mapping across multiple
      xfs_do_writepage() calls to avoid repeated lookups for sequential
      pages backed by the same extent. This is known to be slightly racy
      with extent fork changes in certain difficult to reproduce
      scenarios. The cached extent is trimmed to within EOF to help avoid
      the most common vector for this problem via speculative
      preallocation management, but this is a band-aid that does not
      address the fundamental problem.
      
      Now that we have an xfs_ifork sequence counter mechanism used to
      facilitate COW writeback, we can use the same mechanism to validate
      consistency between the data fork and cached writeback mappings. On
      its face, this is somewhat of a big hammer approach because any
      change to the data fork invalidates any mapping currently cached by
      a writeback in progress regardless of whether the data fork change
      overlaps with the range under writeback. In practice, however, the
      impact of this approach is minimal in most cases.
      
      First, data fork changes (delayed allocations) caused by sustained
      sequential buffered writes are amortized across speculative
      preallocations. This means that a cached mapping won't be
      invalidated by each buffered write of a common file copy workload,
      but rather only on less frequent allocation events. Second, the
      extent tree is always entirely in-core so an additional lookup of a
      usable extent mostly costs a shared ilock cycle and in-memory tree
      lookup. This means that a cached mapping reval is relatively cheap
      compared to the I/O itself. Third, spurious invalidations don't
      impact ioend construction. This means that even if the same extent
      is revalidated multiple times across multiple writepage instances,
      we still construct and submit the same size ioend (and bio) if the
      blocks are physically contiguous.
      
      Update struct xfs_writepage_ctx with a new field to hold the
      sequence number of the data fork associated with the currently
      cached mapping. Check the wpc seqno against the data fork when the
      mapping is validated and reestablish the mapping whenever the fork
      has changed since the mapping was cached. This ensures that
      writeback always uses a valid extent mapping and thus prevents lost
      writebacks and stale delalloc block problems.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      d9252d52
  15. 18 10月, 2018 1 次提交