1. 10 5月, 2018 3 次提交
    • B
      xfs: add bmapi nodiscard flag · fcb762f5
      Brian Foster 提交于
      Freed extents are unconditionally discarded when online discard is
      enabled. Define XFS_BMAPI_NODISCARD to allow callers to bypass
      discards when unnecessary. For example, this will be useful for
      eofblocks trimming.
      
      This patch does not change behavior.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      fcb762f5
    • B
      xfs: defer agfl block frees when dfops is available · f8f2835a
      Brian Foster 提交于
      The AGFL fixup code executes before every block allocation/free and
      rectifies the AGFL based on the current, dynamic allocation
      requirements of the fs. The AGFL must hold a minimum number of
      blocks to satisfy a worst case split of the free space btrees caused
      by the impending allocation operation. The AGFL is also updated to
      maintain the implicit requirement for a minimum number of free slots
      to satisfy a worst case join of the free space btrees.
      
      Since the AGFL caches individual blocks, AGFL reduction typically
      involves multiple, single block frees. We've had reports of
      transaction overrun problems during certain workloads that boil down
      to AGFL reduction freeing multiple blocks and consuming more space
      in the log than was reserved for the transaction.
      
      Since the objective of freeing AGFL blocks is to ensure free AGFL
      free slots are available for the upcoming allocation, one way to
      address this problem is to release surplus blocks from the AGFL
      immediately but defer the free of those blocks (similar to how
      file-mapped blocks are unmapped from the file in one transaction and
      freed via a deferred operation) until the transaction is rolled.
      This turns AGFL reduction into an operation with predictable log
      reservation consumption.
      
      Add the capability to defer AGFL block frees when a deferred ops
      list is available to the AGFL fixup code. Add a dfops pointer to the
      transaction to carry dfops through various contexts to the allocator
      context. Deferring AGFL frees is  conditional behavior based on
      whether the transaction pointer is populated. The long term
      objective is to reuse the transaction pointer to clean up all
      unrelated callchains that pass dfops on the stack along with a
      transaction and in doing so, consistently defer AGFL blocks from the
      allocator.
      
      A bit of customization is required to handle deferred completion
      processing because AGFL blocks are accounted against a per-ag
      reservation pool and AGFL blocks are not inserted into the extent
      busy list when freed (they are inserted when used and released back
      to the AGFL). Reuse the majority of the existing deferred extent
      free infrastructure and customize it appropriately to handle AGFL
      blocks.
      
      Note that this patch only adds infrastructure. It does not change
      behavior because no callers have been updated to pass ->t_agfl_dfops
      into the allocation code.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f8f2835a
    • B
      xfs: create agfl block free helper function · 4223f659
      Brian Foster 提交于
      Refactor the AGFL block free code into a new helper such that it can
      be invoked from deferred context. No functional changes.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4223f659
  2. 10 4月, 2018 1 次提交
  3. 24 3月, 2018 1 次提交
    • B
      xfs: detect agfl count corruption and reset agfl · a27ba260
      Brian Foster 提交于
      The struct xfs_agfl v5 header was originally introduced with
      unexpected padding that caused the AGFL to operate with one less
      slot than intended. The header has since been packed, but the fix
      left an incompatibility for users who upgrade from an old kernel
      with the unpacked header to a newer kernel with the packed header
      while the AGFL happens to wrap around the end. The newer kernel
      recognizes one extra slot at the physical end of the AGFL that the
      previous kernel did not. The new kernel will eventually attempt to
      allocate a block from that slot, which contains invalid data, and
      cause a crash.
      
      This condition can be detected by comparing the active range of the
      AGFL to the count. While this detects a padding mismatch, it can
      also trigger false positives for unrelated flcount corruption. Since
      we cannot distinguish a size mismatch due to padding from unrelated
      corruption, we can't trust the AGFL enough to simply repopulate the
      empty slot.
      
      Instead, avoid unnecessarily complex detection logic and and use a
      solution that can handle any form of flcount corruption that slips
      through read verifiers: distrust the entire AGFL and reset it to an
      empty state. Any valid blocks within the AGFL are intentionally
      leaked. This requires xfs_repair to rectify (which was already
      necessary based on the state the AGFL was found in). The reset
      mitigates the side effect of the padding mismatch problem from a
      filesystem crash to a free space accounting inconsistency. The
      generic approach also means that this patch can be safely backported
      to kernels with or without a packed struct xfs_agfl.
      
      Check the AGF for an invalid freelist count on initial read from
      disk. If detected, set a flag on the xfs_perag to indicate that a
      reset is required before the AGFL can be used. In the first
      transaction that attempts to use a flagged AGFL, reset it to empty,
      warn the user about the inconsistency and allow the freelist fixup
      code to repopulate the AGFL with new blocks. The xfs_perag flag is
      cleared to eliminate the need for repeated checks on each block
      allocation operation.
      
      This allows kernels that include the packing fix commit 96f859d5
      ("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
      to handle older unpacked AGFL formats without a filesystem crash.
      Suggested-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a27ba260
  4. 12 3月, 2018 3 次提交
    • B
      xfs: account only rmapbt-used blocks against rmapbt perag res · 0ab32086
      Brian Foster 提交于
      The rmapbt perag metadata reservation reserves blocks for the
      reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
      the agfl and perag accounting is updated as blocks are allocated
      from the allocation btrees, the reservation actually accounts blocks
      as they are allocated to (or freed from) the agfl rather than the
      rmapbt itself.
      
      While this works for blocks that are eventually used for the rmapbt,
      not all agfl blocks are destined for the rmapbt. Blocks that are
      allocated to the agfl (and thus "reserved" for the rmapbt) but then
      used by another structure leads to a growing inconsistency over time
      between the runtime tracking of rmapbt usage vs. actual rmapbt
      usage. Since the runtime tracking thinks all agfl blocks are rmapbt
      blocks, it essentially believes that less future reservation is
      required to satisfy the rmapbt than what is actually necessary.
      
      The inconsistency is rectified across mount cycles because the perag
      reservation is initialized based on the actual rmapbt usage at mount
      time. The problem, however, is that the excessive drain of the
      reservation at runtime opens a window to allocate blocks for other
      purposes that might be required for the rmapbt on a subsequent
      mount. This problem can be demonstrated by a simple test that runs
      an allocation workload to consume agfl blocks over time and then
      observe the difference in the agfl reservation requirement across an
      unmount/mount cycle:
      
        mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
        ...
        ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
        umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
        mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194
      
      As the above tracepoints show, the reservation requirement reduces
      from 3194 blocks to 2956 blocks as the workload runs.  Without any
      other changes in the filesystem, the same reservation requirement
      jumps from 2956 to 3052 blocks over a umount/mount cycle.
      
      To address this divergence, update the RMAPBT reservation to account
      blocks used for the rmapbt only rather than all blocks filled into
      the agfl. This patch makes several high-level changes toward that
      end:
      
      1.) Reintroduce an AGFL reservation type to serve as an accounting
          no-op for blocks allocated to (or freed from) the AGFL.
      2.) Invoke RMAPBT usage accounting from the actual rmapbt block
          allocation path rather than the AGFL allocation path.
      
      The first change is required because agfl blocks are considered free
      blocks throughout their lifetime. The perag reservation subsystem is
      invoked unconditionally by the allocation subsystem, so we need a
      way to tell the perag subsystem (via the allocation subsystem) to
      not make any accounting changes for blocks filled into the AGFL.
      
      The second change causes the in-core RMAPBT reservation usage
      accounting to remain consistent with the on-disk state at all times
      and eliminates the risk of leaving the rmapbt reservation
      underfilled.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0ab32086
    • B
      xfs: rename agfl perag res type to rmapbt · 21592863
      Brian Foster 提交于
      The AGFL perag reservation type accounts all allocations that feed
      into (or are released from) the allocation group free list (agfl).
      The purpose of the reservation is to support worst case conditions
      for the reverse mapping btree (rmapbt). As such, the agfl
      reservation usage accounting only considers rmapbt usage when the
      in-core counters are initialized at mount time.
      
      This implementation inconsistency leads to divergence of the in-core
      and on-disk usage accounting over time. In preparation to resolve
      this inconsistency and adjust the AGFL reservation into an rmapbt
      specific reservation, rename the AGFL reservation type and
      associated accounting fields to something more rmapbt-specific. Also
      fix up a couple tracepoints that incorrectly use the AGFL
      reservation type to pass the agfl state of the associated extent
      where the raw reservation type is expected.
      
      Note that this patch does not change perag reservation behavior.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      21592863
    • D
      xfs: convert XFS_AGFL_SIZE to a helper function · a78ee256
      Dave Chinner 提交于
      The AGFL size calculation is about to get more complex, so lets turn
      the macro into a function first and remove the macro.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [darrick: forward port to newer kernel, simplify the helper]
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      a78ee256
  5. 29 1月, 2018 1 次提交
    • C
      Split buffer's b_fspriv field · fb1755a6
      Carlos Maiolino 提交于
      By splitting the b_fspriv field into two different fields (b_log_item
      and b_li_list). It's possible to get rid of an old ABI workaround, by
      using the new b_log_item field to store xfs_buf_log_item separated from
      the log items attached to the buffer, which will be linked in the new
      b_li_list field.
      
      This way, there is no more need to reorder the log items list to place
      the buf_log_item at the beginning of the list, simplifying a bit the
      logic to handle buffer IO.
      
      This also opens the possibility to change buffer's log items list into a
      proper list_head.
      
      b_log_item field is still defined as a void *, because it is still used
      by the log buffers to store xlog_in_core structures, and there is no
      need to add an extra field on xfs_buf just for xlog_in_core.
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      [darrick: minor style changes]
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      fb1755a6
  6. 18 1月, 2018 1 次提交
  7. 09 1月, 2018 4 次提交
  8. 22 12月, 2017 1 次提交
    • D
      xfs: always honor OWN_UNKNOWN rmap removal requests · 33df3a9c
      Darrick J. Wong 提交于
      Calling xfs_rmap_free with an unknown owner is supposed to remove any
      rmaps covering that range regardless of owner.  This is used by the EFI
      recovery code to say "we're freeing this, it mustn't be owned by
      anything anymore", but for whatever reason xfs_free_ag_extent filters
      them out.
      
      Therefore, remove the filter and make xfs_rmap_unmap actually treat it
      as a wildcard owner -- free anything that's already there, and if
      there's no owner at all then that's fine too.
      
      There are two existing callers of bmap_add_free that take care the rmap
      deferred ops themselves and use OWN_UNKNOWN to skip the EFI-based rmap
      cleanup; convert these to use OWN_NULL (via helpers), and now we really
      require that an RUI (if any) gets added to the defer ops before any EFI.
      
      Lastly, now that xfs_free_extent filters out OWN_NULL rmap free requests,
      growfs will have to consult directly with the rmap to ensure that there
      aren't any rmaps in the grown region.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      33df3a9c
  9. 02 11月, 2017 1 次提交
  10. 27 10月, 2017 1 次提交
  11. 12 10月, 2017 1 次提交
  12. 28 6月, 2017 1 次提交
  13. 20 6月, 2017 1 次提交
  14. 04 4月, 2017 2 次提交
  15. 18 2月, 2017 1 次提交
  16. 10 2月, 2017 1 次提交
  17. 10 1月, 2017 4 次提交
    • C
      xfs: don't rely on ->total in xfs_alloc_space_available · 12ef8301
      Christoph Hellwig 提交于
      ->total is a bit of an odd parameter passed down to the low-level
      allocator all the way from the high-level callers.  It's supposed to
      contain the maximum number of blocks to be allocated for the whole
      transaction [1].
      
      But in xfs_iomap_write_allocate we only convert existing delayed
      allocations and thus only have a minimal block reservation for the
      current transaction, so xfs_alloc_space_available can't use it for
      the allocation decisions.  Use the maximum of args->total and the
      calculated block requirement to make a decision.  We probably should
      get rid of args->total eventually and instead apply ->minleft more
      broadly, but that will require some extensive changes all over.
      
      [1] which creates lots of confusion as most callers don't decrement it
      once doing a first allocation.  But that's for a separate series.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      12ef8301
    • C
      xfs: adjust allocation length in xfs_alloc_space_available · 54fee133
      Christoph Hellwig 提交于
      We must decide in xfs_alloc_fix_freelist if we can perform an
      allocation from a given AG is possible or not based on the available
      space, and should not fail the allocation past that point on a
      healthy file system.
      
      But currently we have two additional places that second-guess
      xfs_alloc_fix_freelist: xfs_alloc_ag_vextent tries to adjust the
      maxlen parameter to remove the reservation before doing the
      allocation (but ignores the various minium freespace requirements),
      and xfs_alloc_fix_minleft tries to fix up the allocated length
      after we've found an extent, but ignores the reservations and also
      doesn't take the AGFL into account (and thus fails allocations
      for not matching minlen in some cases).
      
      Remove all these later fixups and just correct the maxlen argument
      inside xfs_alloc_fix_freelist once we have the AGF buffer locked.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      54fee133
    • C
      xfs: fix bogus minleft manipulations · 255c5162
      Christoph Hellwig 提交于
      We can't just set minleft to 0 when we're low on space - that's exactly
      what we need minleft for: to protect space in the AG for btree block
      allocations when we are low on free space.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      255c5162
    • C
      xfs: bump up reserved blocks in xfs_alloc_set_aside · 5149fd32
      Christoph Hellwig 提交于
      Setting aside 4 blocks globally for bmbt splits isn't all that useful,
      as different threads can allocate space in parallel.  Bump it to 4
      blocks per AG to allow each thread that is currently doing an
      allocation to dip into it separately.  Without that we may no have
      enough reserved blocks if there are enough parallel transactions
      in an almost out space file system that all run into bmap btree
      splits.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      5149fd32
  18. 05 12月, 2016 1 次提交
  19. 04 10月, 2016 4 次提交
  20. 26 9月, 2016 1 次提交
    • D
      xfs: remote attribute blocks aren't really userdata · 292378ed
      Dave Chinner 提交于
      When adding a new remote attribute, we write the attribute to the
      new extent before the allocation transaction is committed. This
      means we cannot reuse busy extents as that violates crash
      consistency semantics. Hence we currently treat remote attribute
      extent allocation like userdata because it has the same overwrite
      ordering constraints as userdata.
      
      Unfortunately, this also allows the allocator to incorrectly apply
      extent size hints to the remote attribute extent allocation. This
      results in interesting failures, such as transaction block
      reservation overruns and in-memory inode attribute fork corruption.
      
      To fix this, we need to separate the busy extent reuse configuration
      from the userdata configuration. This changes the definition of
      XFS_BMAPI_METADATA slightly - it now means that allocation is
      metadata and reuse of busy extents is acceptible due to the metadata
      ordering semantics of the journal. If this flag is not set, it
      means the allocation is that has unordered data writeback, and hence
      busy extent reuse is not allowed. It no longer implies the
      allocation is for user data, just that the data write will not be
      strictly ordered. This matches the semantics for both user data
      and remote attribute block allocation.
      
      As such, This patch changes the "userdata" field to a "datatype"
      field, and adds a "no busy reuse" flag to the field.
      When we detect an unordered data extent allocation, we immediately set
      the no reuse flag. We then set the "user data" flags based on the
      inode fork we are allocating the extent to. Hence we only set
      userdata flags on data fork allocations now and consider attribute
      fork remote extents to be an unordered metadata extent.
      
      The result is that remote attribute extents now have the expected
      allocation semantics, and the data fork allocation behaviour is
      completely unchanged.
      
      It should be noted that there may be other ways to fix this (e.g.
      use ordered metadata buffers for the remote attribute extent data
      write) but they are more invasive and difficult to validate both
      from a design and implementation POV. Hence this patch takes the
      simple, obvious route to fixing the problem...
      Reported-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      292378ed
  21. 19 9月, 2016 1 次提交
    • D
      xfs: set up per-AG free space reservations · 3fd129b6
      Darrick J. Wong 提交于
      One unfortunate quirk of the reference count and reverse mapping
      btrees -- they can expand in size when blocks are written to *other*
      allocation groups if, say, one large extent becomes a lot of tiny
      extents.  Since we don't want to start throwing errors in the middle
      of CoWing, we need to reserve some blocks to handle future expansion.
      The transaction block reservation counters aren't sufficient here
      because we have to have a reserve of blocks in every AG, not just
      somewhere in the filesystem.
      
      Therefore, create two per-AG block reservation pools.  One feeds the
      AGFL so that rmapbt expansion always succeeds, and the other feeds all
      other metadata so that refcountbt expansion never fails.
      
      Use the count of how many reserved blocks we need to have on hand to
      create a virtual reservation in the AG.  Through selective clamping of
      the maximum length of allocation requests and of the length of the
      longest free extent, we can make it look like there's less free space
      in the AG unless the reservation owner is asking for blocks.
      
      In other words, play some accounting tricks in-core to make sure that
      we always have blocks available.  On the plus side, there's nothing to
      clean up if we crash, which is contrast to the strategy that the rough
      draft used (actually removing extents from the freespace btrees).
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3fd129b6
  22. 26 8月, 2016 1 次提交
  23. 17 8月, 2016 2 次提交
  24. 03 8月, 2016 2 次提交