1. 04 11月, 2019 3 次提交
  2. 30 10月, 2019 4 次提交
  3. 24 10月, 2019 2 次提交
    • B
      xfs: don't set bmapi total block req where minleft is · da781e64
      Brian Foster 提交于
      xfs_bmapi_write() takes a total block requirement parameter that is
      passed down to the block allocation code and is used to specify the
      total block requirement of the associated transaction. This is used
      to try and select an AG that can not only satisfy the requested
      extent allocation, but can also accommodate subsequent allocations
      that might be required to complete the transaction. For example,
      additional bmbt block allocations may be required on insertion of
      the resulting extent to an inode data fork.
      
      While it's important for callers to calculate and reserve such extra
      blocks in the transaction, it is not necessary to pass the total
      value to xfs_bmapi_write() in all cases. The latter automatically
      sets minleft to ensure that sufficient free blocks remain after the
      allocation attempt to expand the format of the associated inode
      (i.e., such as extent to btree conversion, btree splits, etc).
      Therefore, any callers that pass a total block requirement of the
      bmap mapping length plus worst case bmbt expansion essentially
      specify the additional reservation requirement twice. These callers
      can pass a total of zero to rely on the bmapi minleft policy.
      
      Beyond being superfluous, the primary motivation for this change is
      that the total reservation logic in the bmbt code is dubious in
      scenarios where minlen < maxlen and a maxlen extent cannot be
      allocated (which is more common for data extent allocations where
      contiguity is not required). The total value is based on maxlen in
      the xfs_bmapi_write() caller. If the bmbt code falls back to an
      allocation between minlen and maxlen, that allocation will not
      succeed until total is reset to minlen, which essentially throws
      away any additional reservation included in total by the caller. In
      addition, the total value is not reset until after alignment is
      dropped, which means that such callers drop alignment far too
      aggressively than necessary.
      
      Update all callers of xfs_bmapi_write() that pass a total block
      value of the mapping length plus bmbt reservation to instead pass
      zero and rely on xfs_bmapi_minleft() to enforce the bmbt reservation
      requirement. This trades off slightly less conservative AG selection
      for the ability to preserve alignment in more scenarios.
      xfs_bmapi_write() callers that incorporate unrelated or additional
      reservations in total beyond what is already included in minleft
      must continue to use the former.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      da781e64
    • D
      xfs: cap longest free extent to maximum allocatable · 1c743574
      Dave Chinner 提交于
      Cap longest extent to the largest we can allocate based on limits
      calculated at mount time. Dynamic state (such as finobt blocks)
      can result in the longest free extent exceeding the size we can
      allocate, and that results in failure to align full AG allocations
      when the AG is empty.
      
      Result:
      
      xfs_io-4413  [003]   426.412459: xfs_alloc_vextent_loopfailed: dev 8:96 agno 0 agbno 32 minlen 243968 maxlen 244000 mod 0 prod 1 minleft 1 total 262148 alignment 32 minalignslop 0 len 0 type NEAR_BNO otype START_BNO wasdel 0 wasfromfl 0 resv 0 datatype 0x5 firstblock 0xffffffffffffffff
      
      minlen and maxlen are now separated by the alignment size, and
      allocation fails because args.total > free space in the AG.
      
      [bfoster: Added xfs_bmap_btalloc() changes.]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      1c743574
  4. 22 10月, 2019 12 次提交
    • D
      xfs: fix inode fork extent count overflow · 3f8a4f1d
      Dave Chinner 提交于
      [commit message is verbose for discussion purposes - will trim it
      down later. Some questions about implementation details at the end.]
      
      Zorro Lang recently ran a new test to stress single inode extent
      counts now that they are no longer limited by memory allocation.
      The test was simply:
      
      # xfs_io -f -c "falloc 0 40t" /mnt/scratch/big-file
      # ~/src/xfstests-dev/punch-alternating /mnt/scratch/big-file
      
      This test uncovered a problem where the hole punching operation
      appeared to finish with no error, but apparently only created 268M
      extents instead of the 10 billion it was supposed to.
      
      Further, trying to punch out extents that should have been present
      resulted in success, but no change in the extent count. It looked
      like a silent failure.
      
      While running the test and observing the behaviour in real time,
      I observed the extent coutn growing at ~2M extents/minute, and saw
      this after about an hour:
      
      # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next ; \
      > sleep 60 ; \
      > xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
      fsxattr.nextents = 127657993
      fsxattr.nextents = 129683339
      #
      
      And a few minutes later this:
      
      # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
      fsxattr.nextents = 4177861124
      #
      
      Ah, what? Where did that 4 billion extra extents suddenly come from?
      
      Stop the workload, unmount, mount:
      
      # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
      fsxattr.nextents = 166044375
      #
      
      And it's back at the expected number. i.e. the extent count is
      correct on disk, but it's screwed up in memory. I loaded up the
      extent list, and immediately:
      
      # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
      fsxattr.nextents = 4192576215
      #
      
      It's bad again. So, where does that number come from?
      xfs_fill_fsxattr():
      
                      if (ip->i_df.if_flags & XFS_IFEXTENTS)
                              fa->fsx_nextents = xfs_iext_count(&ip->i_df);
                      else
                              fa->fsx_nextents = ip->i_d.di_nextents;
      
      And that's the behaviour I just saw in a nutshell. The on disk count
      is correct, but once the tree is loaded into memory, it goes whacky.
      Clearly there's something wrong with xfs_iext_count():
      
      inline xfs_extnum_t xfs_iext_count(struct xfs_ifork *ifp)
      {
              return ifp->if_bytes / sizeof(struct xfs_iext_rec);
      }
      
      Simple enough, but 134M extents is 2**27, and that's right about
      where things went wrong. A struct xfs_iext_rec is 16 bytes in size,
      which means 2**27 * 2**4 = 2**31 and we're right on target for an
      integer overflow. And, sure enough:
      
      struct xfs_ifork {
              int                     if_bytes;       /* bytes in if_u1 */
      ....
      
      Once we get 2**27 extents in a file, we overflow if_bytes and the
      in-core extent count goes wrong. And when we reach 2**28 extents,
      if_bytes wraps back to zero and things really start to go wrong
      there. This is where the silent failure comes from - only the first
      2**28 extents can be looked up directly due to the overflow, all the
      extents above this index wrap back to somewhere in the first 2**28
      extents. Hence with a regular pattern, trying to punch a hole in the
      range that didn't have holes mapped to a hole in the first 2**28
      extents and so "succeeded" without changing anything. Hence "silent
      failure"...
      
      Fix this by converting if_bytes to a int64_t and converting all the
      index variables and size calculations to use int64_t types to avoid
      overflows in future. Signed integers are still used to enable easy
      detection of extent count underflows. This enables scalability of
      extent counts to the limits of the on-disk format - MAXEXTNUM
      (2**31) extents.
      
      Current testing is at over 500M extents and still going:
      
      fsxattr.nextents = 517310478
      Reported-by: NZorro Lang <zlang@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      3f8a4f1d
    • B
      xfs: optimize near mode bnobt scans with concurrent cntbt lookups · dc8e69bd
      Brian Foster 提交于
      The near mode fallback algorithm consists of a left/right scan of
      the bnobt. This algorithm has very poor breakdown characteristics
      under worst case free space fragmentation conditions. If a suitable
      extent is far enough from the locality hint, each allocation may
      scan most or all of the bnobt before it completes. This causes
      pathological behavior and extremely high allocation latencies.
      
      While locality is important to near mode allocations, it is not so
      important as to incur pathological allocation latency to provide the
      asolute best available locality for every allocation. If the
      allocation is large enough or far enough away, there is a point of
      diminishing returns. As such, we can bound the overall operation by
      including an iterative cntbt lookup in the broader search. The cntbt
      lookup is optimized to immediately find the extent with best
      locality for the given size on each iteration. Since the cntbt is
      indexed by extent size, the lookup repeats with a variably
      aggressive increasing search key size until it runs off the edge of
      the tree.
      
      This approach provides a natural balance between the two algorithms
      for various situations. For example, the bnobt scan is able to
      satisfy smaller allocations such as for inode chunks or btree blocks
      more quickly where the cntbt search may have to search through a
      large set of extent sizes when the search key starts off small
      relative to the largest extent in the tree. On the other hand, the
      cntbt search more deterministically covers the set of suitable
      extents for larger data extent allocation requests that the bnobt
      scan may have to search the entire tree to locate.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      dc8e69bd
    • B
      xfs: factor out tree fixup logic into helper · d2968825
      Brian Foster 提交于
      Lift the btree fixup path into a helper function.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      d2968825
    • B
      xfs: refactor near mode alloc bnobt scan into separate function · 0e26d5ca
      Brian Foster 提交于
      In preparation to enhance the near mode allocation bnobt scan algorithm, lift
      it into a separate function. No functional changes.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0e26d5ca
    • B
      xfs: refactor and reuse best extent scanning logic · 78d7aabd
      Brian Foster 提交于
      The bnobt "find best" helper implements a simple btree walker
      function. This general pattern, or a subset thereof, is reused in
      various parts of a near mode allocation operation. For example, the
      bnobt left/right scans are each iterative btree walks along with the
      cntbt lastblock scan.
      
      Rework this function into a generic btree walker, add a couple
      parameters to control termination behavior from various contexts and
      reuse it where applicable.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      78d7aabd
    • B
      xfs: refactor allocation tree fixup code · 4a65b7c2
      Brian Foster 提交于
      Both algorithms duplicate the same btree allocation code. Eliminate
      the duplication and reuse the fallback algorithm codepath.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4a65b7c2
    • B
      xfs: reuse best extent tracking logic for bnobt scan · fec0afda
      Brian Foster 提交于
      The near mode bnobt scan searches left and right in the bnobt
      looking for the closest free extent to the allocation hint that
      satisfies minlen. Once such an extent is found, the left/right
      search terminates, we search one more time in the opposite direction
      and finish the allocation with the best overall extent.
      
      The left/right and find best searches are currently controlled via a
      combination of cursor state and local variables. Clean up this code
      and prepare for further improvements to the near mode fallback
      algorithm by reusing the allocation cursor best extent tracking
      mechanism. Update the tracking logic to deactivate bnobt cursors
      when out of allocation range and replace open-coded extent checks to
      calls to the common helper. In doing so, rename some misnamed local
      variables in the top-level near mode allocation function.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      fec0afda
    • B
      xfs: refactor cntbt lastblock scan best extent logic into helper · 396bbf3c
      Brian Foster 提交于
      The cntbt lastblock scan checks the size, alignment, locality, etc.
      of each free extent in the block and compares it with the current
      best candidate. This logic will be reused by the upcoming optimized
      cntbt algorithm, so refactor it into a separate helper. Note that
      acur->diff is now initialized to -1 (unsigned) instead of 0 to
      support the more granular comparison logic in the new helper.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      396bbf3c
    • B
      xfs: track best extent from cntbt lastblock scan in alloc cursor · c62321a2
      Brian Foster 提交于
      If the size lookup lands in the last block of the by-size btree, the
      near mode algorithm scans the entire block for the extent with best
      available locality. In preparation for similar best available
      extent tracking across both btrees, extend the allocation cursor
      with best extent data and lift the associated state from the cntbt
      last block scan code. No functional changes.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c62321a2
    • B
      xfs: track allocation busy state in allocation cursor · d6d3aff2
      Brian Foster 提交于
      Extend the allocation cursor to track extent busy state for an
      allocation attempt. No functional changes.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      d6d3aff2
    • B
      xfs: introduce allocation cursor data structure · f5e7dbea
      Brian Foster 提交于
      Introduce a new allocation cursor data structure to encapsulate the
      various states and structures used to perform an extent allocation.
      This structure will eventually be used to track overall allocation
      state across different search algorithms on both free space btrees.
      
      To start, include the three btree cursors (one for the cntbt and two
      for the bnobt left/right search) used by the near mode allocation
      algorithm and refactor the cursor setup and teardown code into
      helpers. This slightly changes cursor memory allocation patterns,
      but otherwise makes no functional changes to the allocation
      algorithm.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      [darrick: fix sparse complaints]
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f5e7dbea
    • B
      xfs: track active state of allocation btree cursors · f6b428a4
      Brian Foster 提交于
      The upcoming allocation algorithm update searches multiple
      allocation btree cursors concurrently. As such, it requires an
      active state to track when a particular cursor should continue
      searching. While active state will be modified based on higher level
      logic, we can define base functionality based on the result of
      allocation btree lookups.
      
      Define an active flag in the private area of the btree cursor.
      Update it based on the result of lookups in the existing allocation
      btree helpers. Finally, provide a new helper to query the current
      state.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f6b428a4
  5. 21 10月, 2019 1 次提交
  6. 09 10月, 2019 3 次提交
    • B
      xfs: move local to extent inode logging into bmap helper · aeea4b75
      Brian Foster 提交于
      The callers of xfs_bmap_local_to_extents_empty() log the inode
      external to the function, yet this function is where the on-disk
      format value is updated. Push the inode logging down into the
      function itself to help prevent future mistakes.
      
      Note that internal bmap callers track the inode logging flags
      independently and thus may log the inode core twice due to this
      change. This is harmless, so leave this code around for consistency
      with the other attr fork conversion functions.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      aeea4b75
    • B
      xfs: remove broken error handling on failed attr sf to leaf change · 603efebd
      Brian Foster 提交于
      xfs_attr_shortform_to_leaf() attempts to put the shortform fork back
      together after a failed attempt to convert from shortform to leaf
      format. While this code reallocates and copies back the shortform
      attr fork data, it never resets the inode format field back to local
      format. Further, now that the inode is properly logged after the
      initial switch from local format, any error that triggers the
      recovery code will eventually abort the transaction and shutdown the
      fs. Therefore, remove the broken and unnecessary error handling
      code.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      603efebd
    • B
      xfs: log the inode on directory sf to block format change · 0b10d8a8
      Brian Foster 提交于
      When a directory changes from shortform (sf) to block format, the sf
      format is copied to a temporary buffer, the inode format is modified
      and the updated format filled with the dentries from the temporary
      buffer. If the inode format is modified and attempt to grow the
      inode fails (due to I/O error, for example), it is possible to
      return an error while leaving the directory in an inconsistent state
      and with an otherwise clean transaction. This results in corruption
      of the associated directory and leads to xfs_dabuf_map() errors as
      subsequent lookups cannot accurately determine the format of the
      directory. This problem is reproduced occasionally by generic/475.
      
      The fundamental problem is that xfs_dir2_sf_to_block() changes the
      on-disk inode format without logging the inode. The inode is
      eventually logged by the bmapi layer in the common case, but error
      checking introduces the possibility of failing the high level
      request before this happens.
      
      Update both of the dir2 and attr callers of
      xfs_bmap_local_to_extents_empty() to log the inode core as
      consistent with the bmap local to extent format change codepath.
      This ensures that any subsequent errors after the format has changed
      cause the transaction to abort.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0b10d8a8
  7. 07 10月, 2019 1 次提交
  8. 24 9月, 2019 3 次提交
  9. 04 9月, 2019 1 次提交
  10. 03 9月, 2019 1 次提交
  11. 31 8月, 2019 9 次提交