1. 27 12月, 2021 2 次提交
  2. 07 10月, 2020 1 次提交
    • D
      xfs: periodically relog deferred intent items · 4e919af7
      Darrick J. Wong 提交于
      There's a subtle design flaw in the deferred log item code that can lead
      to pinning the log tail.  Taking up the defer ops chain examples from
      the previous commit, we can get trapped in sequences like this:
      
      Caller hands us a transaction t0 with D0-D3 attached.  The defer ops
      chain will look like the following if the transaction rolls succeed:
      
      t1: D0(t0), D1(t0), D2(t0), D3(t0)
      t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
      t3: d5(t1), D1(t0), D2(t0), D3(t0)
      ...
      t9: d9(t7), D3(t0)
      t10: D3(t0)
      t11: d10(t10), d11(t10)
      t12: d11(t10)
      
      In transaction 9, we finish d9 and try to roll to t10 while holding onto
      an intent item for D3 that we logged in t0.
      
      The previous commit changed the order in which we place new defer ops in
      the defer ops processing chain to reduce the maximum chain length.  Now
      make xfs_defer_finish_noroll capable of relogging the entire chain
      periodically so that we can always move the log tail forward.  Most
      chains will never get relogged, except for operations that generate very
      long chains (large extents containing many blocks with different sharing
      levels) or are on filesystems with small logs and a lot of ongoing
      metadata updates.
      
      Callers are now required to ensure that the transaction reservation is
      large enough to handle logging done items and new intent items for the
      maximum possible chain length.  Most callers are careful to keep the
      chain lengths low, so the overhead should be minimal.
      
      The decision to relog an intent item is made based on whether the intent
      was logged in a previous checkpoint, since there's no point in relogging
      an intent into the same checkpoint.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      4e919af7
  3. 16 9月, 2020 2 次提交
  4. 07 9月, 2020 1 次提交
  5. 29 7月, 2020 8 次提交
  6. 07 7月, 2020 1 次提交
    • D
      xfs: redesign the reflink remap loop to fix blkres depletion crash · 00fd1d56
      Darrick J. Wong 提交于
      The existing reflink remapping loop has some structural problems that
      need addressing:
      
      The biggest problem is that we create one transaction for each extent in
      the source file without accounting for the number of mappings there are
      for the same range in the destination file.  In other words, we don't
      know the number of remap operations that will be necessary and we
      therefore cannot guess the block reservation required.  On highly
      fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
      run out of block reservation, and fail.
      
      The second problem is that we don't actually use the bmap intents to
      their full potential -- instead of calling bunmapi directly and having
      to deal with its backwards operation, we could call the deferred ops
      xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead.  This
      makes the frontend loop much simpler.
      
      Solve all of these problems by refactoring the remapping loops so that
      we only perform one remapping operation per transaction, and each
      operation only tries to remap a single extent from source to dest.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reported-by: NEdwin Török <edwin@etorok.net>
      Tested-by: NEdwin Török <edwin@etorok.net>
      00fd1d56
  7. 20 5月, 2020 2 次提交
  8. 27 3月, 2020 2 次提交
  9. 18 3月, 2020 3 次提交
    • D
      xfs: support bulk loading of staged btrees · 60e3d707
      Darrick J. Wong 提交于
      Add a new btree function that enables us to bulk load a btree cursor.
      This will be used by the upcoming online repair patches to generate new
      btrees.  This avoids the programmatic inefficiency of calling
      xfs_btree_insert in a loop (which generates a lot of log traffic) in
      favor of stamping out new btree blocks with ordered buffers, and then
      committing both the new root and scheduling the removal of the old btree
      blocks in a single transaction commit.
      
      The design of this new generic code is based off the btree rebuilding
      code in xfs_repair's phase 5 code, with the explicit goal of enabling us
      to share that code between scrub and repair.  It has the additional
      feature of being able to control btree block loading factors.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      60e3d707
    • D
      xfs: introduce fake roots for inode-rooted btrees · 349e1c03
      Darrick J. Wong 提交于
      Create an in-core fake root for inode-rooted btree types so that callers
      can generate a whole new btree using the upcoming btree bulk load
      function without making the new tree accessible from the rest of the
      filesystem.  It is up to the individual btree type to provide a function
      to create a staged cursor (presumably with the appropriate callouts to
      update the fakeroot) and then commit the staged root back into the
      filesystem.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      349e1c03
    • D
      xfs: introduce fake roots for ag-rooted btrees · e06536a6
      Darrick J. Wong 提交于
      Create an in-core fake root for AG-rooted btree types so that callers
      can generate a whole new btree using the upcoming btree bulk load
      function without making the new tree accessible from the rest of the
      filesystem.  It is up to the individual btree type to provide a function
      to create a staged cursor (presumably with the appropriate callouts to
      update the fakeroot) and then commit the staged root back into the
      filesystem.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      e06536a6
  10. 03 3月, 2020 4 次提交
  11. 19 12月, 2019 1 次提交
    • D
      xfs: don't commit sunit/swidth updates to disk if that would cause repair failures · 13eaec4b
      Darrick J. Wong 提交于
      Alex Lyakas reported[1] that mounting an xfs filesystem with new sunit
      and swidth values could cause xfs_repair to fail loudly.  The problem
      here is that repair calculates the where mkfs should have allocated the
      root inode, based on the superblock geometry.  The allocation decisions
      depend on sunit, which means that we really can't go updating sunit if
      it would lead to a subsequent repair failure on an otherwise correct
      filesystem.
      
      Port from xfs_repair some code that computes the location of the root
      inode and teach mount to skip the ondisk update if it would cause
      problems for repair.  Along the way we'll update the documentation,
      provide a function for computing the minimum AGFL size instead of
      open-coding it, and cut down some indenting in the mount code.
      
      Note that we allow the mount to proceed (and new allocations will
      reflect this new geometry) because we've never screened this kind of
      thing before.  We'll have to wait for a new future incompat feature to
      enforce correct behavior, alas.
      
      Note that the geometry reporting always uses the superblock values, not
      the incore ones, so that is what xfs_info and xfs_growfs will report.
      
      [1] https://lore.kernel.org/linux-xfs/20191125130744.GA44777@bfoster/T/#m00f9594b511e076e2fcdd489d78bc30216d72a7dReported-by: NAlex Lyakas <alex@zadara.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      13eaec4b
  12. 27 11月, 2019 1 次提交
    • P
      ftrace: Rework event_create_dir() · 04ae87a5
      Peter Zijlstra 提交于
      Rework event_create_dir() to use an array of static data instead of
      function pointers where possible.
      
      The problem is that it would call the function pointer on module load
      before parse_args(), possibly even before jump_labels were initialized.
      Luckily the generated functions don't use jump_labels but it still seems
      fragile. It also gets in the way of changing when we make the module map
      executable.
      
      The generated function are basically calling trace_define_field() with a
      bunch of static arguments. So instead of a function, capture these
      arguments in a static array, avoiding the function call.
      
      Now there are a number of cases where the fields are dynamic (syscall
      arguments, kprobes and uprobes), in which case a static array does not
      work, for these we preserve the function call. Luckily all these cases
      are not related to modules and so we can retain the function call for
      them.
      
      Also fix up all broken tracepoint definitions that now generate a
      compile error.
      Tested-by: NAlexei Starovoitov <ast@kernel.org>
      Tested-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191111132458.342979914@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      04ae87a5
  13. 30 10月, 2019 1 次提交
  14. 22 10月, 2019 4 次提交
    • B
      xfs: optimize near mode bnobt scans with concurrent cntbt lookups · dc8e69bd
      Brian Foster 提交于
      The near mode fallback algorithm consists of a left/right scan of
      the bnobt. This algorithm has very poor breakdown characteristics
      under worst case free space fragmentation conditions. If a suitable
      extent is far enough from the locality hint, each allocation may
      scan most or all of the bnobt before it completes. This causes
      pathological behavior and extremely high allocation latencies.
      
      While locality is important to near mode allocations, it is not so
      important as to incur pathological allocation latency to provide the
      asolute best available locality for every allocation. If the
      allocation is large enough or far enough away, there is a point of
      diminishing returns. As such, we can bound the overall operation by
      including an iterative cntbt lookup in the broader search. The cntbt
      lookup is optimized to immediately find the extent with best
      locality for the given size on each iteration. Since the cntbt is
      indexed by extent size, the lookup repeats with a variably
      aggressive increasing search key size until it runs off the edge of
      the tree.
      
      This approach provides a natural balance between the two algorithms
      for various situations. For example, the bnobt scan is able to
      satisfy smaller allocations such as for inode chunks or btree blocks
      more quickly where the cntbt search may have to search through a
      large set of extent sizes when the search key starts off small
      relative to the largest extent in the tree. On the other hand, the
      cntbt search more deterministically covers the set of suitable
      extents for larger data extent allocation requests that the bnobt
      scan may have to search the entire tree to locate.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      dc8e69bd
    • B
      xfs: factor out tree fixup logic into helper · d2968825
      Brian Foster 提交于
      Lift the btree fixup path into a helper function.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      d2968825
    • B
      xfs: reuse best extent tracking logic for bnobt scan · fec0afda
      Brian Foster 提交于
      The near mode bnobt scan searches left and right in the bnobt
      looking for the closest free extent to the allocation hint that
      satisfies minlen. Once such an extent is found, the left/right
      search terminates, we search one more time in the opposite direction
      and finish the allocation with the best overall extent.
      
      The left/right and find best searches are currently controlled via a
      combination of cursor state and local variables. Clean up this code
      and prepare for further improvements to the near mode fallback
      algorithm by reusing the allocation cursor best extent tracking
      mechanism. Update the tracking logic to deactivate bnobt cursors
      when out of allocation range and replace open-coded extent checks to
      calls to the common helper. In doing so, rename some misnamed local
      variables in the top-level near mode allocation function.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      fec0afda
    • B
      xfs: refactor cntbt lastblock scan best extent logic into helper · 396bbf3c
      Brian Foster 提交于
      The cntbt lastblock scan checks the size, alignment, locality, etc.
      of each free extent in the block and compares it with the current
      best candidate. This logic will be reused by the upcoming optimized
      cntbt algorithm, so refactor it into a separate helper. Note that
      acur->diff is now initialized to -1 (unsigned) instead of 0 to
      support the more granular comparison logic in the new helper.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      396bbf3c
  15. 21 10月, 2019 2 次提交
  16. 27 8月, 2019 2 次提交
    • D
      xfs: add kmem_alloc_io() · f8f9ee47
      Dave Chinner 提交于
      Memory we use to submit for IO needs strict alignment to the
      underlying driver contraints. Worst case, this is 512 bytes. Given
      that all allocations for IO are always a power of 2 multiple of 512
      bytes, the kernel heap provides natural alignment for objects of
      these sizes and that suffices.
      
      Until, of course, memory debugging of some kind is turned on (e.g.
      red zones, poisoning, KASAN) and then the alignment of the heap
      objects is thrown out the window. Then we get weird IO errors and
      data corruption problems because drivers don't validate alignment
      and do the wrong thing when passed unaligned memory buffers in bios.
      
      TO fix this, introduce kmem_alloc_io(), which will guaranteeat least
      512 byte alignment of buffers for IO, even if memory debugging
      options are turned on. It is assumed that the minimum allocation
      size will be 512 bytes, and that sizes will be power of 2 mulitples
      of 512 bytes.
      
      Use this everywhere we allocate buffers for IO.
      
      This no longer fails with log recovery errors when KASAN is enabled
      due to the brd driver not handling unaligned memory buffers:
      
      # mkfs.xfs -f /dev/ram0 ; mount /dev/ram0 /mnt/test
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f8f9ee47
    • D
      xfs: add kmem allocation trace points · 0ad95687
      Dave Chinner 提交于
      When trying to correlate XFS kernel allocations to memory reclaim
      behaviour, it is useful to know what allocations XFS is actually
      attempting. This information is not directly available from
      tracepoints in the generic memory allocation and reclaim
      tracepoints, so these new trace points provide a high level
      indication of what the XFS memory demand actually is.
      
      There is no per-filesystem context in this code, so we just trace
      the type of allocation, the size and the allocation constraints.
      The kmem code also doesn't include much of the common XFS headers,
      so there are a few definitions that need to be added to the trace
      headers and a couple of types that need to be made common to avoid
      needing to include the whole world in the kmem code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0ad95687
  17. 03 7月, 2019 2 次提交
  18. 29 6月, 2019 1 次提交
    • C
      xfs: split iop_unlock · ddf92053
      Christoph Hellwig 提交于
      The iop_unlock method is called when comitting or cancelling a
      transaction.  In the latter case, the transaction may or may not be
      aborted.  While there is no known problem with the current code in
      practice, this implementation is limited in that any log item
      implementation that might want to differentiate between a commit and a
      cancellation must rely on the aborted state.  The aborted bit is only
      set when the cancelled transaction is dirty, however.  This means that
      there is no way to distinguish between a commit and a clean transaction
      cancellation.
      
      For example, intent log items currently rely on this distinction.  The
      log item is either transferred to the CIL on commit or released on
      transaction cancel. There is currently no possibility for a clean intent
      log item in a transaction, but if that state is ever introduced a cancel
      of such a transaction will immediately result in memory leaks of the
      associated log item(s).  This is an interface deficiency and landmine.
      
      To clean this up, replace the iop_unlock method with an iop_release
      method that is specific to transaction cancel.  The existing
      iop_committing method occurs at the same time as iop_unlock in the
      commit path and there is no need for two separate callbacks here.
      Overload the iop_committing method with the current commit time
      iop_unlock implementations to eliminate the need for the latter and
      further simplify the interface.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      ddf92053