1. 13 2月, 2023 14 次提交
    • D
      xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker · 3432ef61
      Dave Chinner 提交于
      Now that the AG iteration code in the core allocation code has been
      cleaned up, we can easily convert it to use a for_each_perag..()
      variant to use active references and skip AGs that it can't get
      active references on.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      3432ef61
    • D
      xfs: move the minimum agno checks into xfs_alloc_vextent_check_args · 8b813568
      Dave Chinner 提交于
      All of the allocation functions now extract the minimum allowed AG
      from the transaction and then use it in some way. The allocation
      functions that are restricted to a single AG all check if the
      AG requested can be allocated from and return an error if so. These
      all set args->agno appropriately.
      
      All the allocation functions that iterate AGs use it to calculate
      the scan start AG. args->agno is not set until the iterator starts
      walking AGs.
      
      Hence we can easily set up a conditional check against the minimum
      AG allowed in xfs_alloc_vextent_check_args() based on whether
      args->agno contains NULLAGNUMBER or not and move all the repeated
      setup code to xfs_alloc_vextent_check_args(), further simplifying
      the allocation functions.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      8b813568
    • D
      xfs: fold xfs_alloc_ag_vextent() into callers · 230e8fe8
      Dave Chinner 提交于
      We don't need the multiplexing xfs_alloc_ag_vextent() provided
      anymore - we can just call the exact/near/size variants directly.
      This allows us to remove args->type completely and stop using
      args->fsbno as an input to the allocator algorithms.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      230e8fe8
    • D
      xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno() · e4d17426
      Dave Chinner 提交于
      Move it from xfs_alloc_ag_vextent() so we can get rid of that layer.
      Rename xfs_alloc_vextent_set_fsbno() to xfs_alloc_vextent_finish()
      to indicate that it's function is finishing off the allocation that
      we've run now that it contains much more functionality.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      e4d17426
    • D
      xfs: introduce xfs_alloc_vextent_prepare() · 74b9aa63
      Dave Chinner 提交于
      Now that we have wrapper functions for each type of allocation we
      can ask for, we can start unravelling xfs_alloc_ag_vextent(). That
      is essentially just a prepare stage, the allocation multiplexer
      and a post-allocation accounting step is the allocation proceeded.
      
      The current xfs_alloc_vextent*() wrappers all have a prepare stage,
      the allocation operation and a post-allocation accounting step.
      
      We can consolidate this by moving the AG alloc prep code into the
      wrapper functions, the accounting code in the wrapper accounting
      functions, and cut out the multiplexer layer entirely.
      
      This patch consolidates the AG preparation stage.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      74b9aa63
    • D
      xfs: introduce xfs_alloc_vextent_exact_bno() · 5f36b2ce
      Dave Chinner 提交于
      Two of the callers to xfs_alloc_vextent_this_ag() actually want
      exact block number allocation, not anywhere-in-ag allocation. Split
      this out from _this_ag() as a first class citizen so no external
      extent allocation code needs to care about args->type anymore.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      5f36b2ce
    • D
      xfs: introduce xfs_alloc_vextent_near_bno() · db4710fd
      Dave Chinner 提交于
      The remaining callers of xfs_alloc_vextent() are all doing NEAR_BNO
      allocations. We can replace that function with a new
      xfs_alloc_vextent_near_bno() function that does this explicitly.
      
      We also multiplex NEAR_BNO allocations through
      xfs_alloc_vextent_this_ag via args->type. Replace all of these with
      direct calls to xfs_alloc_vextent_near_bno(), too.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      db4710fd
    • D
      xfs: use xfs_alloc_vextent_start_bno() where appropriate · 2a7f6d41
      Dave Chinner 提交于
      Change obvious callers of single AG allocation to use
      xfs_alloc_vextent_start_bno(). Callers no long need to specify
      XFS_ALLOCTYPE_START_BNO, and so the type can be driven inward and
      removed.
      
      While doing this, also pass the allocation target fsb as a parameter
      rather than encoding it in args->fsbno.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      2a7f6d41
    • D
      xfs: use xfs_alloc_vextent_first_ag() where appropriate · 319c9e87
      Dave Chinner 提交于
      Change obvious callers of single AG allocation to use
      xfs_alloc_vextent_first_ag(). This gets rid of
      XFS_ALLOCTYPE_FIRST_AG as the type used within
      xfs_alloc_vextent_first_ag() during iteration is _THIS_AG. Hence we
      can remove the setting of args->type from all the callers of
      _first_ag() and remove the alloctype.
      
      While doing this, pass the allocation target fsb as a parameter
      rather than encoding it in args->fsbno. This starts the process
      of making args->fsbno an output only variable rather than
      input/output.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      319c9e87
    • D
      xfs: use xfs_alloc_vextent_this_ag() where appropriate · 74c36a86
      Dave Chinner 提交于
      Change obvious callers of single AG allocation to use
      xfs_alloc_vextent_this_ag(). Drive the per-ag grabbing out to the
      callers, too, so that callers with active references don't need
      to do new lookups just for an allocation in a context that already
      has a perag reference.
      
      The only remaining caller that does single AG allocation through
      xfs_alloc_vextent() is xfs_bmap_btalloc() with
      XFS_ALLOCTYPE_NEAR_BNO. That is going to need more untangling before
      it can be converted cleanly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      74c36a86
    • D
      xfs: combine __xfs_alloc_vextent_this_ag and xfs_alloc_ag_vextent · 4811c933
      Dave Chinner 提交于
      There's a bit of a recursive conundrum around
      xfs_alloc_ag_vextent(). We can't first call xfs_alloc_ag_vextent()
      without preparing the AGFL for the allocation, and preparing the
      AGFL calls xfs_alloc_ag_vextent() to prepare the AGFL for the
      allocation. This "double allocation" requirement is not really clear
      from the current xfs_alloc_fix_freelist() calls that are sprinkled
      through the allocation code.
      
      It's not helped that xfs_alloc_ag_vextent() can actually allocate
      from the AGFL itself, but there's special code to prevent AGFL prep
      allocations from allocating from the free list it's trying to prep.
      The naming is also not consistent: args->wasfromfl is true when we
      allocated _from_ the free list, but the indication that we are
      allocating _for_ the free list is via checking that (args->resv ==
      XFS_AG_RESV_AGFL).
      
      So, lets make this "allocation required for allocation" situation
      clear by moving it all inside xfs_alloc_ag_vextent(). The freelist
      allocation is a specific XFS_ALLOCTYPE_THIS_AG allocation, which
      translated directly to xfs_alloc_ag_vextent_size() allocation.
      
      This enables us to replace __xfs_alloc_vextent_this_ag() with a call
      to xfs_alloc_ag_vextent(), and we drive the freelist fixing further
      into the per-ag allocation algorithm.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      4811c933
    • D
      xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags() · 2edf06a5
      Dave Chinner 提交于
      The core of the per-ag iteration is effectively doing a "this ag"
      allocation on one AG at a time. Use the same code to implement the
      core "this ag" allocation in both xfs_alloc_vextent_this_ag()
      and xfs_alloc_vextent_iterate_ags().
      
      This means we only call xfs_alloc_ag_vextent() from one place so we
      can easily collapse the call stack in future patches.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      2edf06a5
    • D
      xfs: rework xfs_alloc_vextent() · ecd788a9
      Dave Chinner 提交于
      It's a multiplexing mess that can be greatly simplified, and really
      needs to be simplified to allow active per-ag references to
      propagate from initial AG selection code the the bmapi code.
      
      This splits the code out into separate a parameter checking
      function, an iterator function, and allocation completion functions
      and then implements the individual policies using these functions.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      ecd788a9
    • D
      xfs: perags need atomic operational state · 7ac2ff8b
      Dave Chinner 提交于
      We currently don't have any flags or operational state in the
      xfs_perag except for the pagf_init and pagi_init flags. And the
      agflreset flag. Oh, there's also the pagf_metadata and pagi_inodeok
      flags, too.
      
      For controlling per-ag operations, we are going to need some atomic
      state flags. Hence add an opstate field similar to what we already
      have in the mount and log, and convert all these state flags across
      to atomic bit operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      7ac2ff8b
  2. 11 2月, 2023 2 次提交
    • D
      xfs: t_firstblock is tracking AGs not blocks · 692b6cdd
      Dave Chinner 提交于
      The tp->t_firstblock field is now raelly tracking the highest AG we
      have locked, not the block number of the highest allocation we've
      made. It's purpose is to prevent AGF locking deadlocks, so rename it
      to "highest AG" and simplify the implementation to just track the
      agno rather than a fsbno.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      692b6cdd
    • D
      xfs: fix low space alloc deadlock · 1dd0510f
      Dave Chinner 提交于
      I've recently encountered an ABBA deadlock with g/476. The upcoming
      changes seem to make this much easier to hit, but the underlying
      problem is a pre-existing one.
      
      Essentially, if we select an AG for allocation, then lock the AGF
      and then fail to allocate for some reason (e.g. minimum length
      requirements cannot be satisfied), then we drop out of the
      allocation with the AGF still locked.
      
      The caller then modifies the allocation constraints - usually
      loosening them up - and tries again. This can result in trying to
      access AGFs that are lower than the AGF we already have locked from
      the failed attempt. e.g. the failed attempt skipped several AGs
      before failing, so we have locks an AG higher than the start AG.
      Retrying the allocation from the start AG then causes us to violate
      AGF lock ordering and this can lead to deadlocks.
      
      The deadlock exists even if allocation succeeds - we can do a
      followup allocations in the same transaction for BMBT blocks that
      aren't guaranteed to be in the same AG as the original, and can move
      into higher AGs. Hence we really need to move the tp->t_firstblock
      tracking down into xfs_alloc_vextent() where it can be set when we
      exit with a locked AG.
      
      xfs_alloc_vextent() can also check there if the requested
      allocation falls within the allow range of AGs set by
      tp->t_firstblock. If we can't allocate within the range set, we have
      to fail the allocation. If we are allowed to to non-blocking AGF
      locking, we can ignore the AG locking order limitations as we can
      use try-locks for the first iteration over requested AG range.
      
      This invalidates a set of post allocation asserts that check that
      the allocation is always above tp->t_firstblock if it is set.
      Because we can use try-locks to avoid the deadlock in some
      circumstances, having a pre-existing locked AGF doesn't always
      prevent allocation from lower order AGFs. Hence those ASSERTs need
      to be removed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      1dd0510f
  3. 06 2月, 2023 1 次提交
  4. 18 11月, 2022 1 次提交
  5. 31 10月, 2022 1 次提交
  6. 12 10月, 2022 1 次提交
    • J
      treewide: use prandom_u32_max() when possible, part 1 · 81895a65
      Jason A. Donenfeld 提交于
      Rather than incurring a division or requesting too many random bytes for
      the given range, use the prandom_u32_max() function, which only takes
      the minimum required bytes from the RNG and avoids divisions. This was
      done mechanically with this coccinelle script:
      
      @basic@
      expression E;
      type T;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u64;
      @@
      (
      - ((T)get_random_u32() % (E))
      + prandom_u32_max(E)
      |
      - ((T)get_random_u32() & ((E) - 1))
      + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
      |
      - ((u64)(E) * get_random_u32() >> 32)
      + prandom_u32_max(E)
      |
      - ((T)get_random_u32() & ~PAGE_MASK)
      + prandom_u32_max(PAGE_SIZE)
      )
      
      @multi_line@
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      identifier RAND;
      expression E;
      @@
      
      -       RAND = get_random_u32();
              ... when != RAND
      -       RAND %= (E);
      +       RAND = prandom_u32_max(E);
      
      // Find a potential literal
      @literal_mask@
      expression LITERAL;
      type T;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      position p;
      @@
      
              ((T)get_random_u32()@p & (LITERAL))
      
      // Add one to the literal.
      @script:python add_one@
      literal << literal_mask.LITERAL;
      RESULT;
      @@
      
      value = None
      if literal.startswith('0x'):
              value = int(literal, 16)
      elif literal[0] in '123456789':
              value = int(literal, 10)
      if value is None:
              print("I don't know how to handle %s" % (literal))
              cocci.include_match(False)
      elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
              print("Skipping 0x%x for cleanup elsewhere" % (value))
              cocci.include_match(False)
      elif value & (value + 1) != 0:
              print("Skipping 0x%x because it's not a power of two minus one" % (value))
              cocci.include_match(False)
      elif literal.startswith('0x'):
              coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
      else:
              coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
      
      // Replace the literal mask with the calculated result.
      @plus_one@
      expression literal_mask.LITERAL;
      position literal_mask.p;
      expression add_one.RESULT;
      identifier FUNC;
      @@
      
      -       (FUNC()@p & (LITERAL))
      +       prandom_u32_max(RESULT)
      
      @collapse_ret@
      type T;
      identifier VAR;
      expression E;
      @@
      
       {
      -       T VAR;
      -       VAR = (E);
      -       return VAR;
      +       return E;
       }
      
      @drop_var@
      type T;
      identifier VAR;
      @@
      
       {
      -       T VAR;
              ... when != VAR
       }
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NYury Norov <yury.norov@gmail.com>
      Reviewed-by: NKP Singh <kpsingh@kernel.org>
      Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
      Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
      Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
      Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      81895a65
  7. 23 7月, 2022 1 次提交
  8. 07 7月, 2022 7 次提交
  9. 21 4月, 2022 1 次提交
  10. 11 4月, 2022 1 次提交
  11. 22 3月, 2022 1 次提交
  12. 23 10月, 2021 4 次提交
  13. 20 10月, 2021 3 次提交
    • D
      xfs: compute absolute maximum nlevels for each btree type · 0ed5f735
      Darrick J. Wong 提交于
      Add code for all five btree types so that we can compute the absolute
      maximum possible btree height for each btree type.  This is a setup for
      the next patch, which makes every btree type have its own cursor cache.
      
      The functions are exported so that we can have xfs_db report the
      absolute maximum btree heights for each btree type, rather than making
      everyone run their own ad-hoc computations.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      0ed5f735
    • D
      xfs: rename m_ag_maxlevels to m_allocbt_maxlevels · 7cb3efb4
      Darrick J. Wong 提交于
      Years ago when XFS was thought to be much more simple, we introduced
      m_ag_maxlevels to specify the maximum btree height of per-AG btrees for
      a given filesystem mount.  Then we observed that inode btrees don't
      actually have the same height and split that off; and now we have rmap
      and refcount btrees with much different geometries and separate
      maxlevels variables.
      
      The 'ag' part of the name doesn't make much sense anymore, so rename
      this to m_alloc_maxlevels to reinforce that this is the maximum height
      of the *free space* btrees.  This sets us up for the next patch, which
      will add a variable to track the maximum height of all AG btrees.
      
      (Also take the opportunity to improve adjacent comments and fix minor
      style problems.)
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      7cb3efb4
    • D
      xfs: prepare xfs_btree_cur for dynamic cursor heights · 6ca444cf
      Darrick J. Wong 提交于
      Split out the btree level information into a separate struct and put it
      at the end of the cursor structure as a VLA.  Files with huge data forks
      (and in the future, the realtime rmap btree) will require the ability to
      support many more levels than a per-AG btree cursor, which means that
      we're going to create per-btree type cursor caches to conserve memory
      for the more common case.
      
      Note that a subsequent patch actually introduces dynamic cursor heights.
      This one merely rearranges the structure to prepare for that.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      6ca444cf
  14. 15 10月, 2021 1 次提交
  15. 20 8月, 2021 1 次提交