1. 14 7月, 2022 4 次提交
    • D
      xfs: double link the unlinked inode list · 2fd26cc0
      Dave Chinner 提交于
      Now we have forwards traversal via the incore inode in place, we now
      need to add back pointers to the incore inode to entirely replace
      the back reference cache. We use the same lookup semantics and
      constraints as for the forwards pointer lookups during unlinks, and
      so we can look up any inode in the unlinked list directly and update
      the list pointers, forwards or backwards, at any time.
      
      The only wrinkle in converting the unlinked list manipulations to
      use in-core previous pointers is that log recovery doesn't have the
      incore inode state built up so it can't just read in an inode and
      release it to finish off the unlink. Hence we need to modify the
      traversal in recovery to read one inode ahead before we
      release the inode at the head of the list. This populates the
      next->prev relationship sufficient to be able to replay the unlinked
      list and hence greatly simplify the runtime code.
      
      This recovery algorithm also requires that we actually remove inodes
      from the unlinked list one at a time as background inode
      inactivation will result in unlinked list removal racing with the
      building of the in-memory unlinked list state. We could serialise
      this by holding the AGI buffer lock when constructing the in memory
      state, but all that does is lockstep background processing with list
      building. It is much simpler to flush the inodegc immediately after
      releasing the inode so that it is unlinked immediately and there is
      no races present at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      2fd26cc0
    • D
      xfs: refactor xlog_recover_process_iunlinks() · 04755d2e
      Dave Chinner 提交于
      For upcoming changes to the way inode unlinked list processing is
      done, the structure of recovery needs to change slightly. We also
      really need to untangle the messy error handling in list recovery
      so that actions like emptying the bucket on inode lookup failure
      are associated with the bucket list walk failing, not failing
      to look up the inode.
      
      Refactor the recovery code now to keep the re-organisation seperate
      to the algorithm changes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      04755d2e
    • D
      xfs: track the iunlink list pointer in the xfs_inode · 4fcc94d6
      Dave Chinner 提交于
      Having direct access to the i_next_unlinked pointer in unlinked
      inodes greatly simplifies the processing of inodes on the unlinked
      list. We no longer need to look up the inode buffer just to find
      next inode in the list if the xfs_inode is in memory. These
      improvements will be realised over upcoming patches as other
      dependencies on the inode buffer for unlinked list processing are
      removed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      4fcc94d6
    • Z
      xfs: flush inode gc workqueue before clearing agi bucket · 04a98a03
      Zhang Yi 提交于
      In the procedure of recover AGI unlinked lists, if something bad
      happenes on one of the unlinked inode in the bucket list, we would call
      xlog_recover_clear_agi_bucket() to clear the whole unlinked bucket list,
      not the unlinked inodes after the bad one. If we have already added some
      inodes to the gc workqueue before the bad inode in the list, we could
      get below error when freeing those inodes, and finaly fail to complete
      the log recover procedure.
      
       XFS (ram0): Internal error xfs_iunlink_remove at line 2456 of file
       fs/xfs/xfs_inode.c.  Caller xfs_ifree+0xb0/0x360 [xfs]
      
      The problem is xlog_recover_clear_agi_bucket() clear the bucket list, so
      the gc worker fail to check the agino in xfs_verify_agino(). Fix this by
      flush workqueue before clearing the bucket.
      
      Fixes: ab23a776 ("xfs: per-cpu deferred inode inactivation queues")
      Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      04a98a03
  2. 07 7月, 2022 2 次提交
    • D
      xfs: Pre-calculate per-AG agbno geometry · 0800169e
      Dave Chinner 提交于
      There is a lot of overhead in functions like xfs_verify_agbno() that
      repeatedly calculate the geometry limits of an AG. These can be
      pre-calculated as they are static and the verification context has
      a per-ag context it can quickly reference.
      
      In the case of xfs_verify_agbno(), we now always have a perag
      context handy, so we can store the AG length and the minimum valid
      block in the AG in the perag. This means we don't have to calculate
      it on every call and it can be inlined in callers if we move it
      to xfs_ag.h.
      
      Move xfs_ag_block_count() to xfs_ag.c because it's really a
      per-ag function and not an XFS type function. We need a little
      bit of rework that is specific to xfs_initialise_perag() to allow
      growfs to calculate the new perag sizes before we've updated the
      primary superblock during the grow (chicken/egg situation).
      
      Note that we leave the original xfs_verify_agbno in place in
      xfs_types.c as a static function as other callers in that file do
      not have per-ag contexts so still need to go the long way. It's been
      renamed to xfs_verify_agno_agbno() to indicate it takes both an agno
      and an agbno to differentiate it from new function.
      
      Future commits will make similar changes for other per-ag geometry
      validation functions.
      
      Further:
      
      $ size --totals fs/xfs/built-in.a
      	   text    data     bss     dec     hex filename
      before	1483006	 329588	    572	1813166	 1baaae	(TOTALS)
      after	1482185	 329588	    572	1812345	 1ba779	(TOTALS)
      
      This rework reduces the binary size by ~820 bytes, indicating
      that much less work is being done to bounds check the agbno values
      against on per-ag geometry information.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      0800169e
    • D
      xfs: pass perag to xfs_read_agi · 61021deb
      Dave Chinner 提交于
      We have the perag in most palces we call xfs_read_agi, so pass the
      perag instead of a mount/agno pair.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      61021deb
  3. 27 5月, 2022 2 次提交
  4. 22 5月, 2022 1 次提交
  5. 04 5月, 2022 1 次提交
    • A
      xfs: Set up infrastructure for log attribute replay · fd920008
      Allison Henderson 提交于
      Currently attributes are modified directly across one or more
      transactions. But they are not logged or replayed in the event of an
      error. The goal of log attr replay is to enable logging and replaying
      of attribute operations using the existing delayed operations
      infrastructure.  This will later enable the attributes to become part of
      larger multi part operations that also must first be recorded to the
      log.  This is mostly of interest in the scheme of parent pointers which
      would need to maintain an attribute containing parent inode information
      any time an inode is moved, created, or removed.  Parent pointers would
      then be of interest to any feature that would need to quickly derive an
      inode path from the mount point. Online scrub, nfs lookups and fs grow
      or shrink operations are all features that could take advantage of this.
      
      This patch adds two new log item types for setting or removing
      attributes as deferred operations.  The xfs_attri_log_item will log an
      intent to set or remove an attribute.  The corresponding
      xfs_attrd_log_item holds a reference to the xfs_attri_log_item and is
      freed once the transaction is done.  Both log items use a generic
      xfs_attr_log_format structure that contains the attribute name, value,
      flags, inode, and an op_flag that indicates if the operations is a set
      or remove.
      
      [dchinner: added extra little bits needed for intent whiteouts]
      Signed-off-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      fd920008
  6. 30 3月, 2022 2 次提交
    • D
      xfs: log shutdown triggers should only shut down the log · b5f17bec
      Dave Chinner 提交于
      We've got a mess on our hands.
      
      1. xfs_trans_commit() cannot cancel transactions because the mount is
      shut down - that causes dirty, aborted, unlogged log items to sit
      unpinned in memory and potentially get written to disk before the
      log is shut down. Hence xfs_trans_commit() can only abort
      transactions when xlog_is_shutdown() is true.
      
      2. xfs_force_shutdown() is used in places to cause the current
      modification to be aborted via xfs_trans_commit() because it may be
      impractical or impossible to cancel the transaction directly, and
      hence xfs_trans_commit() must cancel transactions when
      xfs_is_shutdown() is true in this situation. But we can't do that
      because of #1.
      
      3. Log IO errors cause log shutdowns by calling xfs_force_shutdown()
      to shut down the mount and then the log from log IO completion.
      
      4. xfs_force_shutdown() can result in a log force being issued,
      which has to wait for log IO completion before it will mark the log
      as shut down. If #3 races with some other shutdown trigger that runs
      a log force, we rely on xfs_force_shutdown() silently ignoring #3
      and avoiding shutting down the log until the failed log force
      completes.
      
      5. To ensure #2 always works, we have to ensure that
      xfs_force_shutdown() does not return until the the log is shut down.
      But in the case of #4, this will result in a deadlock because the
      log Io completion will block waiting for a log force to complete
      which is blocked waiting for log IO to complete....
      
      So the very first thing we have to do here to untangle this mess is
      dissociate log shutdown triggers from mount shutdowns. We already
      have xlog_forced_shutdown, which will atomically transistion to the
      log a shutdown state. Due to internal asserts it cannot be called
      multiple times, but was done simply because the only place that
      could call it was xfs_do_force_shutdown() (i.e. the mount shutdown!)
      and that could only call it once and once only.  So the first thing
      we do is remove the asserts.
      
      We then convert all the internal log shutdown triggers to call
      xlog_force_shutdown() directly instead of xfs_force_shutdown(). This
      allows the log shutdown triggers to shut down the log without
      needing to care about mount based shutdown constraints. This means
      we shut down the log independently of the mount and the mount may
      not notice this until it's next attempt to read or modify metadata.
      At that point (e.g. xfs_trans_commit()) it will see that the log is
      shutdown, error out and shutdown the mount.
      
      To ensure that all the unmount behaviours and asserts track
      correctly as a result of a log shutdown, propagate the shutdown up
      to the mount if it is not already set. This keeps the mount and log
      state in sync, and saves a huge amount of hassle where code fails
      because of a log shutdown but only checks for mount shutdowns and
      hence ends up doing the wrong thing. Cleaning up that mess is
      an exercise for another day.
      
      This enables us to address the other problems noted above in
      followup patches.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      b5f17bec
    • D
      xfs: shutdown in intent recovery has non-intent items in the AIL · ab9c81ef
      Dave Chinner 提交于
      generic/388 triggered a failure in RUI recovery due to a corrupted
      btree record and the system then locked up hard due to a subsequent
      assert failure while holding a spinlock cancelling intents:
      
       XFS (pmem1): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_trans.c:964).  Shutting down filesystem.
       XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
       XFS: Assertion failed: !xlog_item_is_intent(lip), file: fs/xfs/xfs_log_recover.c, line: 2632
       Call Trace:
        <TASK>
        xlog_recover_cancel_intents.isra.0+0xd1/0x120
        xlog_recover_finish+0xb9/0x110
        xfs_log_mount_finish+0x15a/0x1e0
        xfs_mountfs+0x540/0x910
        xfs_fs_fill_super+0x476/0x830
        get_tree_bdev+0x171/0x270
        ? xfs_init_fs_context+0x1e0/0x1e0
        xfs_fs_get_tree+0x15/0x20
        vfs_get_tree+0x24/0xc0
        path_mount+0x304/0xba0
        ? putname+0x55/0x60
        __x64_sys_mount+0x108/0x140
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Essentially, there's dirty metadata in the AIL from intent recovery
      transactions, so when we go to cancel the remaining intents we assume
      that all objects after the first non-intent log item in the AIL are
      not intents.
      
      This is not true. Intent recovery can log new intents to continue
      the operations the original intent could not complete in a single
      transaction. The new intents are committed before they are deferred,
      which means if the CIL commits in the background they will get
      inserted into the AIL at the head.
      
      Hence if we shut down the filesystem while processing intent
      recovery, the AIL may have new intents active at the current head.
      Hence this check:
      
                      /*
                       * We're done when we see something other than an intent.
                       * There should be no intents left in the AIL now.
                       */
                      if (!xlog_item_is_intent(lip)) {
      #ifdef DEBUG
                              for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
                                      ASSERT(!xlog_item_is_intent(lip));
      #endif
                              break;
                      }
      
      in both xlog_recover_process_intents() and
      log_recover_cancel_intents() is simply not valid. It was valid back
      when we only had EFI/EFD intents and didn't chain intents, but it
      hasn't been valid ever since intent recovery could create and commit
      new intents.
      
      Given that crashing the mount task like this pretty much prevents
      diagnosing what went wrong that lead to the initial failure that
      triggered intent cancellation, just remove the checks altogether.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ab9c81ef
  7. 07 1月, 2022 1 次提交
  8. 22 12月, 2021 1 次提交
    • D
      xfs: only run COW extent recovery when there are no live extents · 7993f1a4
      Darrick J. Wong 提交于
      As part of multiple customer escalations due to file data corruption
      after copy on write operations, I wrote some fstests that use fsstress
      to hammer on COW to shake things loose.  Regrettably, I caught some
      filesystem shutdowns due to incorrect rmap operations with the following
      loop:
      
      mount <filesystem>				# (0)
      fsstress <run only readonly ops> &		# (1)
      while true; do
      	fsstress <run all ops>
      	mount -o remount,ro			# (2)
      	fsstress <run only readonly ops>
      	mount -o remount,rw			# (3)
      done
      
      When (2) happens, notice that (1) is still running.  xfs_remount_ro will
      call xfs_blockgc_stop to walk the inode cache to free all the COW
      extents, but the blockgc mechanism races with (1)'s reader threads to
      take IOLOCKs and loses, which means that it doesn't clean them all out.
      Call such a file (A).
      
      When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
      walks the ondisk refcount btree and frees any COW extent that it finds.
      This function does not check the inode cache, which means that incore
      COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
      one of those former COW extents are allocated and mapped into another
      file (B) and someone triggers a COW to the stale reservation in (A), A's
      dirty data will be written into (B) and once that's done, those blocks
      will be transferred to (A)'s data fork without bumping the refcount.
      
      The results are catastrophic -- file (B) and the refcount btree are now
      corrupt.  In the first patch, we fixed the race condition in (2) so that
      (A) will always flush the COW fork.  In this second patch, we move the
      _recover_cow call to the initial mount call in (0) for safety.
      
      As mentioned previously, xfs_reflink_recover_cow walks the refcount
      btree looking for COW staging extents, and frees them.  This was
      intended to be run at mount time (when we know there are no live inodes)
      to clean up any leftover staging events that may have been left behind
      during an unclean shutdown.  As a time "optimization" for readonly
      mounts, we deferred this to the ro->rw transition, not realizing that
      any failure to clean all COW forks during a rw->ro transition would
      result in catastrophic corruption.
      
      Therefore, remove this optimization and only run the recovery routine
      when we're guaranteed not to have any COW staging extents anywhere,
      which means we always run this at mount time.  While we're at it, move
      the callsite to xfs_log_mount_finish because any refcount btree
      expansion (however unlikely given that we're removing records from the
      right side of the index) must be fed by a per-AG reservation, which
      doesn't exist in its current location.
      
      Fixes: 174edb0e ("xfs: store in-progress CoW allocations in the refcount btree")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      7993f1a4
  9. 15 10月, 2021 1 次提交
    • D
      xfs: port the defer ops capture and continue to resource capture · 512edfac
      Darrick J. Wong 提交于
      When log recovery tries to recover a transaction that had log intent
      items attached to it, it has to save certain parts of the transaction
      state (reservation, dfops chain, inodes with no automatic unlock) so
      that it can finish single-stepping the recovered transactions before
      finishing the chains.
      
      This is done with the xfs_defer_ops_capture and xfs_defer_ops_continue
      functions.  Right now they open-code this functionality, so let's port
      this to the formalized resource capture structure that we introduced in
      the previous patch.  This enables us to hold up to two inodes and two
      buffers during log recovery, the same way we do for regular runtime.
      
      With this patch applied, we'll be ready to support atomic extent swap
      which holds two inodes; and logged xattrs which holds one inode and one
      xattr leaf buffer.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      512edfac
  10. 20 8月, 2021 4 次提交
    • D
      xfs: introduce xfs_sb_is_v5 helper · d6837c1a
      Dave Chinner 提交于
      Rather than open coding XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5
      checks everywhere, add a simple wrapper to encapsulate this and make
      the code easier to read.
      
      This allows us to remove the xfs_sb_version_has_v3inode() wrapper
      which is only used in xfs_format.h now and is just a version number
      check.
      
      There are a couple of places where we should be checking the mount
      feature bits rather than the superblock version (e.g. remount), so
      those are converted to use xfs_has_crc(mp) instead.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      d6837c1a
    • D
      xfs: convert remaining mount flags to state flags · 2e973b2c
      Dave Chinner 提交于
      The remaining mount flags kept in m_flags are actually runtime state
      flags. These change dynamically, so they really should be updated
      atomically so we don't potentially lose an update due to racing
      modifications.
      
      Convert these remaining flags to be stored in m_opstate and use
      atomic bitops to set and clear the flags. This also adds a couple of
      simple wrappers for common state checks - read only and shutdown.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      2e973b2c
    • D
      xfs: replace xfs_sb_version checks with feature flag checks · 38c26bfd
      Dave Chinner 提交于
      Convert the xfs_sb_version_hasfoo() to checks against
      mp->m_features. Checks of the superblock itself during disk
      operations (e.g. in the read/write verifiers and the to/from disk
      formatters) are not converted - they operate purely on the
      superblock state. Everything else should use the mount features.
      
      Large parts of this conversion were done with sed with commands like
      this:
      
      for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
      	sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
      done
      
      With manual cleanups for things like "xfs_has_extflgbit" and other
      little inconsistencies in naming.
      
      The result is ia lot less typing to check features and an XFS binary
      size reduced by a bit over 3kB:
      
      $ size -t fs/xfs/built-in.a
      	text	   data	    bss	    dec	    hex	filenam
      before	1130866  311352     484 1442702  16038e (TOTALS)
      after	1127727  311352     484 1439563  15f74b (TOTALS)
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      38c26bfd
    • D
      xfs: reflect sb features in xfs_mount · a1d86e8d
      Dave Chinner 提交于
      Currently on-disk feature checks require decoding the superblock
      fileds and so can be non-trivial. We have almost 400 hundred
      individual feature checks in the XFS code, so this is a significant
      amount of code. To reduce runtime check overhead, pre-process all
      the version flags into a features field in the xfs_mount at mount
      time so we can convert all the feature checks to a simple flag
      check.
      
      There is also a need to convert the dynamic feature flags to update
      the m_features field. This is required for attr, attr2 and quota
      features. New xfs_mount based wrappers are added for this.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      a1d86e8d
  11. 17 8月, 2021 3 次提交
  12. 10 8月, 2021 5 次提交
    • D
      xfs: refactor xfs_iget calls from log intent recovery · 4bc61983
      Darrick J. Wong 提交于
      Hoist the code from xfs_bui_item_recover that igets an inode and marks
      it as being part of log intent recovery.  The next patch will want a
      common function.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      4bc61983
    • D
      xfs: allow setting and clearing of log incompat feature flags · 908ce71e
      Darrick J. Wong 提交于
      Log incompat feature flags in the superblock exist for one purpose: to
      protect the contents of a dirty log from replay on a kernel that isn't
      prepared to handle those dirty contents.  This means that they can be
      cleared if (a) we know the log is clean and (b) we know that there
      aren't any other threads in the system that might be setting or relying
      upon a log incompat flag.
      
      Therefore, clear the log incompat flags when we've finished recovering
      the log, when we're unmounting cleanly, remounting read-only, or
      freezing; and provide a function so that subsequent patches can start
      using this.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      908ce71e
    • D
      xfs: replace kmem_alloc_large() with kvmalloc() · d634525d
      Dave Chinner 提交于
      There is no reason for this wrapper existing anymore. All the places
      that use KM_NOFS allocation are within transaction contexts and
      hence covered by memalloc_nofs_save/restore contexts. Hence we don't
      need any special handling of vmalloc for large IOs anymore and
      so special casing this code isn't necessary.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      d634525d
    • D
      xfs: remove kmem_alloc_io() · 98fe2c3c
      Dave Chinner 提交于
      Since commit 59bb4798 ("mm, sl[aou]b: guarantee natural alignment
      for kmalloc(power-of-two)"), the core slab code now guarantees slab
      alignment in all situations sufficient for IO purposes (i.e. minimum
      of 512 byte alignment of >= 512 byte sized heap allocations) we no
      longer need the workaround in the XFS code to provide this
      guarantee.
      
      Replace the use of kmem_alloc_io() with kmem_alloc() or
      kmem_alloc_large() appropriately, and remove the kmem_alloc_io()
      interface altogether.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      98fe2c3c
    • D
      mm: Add kvrealloc() · de2860f4
      Dave Chinner 提交于
      During log recovery of an XFS filesystem with 64kB directory
      buffers, rebuilding a buffer split across two log records results
      in a memory allocation warning from krealloc like this:
      
      xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff)
      XFS (dm-0): Unmounting Filesystem
      XFS (dm-0): Mounting V5 Filesystem
      XFS (dm-0): Starting recovery (logdev: internal)
      ------------[ cut here ]------------
      WARNING: CPU: 5 PID: 3435170 at mm/page_alloc.c:3539 get_page_from_freelist+0xdee/0xe40
      .....
      RIP: 0010:get_page_from_freelist+0xdee/0xe40
      Call Trace:
       ? complete+0x3f/0x50
       __alloc_pages+0x16f/0x300
       alloc_pages+0x87/0x110
       kmalloc_order+0x2c/0x90
       kmalloc_order_trace+0x1d/0x90
       __kmalloc_track_caller+0x215/0x270
       ? xlog_recover_add_to_cont_trans+0x63/0x1f0
       krealloc+0x54/0xb0
       xlog_recover_add_to_cont_trans+0x63/0x1f0
       xlog_recovery_process_trans+0xc1/0xd0
       xlog_recover_process_ophdr+0x86/0x130
       xlog_recover_process_data+0x9f/0x160
       xlog_recover_process+0xa2/0x120
       xlog_do_recovery_pass+0x40b/0x7d0
       ? __irq_work_queue_local+0x4f/0x60
       ? irq_work_queue+0x3a/0x50
       xlog_do_log_recovery+0x70/0x150
       xlog_do_recover+0x38/0x1d0
       xlog_recover+0xd8/0x170
       xfs_log_mount+0x181/0x300
       xfs_mountfs+0x4a1/0x9b0
       xfs_fs_fill_super+0x3c0/0x7b0
       get_tree_bdev+0x171/0x270
       ? suffix_kstrtoint.constprop.0+0xf0/0xf0
       xfs_fs_get_tree+0x15/0x20
       vfs_get_tree+0x24/0xc0
       path_mount+0x2f5/0xaf0
       __x64_sys_mount+0x108/0x140
       do_syscall_64+0x3a/0x70
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Essentially, we are taking a multi-order allocation from kmem_alloc()
      (which has an open coded no fail, no warn loop) and then
      reallocating it out to 64kB using krealloc(__GFP_NOFAIL) and that is
      then triggering the above warning.
      
      This is a regression caused by converting this code from an open
      coded no fail/no warn reallocation loop to using __GFP_NOFAIL.
      
      What we actually need here is kvrealloc(), so that if contiguous
      page allocation fails we fall back to vmalloc() and we don't
      get nasty warnings happening in XFS.
      
      Fixes: 771915c4 ("xfs: remove kmem_realloc()")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      de2860f4
  13. 07 8月, 2021 1 次提交
    • D
      xfs: per-cpu deferred inode inactivation queues · ab23a776
      Dave Chinner 提交于
      Move inode inactivation to background work contexts so that it no
      longer runs in the context that releases the final reference to an
      inode. This will allow process work that ends up blocking on
      inactivation to continue doing work while the filesytem processes
      the inactivation in the background.
      
      A typical demonstration of this is unlinking an inode with lots of
      extents. The extents are removed during inactivation, so this blocks
      the process that unlinked the inode from the directory structure. By
      moving the inactivation to the background process, the userspace
      applicaiton can keep working (e.g. unlinking the next inode in the
      directory) while the inactivation work on the previous inode is
      done by a different CPU.
      
      The implementation of the queue is relatively simple. We use a
      per-cpu lockless linked list (llist) to queue inodes for
      inactivation without requiring serialisation mechanisms, and a work
      item to allow the queue to be processed by a CPU bound worker
      thread. We also keep a count of the queue depth so that we can
      trigger work after a number of deferred inactivations have been
      queued.
      
      The use of a bound workqueue with a single work depth allows the
      workqueue to run one work item per CPU. We queue the work item on
      the CPU we are currently running on, and so this essentially gives
      us affine per-cpu worker threads for the per-cpu queues. THis
      maintains the effective CPU affinity that occurs within XFS at the
      AG level due to all objects in a directory being local to an AG.
      Hence inactivation work tends to run on the same CPU that last
      accessed all the objects that inactivation accesses and this
      maintains hot CPU caches for unlink workloads.
      
      A depth of 32 inodes was chosen to match the number of inodes in an
      inode cluster buffer. This hopefully allows sequential
      allocation/unlink behaviours to defering inactivation of all the
      inodes in a single cluster buffer at a time, further helping
      maintain hot CPU and buffer cache accesses while running
      inactivations.
      
      A hard per-cpu queue throttle of 256 inode has been set to avoid
      runaway queuing when inodes that take a long to time inactivate are
      being processed. For example, when unlinking inodes with large
      numbers of extents that can take a lot of processing to free.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [djwong: tweak comments and tracepoints, convert opflags to state bits]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ab23a776
  14. 22 6月, 2021 1 次提交
    • D
      xfs: force the log offline when log intent item recovery fails · 4e6b8270
      Darrick J. Wong 提交于
      If any part of log intent item recovery fails, we should shut down the
      log immediately to stop the log from writing a clean unmount record to
      disk, because the metadata is not consistent.  The inability to cancel a
      dirty transaction catches most of these cases, but there are a few
      things that have slipped through the cracks, such as ENOSPC from a
      transaction allocation, or runtime errors that result in cancellation of
      a non-dirty transaction.
      
      This solves some weird behaviors reported by customers where a system
      goes down, the first mount fails, the second succeeds, but then the fs
      goes down later because of inconsistent metadata.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      4e6b8270
  15. 02 6月, 2021 2 次提交
  16. 08 4月, 2021 2 次提交
  17. 26 3月, 2021 1 次提交
  18. 17 12月, 2020 1 次提交
  19. 10 12月, 2020 1 次提交
  20. 22 10月, 2020 1 次提交
    • D
      xfs: cancel intents immediately if process_intents fails · 2e76f188
      Darrick J. Wong 提交于
      If processing recovered log intent items fails, we need to cancel all
      the unprocessed recovered items immediately so that a subsequent AIL
      push in the bail out path won't get wedged on the pinned intent items
      that didn't get processed.
      
      This can happen if the log contains (1) an intent that gets and releases
      an inode, (2) an intent that cannot be recovered successfully, and (3)
      some third intent item.  When recovery of (2) fails, we leave (3) pinned
      in memory.  Inode reclamation is called in the error-out path of
      xfs_mountfs before xfs_log_cancel_mount.  Reclamation calls
      xfs_ail_push_all_sync, which gets stuck waiting for (3).
      
      Therefore, call xlog_recover_cancel_intents if _process_intents fails.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      2e76f188
  21. 07 10月, 2020 3 次提交
    • D
      xfs: fix an incore inode UAF in xfs_bui_recover · ff4ab5e0
      Darrick J. Wong 提交于
      In xfs_bui_item_recover, there exists a use-after-free bug with regards
      to the inode that is involved in the bmap replay operation.  If the
      mapping operation does not complete, we call xfs_bmap_unmap_extent to
      create a deferred op to finish the unmapping work, and we retain a
      pointer to the incore inode.
      
      Unfortunately, the very next thing we do is commit the transaction and
      drop the inode.  If reclaim tears down the inode before we try to finish
      the defer ops, we dereference garbage and blow up.  Therefore, create a
      way to join inodes to the defer ops freezer so that we can maintain the
      xfs_inode reference until we're done with the inode.
      
      Note: This imposes the requirement that there be enough memory to keep
      every incore inode in memory throughout recovery.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      ff4ab5e0
    • D
      xfs: xfs_defer_capture should absorb remaining transaction reservation · 929b92f6
      Darrick J. Wong 提交于
      When xfs_defer_capture extracts the deferred ops and transaction state
      from a transaction, it should record the transaction reservation type
      from the old transaction so that when we continue the dfops chain, we
      still use the same reservation parameters.
      
      Doing this means that the log item recovery functions get to determine
      the transaction reservation instead of abusing tr_itruncate in yet
      another part of xfs.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      929b92f6
    • D
      xfs: xfs_defer_capture should absorb remaining block reservations · 4f9a60c4
      Darrick J. Wong 提交于
      When xfs_defer_capture extracts the deferred ops and transaction state
      from a transaction, it should record the remaining block reservations so
      that when we continue the dfops chain, we can reserve the same number of
      blocks to use.  We capture the reservations for both data and realtime
      volumes.
      
      This adds the requirement that every log intent item recovery function
      must be careful to reserve enough blocks to handle both itself and all
      defer ops that it can queue.  On the other hand, this enables us to do
      away with the handwaving block estimation nonsense that was going on in
      xlog_finish_defer_ops.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      4f9a60c4