1. 15 10月, 2021 1 次提交
    • D
      xfs: port the defer ops capture and continue to resource capture · 512edfac
      Darrick J. Wong 提交于
      When log recovery tries to recover a transaction that had log intent
      items attached to it, it has to save certain parts of the transaction
      state (reservation, dfops chain, inodes with no automatic unlock) so
      that it can finish single-stepping the recovered transactions before
      finishing the chains.
      
      This is done with the xfs_defer_ops_capture and xfs_defer_ops_continue
      functions.  Right now they open-code this functionality, so let's port
      this to the formalized resource capture structure that we introduced in
      the previous patch.  This enables us to hold up to two inodes and two
      buffers during log recovery, the same way we do for regular runtime.
      
      With this patch applied, we'll be ready to support atomic extent swap
      which holds two inodes; and logged xattrs which holds one inode and one
      xattr leaf buffer.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      512edfac
  2. 20 8月, 2021 4 次提交
    • D
      xfs: introduce xfs_sb_is_v5 helper · d6837c1a
      Dave Chinner 提交于
      Rather than open coding XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5
      checks everywhere, add a simple wrapper to encapsulate this and make
      the code easier to read.
      
      This allows us to remove the xfs_sb_version_has_v3inode() wrapper
      which is only used in xfs_format.h now and is just a version number
      check.
      
      There are a couple of places where we should be checking the mount
      feature bits rather than the superblock version (e.g. remount), so
      those are converted to use xfs_has_crc(mp) instead.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      d6837c1a
    • D
      xfs: convert remaining mount flags to state flags · 2e973b2c
      Dave Chinner 提交于
      The remaining mount flags kept in m_flags are actually runtime state
      flags. These change dynamically, so they really should be updated
      atomically so we don't potentially lose an update due to racing
      modifications.
      
      Convert these remaining flags to be stored in m_opstate and use
      atomic bitops to set and clear the flags. This also adds a couple of
      simple wrappers for common state checks - read only and shutdown.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      2e973b2c
    • D
      xfs: replace xfs_sb_version checks with feature flag checks · 38c26bfd
      Dave Chinner 提交于
      Convert the xfs_sb_version_hasfoo() to checks against
      mp->m_features. Checks of the superblock itself during disk
      operations (e.g. in the read/write verifiers and the to/from disk
      formatters) are not converted - they operate purely on the
      superblock state. Everything else should use the mount features.
      
      Large parts of this conversion were done with sed with commands like
      this:
      
      for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
      	sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
      done
      
      With manual cleanups for things like "xfs_has_extflgbit" and other
      little inconsistencies in naming.
      
      The result is ia lot less typing to check features and an XFS binary
      size reduced by a bit over 3kB:
      
      $ size -t fs/xfs/built-in.a
      	text	   data	    bss	    dec	    hex	filenam
      before	1130866  311352     484 1442702  16038e (TOTALS)
      after	1127727  311352     484 1439563  15f74b (TOTALS)
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      38c26bfd
    • D
      xfs: reflect sb features in xfs_mount · a1d86e8d
      Dave Chinner 提交于
      Currently on-disk feature checks require decoding the superblock
      fileds and so can be non-trivial. We have almost 400 hundred
      individual feature checks in the XFS code, so this is a significant
      amount of code. To reduce runtime check overhead, pre-process all
      the version flags into a features field in the xfs_mount at mount
      time so we can convert all the feature checks to a simple flag
      check.
      
      There is also a need to convert the dynamic feature flags to update
      the m_features field. This is required for attr, attr2 and quota
      features. New xfs_mount based wrappers are added for this.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      a1d86e8d
  3. 17 8月, 2021 3 次提交
  4. 10 8月, 2021 5 次提交
    • D
      xfs: refactor xfs_iget calls from log intent recovery · 4bc61983
      Darrick J. Wong 提交于
      Hoist the code from xfs_bui_item_recover that igets an inode and marks
      it as being part of log intent recovery.  The next patch will want a
      common function.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      4bc61983
    • D
      xfs: allow setting and clearing of log incompat feature flags · 908ce71e
      Darrick J. Wong 提交于
      Log incompat feature flags in the superblock exist for one purpose: to
      protect the contents of a dirty log from replay on a kernel that isn't
      prepared to handle those dirty contents.  This means that they can be
      cleared if (a) we know the log is clean and (b) we know that there
      aren't any other threads in the system that might be setting or relying
      upon a log incompat flag.
      
      Therefore, clear the log incompat flags when we've finished recovering
      the log, when we're unmounting cleanly, remounting read-only, or
      freezing; and provide a function so that subsequent patches can start
      using this.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      908ce71e
    • D
      xfs: replace kmem_alloc_large() with kvmalloc() · d634525d
      Dave Chinner 提交于
      There is no reason for this wrapper existing anymore. All the places
      that use KM_NOFS allocation are within transaction contexts and
      hence covered by memalloc_nofs_save/restore contexts. Hence we don't
      need any special handling of vmalloc for large IOs anymore and
      so special casing this code isn't necessary.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      d634525d
    • D
      xfs: remove kmem_alloc_io() · 98fe2c3c
      Dave Chinner 提交于
      Since commit 59bb4798 ("mm, sl[aou]b: guarantee natural alignment
      for kmalloc(power-of-two)"), the core slab code now guarantees slab
      alignment in all situations sufficient for IO purposes (i.e. minimum
      of 512 byte alignment of >= 512 byte sized heap allocations) we no
      longer need the workaround in the XFS code to provide this
      guarantee.
      
      Replace the use of kmem_alloc_io() with kmem_alloc() or
      kmem_alloc_large() appropriately, and remove the kmem_alloc_io()
      interface altogether.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      98fe2c3c
    • D
      mm: Add kvrealloc() · de2860f4
      Dave Chinner 提交于
      During log recovery of an XFS filesystem with 64kB directory
      buffers, rebuilding a buffer split across two log records results
      in a memory allocation warning from krealloc like this:
      
      xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff)
      XFS (dm-0): Unmounting Filesystem
      XFS (dm-0): Mounting V5 Filesystem
      XFS (dm-0): Starting recovery (logdev: internal)
      ------------[ cut here ]------------
      WARNING: CPU: 5 PID: 3435170 at mm/page_alloc.c:3539 get_page_from_freelist+0xdee/0xe40
      .....
      RIP: 0010:get_page_from_freelist+0xdee/0xe40
      Call Trace:
       ? complete+0x3f/0x50
       __alloc_pages+0x16f/0x300
       alloc_pages+0x87/0x110
       kmalloc_order+0x2c/0x90
       kmalloc_order_trace+0x1d/0x90
       __kmalloc_track_caller+0x215/0x270
       ? xlog_recover_add_to_cont_trans+0x63/0x1f0
       krealloc+0x54/0xb0
       xlog_recover_add_to_cont_trans+0x63/0x1f0
       xlog_recovery_process_trans+0xc1/0xd0
       xlog_recover_process_ophdr+0x86/0x130
       xlog_recover_process_data+0x9f/0x160
       xlog_recover_process+0xa2/0x120
       xlog_do_recovery_pass+0x40b/0x7d0
       ? __irq_work_queue_local+0x4f/0x60
       ? irq_work_queue+0x3a/0x50
       xlog_do_log_recovery+0x70/0x150
       xlog_do_recover+0x38/0x1d0
       xlog_recover+0xd8/0x170
       xfs_log_mount+0x181/0x300
       xfs_mountfs+0x4a1/0x9b0
       xfs_fs_fill_super+0x3c0/0x7b0
       get_tree_bdev+0x171/0x270
       ? suffix_kstrtoint.constprop.0+0xf0/0xf0
       xfs_fs_get_tree+0x15/0x20
       vfs_get_tree+0x24/0xc0
       path_mount+0x2f5/0xaf0
       __x64_sys_mount+0x108/0x140
       do_syscall_64+0x3a/0x70
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Essentially, we are taking a multi-order allocation from kmem_alloc()
      (which has an open coded no fail, no warn loop) and then
      reallocating it out to 64kB using krealloc(__GFP_NOFAIL) and that is
      then triggering the above warning.
      
      This is a regression caused by converting this code from an open
      coded no fail/no warn reallocation loop to using __GFP_NOFAIL.
      
      What we actually need here is kvrealloc(), so that if contiguous
      page allocation fails we fall back to vmalloc() and we don't
      get nasty warnings happening in XFS.
      
      Fixes: 771915c4 ("xfs: remove kmem_realloc()")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      de2860f4
  5. 07 8月, 2021 1 次提交
    • D
      xfs: per-cpu deferred inode inactivation queues · ab23a776
      Dave Chinner 提交于
      Move inode inactivation to background work contexts so that it no
      longer runs in the context that releases the final reference to an
      inode. This will allow process work that ends up blocking on
      inactivation to continue doing work while the filesytem processes
      the inactivation in the background.
      
      A typical demonstration of this is unlinking an inode with lots of
      extents. The extents are removed during inactivation, so this blocks
      the process that unlinked the inode from the directory structure. By
      moving the inactivation to the background process, the userspace
      applicaiton can keep working (e.g. unlinking the next inode in the
      directory) while the inactivation work on the previous inode is
      done by a different CPU.
      
      The implementation of the queue is relatively simple. We use a
      per-cpu lockless linked list (llist) to queue inodes for
      inactivation without requiring serialisation mechanisms, and a work
      item to allow the queue to be processed by a CPU bound worker
      thread. We also keep a count of the queue depth so that we can
      trigger work after a number of deferred inactivations have been
      queued.
      
      The use of a bound workqueue with a single work depth allows the
      workqueue to run one work item per CPU. We queue the work item on
      the CPU we are currently running on, and so this essentially gives
      us affine per-cpu worker threads for the per-cpu queues. THis
      maintains the effective CPU affinity that occurs within XFS at the
      AG level due to all objects in a directory being local to an AG.
      Hence inactivation work tends to run on the same CPU that last
      accessed all the objects that inactivation accesses and this
      maintains hot CPU caches for unlink workloads.
      
      A depth of 32 inodes was chosen to match the number of inodes in an
      inode cluster buffer. This hopefully allows sequential
      allocation/unlink behaviours to defering inactivation of all the
      inodes in a single cluster buffer at a time, further helping
      maintain hot CPU and buffer cache accesses while running
      inactivations.
      
      A hard per-cpu queue throttle of 256 inode has been set to avoid
      runaway queuing when inodes that take a long to time inactivate are
      being processed. For example, when unlinking inodes with large
      numbers of extents that can take a lot of processing to free.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [djwong: tweak comments and tracepoints, convert opflags to state bits]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ab23a776
  6. 22 6月, 2021 1 次提交
    • D
      xfs: force the log offline when log intent item recovery fails · 4e6b8270
      Darrick J. Wong 提交于
      If any part of log intent item recovery fails, we should shut down the
      log immediately to stop the log from writing a clean unmount record to
      disk, because the metadata is not consistent.  The inability to cancel a
      dirty transaction catches most of these cases, but there are a few
      things that have slipped through the cracks, such as ENOSPC from a
      transaction allocation, or runtime errors that result in cancellation of
      a non-dirty transaction.
      
      This solves some weird behaviors reported by customers where a system
      goes down, the first mount fails, the second succeeds, but then the fs
      goes down later because of inconsistent metadata.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      4e6b8270
  7. 02 6月, 2021 2 次提交
  8. 08 4月, 2021 2 次提交
  9. 26 3月, 2021 1 次提交
  10. 17 12月, 2020 1 次提交
  11. 10 12月, 2020 1 次提交
  12. 22 10月, 2020 1 次提交
    • D
      xfs: cancel intents immediately if process_intents fails · 2e76f188
      Darrick J. Wong 提交于
      If processing recovered log intent items fails, we need to cancel all
      the unprocessed recovered items immediately so that a subsequent AIL
      push in the bail out path won't get wedged on the pinned intent items
      that didn't get processed.
      
      This can happen if the log contains (1) an intent that gets and releases
      an inode, (2) an intent that cannot be recovered successfully, and (3)
      some third intent item.  When recovery of (2) fails, we leave (3) pinned
      in memory.  Inode reclamation is called in the error-out path of
      xfs_mountfs before xfs_log_cancel_mount.  Reclamation calls
      xfs_ail_push_all_sync, which gets stuck waiting for (3).
      
      Therefore, call xlog_recover_cancel_intents if _process_intents fails.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      2e76f188
  13. 07 10月, 2020 5 次提交
    • D
      xfs: fix an incore inode UAF in xfs_bui_recover · ff4ab5e0
      Darrick J. Wong 提交于
      In xfs_bui_item_recover, there exists a use-after-free bug with regards
      to the inode that is involved in the bmap replay operation.  If the
      mapping operation does not complete, we call xfs_bmap_unmap_extent to
      create a deferred op to finish the unmapping work, and we retain a
      pointer to the incore inode.
      
      Unfortunately, the very next thing we do is commit the transaction and
      drop the inode.  If reclaim tears down the inode before we try to finish
      the defer ops, we dereference garbage and blow up.  Therefore, create a
      way to join inodes to the defer ops freezer so that we can maintain the
      xfs_inode reference until we're done with the inode.
      
      Note: This imposes the requirement that there be enough memory to keep
      every incore inode in memory throughout recovery.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      ff4ab5e0
    • D
      xfs: xfs_defer_capture should absorb remaining transaction reservation · 929b92f6
      Darrick J. Wong 提交于
      When xfs_defer_capture extracts the deferred ops and transaction state
      from a transaction, it should record the transaction reservation type
      from the old transaction so that when we continue the dfops chain, we
      still use the same reservation parameters.
      
      Doing this means that the log item recovery functions get to determine
      the transaction reservation instead of abusing tr_itruncate in yet
      another part of xfs.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      929b92f6
    • D
      xfs: xfs_defer_capture should absorb remaining block reservations · 4f9a60c4
      Darrick J. Wong 提交于
      When xfs_defer_capture extracts the deferred ops and transaction state
      from a transaction, it should record the remaining block reservations so
      that when we continue the dfops chain, we can reserve the same number of
      blocks to use.  We capture the reservations for both data and realtime
      volumes.
      
      This adds the requirement that every log intent item recovery function
      must be careful to reserve enough blocks to handle both itself and all
      defer ops that it can queue.  On the other hand, this enables us to do
      away with the handwaving block estimation nonsense that was going on in
      xlog_finish_defer_ops.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      4f9a60c4
    • D
      xfs: proper replay of deferred ops queued during log recovery · e6fff81e
      Darrick J. Wong 提交于
      When we replay unfinished intent items that have been recovered from the
      log, it's possible that the replay will cause the creation of more
      deferred work items.  As outlined in commit 50995582 ("xfs: log
      recovery should replay deferred ops in order"), later work items have an
      implicit ordering dependency on earlier work items.  Therefore, recovery
      must replay the items (both recovered and created) in the same order
      that they would have been during normal operation.
      
      For log recovery, we enforce this ordering by using an empty transaction
      to collect deferred ops that get created in the process of recovering a
      log intent item to prevent them from being committed before the rest of
      the recovered intent items.  After we finish committing all the
      recovered log items, we allocate a transaction with an enormous block
      reservation, splice our huge list of created deferred ops into that
      transaction, and commit it, thereby finishing all those ops.
      
      This is /really/ hokey -- it's the one place in XFS where we allow
      nested transactions; the splicing of the defer ops list is is inelegant
      and has to be done twice per recovery function; and the broken way we
      handle inode pointers and block reservations cause subtle use-after-free
      and allocator problems that will be fixed by this patch and the two
      patches after it.
      
      Therefore, replace the hokey empty transaction with a structure designed
      to capture each chain of deferred ops that are created as part of
      recovering a single unfinished log intent.  Finally, refactor the loop
      that replays those chains to do so using one transaction per chain.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      e6fff81e
    • D
      xfs: remove XFS_LI_RECOVERED · 901219bb
      Darrick J. Wong 提交于
      The ->iop_recover method of a log intent item removes the recovered
      intent item from the AIL by logging an intent done item and committing
      the transaction, so it's superfluous to have this flag check.  Nothing
      else uses it, so get rid of the flag entirely.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      901219bb
  14. 26 9月, 2020 1 次提交
  15. 24 9月, 2020 1 次提交
  16. 23 9月, 2020 1 次提交
  17. 16 9月, 2020 5 次提交
  18. 07 9月, 2020 1 次提交
  19. 05 8月, 2020 1 次提交
  20. 07 7月, 2020 1 次提交
  21. 08 5月, 2020 1 次提交