1. 02 7月, 2022 4 次提交
    • D
      xfs: introduce per-cpu CIL tracking structure · af1c2146
      Dave Chinner 提交于
      The CIL push lock is highly contended on larger machines, becoming a
      hard bottleneck that about 700,000 transaction commits/s on >16p
      machines. To address this, start moving the CIL tracking
      infrastructure to utilise per-CPU structures.
      
      We need to track the space used, the amount of log reservation space
      reserved to write the CIL, the log items in the CIL and the busy
      extents that need to be completed by the CIL commit.  This requires
      a couple of per-cpu counters, an unordered per-cpu list and a
      globally ordered per-cpu list.
      
      Create a per-cpu structure to hold these and all the management
      interfaces needed, as well as the hooks to handle hotplug CPUs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      
      af1c2146
    • D
      xfs: rework per-iclog header CIL reservation · 31151cc3
      Dave Chinner 提交于
      For every iclog that a CIL push will use up, we need to ensure we
      have space reserved for the iclog header in each iclog. It is
      extremely difficult to do this accurately with a per-cpu counter
      without expensive summing of the counter in every commit. However,
      we know what the maximum CIL size is going to be because of the
      hard space limit we have, and hence we know exactly how many iclogs
      we are going to need to write out the CIL.
      
      We are constrained by the requirement that small transactions only
      have reservation space for a single iclog header built into them.
      At commit time we don't know how much of the current transaction
      reservation is made up of iclog header reservations as calculated by
      xfs_log_calc_unit_res() when the ticket was reserved. As larger
      reservations have multiple header spaces reserved, we can steal
      more than one iclog header reservation at a time, but we only steal
      the exact number needed for the given log vector size delta.
      
      As a result, we don't know exactly when we are going to steal iclog
      header reservations, nor do we know exactly how many we are going to
      need for a given CIL.
      
      To make things simple, start by calculating the worst case number of
      iclog headers a full CIL push will require. Record this into an
      atomic variable in the CIL. Then add a byte counter to the log
      ticket that records exactly how much iclog header space has been
      reserved in this ticket by xfs_log_calc_unit_res(). This tells us
      exactly how much space we can steal from the ticket at transaction
      commit time.
      
      Now, at transaction commit time, we can check if the CIL has a full
      iclog header reservation and, if not, steal the entire reservation
      the current ticket holds for iclog headers. This minimises the
      number of times we need to do atomic operations in the fast path,
      but still guarantees we get all the reservations we need.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      31151cc3
    • D
      xfs: lift init CIL reservation out of xc_cil_lock · 12380d23
      Dave Chinner 提交于
      The xc_cil_lock is the most highly contended lock in XFS now. To
      start the process of getting rid of it, lift the initial reservation
      of the CIL log space out from under the xc_cil_lock.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      12380d23
    • D
      xfs: use the CIL space used counter for emptiness checks · 88591e7f
      Dave Chinner 提交于
      In the next patches we are going to make the CIL list itself
      per-cpu, and so we cannot use list_empty() to check is the list is
      empty. Replace the list_empty() checks with a flag in the CIL to
      indicate we have committed at least one transaction to the CIL and
      hence the CIL is not empty.
      
      We need this flag to be an atomic so that we can clear it without
      holding any locks in the commit fast path, but we also need to be
      careful to avoid atomic operations in the fast path. Hence we use
      the fact that test_bit() is not an atomic op to first check if the
      flag is set and then run the atomic test_and_clear_bit() operation
      to clear it and steal the initial unit reservation for the CIL
      context checkpoint.
      
      When we are switching to a new context in a push, we place the
      setting of the XLOG_CIL_EMPTY flag under the xc_push_lock. THis
      allows all the other places that need to check whether the CIL is
      empty to use test_bit() and still be serialised correctly with the
      CIL context swaps that set the bit.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      88591e7f
  2. 12 5月, 2022 1 次提交
    • D
      xfs: can't use kmem_zalloc() for attribute buffers · 45ff8b47
      Dave Chinner 提交于
      Because heap allocation of 64kB buffers will fail:
      
      ....
       XFS: fs_mark(8414) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
       XFS: fs_mark(8417) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
       XFS: fs_mark(8409) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
       XFS: fs_mark(8428) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
       XFS: fs_mark(8430) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
       XFS: fs_mark(8437) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
       XFS: fs_mark(8433) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
       XFS: fs_mark(8406) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
       XFS: fs_mark(8412) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
       XFS: fs_mark(8432) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
       XFS: fs_mark(8424) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
      ....
      
      I'd use kvmalloc() instead, but....
      
      - 48.19% xfs_attr_create_intent
        - 46.89% xfs_attri_init
           - kvmalloc_node
      	- 46.04% __kmalloc_node
      	   - kmalloc_large_node
      	      - 45.99% __alloc_pages
      		 - 39.39% __alloc_pages_slowpath.constprop.0
      		    - 38.89% __alloc_pages_direct_compact
      		       - 38.71% try_to_compact_pages
      			  - compact_zone_order
      			  - compact_zone
      			     - 21.09% isolate_migratepages_block
      				  10.31% PageHuge
      				  5.82% set_pfnblock_flags_mask
      				  0.86% get_pfnblock_flags_mask
      			     - 4.48% __reset_isolation_suitable
      				  4.44% __reset_isolation_pfn
      			     - 3.56% __pageblock_pfn_to_page
      				  1.33% pfn_to_online_page
      			       2.83% get_pfnblock_flags_mask
      			     - 0.87% migrate_pages
      				  0.86% compaction_alloc
      			       0.84% find_suitable_fallback
      		 - 6.60% get_page_from_freelist
      		      4.99% clear_page_erms
      		    - 1.19% _raw_spin_lock_irqsave
      		       - do_raw_spin_lock
      			    __pv_queued_spin_lock_slowpath
      	- 0.86% __vmalloc_node_range
      	     0.65% __alloc_pages_bulk
      
      .... this is just yet another reminder of how much kvmalloc() sucks.
      So lift xlog_cil_kvmalloc(), rename it to xlog_kvmalloc() and use
      that instead....
      
      We also clean up the attribute name and value lengths as they no
      longer need to be rounded out to sizes compatible with log vectors.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      45ff8b47
  3. 04 5月, 2022 2 次提交
    • D
      xfs: intent item whiteouts · 0d227466
      Dave Chinner 提交于
      When we log modifications based on intents, we add both intent
      and intent done items to the modification being made. These get
      written to the log to ensure that the operation is re-run if the
      intent done is not found in the log.
      
      However, for operations that complete wholly within a single
      checkpoint, the change in the checkpoint is atomic and will never
      need replay. In this case, we don't need to actually write the
      intent and intent done items to the journal because log recovery
      will never need to manually restart this modification.
      
      Log recovery currently handles intent/intent done matching by
      inserting the intent into the AIL, then removing it when a matching
      intent done item is found. Hence for all the intent-based operations
      that complete within a checkpoint, we spend all that time parsing
      the intent/intent done items just to cancel them and do nothing with
      them.
      
      Hence it follows that the only time we actually need intents in the
      log is when the modification crosses checkpoint boundaries in the
      log and so may only be partially complete in the journal. Hence if
      we commit and intent done item to the CIL and the intent item is in
      the same checkpoint, we don't actually have to write them to the
      journal because log recovery will always cancel the intents.
      
      We've never really worried about the overhead of logging intents
      unnecessarily like this because the intents we log are generally
      very much smaller than the change being made. e.g. freeing an extent
      involves modifying at lease two freespace btree blocks and the AGF,
      so the EFI/EFD overhead is only a small increase in space and
      processing time compared to the overall cost of freeing an extent.
      
      However, delayed attributes change this cost equation dramatically,
      especially for inline attributes. In the case of adding an inline
      attribute, we only log the inode core and attribute fork at present.
      With delayed attributes, we now log the attr intent which includes
      the name and value, the inode core adn attr fork, and finally the
      attr intent done item. We increase the number of items we log from 1
      to 3, and the number of log vectors (regions) goes up from 3 to 7.
      Hence we tripple the number of objects that the CIL has to process,
      and more than double the number of log vectors that need to be
      written to the journal.
      
      At scale, this means delayed attributes cause a non-pipelined CIL to
      become CPU bound processing all the extra items, resulting in a > 40%
      performance degradation on 16-way file+xattr create worklaods.
      Pipelining the CIL (as per 5.15) reduces the performance degradation
      to 20%, but now the limitation is the rate at which the log items
      can be written to the iclogs and iclogs be dispatched for IO and
      completed.
      
      Even log IO completion is slowed down by these intents, because it
      now has to process 3x the number of items in the checkpoint.
      Processing completed intents is especially inefficient here, because
      we first insert the intent into the AIL, then remove it from the AIL
      when the intent done is processed. IOWs, we are also doing expensive
      operations in log IO completion we could completely avoid if we
      didn't log completed intent/intent done pairs.
      
      Enter log item whiteouts.
      
      When an intent done is committed, we can check to see if the
      associated intent is in the same checkpoint as we are currently
      committing the intent done to. If so, we can mark the intent log
      item with a whiteout and immediately free the intent done item
      rather than committing it to the CIL. We can basically skip the
      entire formatting and CIL insertion steps for the intent done item.
      
      However, we cannot remove the intent item from the CIL at this point
      because the unlocked per-cpu CIL item lists do not permit removal
      without holding the CIL context lock exclusively. Transaction commit
      only holds the context lock shared, hence the best we can do is mark
      the intent item with a whiteout so that the CIL push can release it
      rather than writing it to the log.
      
      This means we never write the intent to the log if the intent done
      has also been committed to the same checkpoint, but we'll always
      write the intent if the intent done has not been committed or has
      been committed to a different checkpoint. This will result in
      correct log recovery behaviour in all cases, without the overhead of
      logging unnecessary intents.
      
      This intent whiteout concept is generic - we can apply it to all
      intent/intent done pairs that have a direct 1:1 relationship. The
      way deferred ops iterate and relog intents mean that all intents
      currently have a 1:1 relationship with their done intent, and hence
      we can apply this cancellation to all existing intent/intent done
      implementations.
      
      For delayed attributes with a 16-way 64kB xattr create workload,
      whiteouts reduce the amount of journalled metadata from ~2.5GB/s
      down to ~600MB/s and improve the creation rate from 9000/s to
      14000/s.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0d227466
    • D
      xfs: factor and move some code in xfs_log_cil.c · 22b1afc5
      Dave Chinner 提交于
      In preparation for adding support for intent item whiteouts.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      22b1afc5
  4. 21 4月, 2022 8 次提交
  5. 18 4月, 2022 1 次提交
  6. 30 3月, 2022 2 次提交
    • D
      xfs: drop async cache flushes from CIL commits. · 919edbad
      Dave Chinner 提交于
      Jan Kara reported a performance regression in dbench that he
      bisected down to commit bad77c37 ("xfs: CIL checkpoint
      flushes caches unconditionally").
      
      Whilst developing the journal flush/fua optimisations this cache was
      part of, it appeared to made a significant difference to
      performance. However, now that this patchset has settled and all the
      correctness issues fixed, there does not appear to be any
      significant performance benefit to asynchronous cache flushes.
      
      In fact, the opposite is true on some storage types and workloads,
      where additional cache flushes that can occur from fsync heavy
      workloads have measurable and significant impact on overall
      throughput.
      
      Local dbench testing shows little difference on dbench runs with
      sync vs async cache flushes on either fast or slow SSD storage, and
      no difference in streaming concurrent async transaction workloads
      like fs-mark.
      
      Fast NVME storage.
      
      From `dbench -t 30`, CIL scale:
      
      clients		async			sync
      		BW	Latency		BW	Latency
      1		 935.18   0.855		 915.64   0.903
      8		2404.51   6.873		2341.77   6.511
      16		3003.42   6.460		2931.57   6.529
      32		3697.23   7.939		3596.28   7.894
      128		7237.43  15.495		7217.74  11.588
      512		5079.24  90.587		5167.08  95.822
      
      fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize
      
      	create		chown		unlink
      async   1m41s		1m16s		2m03s
      sync	1m40s		1m19s		1m54s
      
      Slower SATA SSD storage:
      
      From `dbench -t 30`, CIL scale:
      
      clients		async			sync
      		BW	Latency		BW	Latency
      1		  78.59  15.792		  83.78  10.729
      8		 367.88  92.067		 404.63  59.943
      16		 564.51  72.524		 602.71  76.089
      32		 831.66 105.984		 870.26 110.482
      128		1659.76 102.969		1624.73  91.356
      512		2135.91 223.054		2603.07 161.160
      
      fsmark, 16 threads, create w/32k logbsize
      
      	create		unlink
      async   5m06s		4m15s
      sync	5m00s		4m22s
      
      And on Jan's test machine:
      
                         5.18-rc8-vanilla       5.18-rc8-patched
      Amean     1        71.22 (   0.00%)       64.94 *   8.81%*
      Amean     2        93.03 (   0.00%)       84.80 *   8.85%*
      Amean     4       150.54 (   0.00%)      137.51 *   8.66%*
      Amean     8       252.53 (   0.00%)      242.24 *   4.08%*
      Amean     16      454.13 (   0.00%)      439.08 *   3.31%*
      Amean     32      835.24 (   0.00%)      829.74 *   0.66%*
      Amean     64     1740.59 (   0.00%)     1686.73 *   3.09%*
      
      Performance and cache flush behaviour is restored to pre-regression
      levels.
      
      As such, we can now consider the async cache flush mechanism an
      unnecessary exercise in premature optimisation and hence we can
      now remove it and the infrastructure it requires completely.
      
      Fixes: bad77c37 ("xfs: CIL checkpoint flushes caches unconditionally")
      Reported-and-tested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      919edbad
    • D
      xfs: log shutdown triggers should only shut down the log · b5f17bec
      Dave Chinner 提交于
      We've got a mess on our hands.
      
      1. xfs_trans_commit() cannot cancel transactions because the mount is
      shut down - that causes dirty, aborted, unlogged log items to sit
      unpinned in memory and potentially get written to disk before the
      log is shut down. Hence xfs_trans_commit() can only abort
      transactions when xlog_is_shutdown() is true.
      
      2. xfs_force_shutdown() is used in places to cause the current
      modification to be aborted via xfs_trans_commit() because it may be
      impractical or impossible to cancel the transaction directly, and
      hence xfs_trans_commit() must cancel transactions when
      xfs_is_shutdown() is true in this situation. But we can't do that
      because of #1.
      
      3. Log IO errors cause log shutdowns by calling xfs_force_shutdown()
      to shut down the mount and then the log from log IO completion.
      
      4. xfs_force_shutdown() can result in a log force being issued,
      which has to wait for log IO completion before it will mark the log
      as shut down. If #3 races with some other shutdown trigger that runs
      a log force, we rely on xfs_force_shutdown() silently ignoring #3
      and avoiding shutting down the log until the failed log force
      completes.
      
      5. To ensure #2 always works, we have to ensure that
      xfs_force_shutdown() does not return until the the log is shut down.
      But in the case of #4, this will result in a deadlock because the
      log Io completion will block waiting for a log force to complete
      which is blocked waiting for log IO to complete....
      
      So the very first thing we have to do here to untangle this mess is
      dissociate log shutdown triggers from mount shutdowns. We already
      have xlog_forced_shutdown, which will atomically transistion to the
      log a shutdown state. Due to internal asserts it cannot be called
      multiple times, but was done simply because the only place that
      could call it was xfs_do_force_shutdown() (i.e. the mount shutdown!)
      and that could only call it once and once only.  So the first thing
      we do is remove the asserts.
      
      We then convert all the internal log shutdown triggers to call
      xlog_force_shutdown() directly instead of xfs_force_shutdown(). This
      allows the log shutdown triggers to shut down the log without
      needing to care about mount based shutdown constraints. This means
      we shut down the log independently of the mount and the mount may
      not notice this until it's next attempt to read or modify metadata.
      At that point (e.g. xfs_trans_commit()) it will see that the log is
      shutdown, error out and shutdown the mount.
      
      To ensure that all the unmount behaviours and asserts track
      correctly as a result of a log shutdown, propagate the shutdown up
      to the mount if it is not already set. This keeps the mount and log
      state in sync, and saves a huge amount of hassle where code fails
      because of a log shutdown but only checks for mount shutdowns and
      hence ends up doing the wrong thing. Cleaning up that mess is
      an exercise for another day.
      
      This enables us to address the other problems noted above in
      followup patches.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      b5f17bec
  7. 20 3月, 2022 2 次提交
    • D
      xfs: log items should have a xlog pointer, not a mount · d86142dd
      Dave Chinner 提交于
      Log items belong to the log, not the xfs_mount. Convert the mount
      pointer in the log item to a xlog pointer in preparation for
      upcoming log centric changes to the log items.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      d86142dd
    • D
      xfs: async CIL flushes need pending pushes to be made stable · 70447e0a
      Dave Chinner 提交于
      When the AIL tries to flush the CIL, it relies on the CIL push
      ending up on stable storage without having to wait for and
      manipulate iclog state directly. However, if there is already a
      pending CIL push when the AIL tries to flush the CIL, it won't set
      the cil->xc_push_commit_stable flag and so the CIL push will not
      actively flush the commit record iclog.
      
      generic/530 when run on a single CPU test VM can trigger this fairly
      reliably. This test exercises unlinked inode recovery, and can
      result in inodes being pinned in memory by ongoing modifications to
      the inode cluster buffer to record unlinked list modifications. As a
      result, the first inode unlinked in a buffer can pin the tail of the
      log whilst the inode cluster buffer is pinned by the current
      checkpoint that has been pushed but isn't on stable storage because
      because the cil->xc_push_commit_stable was not set. This results in
      the log/AIL effectively deadlocking until something triggers the
      commit record iclog to be pushed to stable storage (i.e. the
      periodic log worker calling xfs_log_force()).
      
      The fix is two-fold - first we should always set the
      cil->xc_push_commit_stable when xlog_cil_flush() is called,
      regardless of whether there is already a pending push or not.
      
      Second, if the CIL is empty, we should trigger an iclog flush to
      ensure that the iclogs of the last checkpoint have actually been
      submitted to disk as that checkpoint may not have been run under
      stable completion constraints.
      Reported-and-tested-by: NMatthew Wilcox <willy@infradead.org>
      Fixes: 0020a190 ("xfs: AIL needs asynchronous CIL forcing")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      70447e0a
  8. 07 1月, 2022 1 次提交
    • D
      xfs: reduce kvmalloc overhead for CIL shadow buffers · 8dc9384b
      Dave Chinner 提交于
      Oh, let me count the ways that the kvmalloc API sucks dog eggs.
      
      The problem is when we are logging lots of large objects, we hit
      kvmalloc really damn hard with costly order allocations, and
      behaviour utterly sucks:
      
           - 49.73% xlog_cil_commit
      	 - 31.62% kvmalloc_node
      	    - 29.96% __kmalloc_node
      	       - 29.38% kmalloc_large_node
      		  - 29.33% __alloc_pages
      		     - 24.33% __alloc_pages_slowpath.constprop.0
      			- 18.35% __alloc_pages_direct_compact
      			   - 17.39% try_to_compact_pages
      			      - compact_zone_order
      				 - 15.26% compact_zone
      				      5.29% __pageblock_pfn_to_page
      				      3.71% PageHuge
      				    - 1.44% isolate_migratepages_block
      					 0.71% set_pfnblock_flags_mask
      				   1.11% get_pfnblock_flags_mask
      			   - 0.81% get_page_from_freelist
      			      - 0.59% _raw_spin_lock_irqsave
      				 - do_raw_spin_lock
      				      __pv_queued_spin_lock_slowpath
      			- 3.24% try_to_free_pages
      			   - 3.14% shrink_node
      			      - 2.94% shrink_slab.constprop.0
      				 - 0.89% super_cache_count
      				    - 0.66% xfs_fs_nr_cached_objects
      				       - 0.65% xfs_reclaim_inodes_count
      					    0.55% xfs_perag_get_tag
      				   0.58% kfree_rcu_shrink_count
      			- 2.09% get_page_from_freelist
      			   - 1.03% _raw_spin_lock_irqsave
      			      - do_raw_spin_lock
      				   __pv_queued_spin_lock_slowpath
      		     - 4.88% get_page_from_freelist
      			- 3.66% _raw_spin_lock_irqsave
      			   - do_raw_spin_lock
      				__pv_queued_spin_lock_slowpath
      	    - 1.63% __vmalloc_node
      	       - __vmalloc_node_range
      		  - 1.10% __alloc_pages_bulk
      		     - 0.93% __alloc_pages
      			- 0.92% get_page_from_freelist
      			   - 0.89% rmqueue_bulk
      			      - 0.69% _raw_spin_lock
      				 - do_raw_spin_lock
      				      __pv_queued_spin_lock_slowpath
      	   13.73% memcpy_erms
      	 - 2.22% kvfree
      
      On this workload, that's almost a dozen CPUs all trying to compact
      and reclaim memory inside kvmalloc_node at the same time. Yet it is
      regularly falling back to vmalloc despite all that compaction, page
      and shrinker reclaim that direct reclaim is doing. Copying all the
      metadata is taking far less CPU time than allocating the storage!
      
      Direct reclaim should be considered extremely harmful.
      
      This is a high frequency, high throughput, CPU usage and latency
      sensitive allocation. We've got memory there, and we're using
      kvmalloc to allow memory allocation to avoid doing lots of work to
      try to do contiguous allocations.
      
      Except it still does *lots of costly work* that is unnecessary.
      
      Worse: the only way to avoid the slowpath page allocation trying to
      do compaction on costly allocations is to turn off direct reclaim
      (i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).
      
      Unfortunately, the stupid kvmalloc API then says "oh, this isn't a
      GFP_KERNEL allocation context, so you only get kmalloc!". This
      cuts off the vmalloc fallback, and this leads to almost instant OOM
      problems which ends up in filesystems deadlocks, shutdowns and/or
      kernel crashes.
      
      I want some basic kvmalloc behaviour:
      
      - kmalloc for a contiguous range with fail fast semantics - no
        compaction direct reclaim if the allocation enters the slow path.
      - run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails
      
      The really, really stupid part about this is these kvmalloc() calls
      are run under memalloc_nofs task context, so all the allocations are
      always reduced to GFP_NOFS regardless of the fact that kvmalloc
      requires GFP_KERNEL to be passed in. IOWs, we're already telling
      kvmalloc to behave differently to the gfp flags we pass in, but it
      still won't allow vmalloc to be run with anything other than
      GFP_KERNEL.
      
      So, this patch open codes the kvmalloc() in the commit path to have
      the above described behaviour. The result is we more than halve the
      CPU time spend doing kvmalloc() in this path and transaction commits
      with 64kB objects in them more than doubles. i.e. we get ~5x
      reduction in CPU usage per costly-sized kvmalloc() invocation and
      the profile looks like this:
      
        - 37.60% xlog_cil_commit
      	16.01% memcpy_erms
            - 8.45% __kmalloc
      	 - 8.04% kmalloc_order_trace
      	    - 8.03% kmalloc_order
      	       - 7.93% alloc_pages
      		  - 7.90% __alloc_pages
      		     - 4.05% __alloc_pages_slowpath.constprop.0
      			- 2.18% get_page_from_freelist
      			- 1.77% wake_all_kswapds
      ....
      				    - __wake_up_common_lock
      				       - 0.94% _raw_spin_lock_irqsave
      		     - 3.72% get_page_from_freelist
      			- 2.43% _raw_spin_lock_irqsave
            - 5.72% vmalloc
      	 - 5.72% __vmalloc_node_range
      	    - 4.81% __get_vm_area_node.constprop.0
      	       - 3.26% alloc_vmap_area
      		  - 2.52% _raw_spin_lock
      	       - 1.46% _raw_spin_lock
      	      0.56% __alloc_pages_bulk
            - 4.66% kvfree
      	 - 3.25% vfree
      	    - __vfree
      	       - 3.23% __vunmap
      		  - 1.95% remove_vm_area
      		     - 1.06% free_vmap_area_noflush
      			- 0.82% _raw_spin_lock
      		     - 0.68% _raw_spin_lock
      		  - 0.92% _raw_spin_lock
      	 - 1.40% kfree
      	    - 1.36% __free_pages
      	       - 1.35% __free_pages_ok
      		  - 1.02% _raw_spin_lock_irqsave
      
      It's worth noting that over 50% of the CPU time spent allocating
      these shadow buffers is now spent on spinlocks. So the shadow buffer
      allocation overhead is greatly reduced by getting rid of direct
      reclaim from kmalloc, and could probably be made even less costly if
      vmalloc() didn't use global spinlocks to protect it's structures.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      8dc9384b
  9. 23 12月, 2021 1 次提交
    • D
      xfs: prevent UAF in xfs_log_item_in_current_chkpt · f8d92a66
      Darrick J. Wong 提交于
      While I was running with KASAN and lockdep enabled, I stumbled upon an
      KASAN report about a UAF to a freed CIL checkpoint.  Looking at the
      comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me
      that the original patch to xfs_defer_finish_noroll should have done
      something to lock the CIL to prevent it from switching the CIL contexts
      while the predicate runs.
      
      For upper level code that needs to know if a given log item is new
      enough not to need relogging, add a new wrapper that takes the CIL
      context lock long enough to sample the current CIL context.  This is
      kind of racy in that the CIL can switch the contexts immediately after
      sampling, but that's ok because the consequence is that the defer ops
      code is a little slow to relog items.
      
       ==================================================================
       BUG: KASAN: use-after-free in xfs_log_item_in_current_chkpt+0x139/0x160 [xfs]
       Read of size 8 at addr ffff88804ea5f608 by task fsstress/527999
      
       CPU: 1 PID: 527999 Comm: fsstress Tainted: G      D      5.16.0-rc4-xfsx #rc4
       Call Trace:
        <TASK>
        dump_stack_lvl+0x45/0x59
        print_address_description.constprop.0+0x1f/0x140
        kasan_report.cold+0x83/0xdf
        xfs_log_item_in_current_chkpt+0x139/0x160
        xfs_defer_finish_noroll+0x3bb/0x1e30
        __xfs_trans_commit+0x6c8/0xcf0
        xfs_reflink_remap_extent+0x66f/0x10e0
        xfs_reflink_remap_blocks+0x2dd/0xa90
        xfs_file_remap_range+0x27b/0xc30
        vfs_dedupe_file_range_one+0x368/0x420
        vfs_dedupe_file_range+0x37c/0x5d0
        do_vfs_ioctl+0x308/0x1260
        __x64_sys_ioctl+0xa1/0x170
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f2c71a2950b
       Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff
      ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01
      f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48
       RSP: 002b:00007ffe8c0e03c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
       RAX: ffffffffffffffda RBX: 00005600862a8740 RCX: 00007f2c71a2950b
       RDX: 00005600862a7be0 RSI: 00000000c0189436 RDI: 0000000000000004
       RBP: 000000000000000b R08: 0000000000000027 R09: 0000000000000003
       R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a
       R13: 00005600862804a8 R14: 0000000000016000 R15: 00005600862a8a20
        </TASK>
      
       Allocated by task 464064:
        kasan_save_stack+0x1e/0x50
        __kasan_kmalloc+0x81/0xa0
        kmem_alloc+0xcd/0x2c0 [xfs]
        xlog_cil_ctx_alloc+0x17/0x1e0 [xfs]
        xlog_cil_push_work+0x141/0x13d0 [xfs]
        process_one_work+0x7f6/0x1380
        worker_thread+0x59d/0x1040
        kthread+0x3b0/0x490
        ret_from_fork+0x1f/0x30
      
       Freed by task 51:
        kasan_save_stack+0x1e/0x50
        kasan_set_track+0x21/0x30
        kasan_set_free_info+0x20/0x30
        __kasan_slab_free+0xed/0x130
        slab_free_freelist_hook+0x7f/0x160
        kfree+0xde/0x340
        xlog_cil_committed+0xbfd/0xfe0 [xfs]
        xlog_cil_process_committed+0x103/0x1c0 [xfs]
        xlog_state_do_callback+0x45d/0xbd0 [xfs]
        xlog_ioend_work+0x116/0x1c0 [xfs]
        process_one_work+0x7f6/0x1380
        worker_thread+0x59d/0x1040
        kthread+0x3b0/0x490
        ret_from_fork+0x1f/0x30
      
       Last potentially related work creation:
        kasan_save_stack+0x1e/0x50
        __kasan_record_aux_stack+0xb7/0xc0
        insert_work+0x48/0x2e0
        __queue_work+0x4e7/0xda0
        queue_work_on+0x69/0x80
        xlog_cil_push_now.isra.0+0x16b/0x210 [xfs]
        xlog_cil_force_seq+0x1b7/0x850 [xfs]
        xfs_log_force_seq+0x1c7/0x670 [xfs]
        xfs_file_fsync+0x7c1/0xa60 [xfs]
        __x64_sys_fsync+0x52/0x80
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
       The buggy address belongs to the object at ffff88804ea5f600
        which belongs to the cache kmalloc-256 of size 256
       The buggy address is located 8 bytes inside of
        256-byte region [ffff88804ea5f600, ffff88804ea5f700)
       The buggy address belongs to the page:
       page:ffffea00013a9780 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804ea5ea00 pfn:0x4ea5e
       head:ffffea00013a9780 order:1 compound_mapcount:0
       flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff)
       raw: 04fff80000010200 ffffea0001245908 ffffea00011bd388 ffff888004c42b40
       raw: ffff88804ea5ea00 0000000000100009 00000001ffffffff 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff88804ea5f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
        ffff88804ea5f580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       >ffff88804ea5f600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                             ^
        ffff88804ea5f680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff88804ea5f700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ==================================================================
      
      Fixes: 4e919af7 ("xfs: periodically relog deferred intent items")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      f8d92a66
  10. 20 8月, 2021 1 次提交
  11. 17 8月, 2021 11 次提交
    • D
      xfs: move the CIL workqueue to the CIL · 33c0dd78
      Dave Chinner 提交于
      We only use the CIL workqueue in the CIL, so it makes no sense to
      hang it off the xfs_mount and have to walk multiple pointers back up
      to the mount when we have the CIL structures right there.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      33c0dd78
    • D
      xfs: CIL work is serialised, not pipelined · 39823d0f
      Dave Chinner 提交于
      Because we use a single work structure attached to the CIL rather
      than the CIL context, we can only queue a single work item at a
      time. This results in the CIL being single threaded and limits
      performance when it becomes CPU bound.
      
      The design of the CIL is that it is pipelined and multiple commits
      can be running concurrently, but the way the work is currently
      implemented means that it is not pipelining as it was intended. The
      critical work to switch the CIL context can take a few milliseconds
      to run, but the rest of the CIL context flush can take hundreds of
      milliseconds to complete. The context switching is the serialisation
      point of the CIL, once the context has been switched the rest of the
      context push can run asynchrnously with all other context pushes.
      
      Hence we can move the work to the CIL context so that we can run
      multiple CIL pushes at the same time and spread the majority of
      the work out over multiple CPUs. We can keep the per-cpu CIL commit
      state on the CIL rather than the context, because the context is
      pinned to the CIL until the switch is done and we aggregate and
      drain the per-cpu state held on the CIL during the context switch.
      
      However, because we no longer serialise the CIL work, we can have
      effectively unlimited CIL pushes in progress. We don't want to do
      this - not only does it create contention on the iclogs and the
      state machine locks, we can run the log right out of space with
      outstanding pushes. Instead, limit the work concurrency to 4
      concurrent works being processed at a time. This is enough
      concurrency to remove the CIL from being a CPU bound bottleneck but
      not enough to create new contention points or unbound concurrency
      issues.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      39823d0f
    • D
      xfs: AIL needs asynchronous CIL forcing · 0020a190
      Dave Chinner 提交于
      The AIL pushing is stalling on log forces when it comes across
      pinned items. This is happening on removal workloads where the AIL
      is dominated by stale items that are removed from AIL when the
      checkpoint that marks the items stale is committed to the journal.
      This results is relatively few items in the AIL, but those that are
      are often pinned as directories items are being removed from are
      still being logged.
      
      As a result, many push cycles through the CIL will first issue a
      blocking log force to unpin the items. This can take some time to
      complete, with tracing regularly showing push delays of half a
      second and sometimes up into the range of several seconds. Sequences
      like this aren't uncommon:
      
      ....
       399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 270ms delay>
       400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
       400.099623:  xfsaild: first lsn 0x11002f3600
       400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
      <wanted 50ms, got 500ms delay>
       400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
       400.589349:  xfsaild: first lsn 0x1100305000
       400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
      <wanted 50ms, got 460ms delay>
       400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
       400.950343:  xfsaild: first lsn 0x1100317c00
       400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 200ms delay>
       401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
       401.142334:  xfsaild: first lsn 0x110032e600
       401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
      <wanted 10ms, got 10ms delay>
       401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
       401.154328:  xfsaild: first lsn 0x1100353000
       401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 300ms delay>
       401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
       401.451526:  xfsaild: first lsn 0x1100353000
       401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
      <wanted 50ms, got 500ms delay>
       401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
      ....
      
      In each of these cases, every AIL pass saw 101 log items stuck on
      the AIL (pinned) with very few other items being found. Each pass, a
      log force was issued, and delay between last/first is the sleep time
      + the sync log force time.
      
      Some of these 101 items pinned the tail of the log. The tail of the
      log does slowly creep forward (first lsn), but the problem is that
      the log is actually out of reservation space because it's been
      running so many transactions that stale items that never reach the
      AIL but consume log space. Hence we have a largely empty AIL, with
      long term pins on items that pin the tail of the log that don't get
      pushed frequently enough to keep log space available.
      
      The problem is the hundreds of milliseconds that we block in the log
      force pushing the CIL out to disk. The AIL should not be stalled
      like this - it needs to run and flush items that are at the tail of
      the log with minimal latency. What we really need to do is trigger a
      log flush, but then not wait for it at all - we've already done our
      waiting for stuff to complete when we backed off prior to the log
      force being issued.
      
      Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
      still do a blocking flush of the CIL and that is what is causing the
      issue. Hence we need a new interface for the CIL to trigger an
      immediate background push of the CIL to get it moving faster but not
      to wait on that to occur. While the CIL is pushing, the AIL can also
      be pushing.
      
      We already have an internal interface to do this -
      xlog_cil_push_now() - but we need a wrapper for it to be used
      externally. xlog_cil_force_seq() can easily be extended to do what
      we need as it already implements the synchronous CIL push via
      xlog_cil_push_now(). Add the necessary flags and "push current
      sequence" semantics to xlog_cil_force_seq() and convert the AIL
      pushing to use it.
      
      One of the complexities here is that the CIL push does not guarantee
      that the commit record for the CIL checkpoint is written to disk.
      The current log force ensures this by submitting the current ACTIVE
      iclog that the commit record was written to. We need the CIL to
      actually write this commit record to disk for an async push to
      ensure that the checkpoint actually makes it to disk and unpins the
      pinned items in the checkpoint on completion. Hence we need to pass
      down to the CIL push that we are doing an async flush so that it can
      switch out the commit_iclog if necessary to get written to disk when
      the commit iclog is finally released.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0020a190
    • D
      xfs: order CIL checkpoint start records · 68a74dca
      Dave Chinner 提交于
      Because log recovery depends on strictly ordered start records as
      well as strictly ordered commit records.
      
      This is a zero day bug in the way XFS writes pipelined transactions
      to the journal which is exposed by fixing the zero day bug that
      prevents the CIL from pipelining checkpoints. This re-introduces
      explicit concurrent commits back into the on-disk journal and hence
      out of order start records.
      
      The XFS journal commit code has never ordered start records and we
      have relied on strict commit record ordering for correct recovery
      ordering of concurrently written transactions. Unfortunately, root
      cause analysis uncovered the fact that log recovery uses the LSN of
      the start record for transaction commit processing. Hence, whilst
      the commits are processed in strict order by recovery, the LSNs
      associated with the commits can be out of order and so recovery may
      stamp incorrect LSNs into objects and/or misorder intents in the AIL
      for later processing. This can result in log recovery failures
      and/or on disk corruption, sometimes silent.
      
      Because this is a long standing log recovery issue, we can't just
      fix log recovery and call it good. This still leaves older kernels
      susceptible to recovery failures and corruption when replaying a log
      from a kernel that pipelines checkpoints. There is also the issue
      that in-memory ordering for AIL pushing and data integrity
      operations are based on checkpoint start LSNs, and if the start LSN
      is incorrect in the journal, it is also incorrect in memory.
      
      Hence there's really only one choice for fixing this zero-day bug:
      we need to strictly order checkpoint start records in ascending
      sequence order in the log, the same way we already strictly order
      commit records.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      68a74dca
    • D
      xfs: attach iclog callbacks in xlog_cil_set_ctx_write_state() · caa80090
      Dave Chinner 提交于
      Now that we have a mechanism to guarantee that the callbacks
      attached to an iclog are owned by the context that attaches them
      until they drop their reference to the iclog via
      xlog_state_release_iclog(), we can attach callbacks to the iclog at
      any time we have an active reference to the iclog.
      
      xlog_state_get_iclog_space() always guarantees that the commit
      record will fit in the iclog it returns, so we can move this IO
      callback setting to xlog_cil_set_ctx_write_state(), record the
      commit iclog in the context and remove the need for the commit iclog
      to be returned by xlog_write() altogether.
      
      This, in turn, allows us to move the wakeup for ordered commit
      record writes up into xlog_cil_set_ctx_write_state(), too, because
      we have been guaranteed that this commit record will be physically
      located in the iclog before any waiting commit record at a higher
      sequence number will be granted iclog space.
      
      This further cleans up the post commit record write processing in
      the CIL push code, especially as xlog_state_release_iclog() will now
      clean up the context when shutdown errors occur.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      caa80090
    • D
      xfs: factor out log write ordering from xlog_cil_push_work() · bf034bc8
      Dave Chinner 提交于
      So we can use it for start record ordering as well as commit record
      ordering in future.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      bf034bc8
    • D
      xfs: pass a CIL context to xlog_write() · c45aba40
      Dave Chinner 提交于
      Pass the CIL context to xlog_write() rather than a pointer to a LSN
      variable. Only the CIL checkpoint calls to xlog_write() need to know
      about the start LSN of the writes, so rework xlog_write to directly
      write the LSNs into the CIL context structure.
      
      This removes the commit_lsn variable from xlog_cil_push_work(), so
      now we only have to issue the commit record ordering wakeup from
      there.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      c45aba40
    • D
      xfs: move xlog_commit_record to xfs_log_cil.c · 2ce82b72
      Dave Chinner 提交于
      It is only used by the CIL checkpoints, and is the counterpart to
      start record formatting and writing that is already local to
      xfs_log_cil.c.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      2ce82b72
    • D
      xfs: don't run shutdown callbacks on active iclogs · 502a01fa
      Dave Chinner 提交于
      When the log is shutdown, it currently walks all the iclogs and runs
      callbacks that are attached to the iclogs, regardless of whether the
      iclog is queued for IO completion or not. This creates a problem for
      contexts attaching callbacks to iclogs in that a racing shutdown can
      run the callbacks even before the attaching context has finished
      processing the iclog and releasing it for IO submission.
      
      If the callback processing of the iclog frees the structure that is
      attached to the iclog, then this leads to an UAF scenario that can
      only be protected against by holding the icloglock from the point
      callbacks are attached through to the release of the iclog. While we
      currently do this, it is not practical or sustainable.
      
      Hence we need to make shutdown processing the responsibility of the
      context that holds active references to the iclog. We know that the
      contexts attaching callbacks to the iclog must have active
      references to the iclog, and that means they must be in either
      ACTIVE or WANT_SYNC states. xlog_state_do_callback() will skip over
      iclogs in these states -except- when the log is shut down.
      
      xlog_state_do_callback() checks the state of the iclogs while
      holding the icloglock, therefore the reference count/state change
      that occurs in xlog_state_release_iclog() after the callbacks are
      atomic w.r.t. shutdown processing.
      
      We can't push the responsibility of callback cleanup onto the CIL
      context because we can have ACTIVE iclogs that have callbacks
      attached that have already been released. Hence we really need to
      internalise the cleanup of callbacks into xlog_state_release_iclog()
      processing.
      
      Indeed, we already have that internalisation via:
      
      xlog_state_release_iclog
        drop last reference
          ->SYNCING
        xlog_sync
          xlog_write_iclog
            if (log_is_shutdown)
              xlog_state_done_syncing()
      	  xlog_state_do_callback()
      	    <process shutdown on iclog that is now in SYNCING state>
      
      The problem is that xlog_state_release_iclog() aborts before doing
      anything if the log is already shut down. It assumes that the
      callbacks have already been cleaned up, and it doesn't need to do
      any cleanup.
      
      Hence the fix is to remove the xlog_is_shutdown() check from
      xlog_state_release_iclog() so that reference counts are correctly
      released from the iclogs, and when the reference count is zero we
      always transition to SYNCING if the log is shut down. Hence we'll
      always enter the xlog_sync() path in a shutdown and eventually end
      up erroring out the iclog IO and running xlog_state_do_callback() to
      process the callbacks attached to the iclog.
      
      This allows us to stop processing referenced ACTIVE/WANT_SYNC iclogs
      directly in the shutdown code, and in doing so gets rid of the UAF
      vector that currently exists. This then decouples the adding of
      callbacks to the iclogs from xlog_state_release_iclog() as we
      guarantee that xlog_state_release_iclog() will process the callbacks
      if the log has been shut down before xlog_state_release_iclog() has
      been called.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      502a01fa
    • D
      xfs: XLOG_STATE_IOERROR must die · 5112e206
      Dave Chinner 提交于
      We don't need an iclog state field to tell us the log has been shut
      down. We can just check the xlog_is_shutdown() instead. The avoids
      the need to have shutdown overwrite the current iclog state while
      being active used by the log code and so having to ensure that every
      iclog state check handles XLOG_STATE_IOERROR appropriately.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      5112e206
    • D
      xfs: convert XLOG_FORCED_SHUTDOWN() to xlog_is_shutdown() · 2039a272
      Dave Chinner 提交于
      Make it less shouty and a static inline before adding more calls
      through the log code.
      
      Also convert internal log code that uses XFS_FORCED_SHUTDOWN(mount)
      to use xlog_is_shutdown(log) as well.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      2039a272
  12. 10 8月, 2021 1 次提交
  13. 30 7月, 2021 1 次提交
    • D
      xfs: fix ordering violation between cache flushes and tail updates · 0dc8f7f1
      Dave Chinner 提交于
      There is a race between the new CIL async data device metadata IO
      completion cache flush and the log tail in the iclog the flush
      covers being updated. This can be seen by repeating generic/482 in a
      loop and eventually log recovery fails with a failures such as this:
      
      XFS (dm-3): Starting recovery (logdev: internal)
      XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
      XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
      XFS (dm-3): Unmount and run xfs_repair
      XFS (dm-3): First 128 bytes of corrupted metadata buffer:
      00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117
      
      Analysis of the logwrite replay shows that there were no writes to
      the data device between the FUA @ write 124 and the FUA at write @
      125, but log recovery @ 125 failed. The difference was the one log
      write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
      and so the inode create intent in (1,8) was not replayed and so the
      inode cluster was zero on disk when replay of the first inode item
      in (1,32) was attempted.
      
      What this meant was that the journal write that occurred at @ 125
      did not ensure that metadata completed before the iclog was written
      was correctly on stable storage. The tail of the log moved forward,
      so IO must have been completed between the two iclog writes. This
      means that there is a race condition between the unconditional async
      cache flush in the CIL push work and the tail LSN that is written to
      the iclog. This happens like so:
      
      CIL push work				AIL push work
      -------------				-------------
      Add to committing list
      start async data dev cache flush
      .....
      <flush completes>
      <all writes to old tail lsn are stable>
      xlog_write
        ....					push inode create buffer
      					<start IO>
      					.....
      xlog_write(commit record)
        ....					<IO completes>
        					log tail moves
        					  xlog_assign_tail_lsn()
      start_lsn == commit_lsn
        <no iclog preflush!>
      xlog_state_release_iclog
        __xlog_state_release_iclog()
          <writes *new* tail_lsn into iclog>
        xlog_sync()
          ....
          submit_bio()
      <tail in log moves forward without flushing written metadata>
      
      Essentially, this can only occur if the commit iclog is issued
      without a cache flush. If the iclog bio is submitted with
      REQ_PREFLUSH, then it will guarantee that all the completed IO is
      one stable storage before the iclog bio with the new tail LSN in it
      is written to the log.
      
      IOWs, the tail lsn that is written to the iclog needs to be sampled
      *before* we issue the cache flush that guarantees all IO up to that
      LSN has been completed.
      
      To fix this without giving up the performance advantage of the
      flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
      compared to 5.13), we need to ensure that we always issue a cache
      flush if the tail LSN changes between the initial async flush and
      the commit record being written. THis requires sampling the tail_lsn
      before we start the flush, and then passing the sampled tail LSN to
      xlog_state_release_iclog() so it can determine if the the tail LSN
      has changed while writing the checkpoint. If the tail LSN has
      changed, then it needs to set the NEED_FLUSH flag on the iclog and
      we'll issue another cache flush before writing the iclog.
      
      Fixes: eef983ff ("xfs: journal IO cache flush reductions")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0dc8f7f1
  14. 26 6月, 2021 2 次提交
    • D
      xfs: don't wait on future iclogs when pushing the CIL · 1effb72a
      Dave Chinner 提交于
      The iclogbuf ring attached to the struct xlog is circular, hence the
      first and last iclogs in the ring can only be determined by
      comparing them against the log->l_iclog pointer.
      
      In xfs_cil_push_work(), we want to wait on previous iclogs that were
      issued so that we can flush them to stable storage with the commit
      record write, and it simply waits on the previous iclog in the ring.
      This, however, leads to CIL push hangs in generic/019 like so:
      
      task:kworker/u33:0   state:D stack:12680 pid:    7 ppid:     2 flags:0x00004000
      Workqueue: xfs-cil/pmem1 xlog_cil_push_work
      Call Trace:
       __schedule+0x30b/0x9f0
       schedule+0x68/0xe0
       xlog_wait_on_iclog+0x121/0x190
       ? wake_up_q+0xa0/0xa0
       xlog_cil_push_work+0x994/0xa10
       ? _raw_spin_lock+0x15/0x20
       ? xfs_swap_extents+0x920/0x920
       process_one_work+0x1ab/0x390
       worker_thread+0x56/0x3d0
       ? rescuer_thread+0x3c0/0x3c0
       kthread+0x14d/0x170
       ? __kthread_bind_mask+0x70/0x70
       ret_from_fork+0x1f/0x30
      
      With other threads blocking in either xlog_state_get_iclog_space()
      waiting for iclog space or xlog_grant_head_wait() waiting for log
      reservation space.
      
      The problem here is that the previous iclog on the ring might
      actually be a future iclog. That is, if log->l_iclog points at
      commit_iclog, commit_iclog is the first (oldest) iclog in the ring
      and there are no previous iclogs pending as they have all completed
      their IO and been activated again. IOWs, commit_iclog->ic_prev
      points to an iclog that will be written in the future, not one that
      has been written in the past.
      
      Hence, in this case, waiting on the ->ic_prev iclog is incorrect
      behaviour, and depending on the state of the future iclog, we can
      end up with a circular ABA wait cycle and we hang.
      
      The fix is made more complex by the fact that many iclogs states
      cannot be used to determine if the iclog is a past or future iclog.
      Hence we have to determine past iclogs by checking the LSN of the
      iclog rather than their state. A past ACTIVE iclog will have a LSN
      of zero, while a future ACTIVE iclog will have a LSN greater than
      the current iclog. We don't wait on either of these cases.
      
      Similarly, a future iclog that hasn't completed IO will have an LSN
      greater than the current iclog and so we don't wait on them. A past
      iclog that is still undergoing IO completion will have a LSN less
      than the current iclog and those are the only iclogs that we need to
      wait on.
      
      Hence we can use the iclog LSN to determine what iclogs we need to
      wait on here.
      
      Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO")
      Reported-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      1effb72a
    • D
      xfs: Fix a CIL UAF by getting get rid of the iclog callback lock · a1bb8505
      Dave Chinner 提交于
      The iclog callback chain has it's own lock. That was added way back
      in 2008 by myself to alleviate severe lock contention on the
      icloglock in commit 114d23aa ("[XFS] Per iclog callback chain
      lock"). This was long before delayed logging took the icloglock out
      of the hot transaction commit path and removed all contention on it.
      Hence the separate ic_callback_lock doesn't serve any scalability
      purpose anymore, and hasn't for close on a decade.
      
      Further, we only attach callbacks to iclogs in one place where we
      are already taking the icloglock soon after attaching the callbacks.
      We also have to drop the icloglock to run callbacks and grab it
      immediately afterwards again. So given that the icloglock is no
      longer hot, making it cover callbacks again doesn't really change
      the locking patterns very much at all.
      
      We also need to extend the icloglock to cover callback addition to
      fix a zero-day UAF in the CIL push code. This occurs when shutdown
      races with xlog_cil_push_work() and the shutdown runs the callbacks
      before the push releases the iclog. This results in the CIL context
      structure attached to the iclog being freed by the callback before
      the CIL push has finished referencing it, leading to UAF bugs.
      
      Hence, to avoid this UAF, we need the callback attachment to be
      atomic with post processing of the commit iclog and references to
      the structures being attached to the iclog. This requires holding
      the icloglock as that's the only way to serialise iclog state
      against a shutdown in progress.
      
      The result is we need to be using the icloglock to protect the
      callback list addition and removal and serialise them with shutdown.
      That makes the ic_callback_lock redundant and so it can be removed.
      
      Fixes: 71e330b5 ("xfs: Introduce delayed logging core code")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      a1bb8505
  15. 22 6月, 2021 2 次提交
    • D
      xfs: xfs_log_force_lsn isn't passed a LSN · 5f9b4b0d
      Dave Chinner 提交于
      In doing an investigation into AIL push stalls, I was looking at the
      log force code to see if an async CIL push could be done instead.
      This lead me to xfs_log_force_lsn() and looking at how it works.
      
      xfs_log_force_lsn() is only called from inode synchronisation
      contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
      value as the LSN to sync the log to. This gets passed to
      xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
      journal, and then used by xfs_log_force_lsn() to flush the iclogs to
      the journal.
      
      The problem is that ip->i_itemp->ili_last_lsn does not store a
      log sequence number. What it stores is passed to it from the
      ->iop_committing method, which is called by xfs_log_commit_cil().
      The value this passes to the iop_committing method is the CIL
      context sequence number that the item was committed to.
      
      As it turns out, xlog_cil_force_lsn() converts the sequence to an
      actual commit LSN for the related context and returns that to
      xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
      variable that contained a sequence with an actual LSN and then uses
      that to sync the iclogs.
      
      This caused me some confusion for a while, even though I originally
      wrote all this code a decade ago. ->iop_committing is only used by
      a couple of log item types, and only inode items use the sequence
      number it is passed.
      
      Let's clean up the API, CIL structures and inode log item to call it
      a sequence number, and make it clear that the high level code is
      using CIL sequence numbers and not on-disk LSNs for integrity
      synchronisation purposes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      5f9b4b0d
    • D
      xfs: Fix CIL throttle hang when CIL space used going backwards · 19f4e7cc
      Dave Chinner 提交于
      A hang with tasks stuck on the CIL hard throttle was reported and
      largely diagnosed by Donald Buczek, who discovered that it was a
      result of the CIL context space usage decrementing in committed
      transactions once the hard throttle limit had been hit and processes
      were already blocked.  This resulted in the CIL push not waking up
      those waiters because the CIL context was no longer over the hard
      throttle limit.
      
      The surprising aspect of this was the CIL space usage going
      backwards regularly enough to trigger this situation. Assumptions
      had been made in design that the relogging process would only
      increase the size of the objects in the CIL, and so that space would
      only increase.
      
      This change and commit message fixes the issue and documents the
      result of an audit of the triggers that can cause the CIL space to
      go backwards, how large the backwards steps tend to be, the
      frequency in which they occur, and what the impact on the CIL
      accounting code is.
      
      Even though the CIL ctx->space_used can go backwards, it will only
      do so if the log item is already logged to the CIL and contains a
      space reservation for it's entire logged state. This is tracked by
      the shadow buffer state on the log item. If the item is not
      previously logged in the CIL it has no shadow buffer nor log vector,
      and hence the entire size of the logged item copied to the log
      vector is accounted to the CIL space usage. i.e.  it will always go
      up in this case.
      
      If the item has a log vector (i.e. already in the CIL) and the size
      decreases, then the existing log vector will be overwritten and the
      space usage will go down. This is the only condition where the space
      usage reduces, and it can only occur when an item is already tracked
      in the CIL. Hence we are safe from CIL space usage underruns as a
      result of log items decreasing in size when they are relogged.
      
      Typically this reduction in CIL usage occurs from metadata blocks
      being free, such as when a btree block merge occurs or a directory
      enter/xattr entry is removed and the da-tree is reduced in size.
      This generally results in a reduction in size of around a single
      block in the CIL, but also tends to increase the number of log
      vectors because the parent and sibling nodes in the tree needs to be
      updated when a btree block is removed. If a multi-level merge
      occurs, then we see reduction in size of 2+ blocks, but again the
      log vector count goes up.
      
      The other vector is inode fork size changes, which only log the
      current size of the fork and ignore the previously logged size when
      the fork is relogged. Hence if we are removing items from the inode
      fork (dir/xattr removal in shortform, extent record removal in
      extent form, etc) the relogged size of the inode for can decrease.
      
      No other log items can decrease in size either because they are a
      fixed size (e.g. dquots) or they cannot be relogged (e.g. relogging
      an intent actually creates a new intent log item and doesn't relog
      the old item at all.) Hence the only two vectors for CIL context
      size reduction are relogging inode forks and marking buffers active
      in the CIL as stale.
      
      Long story short: the majority of the code does the right thing and
      handles the reduction in log item size correctly, and only the CIL
      hard throttle implementation is problematic and needs fixing. This
      patch makes that fix, as well as adds comments in the log item code
      that result in items shrinking in size when they are relogged as a
      clear reminder that this can and does happen frequently.
      
      The throttle fix is based upon the change Donald proposed, though it
      goes further to ensure that once the throttle is activated, it
      captures all tasks until the CIL push issues a wakeup, regardless of
      whether the CIL space used has gone back under the throttle
      threshold.
      
      This ensures that we prevent tasks reducing the CIL slightly under
      the throttle threshold and then making more changes that push it
      well over the throttle limit. This is acheived by checking if the
      throttle wait queue is already active as a condition of throttling.
      Hence once we start throttling, we continue to apply the throttle
      until the CIL context push wakes everything on the wait queue.
      
      We can use waitqueue_active() for the waitqueue manipulations and
      checks as they are all done under the ctx->xc_push_lock. Hence the
      waitqueue has external serialisation and we can safely peek inside
      the wait queue without holding the internal waitqueue locks.
      
      Many thanks to Donald for his diagnostic and analysis work to
      isolate the cause of this hang.
      Reported-and-tested-by: NDonald Buczek <buczek@molgen.mpg.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      19f4e7cc